1

I have 2GB files (9 of them) which contains approximately 12M records of strings that i want to insert each one as a document to local mongodb (windows).

Now i'm reading line by line and inserting every second line (the first is unnecessary header) like this:

bool readingFlag = false;
foreach (var line in File.ReadLines(file))
{
    if (readingflag)
    {
        String document = "{'read':'" + line + "'}";
        var documnt = new BsonDocument(
             MongoDB
             .Bson
             .Serialization
             .BsonSerializer
             .Deserialize<BsonDocument>(document));

        await collection.InsertOneAsync(documnt);
        readingflag = false;
    }
    else
    {
        readingflag = true;
    }
}

This method is working but not as fast as i expected. I'm now in the middle of the file and i assume it will end in about 4 hours for just one file. (40 hours for all my data)

I think that my bottleneck is the file reading but since it is very big file VS doesn't let my load it to memory (out of memory exception).

Is there any other way that i'm missing here?

profesor79
  • 9,213
  • 3
  • 31
  • 52
Yogevnn
  • 1,430
  • 2
  • 18
  • 37
  • 3
    Instead of inserting one at a time why not batch insert? I believe connection has a `InsertBatch` method available where you can then read out some amount of lines and insert it in. – BenM Jun 24 '16 at 20:01
  • 2
    I just used `File.ReadLines()` to read a 960mb UTF-8 text file that has 6.056 million lines, and for each line had it deserialize a string (your `document` string with `"test"` as the `line` for every line), and it only took 19 seconds. So I doubt your bottleneck is reading or even deserializing, but you can know for sure by timing it with [Stopwatch](https://msdn.microsoft.com/en-us/library/system.diagnostics.stopwatch(v=vs.110).aspx) and commenting your `InsertOneAsync()` line. Chances are it only takes it a minute or two once you comment that line. – Quantic Jun 24 '16 at 20:15
  • Can you share the sample file with at least 10 lines? – jOSe Jun 24 '16 at 23:07
  • BenM - I'll check the answer and update about that @Quantic - you are correct i checked that now. jOSe - it's just 100 chars all the same – Yogevnn Jun 25 '16 at 11:30

2 Answers2

1

I think we could utilize those things:

  1. Get some lines and add in a bunch by insert many
  2. insert data on separate thread as we don't need to wait for finish
  3. use a typed class TextData to push serialization to other thread

You can play with limit at once - as this depend of amount of data read from file

public class TextData{
    public ObjectId _id {
        get;
        set;
    }
    public string read {
        get;
        set;
    }
}

public class Processor{
    public async void ProcessData() {
        var client = new MongoClient("mongodb://localhost:27017");
        var database = client.GetDatabase("test");

        var collection = database.GetCollection < TextData > ("Yogevnn");
        var readingflag = false;
        var listOfDocument = new List < TextData > ();
        var limiAtOnce = 100;
        var current = 0;

        foreach(var line in File.ReadLines( @ "E:\file.txt")) {
            if (readingflag) {
                var dataToInsert = new TextData {
                    read = line
                };
                listOfDocument.Add(dataToInsert);
                readingflag = false;
                Console.WriteLine($ "Current position: {current}");

                if (++current == limiAtOnce) {
                    current = 0;
                    Console.WriteLine($ "Inserting data");
                    var listToInsert = listOfDocument;

                    var t = new Task(() =  > {
                                Console.WriteLine($ "Inserting data START");
                                collection.InsertManyAsync(listToInsert);
                                Console.WriteLine($ "Inserting data FINISH");
                            });
                    t.Start();
                    listOfDocument = new List < TextData > ();
                }
            } else {
                readingflag = true;
            }
        }

        // insert remainder
        await collection.InsertManyAsync(listOfDocument);
    }
}

Any comments welcome!

profesor79
  • 9,213
  • 3
  • 31
  • 52
  • Remove all of those `Console.WriteLine()`'s after debugging and before doing the actual work because each call is process intensive, comparatively. Adding in 2 `Console.WriteLine()`'s to my `File.ReadLines()` test as mentioned in my comment to OP makes reading *only* 10,000 lines of my text file go from 0.094 seconds to 20 seconds. – Quantic Jun 24 '16 at 21:05
  • @Quantic thanks for that - I think Yogewen will manage that :-) and probably will try first with some debug as well - to see how it is going. In my case this snippet read 116666 liners from 233332 lines from file in less than 2 seconds. – profesor79 Jun 24 '16 at 21:14
  • Thank you for your time! what does the last line suppose to do? and how can you insert TextData (i'm getting conversion error) – Yogevnn Jun 25 '16 at 12:01
  • @yog Last line insert remainder, so we we have 50 lines instead of 100, remainder is inserted. TEXTDATA is a class added and that makes collection a typed one instead of bson one. Check I'd this class is added with your code. – profesor79 Jun 25 '16 at 14:20
  • @professor79 i'm not sure i understand what is the remainder. Textdata is able to push to the mongodb? – Yogevnn Jun 25 '16 at 15:34
  • Yes,when mongo driver sees `collection`, it will serialize and destabilize result to IEnumeraable - that means we don't need to deal with bson documents, but we can use tped objects. – profesor79 Jun 25 '16 at 21:25
  • 1
    When loop execute, and our current will be let say 15, then this 15 items need to be inserted as in our case we are inserting when `counter=limtAtOnce` - to avoid truncation of data, we need to ensure that the remainder will be inserted as well at the end of process – profesor79 Jun 25 '16 at 21:27
  • @Yogevnn please see my coments – profesor79 Jun 25 '16 at 21:28
  • @profesor79 i got it now about the reminder, i forgot about the limit. About the TextData, i'm getting an error when trying to do `collection.InsertManyAsync(listToInsert);` – Yogevnn Jun 25 '16 at 22:38
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/115608/discussion-between-profesor79-and-yogevnn). – profesor79 Jun 25 '16 at 22:46
0

In my experiments I found Parallel.ForEach(File.ReadLines("path")) to be the fastest. File size was about 42 GB. I also tried batching a set of 100 lines and save the batch but was slower than Parallel.ForEach.

Another example: Read large txt file multithreaded?

Azsgy
  • 3,139
  • 2
  • 29
  • 40