1

I have a limited amount of RAM on my server but have a large amount of data I need to work with in memory in a console program. Are there any tricks that would allow me to still get the same end result, but without needing so much RAM

For this example I have 100 million email addresses in a string list. I need to find out if any of the new emails I am comparing to it already exist in it. If so, add them. If not, don't add them. So we always have a unique list of emails, no duplicates.

100 million emails in this example requires approximately 17GB of RAM.

Are there any tricks or tips you know of to reduce how much RAM is required to still at least be able to do a "DOES IT EXIST IN THE LIST COLLECTION?" comparison? - types of examples that come to mind: such as a different type of collection, or a custom 3rd party referenced software tool that compresses data in memory but you can still sort or compare on that data, or perhaps a file based database system that uses far less memory on the same amount of data.

I've written the code to demonstrate how to do this the normal way that causes 17GB of RAM to be consumed.

using System;
using System.Collections.Generic;
using System.Linq;

namespace NewProgram
{
    class Program
    {
        public static List<string> emails = new List<string>(); 

        public static void Main(string[] args)
        {
            LoadAllEmails();

            Console.WriteLine(emails.Count() + " total emails"); //100000000 total emails

            AddEmailsThatDontExistInMasterList(
                new List<string>()
                {
                "something@test.com", //does not already exist, so it will be added to list
                "testingfirst.testinglast"+ (1234567).ToString() + "@testingdomain.com", //should already exist, won't be added
                "testingfirst.testinglast"+ (3333335).ToString() + "@testingdomain.com", //should already exist, won't be added
                "something2@test.com", //does not already exist, so it will be added to list
                "testingfirst.testinglast"+ (8765432).ToString() + "@testingdomain.com", //should already exist, won't be added
                });

            Console.WriteLine(emails.Count() + " total emails after"); //100000002 total emails

            Console.ReadLine();
        }


        public static void LoadAllEmails()
        {
            for (int i = 0; i < 100000000; i++)  //100,000,000 emails = approximately 17GB of memory
            {
                emails.Add("testingfirst.testinglast" + i.ToString() + "@testingdomain.com");
            }
        }

        public static void AddEmailsThatDontExistInMasterList(List<string> newEmails)
        {
            foreach (string email in newEmails)
            {
                if (emails.Contains(email) == false)
                {
                    emails.Add(email);
                }
            }
        }
    }
}

After adding 100,000,000 emails to the "emails" collection, it will look at 5 more emails in a new list being added to it. 2 will be added, 3 will not be added since they are duplicates already in the list. the total when completed is 100,000,002 emails in the collection. This is only meant to demonstrate that my end goal is to be able to compare against an existing collection to see if a value is a duplicate or already exists in that collection, a very large collection of data. The other goal is to get the total consumed RAM down from 17 GB to something much smaller.

Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
SED
  • 311
  • 1
  • 11
  • 4
    You could use a DB. – ikegami May 25 '19 at 04:57
  • Alternatively, work from a sorted list. Then your memory requirements drop to O(1). – ikegami May 25 '19 at 05:00
  • 2
    Database + index on email address = tiny memory usage, fast lookup. Please explain why the list needs to be in memory instead of doing this. Or just add another 32 GB of RAM to your server :) – Dave S May 25 '19 at 05:23
  • I doubt in-memory is really going to be much better than a simple, clean database unless you're trying to do a huge number of comparisons but you could try rolling your own hash or prefix table + compressed buffer per table scheme. For example a 16-bit hash slices the 17 GB into 65,000 separate sections, each of which is then compressed. You then hash the new email, decompress the one table entry's section and search that. – Dave S May 25 '19 at 05:31
  • 2
    If you are still going for the memory route, you could also consider storing the strings using UTF-8 encoding, this should give you a 2:1 compression on most stored addresses as many characters will be encoded as single bytes rather than 16-bits if using the default (UTF-16) encoding. – BlueStrat May 25 '19 at 05:42
  • 1
    Is there a specific NFR requiring you to do this in memory? The obvious answer is to use a database. If that isn't fast enough, use a bigger database server. If that isn't fast enough, use a database cluster. – John Wu May 25 '19 at 08:07
  • @DaveS yes I am trying to do a large number of comparisons, which is why in memory seems to me to be the better approach. I am exploring options that allow me to run this code on a server that only has 4 GB of memory, without needing to call an external database server for "does it exist in the collection?". – SED May 25 '19 at 15:44

3 Answers3

3

Option 1 Use a Ternary Tree

This data structure is an efficient way to store words in memory. It's highly compressed and fast to search.

Option 2 Use an in memory hash and an on-disk file

Keep just a hash of each email in memory. If you get a hit in the hashtable go look on disk.

Option 3 Use a Bloom Filter and an on-disk file

See https://llimllib.github.io/bloomfilter-tutorial/

Ian Mercer
  • 38,490
  • 8
  • 97
  • 133
  • Do you mean [Ternary Tree](https://en.wikipedia.org/wiki/Ternary_tree) or a [Trie](https://en.wikipedia.org/wiki/Trie)? – Enigmativity May 25 '19 at 06:09
  • @Enigmativity I meant a ternary tree, but a trie could also be used. My assumption is that the letter distribution within email address is such that a ternary tree would be more space efficient. It might even make sense to split the email into two in order to get more commonality on the domain name. – Ian Mercer May 25 '19 at 06:14
  • Yes, I think reversing the email might actually lead to the most efficient Trie. – Enigmativity May 25 '19 at 06:22
  • @IanMercer I am not that familiar with Option 2 using an in memory hash. Does that lose its usefulness due to having 100 million emails, so many could possibly have the same hash values or sub values that it would generate so many false positives when doing a hash check? Can you point me in the right direction, even with just pseudo code, on how something like that could work? It sounds like a potential answer, if I did all hash checks against all emails, and all of them that got a false positive or accurate answer of being in the hash, I could compare just those to a file. – SED May 25 '19 at 15:48
  • @SED if you are getting that many hash collisions, pick a better hash key. But why not just try the default .NET hash code for strings and see how many collisions you actually get. A more advanced approach would be a Bloom Filter (see http://benwendt.ca/articles/a-bloom-filter-in-c/). – Ian Mercer May 25 '19 at 16:10
  • @IanMercer a rough draft tests of 10 randomly chosen emails, 6 that I know are in it and 4 that aren't, I get a 100% accuracy rate using bloom filters, so at first glance it is accurate enough and the total memory consumed is < 300 MB. I think we may have our winner with bloom filters, that is the type of answer I was looking for :) I'm going to do a little more testing first before marking this as the answer. Thanks for this suggestion! – SED May 25 '19 at 16:29
  • 1
    Do NOT use the default .NET string hash code to reference files saved on disk. If you restart your program, or start a new app domain, strings may produce different hashes. – ILMTitan May 25 '19 at 16:30
  • 1
    @IanMercer 34 false positives (which is acceptable, I can double verify those) out of 1,000,000 random emails attempted when testing against emails known to not exist. 0 false positives when testing 1,000,000 random emails known to already exist. less than 1 second to compare against all 2 million emails. Bloom filter is awesome! – SED May 25 '19 at 16:41
1

You seem to be implying that you have 100 million email addresses in eg a text file. I don't see the need to load the entire file into memory and loop through it; use a stream reader and read it line by line. For each line, check whether the email address just read is in the list of 10 you wish to import and if it is, remove it from the import list

At the end of the process you will have reduced your import list to just those addresses not in the huge file and you'll never have read more than a single line at a time (well the reader will cache some small number of kilobytes)

Adapted From Microsoft's example collection:

https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/file-system/how-to-read-a-text-file-one-line-at-a-time

string line;  
string[] emailsToImport = "a@b.com c@d.com".Split();

// Read the file and process it line by line.  
System.IO.StreamReader file =   
  new System.IO.StreamReader(@"c:\100million.txt");  
while((line = file.ReadLine()) != null)  
{  
    for(int i = 0; i < emailsToImport.Length; i++){
      if(emailsToImport[i] == line)
        emailsToImport[i] = null;
    }
}  

file.Close();  
System.Console.WriteLine("new emails remaining to import: {0} ", string.Join(",", emailsToImport));  

This is a quick and very dirty example that is case ignorant; it is intended as a simple explanation of a concept , not production code

I worked on the following assumptions:

You have a server with a low amount of ram (e.g. 4gb) , and you have an infrequent need (e.g. once per 5 minutes) to add a small number of email addresses (e.g. 10) to a large list of 100M addresses, ensuring no duplication of the new addresses

Read the file line by line, compare each line to all 10 new addresses, delete any that are known already. At the end of reading the file once you have up to N addresses that you started with, that you know not to exist in the main list.

I assert that your original statement "I have a large amount of data I need to work with in memory" is, in this case, possible to work with on disk instead

Caius Jard
  • 72,509
  • 5
  • 49
  • 80
  • 1
    You're suggesting an `O(N^2)` solution. I don't think it's ideal. – Enigmativity May 25 '19 at 06:24
  • Please explain how this is N^2. Please elaborate on your definition of "ideal". Please feel free to suggest an "ideal" answer.. – Caius Jard May 25 '19 at 06:58
  • The OP's code shows that he is calling `foreach (string email in newEmails)` which then goes on to check the list of all emails (reading the file) for each email in `newEmails`. That's `O(n^2)`. Ideally it would be something that doesn't need to re-read the entire file each time. – Enigmativity May 25 '19 at 07:11
  • You should be posting your comments under the question if you want to talk to the OP – Caius Jard May 25 '19 at 07:16
  • 1
    So when you said "The OP's code" you meant I was the OP? You're quite confusing. This code reads the file once, and is looking for a small number of emails in it. It is not intending to dedupe 100M emails. It is not intending to import uniquely a list of 100M emails into an existing list of 100M emails. It's a naive solution that aims for a minimal memory footprint by leaving most of the file on disk, suited for use on a server with critically low memory. We'd all welcome you to post up some wisdom of your own rather than just telling everyone else what you think they got wrong.. – Caius Jard May 25 '19 at 07:23
  • I referenced the OP's code so that you could see that your code would repeatedly read the entire file again and again. I had hoped that was clear enough. My apologies if it were not. – Enigmativity May 25 '19 at 07:27
  • Put my code in a static void main, run it once and explain to me how my code repeatedly reads the file. It reads it once. If you swap the while and the for around then sure, it'll read it N times where N is the length of the new emails array but right now I think you've misread something – Caius Jard May 25 '19 at 07:30
  • Sorry, I see what you're doing - it's just different from how the OP shows that he wants to call the code. You may need to make it obvious as to how you would intend your code to work using the OP's code as a base. – Enigmativity May 25 '19 at 07:39
  • @CaiusJard thank you for the suggestion, I will explore this further. To help make things more clear: another goal I have is to keep this fast and efficient. If my server had enough RAM, I could load all 100million emails into a "public static HashSet emails = new HashSet();" and any call to it of "emails.Contains(newEmail)" would result in 0 millisecond response time of a true/false value. I'm concerned reading a 100million line file, line by line, every time I need to do a check would result in a significantly slower solution which buying more RAM would be preferred. – SED May 25 '19 at 15:53
  • We can only work with the specifications given. If you have any wonder about the performance of any solution, you should test it before going off and spending a large amount of money on an assumption. This one is easy to test; take 5 minutes out to develop it (i've done most of it above - just put it in a static void main as a small test app and load 10, 50, 100 email addresses into the fixed array etc) and run it. Don't make assumptions like "line by line must be slow" - there is PLENTY of caching between your code and the spinning platter. Test it, optimize it.. You might well find that.. – Caius Jard May 25 '19 at 17:36
  • for a use case of loading 100 new email addresses every 30 mins, a "keep it on disk and do line by live" approach works just fine. If you do decide to spend money, consider carefully on what - for the price of a single server memory module you might be able to get a reasonable i5 mini desktop PC with enough resources to plop somewhere else on the network and run a redis cache to speed this process up.. – Caius Jard May 25 '19 at 17:38
0

To check if an item is not in a very large list you can use a Bloom Filter. It is a generalization of a hash table which generates for each input string a list of hashes which are stored unlike in the case of a hash table at several buckets. Thus if you have one hash value collision it still can work out by checking the other hashes if that exact string was ever added.

You can still have false positives ( filter.Contains("xxxx") return true although it was never added to the list) but never false negatives.

By adjusting the capacity you can configure the error rate. If your filter can live with a small false positive rate then this one should be good.

For an example check out this one.

I have tried a few tries. Most implementations use references in their node class which is incredibly memory inefficient. One on SO seems to be pretty good: How to create a trie in c#. This one saves ca. 30% compared to a plain list.

Apart from data structures I think you need to look at the overall goal. I guess your actual data is not email addresses because a Spam filter is a long solved problem. But to toy with your example, there is structure in your data you can take advantage of. You have a large list consisting of a name and a domain. The first thing to do is to split your data into smaller files which contain only the mail addresses of one domain. Then you sort by name and create an index for each domain where the file start position for each letter is stored in its header.

When a new mail arrives you can first do a quick pre check with the Bloom Filter. If it says it is clean you are finished in 99.5% of all cases already. Why 99.5%? You can configure your Bloom Filter to have this property by calculating how much memory you want to spend to get this accuracy.

That is great from a system perspective because you can now make a conscious decision how much Memory/CPU/Disk resources you want to spend for this task.

When you have a hit you can open the indexed file for that domain, jump to the sorted mail addresses directly and start reading the addresses until you hit the bad guy or you are past the alphabectical sort order and you can stop reading.

If you have more domain know how you can get even smarter. Because you know that the number of valid senders is pretty limitted for a company you can create a much smaller checked white list of valid senders. If the sender is on the white list you can skip the other checks already.

Alois Kraus
  • 13,229
  • 1
  • 38
  • 64