I have a limited amount of RAM on my server but have a large amount of data I need to work with in memory in a console program. Are there any tricks that would allow me to still get the same end result, but without needing so much RAM
For this example I have 100 million email addresses in a string list. I need to find out if any of the new emails I am comparing to it already exist in it. If so, add them. If not, don't add them. So we always have a unique list of emails, no duplicates.
100 million emails in this example requires approximately 17GB of RAM.
Are there any tricks or tips you know of to reduce how much RAM is required to still at least be able to do a "DOES IT EXIST IN THE LIST COLLECTION?" comparison? - types of examples that come to mind: such as a different type of collection, or a custom 3rd party referenced software tool that compresses data in memory but you can still sort or compare on that data, or perhaps a file based database system that uses far less memory on the same amount of data.
I've written the code to demonstrate how to do this the normal way that causes 17GB of RAM to be consumed.
using System;
using System.Collections.Generic;
using System.Linq;
namespace NewProgram
{
class Program
{
public static List<string> emails = new List<string>();
public static void Main(string[] args)
{
LoadAllEmails();
Console.WriteLine(emails.Count() + " total emails"); //100000000 total emails
AddEmailsThatDontExistInMasterList(
new List<string>()
{
"something@test.com", //does not already exist, so it will be added to list
"testingfirst.testinglast"+ (1234567).ToString() + "@testingdomain.com", //should already exist, won't be added
"testingfirst.testinglast"+ (3333335).ToString() + "@testingdomain.com", //should already exist, won't be added
"something2@test.com", //does not already exist, so it will be added to list
"testingfirst.testinglast"+ (8765432).ToString() + "@testingdomain.com", //should already exist, won't be added
});
Console.WriteLine(emails.Count() + " total emails after"); //100000002 total emails
Console.ReadLine();
}
public static void LoadAllEmails()
{
for (int i = 0; i < 100000000; i++) //100,000,000 emails = approximately 17GB of memory
{
emails.Add("testingfirst.testinglast" + i.ToString() + "@testingdomain.com");
}
}
public static void AddEmailsThatDontExistInMasterList(List<string> newEmails)
{
foreach (string email in newEmails)
{
if (emails.Contains(email) == false)
{
emails.Add(email);
}
}
}
}
}
After adding 100,000,000 emails to the "emails" collection, it will look at 5 more emails in a new list being added to it. 2 will be added, 3 will not be added since they are duplicates already in the list. the total when completed is 100,000,002 emails in the collection. This is only meant to demonstrate that my end goal is to be able to compare against an existing collection to see if a value is a duplicate or already exists in that collection, a very large collection of data. The other goal is to get the total consumed RAM down from 17 GB to something much smaller.