5

Currently we are developing quite a big application which will have to work with some huge amounts of records.

The idea is that the e-mails will have to be stored (with attachments) and via a web-api users should be able to search in their stored e-mails. Users should be able to search (within their own messages they have exported into the database/storage) on at least the following items:

  • from
  • to
  • subject
  • date (range)
  • attachments (names & types only)
  • message contents
  • (optional) mailbox / folder structure

The application should be able to work with big numbers of users and extreme numbers of e-mails (easily growing from millions to billions). The users should be able to download the whole originals message (with attachments) so they can import it into their email client.

I was thinking about indexing the e-mails into a database, and just storing the full e-mail with attachments with a unique key as a package into a seperate storage. With this way I should keep the database load as low as possible and therefore the search as quick as possible.

I have found several database schemas for handling e-mail like this. I couldn't find any database that is able to handle with hundreds of millions and maybe even billions of records (e-mails).

Is this the most efficient way to keep it simple, efficient and fast or am I forgetting anything?

// edit The idea is to run this on the amazon cloud (perhaps any suggestions related to it?)

Floris
  • 299
  • 3
  • 17

2 Answers2

6

You can use mongoDB database for this amount of data. Here is detail of mongoDb.http://www.mongodb.org/

In mongoDb mysql table is called as collections and row as document.

Mongo store data in JSON based object format.

one possible way to make db schema here.

from : string
to : string
subject: string
date (range): datetime
attachments (names & types only) : Object Array
message contents : string
(optional) mailbox / folder structure: string

for example:
from: from@gmail.com
to: to@gmail.com
subject: "test subject"
date: "current date",
attachments: {
 [0]=>{
   names: "attachments1",
   types: "text"
},
[1]=>{
  names: "attachments2",
   types: "pdf"
}
}
Jaykesh Patel
  • 2,130
  • 1
  • 14
  • 18
  • Are there any available schemas for mongo db which I can use as a reference to provide such amounts of data? – Floris Feb 01 '13 at 09:46
  • [http://docs.mongodb.org/manual/core/data-modeling/](http://docs.mongodb.org/manual/core/data-modeling/) this contain detail how to design mongodb schemas. – Jaykesh Patel Feb 01 '13 at 10:22
  • So if I'm getting this MongoDB will have no problems with holding over a billion records in a single DB and processing millions of inserts and search queries daily if the schema is ok. As said, the idea is to run this into the Amazon cloud. As most of the size of the e-mails itself will be attachments my idea is to run a DB on amazon and store attachments into the s3 storage. – Floris Feb 01 '13 at 10:29
  • Its depend on how you design database schema.http://stackoverflow.com/questions/9702643/mysql-vs-mongodb-1000-reads please check out this links might help to you. – Jaykesh Patel Feb 01 '13 at 10:41
  • Currently I'm thinking about taking a simple, but efficient schema like the one you provided above. E-mails exported out of the mail client to the application, I will use the application to fetch the indexable fields (from, to, subject etc.) from the e-mail. Those fields will be inserted into the DB and the e-mail (itself, with attachments) will be stored into the S3. The e-mail client will be able to query the application for search results based on the fields in the DB. Is this the way to go? – Floris Feb 01 '13 at 10:51
0

You don't want to store this sort of information in an RDBMS. Rather, you want to extend something like lucene. For email, solr has an email indexer. Hope that helps...

hd1
  • 33,938
  • 5
  • 80
  • 91
  • The idea is to let the e-mail client (ms outlook / apple mail) send the e-mail to the storage (application). – Floris Feb 01 '13 at 09:45