1

We need to keep some in-memory data structure to keep english word dictionary in memory. When the computer/wordweb starts,we need to read dictionary from disk into an in-memory data structure.

This question asks how do we populate from disk to in-memory data structure in typical real world dictionaries say wordweb?

Ideally we would like to keep dictionary in disk in the way, we require it in in-memory, so that we don't have to spend time building in-memory data structure, we just read it off the disk. But for linked lists, pointers etc, how do we store the same image in disk. Some relative addresses etc would help here?

Typically, is the entire dictionary read and stored in memory. or only part/handlers and leaf page IOs are done, when searching for a specific word.

If somebody wants to help with what that in-memory data structure is typically, please go ahead.

Thanks,

xyz
  • 8,607
  • 16
  • 66
  • 90
  • what language? You mentioned pointers... does that mean you're using C++? – Kiril Jun 08 '11 at 12:50
  • whatever..or some equivalent to C++..if a tree branches out, we have to get handler to children nodes. I am calling those handlers as pointers. – xyz Jun 08 '11 at 12:56
  • well, pointers in C++ are much harder to serialize. C# has really good XML serialization and it works with "pointers" (i.e. nodes in a tree), Java has good serialization too. So knowing the language will make it easier to answer. – Kiril Jun 08 '11 at 13:07
  • 1
    I don't know what your general requirements are and how you plan on using this, but you may want to have a look at a data structure called DAWG (Directed Acyclic Word Graph) or MA-FSA (Minimal Acyclic Finite State Automaton). Take a look at these: http://stevehanov.ca/blog/index.php?id=115 and http://en.wikipedia.org/wiki/Directed_acyclic_word_graph – eSniff Jun 08 '11 at 18:39
  • Also if you just need a quick way to find if a word is valid, try looking into bloom filters. For Example: If I was designing a spell checker (which I've never done) I may try using a bloomfilter to flag the miss spelled words and then use a BK-Tree to do a fuzzy search to find the correct spelling for each flagged word. Anyway Bloom filters can be stored on disk as a byte array and are very small. – eSniff Jun 08 '11 at 18:53

3 Answers3

2

You mentioned pointers, so I'm assuming you're using C++; if that's the case and you want to read directly from disk into memory without having to "rebuild" your data structure, then you might want to look into serialization: How do you serialize an object in C++?

However, you generally don't want to load the entire dictionary anyway, especially if it's a user application. If the user is looking up dictionary words, then reading from disk happens so fast that the user will never notice the "delay." If you're servicing hundreds or thousands of requests, then it might make sense to cache the dictionary into memory.

So how many users do you have?
What kind of load are you expecting to have on the application?

Community
  • 1
  • 1
Kiril
  • 39,672
  • 31
  • 167
  • 226
  • It was a very elementary question from my side. I just began to wonder how wordweb on my windows xp is implemented. so it has only 1 user - that's me. So here it looks like reading from disk directly would suffice when a word is searched and user will never notice delay. next level of understanding would be how it happens for say an online site offering dictionary where there will be many requests per second.. – xyz Jun 08 '11 at 13:16
  • @p2pnode, if it's an online service, then they would cache the dictionary... probably even loading the entire thing in memory. Good servers have tons of RAM (16+ GB) so I'm pretty sure wordweb can fit into memory (although I'm not sure exactly how big is it). Your other question regarding the trie is probably on the right track for representing the dictionary in memory. – Kiril Jun 08 '11 at 13:29
0

Else why don't you create a JSON format File containing all the meaning creating a short form of Dictionary ?

Krishna
  • 673
  • 3
  • 6
  • 21
0

Wordweb is using Sqlite Database at backend. It makes sense to me to use a Database system to store the content so its easier to GET the content which the user is looking for quickly.

Wordweb has Word prediction as well... so it will be a query to database like

select word from table where word='ab%';

on the other hand, when the user presses enter for the word

select meaning from table where word='abandon'

You do not want to be Serializing the content from disk to memory while the user is typing or after he has pressed Enter to search. Since the data will be large (Dictionary), Serialization will probably take time more then the user will tolerate for every word search.

Pheonix
  • 6,049
  • 6
  • 30
  • 48