4

I posted a similar question earlier but don't think I explained my requirements very clearly. Basically, I have a .NET application that writes out a bunch of HTML files ... I additionally want this application to index these HTML files for full-text searching in a way that javascript code in the HTML files can query the index (based on search terms input by a user viewing the files offline in a web browser).

The idea is to create all this and then copy to something like a thumb drive or CD-ROM to distribute for viewing on a device that has a web browser but not necessarily internet access.

I used Apache Solr for a proof of concept, but that needs to run a web server.

The closest I've gotten to a viable solution is JSSindex (jssindex.sourceforge.net), which uses Lush, but our users' environment is Windows and we don't want to require them to install Cygwin.

user1263226
  • 250
  • 3
  • 12
  • possible duplicate of [Full-text search for local/offline web "site"](http://stackoverflow.com/questions/10356532/full-text-search-for-local-offline-web-site) – epascarello May 11 '12 at 23:00
  • Yes that's my original question that I made reference to in this question ... I'm pretty new to SO, guess I could've just completely re-worded the older one instead of posting this new one? – user1263226 May 11 '12 at 23:40
  • BTW JSSindex looks like exactly what you want - "Lush.. not required by end-users to perform search queries". – Alexei Levenkov May 12 '12 at 00:50
  • Let me explain ... I have sort of two "tiers" of end users ... Tier 1 is people who use the application to create the HTML files (and hopefully at some point, the search index) and Tier 2 is the folks who just browse and interact with the output in their browsers. So basically, my team writes the .NET application code as well as the HTML templates with CSS, JS, etc. for the output that will be browsed. There is some overlap between the two tiers of end users, but I don't want either of them to have to install Cygwin (for Lush). – user1263226 May 12 '12 at 01:40

3 Answers3

2

It looks like your main problem is to make index accessible by local HTML. Cheat way to do it: put index in JS file and refer from the HTML pages.

var index=[ {word:"home", files:["f.html", "bb.html"]},....];
Alexei Levenkov
  • 98,904
  • 14
  • 127
  • 179
  • I don't think that is scalable for my needs, each dataset will have around 1,000 HTML files with 50MB total of text content ... but I don't think that's the main problem anyway, I can just put the info in your answer in an XML file in a known location right? And use XSL to "query" it and present the results? Or am I missing your whole point? – user1263226 May 11 '12 at 23:34
  • XML would work too... JavaScript may be friendlier to offline HTML. You can also split index files... All assuming you really want to built on yourself. – Alexei Levenkov May 12 '12 at 00:48
  • Yea I was kinda hoping to avoid building the indexing routine myself ... but I might not have a choice it seems – user1263226 May 12 '12 at 01:34
1

Ladders Could be a solution, as it provides on the spot indexing. But with 1,000 files or more, I dunno how well it'd scale... Sadly, I am not sure JS is the answer here. I'd go for a custom (compiled) app that served both as front-end (HTML display) and back-end (text search and indexing).

dda
  • 6,030
  • 2
  • 25
  • 34
  • Yes I already looked at that but doubt it will scale for my needs ... I have the flexibility to do the indexing in a custom (compiled) app and want to take advantage of the efficiency of doing so ... but the search itself (i.e. querying of the index) needs to be done in the browser. There's no technical reason why I shouldn't be able to do so (just feel like I'd be reinventing the wheel if I coded it all from scratch myself), I am after all able to query a Solr index with Javascript ReST requests to the server for my demo/proof of concept ... – user1263226 May 21 '12 at 20:22
0

Use a trie - they're ridiculously compact and very scalable - dead handy for text matching.

There is a great article covering performance and design strategies. They're slower to boot up than a dictionary, but take up a lot less room, particularly when you're working with larger datasets.

I'd tackle it as follows:

  1. in your .net code index all the keywords that are important to you (track their document and offset).
  2. generate your trie structure using an alpha sorted list of keywords,
  3. decorate the terminal nodes with information about the documents the words they represent can be found in.

      C
     A
    R  T [{docid,[hit offsets]},...]
    

You don't have to store the offsets, but it would allow you to search for words by proximity or order.

Your .net guys could build the trie sample code.

It will take a while to generate the map, but once it's done and you've serialised it to JSON your javascript application will race through it.

web_bod
  • 5,728
  • 1
  • 17
  • 25
  • I can sling some code together if would help, but I reckon your guys can handle it, just gimme a shout. – web_bod May 21 '12 at 23:51
  • Hmmm ... you mention serializing to JSON, which I'm new to ... you mean serializing the index (represented by a trie map) to JSON? Is there then not a way to serialize something like a Lucene/Lucene.NET index to JSON too? If so, why would it be better to do it as a trie instead, just a more efficient data structure? – user1263226 May 23 '12 at 17:15