0

I want to set up a search in Lucene (actually Lucene.NET, but I can convert from Java as necessary) using the following logic:

  1. Search string is: A B C
  2. Search one field in the index for anything that matches A, B, or C. (Query: (field1:A field1:B field1:C))
  3. For each term that didn't match in step 2, search a second field for it while keeping the results from the first search (Query: (+(field1:A) +(field2:B field2:C)))
  4. For each term that didn't match in step 3, search a third field...
  5. Continue until running out of fields, or there's a search which has used every term.

Currently, my code can test whether a given search produces NO results, and ANDs together all the ones that do produce results. But I have no way to stop it before it tests against every field (which unnecessarily limits the results) - it's currently ending up with a query like: (+(field1:A field1:B field1:C) +(field3:A field3:B field3:C)) when I want it to be (+(field1:A field1:C) +(field3:B)). I can't just look at the results from the first search and remove words from the search string because the Analyzer mangles the words when it parses it for search, and I have no way to un-mangle them to figure out which of the original search terms it corresponds to.

Any suggestions?


Edit: Ok, generally I prefer describing my problems in the abstract, but I think some part of it is getting lost in the process, so I'll be more concrete.

I'm building a search engine for an site which needs to have several layers of search logic. A few example searches which I'll trace out are:

  1. Headphones
  2. Monster Headphones
  3. White Monster Headphones
  4. White Foobar Headphones

The index contains documents with seven fields - the relevant ones to this example are:

  • "datattype": A string representing what type of item this document represents (product, category, brand), so we know how to display it
  • "brand": The brand(s) that are relevant (categories have multiple brands, products and brands have one each)
  • "path": The path to a given category (i.e. "Audio Headphones In-Ear" for "Audio > Headphones > In-Ear")
  • "keywords": Various things that describe the product that don't go anywhere else.

In general, the logic for each step of the search is as follows:

  1. Check to see if we have a match.
  2. If so, filter the results based on that match, and continue parsing the rest of the search terms in the next step.
  3. If not, parse the search terms in the next step.

Each step is something like:

  • Search for a category
  • Search for a brand
  • Search for keywords

So here's how those three example searches should play out:

  1. Headphones
    • Search for a category: +path:headphones +datatype:Category
    • There are matches (the Headphone category), and no words from the original query are left, so we return it.
  2. Monster Headphones
    • Search for a category: `+(path:monster path:headphones) +datatype:Category
    • Matches were found for path:headphones and datatype:Category, leaving "Monster" unmatched
    • Search for a brand: +path:headphones +brand:monster
    • Matches were found for path:headphones and brand:monster, and no words from the original query are left, so we return all the headphones by Monster.
  3. White Monster Headphones
    • Search for a category: +(path:monster path:headphones path:white) +datatype:Category
    • Matches were found for path:headphones, and datatype:Category, leaving "White" and "Monster" unmatched
    • Search for a brand: +path:headphones +(brand:monster +brand:white)
    • Matches were found for path:headphones and brand:monster, leaving "White" unmatched
    • Search keywords: +path:headphones +brand:monster +keywords:white
    • There are matches, and no words from the original query are left, so we return them.
  4. White Foobar Headphones
    • Search for a category: +(path:foobar path:headphones path:white) +datatype:Category
    • Matches were found for path:headphones, and datatype:Category, leaving "White" and "Foobar" unmatched
    • Search for a brand: +path:headphones +(brand:foobar +brand:white)
    • Nothing was found, so we continue.
    • Search keywords: +path:headphones +(keywords:white keywords:foobar)
    • Matches were found for path:headphones and keywords:white, leaving "Foobar" unmatched
    • ... (continue searching other fields, including product description) ...
    • There are search terms still unmatched ("Foobar"), return "No results found"

The problem I have is twofold:

  1. I don't want the matches to continue once everything's matched (only products have descriptions, so once it reaches that step we'll never return something that's not a product). I could manage this by using denis's GetHitTerms from here, except that I then end up searching for the first matched term in all subsequent fields until everything matches (i.e. in example #2, I'd have +path:headphones +(brand:headphones brand:monster)).
  2. Despite my example above, my actual search query on the path field looks like +path:headphon +datatype:Taxonomy because I'm mangling it for searching. So I can't take the matched term and just remove that from the original query (because "headphon" != "headphones").

Hopefully that makes it clearer what I'm looking for.

Community
  • 1
  • 1
Bobson
  • 13,498
  • 5
  • 55
  • 80
  • 1
    Have you looked at this? https://cwiki.apache.org/LUCENENET/simple-faceted-search.html There may be ideas you can use. – L.B Nov 07 '11 at 21:57
  • @denis: At first glance, that does look like what I need. I'll need to delve into it more to be sure, but it does look promising. – Bobson Nov 07 '11 at 22:31
  • @denis: Good news and bad news. It does seem to be what I need, but my data set is far, far too large for it to handle. Even if I bump up the MAX_FACETS const. I have 455 brand terms, 251 path terms, and 10094 keyword terms, just from the fields above. If it faceted everything except descriptions, it weighs in at almost 4 *trillion* facets. – Bobson Nov 08 '11 at 20:12
  • I bet you haven't read `References to adding faceted to Lucene.Net section` in that page. There are other tricks utilizing Collector class to make a faceted search. See http://mail-archives.apache.org/mod_mbox/lucene-lucene-net-dev/201106.mbox/%3C003b01cc27a9$1218e740$364ab5c0$@com%3E – L.B Nov 08 '11 at 20:50
  • @denis: I've read it now, but I don't understand what to do with the result, or how to tell it I want multiple fields to be part of my search. – Bobson Nov 08 '11 at 23:45

2 Answers2

1

I don't understand your use case, but you sound like you're asking about the BooleanQuery API. You can get the clauses of your query by calling getClauses.

A simple example:

BooleanQuery bq = new BooleanQuery();
bq.add(new TermQuery(new Term("field1","a")), BooleanClause.Occur.SHOULD)
bq.add(new TermQuery(new Term("field1","b")), BooleanClause.Occur.SHOULD)

BooleanClause[] clauses = bq.getClauses();

EDIT: maybe you're just asking for a search algorithm. In pseudocode:

generate_query (qs_that_matched, qs_that_didnt_match, level):
   new_query = qs_that_matched AND level:qs_that_didnt_match
   qs_still_unmatched = ...
   qs_which_just_matched = ...
   if qs_still_unmatched != null:
      return generate_query(qs_that_matched AND qs_which_just_matched, qs_still_unmatched, level+1)
   else:
      return qs_that_matched AND qs_which_just_matched
Xodarap
  • 11,581
  • 11
  • 56
  • 94
  • That's just the first step. I want to run `bq`, see that only field1:a matched, and then _do something else_ with b. Specifically, I want to try (+field1:a +field2:b) and see whether that matches both a and b. Does that clarify? – Bobson Nov 04 '11 at 17:24
  • @Bobson: I've edited my answer to give a search algorithm since maybe that's what you're asking for? – Xodarap Nov 04 '11 at 19:58
  • That's closer to what I need. I had been thinking something along those lines, but I couldn't figure out how to do it as elegantly as this (I was thinking iteration instead of recursion). However, the **...** sections are stumping me. Based on my previous question (http://stackoverflow.com/questions/7896183/get-matched-terms-from-lucene-query) I know how to **get** the matched terms, but not how to get the part of the query string which generated the terms that I just matched (or failed to match). It's not just string-matching, because they get mangled by the Analyzer. – Bobson Nov 04 '11 at 20:06
  • @Bobson: If my pseudocode is right, you will never have nested queries. So the ellipses can be filled in as the conjunction/union of terms which match/don't match. – Xodarap Nov 04 '11 at 21:11
  • I edited the question to provide a more concrete example. Does that help clarify what I'm looking for? – Bobson Nov 08 '11 at 20:13
  • @Bobson: if you want to scale facets, use [bobo](http://linkedin.jira.com/wiki/display/BOBO/Home). It's what powers LinkedIn's facets, so I assume it will scale to what you need. – Xodarap Nov 10 '11 at 21:46
0

In the end, I built a QueryTree class and stored the queries in a tree structure. It stores a reference to a function that takes a query, a list of terms to pump into that query, whether it should AND or OR those terms, and a list of children (which represent unique combinations of matching terms).

To perform the next level of searching, I just call Evaluate(Func<string, QueryParser.Operator, Query> newQuery) on the deepest nodes in my tree, with a reference to a function which takes terms and an operator and returns the correct Query for that set of logic. The Evaluate function then tests that new query against the list of unmatched terms that have been passed down to it and the result sets of all ancestral Querys (by ANDing with the parent, which ANDs with it's parent and so on). It then creates children for each set of matching terms, using GetHitTerms, and gives the unmatched terms to the child. Repeat for each level of search.


I suspect that there's a better way to do this - I didn't even look into Bobo that Xodarap mentioned, and I never really got faceted searching (as per denis) working. However, it's working, which means it's time to move on to other aspects of the site.

Community
  • 1
  • 1
Bobson
  • 13,498
  • 5
  • 55
  • 80
  • If anyone wants it, I can post the code in question, but I'm not sure whether it would add anything, and it'd make the answer significantly longer. – Bobson Nov 14 '11 at 15:14