2

Given a Lucene search query like: +(letter:A letter:B letter:C) +(style:Capital), how can I tell which of the three letters actually matched any given document? I don't care where they match, or how many times they match, I just need to know whether they matched.

The intent is to take the initial query ("A B C"), remove the terms which successfully matched (A and B), and then do further processing on the remainder (C).

Bobson
  • 13,498
  • 5
  • 55
  • 80

5 Answers5

10

Although the sample is in c#, Lucene APIs are very similar(some upper/lower case differences). I don't think it would be hard to translate to java.

This is the usage

List<Term> terms = new List<Term>();    //will be filled with non-matched terms
List<Term> hitTerms = new List<Term>(); //will be filled with matched terms
GetHitTerms(query, searcher,docId, hitTerms,terms);

And here is the method

void GetHitTerms(Query query,IndexSearcher searcher,int docId,List<Term> hitTerms,List<Term>rest)
{
    if (query is TermQuery)
    {
        if (searcher.Explain(query, docId).IsMatch() == true) 
            hitTerms.Add((query as TermQuery).GetTerm());
        else
            rest.Add((query as TermQuery).GetTerm());
        return;
    }

    if (query is BooleanQuery)
    {
        BooleanClause[] clauses = (query as BooleanQuery).GetClauses();
        if (clauses == null) return;

        foreach (BooleanClause bc in clauses)
        {
            GetHitTerms(bc.GetQuery(), searcher, docId,hitTerms,rest);
        }
        return;
    }

    if (query is MultiTermQuery)
    {
        if (!(query is FuzzyQuery)) //FuzzQuery doesn't support SetRewriteMethod
            (query as MultiTermQuery).SetRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);

        GetHitTerms(query.Rewrite(searcher.GetIndexReader()), searcher, docId,hitTerms,rest);
    }
}
L.B
  • 114,136
  • 19
  • 178
  • 224
  • This is exactly what I needed, and since I'm already working with Lucene.net, I don't even need to translate it. Thanks! (I realize I should probably have used the lucene.net tag, but I figured it was a general question about the system, rather than an issue with the .net flavor, since they work the same). – Bobson Oct 26 '11 at 15:57
1

As answer given by @L.B, Here is the converted code of JAVA which works for me:

void GetHitTerms(Query query,IndexSearcher searcher,int docId,List<Term> hitTerms,List<Term>rest) throws IOException
    {
        if(query instanceof TermQuery )
        {
            if (searcher.explain(query, docId).isMatch())
                hitTerms.add(((TermQuery) query).getTerm());
            else
                rest.add(((TermQuery) query).getTerm());
            return;
        }

            if(query instanceof BooleanQuery )
            {
                for (BooleanClause clause : (BooleanQuery)query) {
                    GetHitTerms(clause.getQuery(), searcher, docId,hitTerms,rest);
            }
            return;
        }

        if (query instanceof MultiTermQuery)
        {
            if (!(query instanceof FuzzyQuery)) //FuzzQuery doesn't support SetRewriteMethod
                ((MultiTermQuery)query).setRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);

            GetHitTerms(query.rewrite(searcher.getIndexReader()), searcher, docId,hitTerms,rest);
        }
    }
Christoph H.
  • 173
  • 1
  • 14
PVR
  • 885
  • 9
  • 18
1

I basically used the same approach as @L.B, but updated it for usage for the newest Lucene Version 7.4.0. Note: FuzzyQuery now supports .setRewriteMethod (that's why I removed the if).

I also included handling for BoostQuerys and saved the words that were found by Lucene in a HashSet to avoid duplicates instead of the Terms.

private void saveHitWordInList(Query query, IndexSearcher indexSearcher,
    int docId, HashSet<String> hitWords) throws IOException {
  if (query instanceof TermQuery)
    if (indexSearcher.explain(query, docId).isMatch())
      hitWords.add(((TermQuery) query).getTerm().toString().split(":")[1]);
  if (query instanceof BooleanQuery) {
    for (BooleanClause clause : (BooleanQuery) query) {
      saveHitWordInList(clause.getQuery(), indexSearcher, docId, hitWords);
    }
  }

  if (query instanceof MultiTermQuery) {
    ((MultiTermQuery) query)
        .setRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_REWRITE);
    saveHitWordInList(query.rewrite(indexSearcher.getIndexReader()),
        indexSearcher, docId, hitWords);
  }

  if (query instanceof BoostQuery)
    saveHitWordInList(((BoostQuery) query).getQuery(), indexSearcher, docId,
        hitWords);
}
Christoph H.
  • 173
  • 1
  • 14
1

Here is a simplified and non-recursive version with Lucene.NET 4.8.
Unverified, but this should also work on Lucene.NET 3.x

IEnumerable<Term> GetHitTermsForDoc(Query query, IndexSearcher searcher, int docId)
{
    //Rewrite query into simpler internal form, required for ExtractTerms
    var simplifiedQuery = query.Rewrite(searcher.IndexReader);
    HashSet<Term> queryTerms = new HashSet<Term>();
    simplifiedQuery.ExtractTerms(queryTerms);

    List<Term> hitTerms = new List<Term>();
    foreach (var term in queryTerms)
    {
        var termQuery = new TermQuery(term);

        var explanation = searcher.Explain(termQuery, docId);
        if (explanation.IsMatch)
        {
            hitTerms.Add(term);
        }
    }
    return hitTerms;
}
farlee2121
  • 2,959
  • 4
  • 29
  • 41
  • I'm no longer at that job, so I can't test it, but this definitely does look cleaner. Has something been added since 2011 that specifically enables this? – Bobson Jan 13 '20 at 15:17
  • 1
    I haven't tried running it on lucene.net 3.0.3, but I think it should be possible. The two key methods here are `query.Rewrite` and `query.ExtractTerms`. Both of these methods exist in Lucene.net 3.0.3 – farlee2121 Jan 13 '20 at 15:40
-1

You could use a cached filter for each of the individual terms, and quickly check each doc id against their BitSets.

A. Coady
  • 54,452
  • 8
  • 34
  • 40
  • It might be because I'm new to Lucene, but I don't see how I'd use that. I don't have a filter right now - just a Query, and the query terms change each time - it could be "A B" one time, and "A F B Q" another. I can create a QueryFilter from the Query, but that seems redundant (although possibly necessary). I also don't have an indexReader for the filter's GetDocIdSet() function, just an IndexSearcher(). Can you provide an example of usage? – Bobson Oct 26 '11 at 13:28
  • Actually, I just realized how I could get the IndexReader - I'm already doing it elsewhere in my code. The rest of the comment still applies, though. – Bobson Oct 26 '11 at 13:50