Find documents containing multiple words

Question

I have developed a search engine that lets the user search a word through hundreds of thousands of documents.

All documents are split into words. All words are stored in the table "word", and one row is added to the table "urlword" for each word found in a document. The column "score" is calculated with the tf-idf algorithm.

To search for one word, the query is simple and fast:

SELECT url_id FROM urlword WHERE word_id = [word id] ORDER BY score DESC LIMIT 10

However it gets exponentially slower when we want to find documents that include multiple words. For example:

select url_id from urlword uw
where word_id in (21,28)
group by url_id
having count(*) = 2
order by sum(score) DESC
limit 10

or

select uw1.url_id from urlword uw1
inner join urlword uw2 on uw2.word_id=28 and uw1.url_id=uw2.url_id
where uw1.word_id=21
order by uw1.score+uw2.score DESC
limit 10

are very slow, dozens of seconds, when the words have many occurrences.

So is there a more optimized way to do it with mysql?

Is elasticsearch the only way to do? And would even elasticsearch perform well here? Or maybe another tool?

The mysql schema:

CREATE TABLE `url` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `url` text COLLATE utf8_unicode_ci,
  `urlhash` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
  `title` text COLLATE utf8_unicode_ci,
  PRIMARY KEY (`id`),
  UNIQUE KEY `urlhash` (`urlhash`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

CREATE TABLE `urlword` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `count` int(11) DEFAULT NULL,
  `score` decimal(20,6) DEFAULT NULL,
  `url_id` int(11) DEFAULT NULL,
  `word_id` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `UNIQ_DBA1BC7981CFDAE7E357438D` (`url_id`,`word_id`),
  KEY `IDX_DBA1BC7932993751` (`score`),
  KEY `word_id_url_id_score` (`word_id`,`url_id`,`score`),
  KEY `word_id_score` (`word_id`,`score`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

CREATE TABLE `word` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `word` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `UNIQ_C3F17511C3F17511` (`word`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

You should be using MySQL's full text indexes: http://dev.mysql.com/doc/refman/5.7/en/fulltext-search.html. However, I'm surprised that the `group by` version gets much slower as the number of rows increases. — Gordon Linoff, Jul 14 '16 at 12:27
I have a FTS example [here](http://stackoverflow.com/a/30677347) with MyISAM. FT Indexes were added to INNODB in version [5.6](https://dev.mysql.com/doc/refman/5.6/en/innodb-fulltext-index.html). It needs volume, training, and a focus on [Stop Words](http://dev.mysql.com/doc/refman/5.7/en/fulltext-stopwords.html). For more sophisticated search, I use `solr` — Drew, Jul 14 '16 at 13:04
@GordonLinoff i'll give it a try with fulltext and see if it fits my needs. For the group by version, it's slowed by `order by sum(score)`. — leyou, Jul 14 '16 at 13:08
@leyou . . . I can see that is would be slower. But as the number of words increases, the number of matching rows should decrease -- and the sort should be faster. The `group by` itself might take a bit more time, but I'm surprised that it would grow so fast. — Gordon Linoff, Jul 14 '16 at 23:38

Find documents containing multiple words

0 Answers0