I have developed a search engine that lets the user search a word through hundreds of thousands of documents.
All documents are split into words. All words are stored in the table "word", and one row is added to the table "urlword" for each word found in a document. The column "score" is calculated with the tf-idf algorithm.
To search for one word, the query is simple and fast:
SELECT url_id FROM urlword WHERE word_id = [word id] ORDER BY score DESC LIMIT 10
However it gets exponentially slower when we want to find documents that include multiple words. For example:
select url_id from urlword uw
where word_id in (21,28)
group by url_id
having count(*) = 2
order by sum(score) DESC
limit 10
or
select uw1.url_id from urlword uw1
inner join urlword uw2 on uw2.word_id=28 and uw1.url_id=uw2.url_id
where uw1.word_id=21
order by uw1.score+uw2.score DESC
limit 10
are very slow, dozens of seconds, when the words have many occurrences.
So is there a more optimized way to do it with mysql?
Is elasticsearch the only way to do? And would even elasticsearch perform well here? Or maybe another tool?
The mysql schema:
CREATE TABLE `url` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`url` text COLLATE utf8_unicode_ci,
`urlhash` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`title` text COLLATE utf8_unicode_ci,
PRIMARY KEY (`id`),
UNIQUE KEY `urlhash` (`urlhash`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
CREATE TABLE `urlword` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`count` int(11) DEFAULT NULL,
`score` decimal(20,6) DEFAULT NULL,
`url_id` int(11) DEFAULT NULL,
`word_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `UNIQ_DBA1BC7981CFDAE7E357438D` (`url_id`,`word_id`),
KEY `IDX_DBA1BC7932993751` (`score`),
KEY `word_id_url_id_score` (`word_id`,`url_id`,`score`),
KEY `word_id_score` (`word_id`,`score`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
CREATE TABLE `word` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`word` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `UNIQ_C3F17511C3F17511` (`word`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;