PHP library for word clustering/NLP?

Question

What I am trying to implement is a rather trivial "take search results (as in title & short description), cluster them into meaningful named groups" program in PHP.

After hours of googling and countless searches on SO (yielding interesting results as always, albeit nothing really useful) I'm still unable to find any PHP library that would help me handle clustering.

Is there such a PHP library out there that I might have missed?
If not, is there any FOSS that handles clustering and has a decent API?

Cluster them based on what? What's a meaningful group to you? — netcoder, Nov 08 '11 at 01:49
@netcoder: in a general purpose clustering library, that shouldn't matter. The choice of features should determine what kind of groups are produced. — Fred Foo, Nov 11 '11 at 10:44

score 5 · Accepted Answer · answered Nov 14 '11 at 03:02

Like this:

Use a list of stopwords, get all words or phrases not in the stopwords, count occurances of each, sort in descending order.

The stopwords needs to be a list of all common English terms. It should also include punctuation, and you will need to preg_replace all the punctuation to be a separate word first, e.g. "Something, like this." -> "Something , like this ." OR, you can just remove all punctuation.

$content=preg_replace('/[^a-z\s]/', '', $content); // remove punctuation

$stopwords='the|and|is|your|me|for|where|etc...';
$stopwords=explode('|',$stopwords);
$stopwords=array_flip($stopwords);

$result=array(); $temp=array();
foreach ($content as $s)
if (isset($stopwords[$s]) OR strlen($s)<3)
 {
 if (sizeof($temp)>0)
  {
  $result[]=implode(' ',$temp);
  $temp=array();
  }            
 } else $temp[]=$s;
if (sizeof($temp)>0) $result[]=implode(' ',$temp);

$phrases=array_count_values($result);
arsort($phrases);

Now you have an associative array in order of the frequency of terms that occur in your input data.

How you want to do the matches depends upon you, and it depends largely on the length of the strings in the input data.

I would see if any of the top 3 array keys match any of the top 3 from any other in the data. These are then your groups.

Let me know if you have any trouble with this.

I forgot to mention to strtolower() first, though it should be obvious. — Alasdair, Nov 15 '11 at 04:07

score 2 · Answer 2 · edited May 23 '17 at 12:03

2

"... cluster them into meaningful groups" is a bit to vague, you'll need to be more specific.

For starters you could look into K-Means clustering.

Have a look at this page and website:

PHP/irInformation Retrieval and other interesting topics

EDIT: You could try some data mining yourself by cross referencing search results with something like the open directory dmoz RDF data dump and then enumerate the matching categories.

EDIT2: And here is a dmoz/category question that also mentions "Faceted Search"!

Dmoz/Monster algorithme to calculate count of each category and sub category?

edited May 23 '17 at 12:03

Community

1
1

answered Nov 02 '11 at 11:35

zaf

22,776
12
65
95

Thanks, I had found that one already … While an interesting read and good example code, it's far from being a library. As for "meaningful groups", [this Yippy search (mind what they call "clouds")](http://search.yippy.com/search?input-form=clusty-simple&v%3Asources=webplus-ns-aaf&v%3Aproject=clusty&query=sightseeing+munich) illustrates what I'm trying to implement pretty well. – vzwick Nov 02 '11 at 11:43
@vzwick: You mean... faceting? – netcoder Nov 08 '11 at 01:50
@vzwick Ah, the example site explains all. The simple answer is no - you won't find a library to automatigically do that for you. – zaf Nov 08 '11 at 08:23

score 1 · Answer 3 · edited May 23 '17 at 11:47

If you're doing this for English only, you could use WordNet: http://wordnet.princeton.edu/. It's a lexicon widely used in research which provides, among other things, sets of synonyms for English words. The shortest distance between two words could then serve as a similarity metric to do clustering yourself as zaf proposed.

Apparently there is a PHP interface to WordNet here: http://www.foxsurfer.com/wordnet/. It came up in this question: How to use word Net with php, but I have not tried it. However, interfacing with a command line tool from PHP yourself is feasible as well.

score 1 · Answer 4 · answered Nov 09 '11 at 09:44

You could also have a look at Programming Collective Intelligence (Chapter 3 : Discovering Groups) by Toby Segaran which goes through just this use case using Python. However, you should be able to implement things in PHP once you understand how it works.

Even though it is not PHP, the Carrot2 project offers several clustering engines and can be integrated with Solr.

score 0 · Answer 5 · answered Nov 09 '11 at 15:56

This may be way off but check out OpenCalais. They have a web service which allows you to pass a block of text in and it will pass you back a parseable response of things that it found in the text, such as places, people, facts etc. You could use these categories to build your "clouds" and too choose which results to display.

I've used this library a few times in php and it's always been quite easy to work with.

Again, might not be relevant to what your trying to do. Maybe you could post an example of what your trying to accomplish?

score 0 · Answer 6 · answered Nov 12 '11 at 11:52

If you can pre-define the filters for your faceted search (the named groups) then it will be much easier.

Rather than relying on an algorithm that uses the current searcher's input and their particular results to generate the filter list, you would use an aggregate of the most commonly performed searches by all users and then tag results with them if they match.

You would end up with a table (or something) of URLs in a many-to-many join to a table of tags, so each result url could have several appropriate tags.

When the user searches, you simply match their search against the full index. But for the filters, you take the top results from among the current resultset.

I'll work on query examples if you want.

PHP library for word clustering/NLP?

6 Answers6