How to get DistributedLDAModel in org.apache.spark.ml.clustering.LDA

Question

I am trying to get the DistributedLDAModel for LDA ml library. I am seeing examples with mllib LDA and not ml LDA. Is there any sample code which I can follow?

Do you know that there more than just examples to know how to use a framework ? Try to find that in the spark scaladoc, try something and we will correct you if you have some issues ! — eliasah, Aug 10 '16 at 06:17
Hi eliasah checkout my scala code at http://stackoverflow.com/questions/38818879/matcherror-while-accessing-vector-column-in-spark-2-0/38819323#38819323 . Now my next pursuit was to find out the topics getting discussed. Let me know if you need more details — Nabs, Aug 10 '16 at 06:49

score 3 · Answer 1 · answered Jun 07 '17 at 16:38

To get a DistributedLDAModel instead of a LocalLDAModel, you need to use the Expectation-Maximization (EM) optimizer instead of the default Online Variational Bayes (online) one.

Concretely, use setOptimizer('em') on your LDA builder to get a distributed model:

val lda = new LDA().setOptimizer("em")

score -4 · Answer 2 · answered Aug 10 '16 at 07:05

-4

I am sharing the sample code from Spark ml.

import org.apache.spark.ml.clustering.LDA
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.sql.types.{StructField, StructType}

// Loads data
val rowRDD = sc.textFile(input).filter(_.nonEmpty)
  .map(_.split(" ").map(_.toDouble)).map(Vectors.dense).map(Row(_))
val schema = StructType(Array(StructField(FEATURES_COL, new VectorUDT, false)))
val dataset = sqlContext.createDataFrame(rowRDD, schema)

// Trains a LDA model
val lda = new LDA()
  .setK(10)
  .setMaxIter(10)
  .setFeaturesCol(FEATURES_COL)
val model = lda.fit(dataset)
val transformed = model.transform(dataset)

val ll = model.logLikelihood(dataset)
val lp = model.logPerplexity(dataset)

// describeTopics
val topics = model.describeTopics(3)

// Shows the result
topics.show(false)
transformed.show(false)

You can find the complete code here

answered Aug 10 '16 at 07:05

Rohit Raj

1
3

Hi Rohit, where are you invoking the DistributedLDAModel? – Nabs Aug 10 '16 at 07:09
This is a copy pasted code from the documentation ! – eliasah Aug 10 '16 at 07:16
Hi eliasah. I know its a code from example and so i have shared the link as well. and i think thats what user has asked for. @user1733690 currently I am developing a topic model for a for Real time Tag recommnder system. – Rohit Raj Aug 10 '16 at 07:28
Rohit, I am aware of this code, if you browse to the link in my question you will see that. As eliasah suggested there is a documentation for LDA, LDAModel, DistributedLDAModel.... The one you mentioned is for LDA, I wanted DistributedLDAModel. The only issue is I am not getting any example , so am hitting into issues. ex: I tried to create an object of DistributedLDAModel and here is what I got - – Nabs Aug 10 '16 at 07:36
import org.apache.spark.ml.clustering.DistributedLDAModel val lda = new DistributedLDAModel() .setK(3) .setMaxIter(10) .setFeaturesCol("features") :68: error: constructor DistributedLDAModel in class DistributedLDAModel cannot be accessed in class $iwC val lda = new DistributedLDAModel() – Nabs Aug 10 '16 at 07:36
So Basically I am not sure if I am doing a right thing or not – Nabs Aug 10 '16 at 07:37
Here is the documentation link for scala spark which I am reffering http://spark.apache.org/docs/latest/api/scala/#package pretty elaborate but it would have been more informative if a small sample would have been given. – Nabs Aug 10 '16 at 07:41
1

Just to point out another discrepancy in docs which is confusing checkout the source link at http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.ml.clustering.DistributedLDAModel , the source is having reference to mllib version of DistributedLDAModel, I expected it to have ml version of DistributedLDAModel. It will be great if anyone can explain what is going on here :( – Nabs Aug 10 '16 at 07:47
@user1733690 . I just want to clarify why you want to implement it using ml when its is implemented in mllib. for my purpose I have implemented the whole topic model on spark and used mllib for distributed modelling and ```import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel, LocalLDAModel}``` used localLDAModel for prediction purpose as you cant predict topic for new document using DistributedLDAModel – Rohit Raj Aug 10 '16 at 08:54
1

One major reason is performance , Mllib is having to do with rdds which is kind of slow compare to ml which is having df and it is fast processing. Also I don't want to give up after coming so far. When they are having documentation for ml distributed lda then there should be implementation too. It seems no one has explored it still. – Nabs Aug 10 '16 at 14:15

How to get DistributedLDAModel in org.apache.spark.ml.clustering.LDA

2 Answers2