Running regression on several columns in parallel

Question

I have a very wide dataframe with label columns. I want to run a logistic regression for each column independenly. I'm trying to find the most efficient way to run this in parallel.

+----------+--------+--------+--------+-----+------------+
| features | label1 | label2 | label3 | ... | label30000 |
+----------+--------+--------+--------+-----+------------+

My initial thought was to use ThreadPoolExecutor, get result for each column, and join:

extract_prob = udf(lambda x: float(x[1]), FloatType())

def lr_for_column(argm):
    col_name = argm[0]
    test_res = argm[1]
    lr = LogisticRegression(featuresCol="features", labelCol=col_name, regParam=0.1)
    lrModel = lr.fit(tfidf)
    res = lrModel.transform(test_tfidf)
    test_res = test_res.join(res.select('id', 'probability'), on="id")
    test_res = test_res.withColumn(col_name, extract_prob('probability')).drop("probability")
    return test_res.select('id', col_name)


with futures.ThreadPoolExecutor(max_workers=100) as executor:
    future_results = [executor.submit(lr_for_column, [colname, test_res]) for colname in list_of_label_columns]
    futures.wait(future_results)
    for future in future_results:
       test_res = test_res.join(future.result(), on="id")

but this method is not very performant. Is there a faster way to do this?

And how many partitions data has and how many cores you've allocated in total? Also how much memory / core? — Alper t. Turker, May 19 '18 at 13:40
@user9613318 200 partitions, 8-nodes cluster, each node has 4 cores and 28 GB RAM — Kertis van Kertis, May 19 '18 at 15:04

Alper t. Turker · Accepted Answer · 2018-05-19T17:16:18.653

Taking into account available resources you have nothing to gain by using ThreadPoolExecutor - having 32 cores in total and 200 partitions you can process only ~16% of you data at the same time, and this fraction can become only worse, if data grows.

If you want to train 30000 models and use default number of iterations (100, probably to low in practice) you Spark program will submit around 3 000 000 jobs (each iteration create a separate one), and only a fraction of each can be processed concurrently - this doesn't give much hope for improvement, unless you add more resource.

Despite that there are some things you can try:

Make sure that final features don't have to be recomputed. If necessary write data to persistent storage and load it back, and make sure that data passed to model is cached.
Consider applying some dimensionality reduction algorithm. Number of features is 300000 not only high, but also close to number of records (500000). It is not only computationally expensive, but can also result in serious overfitting.
If you decide to reduce dimensions consider sampling to further reduce size of your training data, and consequently reduce number of partitions and increase overall throughput.

If there are strong linear trends in your data, there should be visible even on a smaller sample, without significant loss of precision.
Consider replacing expensive pyspark.ml algorithm with a variant that doesn't require multiple jobs, for example using some combination of tools from spark-sklearn (you could create ensemble model, by fitting sklearn model on each partition).
Oversubscribing cores. For example if you have 4 physical cores / node, allow 8 or 16 to account for IO wait time.

4 cores times 8 nodes gives 32 cores in total. If data has 200 partitions -> 200 * 0.16 = 32 (each core can process one partition at the time). Even if you include IO wait it doesn't look good. — Alper t. Turker, May 19 '18 at 17:13
Does this mean that I should also decrease the number of partitions? — Kertis van Kertis, May 19 '18 at 17:19
To large partitions are in general no good so probably no. For broader discussion [How to calculate the best numberOfPartitions for coalesce?](https://stackoverflow.com/q/40865326/9613318) — Alper t. Turker, May 19 '18 at 17:22

Running regression on several columns in parallel

1 Answers1