-2

I have data frame which has some groups and I want to fill the missing values based on last previous available and next value available average of score column i.e. (previous value+next value)/2.

I want to group by state,school,class,subject and then fill value.

If the first value not available in score column then fill the value with value which is available next or If the last value not available then fill the value with value which is available previously for each group this needs to be followed.

It is data imputation complex problem. I searched online and found pandas has some functionality i.e. pandas.core.groupby.DataFrameGroupBy.ffill but dont know how to use in this case.

I am thinking to solve in python,pyspark,SQL !

My data frame looks like this

Missing values

Data Imputation

Code run
  • 165
  • 9
  • Shouldnt the math score be 46 for all of the records in Texas Crescent school? – George Joseph Jul 26 '20 at 04:14
  • Nope,you are missing the class group !! – Code run Jul 26 '20 at 04:46
  • 1
    do you want to fill in that way only? what about interpolate them after grouping them? – Pygirl Jul 26 '20 at 04:47
  • That would solve my problem if I can group and interpolate. The problem I am facing is how to group by and fill information @Pygirl – Code run Jul 26 '20 at 04:53
  • See this: https://stackoverflow.com/questions/37057187/pandas-interpolate-within-a-groupby – Pygirl Jul 26 '20 at 04:57
  • Nope this wont solve the problem> I need to fill the values in the above manner only.grouping and filling doesnt let me create new column. I was thinking to create two columns by perform same grouping and different logic. Then merge the two to create final df – Code run Jul 26 '20 at 05:08
  • I know of a way to solve this in sql, but not sure if that is what you need here – George Joseph Jul 26 '20 at 05:21

1 Answers1

1

Perhaps this is helpful -

Load the test data

df2.show(false)
    df2.printSchema()
    /**
      * +-----+-----+
      * |class|score|
      * +-----+-----+
      * |A    |null |
      * |A    |46   |
      * |A    |null |
      * |A    |null |
      * |A    |35   |
      * |A    |null |
      * |A    |null |
      * |A    |null |
      * |A    |46   |
      * |A    |null |
      * |A    |null |
      * |B    |78   |
      * |B    |null |
      * |B    |null |
      * |B    |null |
      * |B    |null |
      * |B    |null |
      * |B    |56   |
      * |B    |null |
      * +-----+-----+
      *
      * root
      * |-- class: string (nullable = true)
      * |-- score: integer (nullable = true)
      */

Impute Null values from score columns(check new_score column)


    val w1 = Window.partitionBy("class").rowsBetween(Window.unboundedPreceding, Window.currentRow)
    val w2 = Window.partitionBy("class").rowsBetween(Window.currentRow, Window.unboundedFollowing)
    df2.withColumn("previous", last("score", ignoreNulls = true).over(w1))
      .withColumn("next", first("score", ignoreNulls = true).over(w2))
      .withColumn("new_score", (coalesce($"previous", $"next") + coalesce($"next", $"previous")) / 2)
      .drop("next", "previous")
      .show(false)

    /**
      * +-----+-----+---------+
      * |class|score|new_score|
      * +-----+-----+---------+
      * |A    |null |46.0     |
      * |A    |46   |46.0     |
      * |A    |null |40.5     |
      * |A    |null |40.5     |
      * |A    |35   |35.0     |
      * |A    |null |40.5     |
      * |A    |null |40.5     |
      * |A    |null |40.5     |
      * |A    |46   |46.0     |
      * |A    |null |46.0     |
      * |A    |null |46.0     |
      * |B    |78   |78.0     |
      * |B    |null |67.0     |
      * |B    |null |67.0     |
      * |B    |null |67.0     |
      * |B    |null |67.0     |
      * |B    |null |67.0     |
      * |B    |56   |56.0     |
      * |B    |null |56.0     |
      * +-----+-----+---------+
      */
Som
  • 6,193
  • 1
  • 11
  • 22
  • Get the last available value for each row as previous ignoring nulls. Similarly get the next value present and then compute the the average `previous + next/2` – Som Jul 26 '20 at 12:27
  • Thanks, I haven't used $ in my pyspark code, wanted to confirm if this is pyspark or scala code ? – Code run Jul 26 '20 at 12:30
  • This is scala, but can be implemented the same in pyspark with minimal change. Replace $".." with F.col(...) – Som Jul 26 '20 at 12:49
  • df.withColumn("previous", last("score", ignoreNulls = true).over(w1)).withColumn("next", first("score", ignoreNulls = true).over(w2)).withColumn("new_score", (coalesce(F.col("previous"), F.col("next") + coalesce(F.col("next"), F.col("previous")) / 2))).drop("next", "previous").show(false) – Code run Jul 26 '20 at 13:04
  • I already accepted and up voted the answer. Found it useful ! However I might need to change the above code to achieve desired result – Code run Jul 26 '20 at 18:16
  • This is how ok one should accept an answer- https://meta.stackexchange.com/a/5235/767994 – Som Jul 27 '20 at 03:17
  • Thanks for telling me that I didnt knew this earlier.I did it. Can upvote the question also? – Code run Jul 27 '20 at 03:24
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/218660/discussion-between-code-run-and-someshwar-kale). – Code run Jul 27 '20 at 06:23