0

I have two Scala DataFrames which I am testing for similarities. I want to be able to pick a specific row number, and compare each value of that row between the two DataFrames. For example:

Dataframe 1: df1

+------+-----+-----------+
| Name | Age | Eye Color |
+------+-----+-----------+
| Bob  | 12  |   Blue    |
| Bil  | 17  |   Red     |
| Ron  | 13  |   Brown   |
+------+-----+-----------+

Dataframe 2: df2

+------+-----+-----------+
| Name | Age | Eye Color |
+------+-----+-----------+
| Bob  | 12  |   Blue    |
| Bil  | 14  |   Blue    |
| Ron  | 13  |   Brown   |
+------+-----+-----------+

Input: Row 2, output: Age, Eye Color.

What would be ideal, is for the output to show the values that are different too. I have considered the option here but the issue is that my DataFrames are very large (in excess of 200,000 rows) so this takes far too long. Is there a simpler way to select a specific row value of a Dataframe in Scala?

David Boulton
  • 173
  • 1
  • 1
  • 7
  • The outcome in the sample you have given compares two rows based on **Name** property. Is that what you want to do? Or you strictly want to give your program a row number? – jrook Oct 22 '20 at 16:44
  • 1
    `zipWithIndex` is the only way you can get continuous incrementing values across 2 different DFs. It should have worked though as it is parallelised. – Sanket9394 Oct 22 '20 at 17:03
  • 1
    Secondly, your usecase of comparing 2 rows of 2 different dataframes makes sense, only if you are `sorting` both dataframes first by some common column. – Sanket9394 Oct 22 '20 at 17:04
  • @jrook I want to strictly give the program a row number as I need to compare all fields in that row – David Boulton Oct 23 '20 at 08:47
  • @Sanket9394 Both databases are sorted and should be identical so that shouldn't be an issue. I will try using zipWithIndex and see how long it takes. Thanks – David Boulton Oct 23 '20 at 08:48
  • @DavidBoulton , Databases are sorted means? df1 and df2 are from database ? – Sanket9394 Oct 23 '20 at 12:00

0 Answers0