Is there any alternative for df[100, c("column")]
in scala spark data frames. I want to select specific row from a column of spark data frame.
for example 100th
row in above R equivalent code

- 821
- 1
- 10
- 19
-
2Possible duplicate of [How to read specific lines from sparkContext](http://stackoverflow.com/questions/35221033/how-to-read-specific-lines-from-sparkcontext) – Daniel Darabos Feb 08 '16 at 17:30
-
5This is about DataFrames, and [How to read specific lines from sparkContext](http://stackoverflow.com/questions/35221033/how-to-read-specific-lines-from-sparkcontext) is about RDDs – Josiah Yoder Aug 16 '16 at 21:12
9 Answers
Firstly, you must understand that DataFrames
are distributed, that means you can't access them in a typical procedural way, you must do an analysis first. Although, you are asking about Scala
I suggest you to read the Pyspark Documentation, because it has more examples than any of the other documentations.
However, continuing with my explanation, I would use some methods of the RDD
API cause all DataFrame
s have one RDD
as attribute. Please, see my example bellow, and notice how I take the 2nd record.
df = sqlContext.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["letter", "name"])
myIndex = 1
values = (df.rdd.zipWithIndex()
.filter(lambda ((l, v), i): i == myIndex)
.map(lambda ((l,v), i): (l, v))
.collect())
print(values[0])
# (u'b', 2)
Hopefully, someone gives another solution with fewer steps.

- 17,556
- 10
- 64
- 93
-
Your link is dead. It should probably be this: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.html?highlight=dataframe#pyspark.sql.DataFrame – MyrionSC2 Mar 03 '21 at 13:31
This is how I achieved the same in Scala. I am not sure if it is more efficient than the valid answer, but it requires less coding
val parquetFileDF = sqlContext.read.parquet("myParquetFule.parquet")
val myRow7th = parquetFileDF.rdd.take(7).last

- 7,307
- 8
- 57
- 94
-
1Will the output change depending on how many nodes the data is clustered across? – bshelt141 Oct 19 '17 at 17:49
-
1
In PySpark, if your dataset is small (can fit into memory of driver), you can do
df.collect()[n]
where df
is the DataFrame object, and n
is the Row of interest. After getting said Row, you can do row.myColumn
or row["myColumn"]
to get the contents, as spelled out in the API docs.

- 3,999
- 40
- 55
The getrows()
function below should get the specific rows you want.
For completeness, I have written down the full code in order to reproduce the output.
# Create SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('scratch').getOrCreate()
# Create the dataframe
df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["letter", "name"])
# Function to get rows at `rownums`
def getrows(df, rownums=None):
return df.rdd.zipWithIndex().filter(lambda x: x[1] in rownums).map(lambda x: x[0])
# Get rows at positions 0 and 2.
getrows(df, rownums=[0, 2]).collect()
# Output:
#> [(Row(letter='a', name=1)), (Row(letter='c', name=3))]

- 2,045
- 1
- 23
- 18
There is a scala way (if you have a enough memory on working machine):
val arr = df.select("column").rdd.collect
println(arr(100))
If dataframe schema is unknown, and you know actual type of "column"
field (for example double), than you can get arr
as following:
val arr = df.select($"column".cast("Double")).as[Double].rdd.collect

- 2,508
- 25
- 30
you can simply do that by using below single line of code
val arr = df.select("column").collect()(99)

- 91,361
- 17
- 137
- 196

- 2,689
- 2
- 20
- 35
When you want to fetch max value of a date column from dataframe, just the value without object type or Row object information, you can refer to below code.
table = "mytable"
max_date = df.select(max('date_col')).first()[0]
2020-06-26
instead of Row(max(reference_week)=datetime.date(2020, 6, 26))

- 21
- 3
Following is a Java-Spark way to do it , 1) add a sequentially increment columns. 2) Select Row number using Id. 3) Drop the Column
import static org.apache.spark.sql.functions.*;
..
ds = ds.withColumn("rownum", functions.monotonically_increasing_id());
ds = ds.filter(col("rownum").equalTo(99));
ds = ds.drop("rownum");
N.B. monotonically_increasing_id starts from 0;

- 8,952
- 7
- 49
- 60
-
1`monotonically_increasing_id` - The generated ID is guaranteed to be monotonically increasing and unique, but not **consecutive**. – Gowrav Mar 12 '20 at 15:07