I have a dataframe where every cell is a letter from A, B, C, or D. It is a very large dataframe. What's the most efficient way to come up with a count of the number of times each letter appears across the entire dataframe?
A very small example:
col1 col2 col3 col4
A C A C
B D C A
I want a count that looks like:
col count
A 3
B 1
C 3
D 1
Does not have to be sorted.
For example, I was thinking of combining the columns into one and calling groupby and count on that:
one_column_df = df.withColumn("mycol", array("col1", " col2", "col3", "col4")).select(explode(col("mycol")))
one_column_df.groupBy("col").count().show()
Is there a more efficient way than having to call array and explode?