My goal is to one-hot encode a list of categorical columns using Spark DataFrames. For example, same like get_dummies()
function does in Pandas
.
The data set, bureau.csv
originally have been taken from a Kaggle competition Home Credit Default Risk. Here is my entry table example, say entryData
, where it is filtered where only KEY = 100001
.
# primary key
KEY = 'SK_ID_CURR'
data = spark.read.csv("bureau.csv", header=True, inferSchema=True)
# sample data from bureau.csv of 1716428 rows
entryData = data.select(columnList).where(F.col(KEY) == 100001).show()
print(entryData)
+----------+-------------+---------------+---------------+
|SK_ID_CURR|CREDIT_ACTIVE|CREDIT_CURRENCY| CREDIT_TYPE|
+----------+-------------+---------------+---------------+
| 100001| Closed| currency 1|Consumer credit|
| 100001| Closed| currency 1|Consumer credit|
| 100001| Closed| currency 1|Consumer credit|
| 100001| Closed| currency 1|Consumer credit|
| 100001| Active| currency 1|Consumer credit|
| 100001| Active| currency 1|Consumer credit|
| 100001| Active| currency 1|Consumer credit|
+----------+-------------+---------------+---------------+
I'm looking to one-hot encode the list columnList
by creating the function catg_encode(entryData, columnList)
,
columnList = cols_type(entryData, obj=True)[1:]
print(columnList)
['CREDIT_ACTIVE', 'CREDIT_CURRENCY', 'CREDIT_TYPE']
Note cols_type()
is a function that returns a list of columns either categorical columns (if obj=True
) or numerical ones (if obj=False
).
I have succeeded in one-hot encode the first column 'CREDIT_ACTIVE'
but I couldn't for the hole columns simultaneously, I mean to build the function catg_encode
.
# import necessary modules
from pyspark.sql import functions as F
# look for all distinct categoris within a given feature (here 'CREDIT_ACTIVE')
categories = entryData.select(columnList[0]).distinct().rdd.flatMap(lambda x: x).collect()
# one-hot encode the categories
exprs = [F.when(F.col(columnList[0]) == category, 1).otherwise(0).alias(category) for category in categories]
# nice table with encoded feature 'CREDIT_ACTIVE'
oneHotEncode = entryData.select(KEY, *exprs)
print(oneHotEncode)
+----------+--------+----+------+------+
|SK_ID_CURR|Bad debt|Sold|Active|Closed|
+----------+--------+----+------+------+
| 100001| 0| 0| 0| 1|
| 100001| 0| 0| 0| 1|
| 100001| 0| 0| 0| 1|
| 100001| 0| 0| 0| 1|
| 100001| 0| 0| 1| 0|
| 100001| 0| 0| 1| 0|
| 100001| 0| 0| 1| 0|
+----------+--------+----+------+------+
Here the feature 'CREDIT_ACTIVE'
has 4 distinct categories; ['Bad debt', 'Sold', 'Active', 'Closed']
.
Note I have even tried IndexToString
and OneHotEncoderEstimator
but didn't help for this specific task.
I am expecting to have the following output,
+----------+--------+----+------+------+----------+----------+----------+----------+----------+---
|SK_ID_CURR|Bad debt|Sold|Active|Closed|currency 1|currency 2|currency 3|currency 4|..........|...
+----------+--------+----+------+------+----------+----------+----------+----------+----------+---
| 100001| 0| 0| 0| 1| 1| 0| 0| 0| ..|
| 100001| 0| 0| 0| 1| 1| 0| 0| 0| ..|
| 100001| 0| 0| 0| 1| 1| 0| 0| 0| ..|
| 100001| 0| 0| 0| 1| 1| 0| 0| 0| ..|
| 100001| 0| 0| 1| 0| 1| 0| 0| 0| ..|
| 100001| 0| 0| 1| 0| 1| 0| 0| 0| ..|
| 100001| 0| 0| 1| 0| 1| 0| 0| 0| ..|
+----------+--------+----+------+------+----------+----------+----------+----------+----------+---
The continuous points ...
are for the remaining categories of the feature 'CREDIT_TYPE'
which are
['Loan for the purchase of equipment', 'Cash loan (non-earmarked)', 'Microloan', 'Consumer credit', 'Mobile operator loan', 'Another type of loan', 'Mortgage', 'Interbank credit', 'Loan for working capital replenishment', 'Car loan', 'Real estate loan', 'Unknown type of loan', 'Loan for business development', 'Credit card', 'Loan for purchase of shares (margin lending)']
.
Remarque: I have seen this post E-num / get Dummies in pyspark but doesn't automate the process for many columns, case of big data. The post gives a solution to write separate code for each categorical feature which is not my case problem.