0

I'm currently transforming some date data in a dataframe which looks like:

+-----------+------------+
|first_col|sec_col-------|
+---------+--------------+
|a--------|28-04-2021    |
|a--------|01-03-2017    |
|a--------|"Feb 23, 2012"|
|a--------|"May 01, 2019"|
+---------+--------------+

I now want to transform the last two lines into better date formats like so: 23-Feb-2012 I thought doing this by regular expressions but the following code doesn't work:

from pyspark.sql import functions as f
from pyspark.sql.functions import regexp_replace, regexp_extract

#(a lot of stuff happens here which is not important for the question so I let it out)

input_df = input_df.withColumn("sec_col", input_df.sec_col.cast("String"))
    .withColumn("sec_col2", 
        f.when(input_df.sec_col.rlike("\"\w{3} \d{2}, \d{4}\""),
            f.concat(regexp_extract("sec_col","\"(\w{3}) (\d{2}), (\d{4})\"",2),f.lit("-"), regexp_extract("sec_col","\"(\w{3}) (\d{2}), (\d{4})\"",1),f.lit("-"),regexp_extract("sec_col","\"(\w{3}) (\d{2}), (\d{4})\"",3))))
         .otherwise(f.col("sec_col"))

Can anyone help?

Aquen
  • 267
  • 1
  • 2
  • 16

1 Answers1

0

Instead of trying to parse the data format with regex you can directly convert it to a date using to_date as you already know the date_format to parse and fetch the first non-null value

Data Preparation

sparkDF = sql.createDataFrame([("28-04-2021",),
                               ("01-03-2017",),
                               ("Feb 23, 2012",),
                               ("May 01, 2019",)
                              ]
                              ,['timestamp'])

sparkDF.show()

+------------+
|   timestamp|
+------------+
|  28-04-2021|
|  01-03-2017|
|Feb 23, 2012|
|May 01, 2019|
+------------+

Parsing Dates


sparkDF = sparkDF.withColumn('p1',F.to_date(F.col('timestamp'),"MMM dd, yyyy"))\
                 .withColumn('p2',F.to_date(F.col('timestamp'),"dd-MM-yyyy"))


+------------+----------+----------+
|   timestamp|        p1|        p2|
+------------+----------+----------+
|  28-04-2021|      null|2021-04-28|
|  01-03-2017|      null|2017-03-01|
|Feb 23, 2012|2012-02-23|      null|
|May 01, 2019|2019-05-01|      null|
+------------+----------+----------+

Coalesce

sparkDF = sparkDF.withColumn('timestamp_parsed',F.coalesce(F.col('p1'),F.col('p2')))\
                 .drop(*['p1','p2'])

sparkDF.show()

+------------+----------------+
|   timestamp|timestamp_parsed|
+------------+----------------+
|  28-04-2021|      2021-04-28|
|  01-03-2017|      2017-03-01|
|Feb 23, 2012|      2012-02-23|
|May 01, 2019|      2019-05-01|
+------------+----------------+

Furthermore you can check various parsing formats available here and a similar question can be found here

Vaebhav
  • 4,672
  • 1
  • 13
  • 33