I'm currently transforming some date data in a dataframe which looks like:
+-----------+------------+
|first_col|sec_col-------|
+---------+--------------+
|a--------|28-04-2021 |
|a--------|01-03-2017 |
|a--------|"Feb 23, 2012"|
|a--------|"May 01, 2019"|
+---------+--------------+
I now want to transform the last two lines into better date formats like so: 23-Feb-2012 I thought doing this by regular expressions but the following code doesn't work:
from pyspark.sql import functions as f
from pyspark.sql.functions import regexp_replace, regexp_extract
#(a lot of stuff happens here which is not important for the question so I let it out)
input_df = input_df.withColumn("sec_col", input_df.sec_col.cast("String"))
.withColumn("sec_col2",
f.when(input_df.sec_col.rlike("\"\w{3} \d{2}, \d{4}\""),
f.concat(regexp_extract("sec_col","\"(\w{3}) (\d{2}), (\d{4})\"",2),f.lit("-"), regexp_extract("sec_col","\"(\w{3}) (\d{2}), (\d{4})\"",1),f.lit("-"),regexp_extract("sec_col","\"(\w{3}) (\d{2}), (\d{4})\"",3))))
.otherwise(f.col("sec_col"))
Can anyone help?