I am relatively new here, so i will try to follow the means of SO.
I am working on spark on data bricks and working with the following data:
Distinct_Id Event Date
*some alphanumerical value* App Access 2018-01-09
*some alphanumerical value* App Opened 2017-23-01
... ... ...
The data means:
Every distinct_id identifies a distinct user. There are 4 main events - App access, app opened, app launched, mediaReady.
The problem:
I am trying to find the first app access date for a particular distinct_id. App access is defined as: event in ('App access', 'App opened', 'App Launched')
The first app viewed date for a particular distinct_id. App viewed is defined as: event == 'mediaReady'
My data is present in parquet files and the data volume is huge (2 years data).
I tried the following to find first app access date:
temp_result = spark.sql("
with cte as(
select gaid,
event,
event_date,
RANK() OVER (PARTITION BY gaid order by event_date) as rnk
from df_raw_data
WHERE upper(event) IN ('APP LAUNCHED', 'APP OPENED', 'APP ACCESS')
group by gaid,event,event_date
)
select DISTINCT gaid, event_date, event from cte where rnk = 1
")
I am trying to write a robust query which will scale with the increase in data and give the result. I hope I've described the problem in a decent way.