Oldest Record For a Distinct ID - SparkSQL

Question

I am relatively new here, so i will try to follow the means of SO.

I am working on spark on data bricks and working with the following data:

    Distinct_Id                       Event                       Date
*some alphanumerical value*       App Access                  2018-01-09
*some alphanumerical value*       App Opened                  2017-23-01
           ...                        ...                          ...

The data means:

Every distinct_id identifies a distinct user. There are 4 main events - App access, app opened, app launched, mediaReady.

The problem:

I am trying to find the first app access date for a particular distinct_id. App access is defined as: event in ('App access', 'App opened', 'App Launched')
The first app viewed date for a particular distinct_id. App viewed is defined as: event == 'mediaReady'

My data is present in parquet files and the data volume is huge (2 years data).

I tried the following to find first app access date:

temp_result = spark.sql("
     with cte as(
       select gaid,  
              event, 
              event_date, 
              RANK() OVER (PARTITION BY gaid order by event_date) as rnk 
       from df_raw_data 
       WHERE upper(event) IN ('APP LAUNCHED', 'APP OPENED', 'APP ACCESS')
       group by gaid,event,event_date
     ) 
     select  DISTINCT gaid, event_date, event from cte where rnk = 1
")

I am trying to write a robust query which will scale with the increase in data and give the result. I hope I've described the problem in a decent way.

and what is exactly your problem? the query seems to be correct. — Radim Bača, Dec 17 '18 at 09:03
Any distinct_id might have multiple events associated to it ('app access','app opened') which actually define app_access. I am a little unsure if my query addresses this and just gives me unique distinct_id and their first app access date (which is a combination of 3 events). — Aman Gupta, Dec 17 '18 at 09:14

score 0 · Answer 1 · answered Dec 17 '18 at 09:01

0

Feels more like a pivot query:

SELECT
  gaid, 
  MIN(CASE WHEN event in ('App access', 'App opened', 'App Launched') THEN date END) as first_app_access_date,
  MIN(CASE WHEN event in ('mediaReady') THEN date END) as first_app_viewed_date
FROM df_raw_data 
GROUP BY gaid

I've no idea about case sensitivity etc of a spark db so you might need to fix some of that up..

answered Dec 17 '18 at 09:01

Caius Jard

72,509
5
49
80

thanks for your response, Caius. But does this query address this concern: Any distinct_id might have multiple events associated to it ('app access','app opened') which actually define app_access. I am a little unsure if my query addresses this and just gives me unique distinct_id and their first app access date (which is a combination of 3 events)? – Aman Gupta Dec 17 '18 at 09:15
You said that any one of 3 things can define "first app access date". This query does that; if any one of these 3 things occurs it is the first occurrence of whichever things that is given as the first access date – Caius Jard Dec 17 '18 at 13:34

Oldest Record For a Distinct ID - SparkSQL

1 Answers1