1

I'm attempting to re-create Google Analytics "Last Non-Direct Click" marketing attribution methodology on top of Amplitude data set. In basic terms, this means if a user has a session where the channel is direct, it looks backwards in time and finds the most recent non-direct channel value.

I have a table that looks roughly like this. I also have the timestamp of the session start. Session ID is in order with which the session occurred, ASC.

Session ID Marketing Channel
1 direct
2 paid
3 direct
4 organic search
5 direct

I want to create a third column where the values would look like this

Session ID Marketing Channel Last Non-Direct Click Channel
1 direct direct
2 paid paid
3 direct paid
4 organic search organic search
5 direct organic search
6 direct organic search
7 direct organic search

I need to take any value in the Channel Tactic column where = to "Direct" and replace it with the previous Channel Tactic value that is not <> direct. It has to be the most recent one historically, it cannot be just any that is not direct from the past. So in the above example, row 5 "direct" cannot be either paid or organic search, it must be organic search.

I was thinking of leveraging a window function join but I can't think of way to make sure I'm only grabbing the most recent non-direct channel, not just any random non-direct channel. Keep in mind my data set is huge and most users have many sessions with multiple different marketing channels.

I'm looking for a solution that will be compatible in Snowflake, thank you!

Mogsdad
  • 44,709
  • 21
  • 151
  • 275
  • What is the definition of most recent, as there is not date column, is it defined by session id order? Also, is the data for multiple users with multiple sessions? – Pankaj Jul 11 '22 at 00:36
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Jul 11 '22 at 12:54

3 Answers3

1

Edit: I thought of a way to implement this as a single window function without self-joining.

create or replace table T1(USER_ID int, SESSION_ID int, CHANNEL string);

insert into T1(USER_ID, SESSION_ID, CHANNEL) values 
(1, 1,  'direct'),
(1, 2,  'paid'),
(1, 3,  'direct'),
(1, 4,  'organic search'),
(1, 5,  'direct'),
(1, 6,  'direct'),
(1, 7,  'direct');

select   T1.* 
        ,case 
            when SESSION_ID = 1 or CHANNEL <> 'direct' then CHANNEL
            else lag(iff(CHANNEL = 'direct', null, CHANNEL)) ignore nulls 
                    over (partition by USER_ID order by SESSION_ID) 
         end as LAST_NON_DIRECT_CHANNEL
from T1;

This works because for the purposes of the lag function, we're treating the word direct as null using the iff function. The case statement indicates that if it's the first row, use the CHANNEL even if it is direct. From then on, if the CHANNEL is direct, lag back to the previous non-null (non direct using the iff function) value.

Greg Pavlik
  • 10,089
  • 2
  • 12
  • 29
0

Greg's solution is pretty clever, but here is an alternative if you prefer a 2-step solution for readability.

with cte as
(select *, max(case when channel <> 'direct' then session_id end) over (partition by user_id order by session_id) as last_non_drct_sess_id
 from t)

select *, max(case when last_non_drct_sess_id = session_id  or last_non_drct_sess_id is null then channel end) over (partition by user_id, last_non_drct_sess_id) as last_non_drct_chnl
from cte
Radagast
  • 5,102
  • 3
  • 12
  • 27
0

In bigquery case statement will look like this:

create or replace table `dataset.T1` (USER_ID int, SESSION_ID int, CHANNEL string);

INSERT INTO `dataset.T1` (USER_ID, SESSION_ID, CHANNEL) values 
(1, 1,  'direct'),
(1, 2,  'paid'),
(1, 3,  'direct'),
(1, 4,  'organic search'),
(1, 5,  'direct'),
(1, 6,  'direct'),
(1, 7,  'direct');

select   T1.* 
        ,    CASE 
        WHEN trafficSource.source != "(direct)" THEN trafficSource.source
        WHEN trafficSource.source = "(direct)" THEN LAST_VALUE( NULLIF(trafficSource.source, "(direct)") IGNORE NULLS ) OVER(PARTITION BY user.clientId ORDER BY user.visitNumber) 
    END AS last_non_direct_source
from T1;
ninetor
  • 1
  • 1
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jul 12 '22 at 17:04