I ran a count of attempts by (user,app) over a time window of day(86400). I want to extract the rows with latest timestamp with the count and remove unnecessary previous counts. Make sure your answer considers the time window. One user with 1 device can do make multiple attempts a day or a week, I wanna be able to retrieve those particular moments with the final count in every specific window.
My intial dataset is like this:
val df = sc.parallelize(Seq(
("user1", "iphone", "2017-12-22 10:06:18", "Success"),
("user1", "iphone", "2017-12-22 11:15:12", "failed"),
("user1", "iphone", "2017-12-22 12:06:18", "Success"),
("user1", "iphone", "2017-12-22 09:15:12", "failed"),
("user1", "iphone", "2017-12-20 10:06:18", "Success"),
("user1", "iphone", "2017-12-20 11:15:12", "failed"),
("user1", "iphone", "2017-12-20 12:06:18", "Success"),
("user1", "iphone", "2017-12-20 09:15:12", "failed"),
("user1", "android", "2017-12-20 09:25:20", "Success"),
("user1", "android", "2017-12-20 09:44:22", "Success"),
("user1", "android", "2017-12-20 09:58:22", "Success"),
("user1", "iphone", "2017-12-20 16:44:20", "Success"),
("user1", "iphone", "2017-12-20 16:44:25", "Success"),
("user1", "iphone", "2017-12-20 16:44:35", "Success")
)).toDF("username", "device", "date_time", "status")
The code I ran and what I got.
// Basically I'm looking 1 day which is 86400 seconds
val w1 = Window.partitionBy("username", "device")
.orderBy(col("date_time").cast("date_time").cast("long").desc)
.rangeBetween(-86400, 0)
val countEveryAttemptDF = df.withColumn("attempts", count("device").over(w1))
Now I have
// countEveryAttemptDF.show
+--------+--------------+---------------------+-------+--------+
|username|. device| date_time| status|attempts|
+--------+--------------+---------------------+-------+--------+
| user1| android| 2017-12-20 09:58:22|Success| 1|
| user1| android| 2017-12-20 09:44:22|Success| 2|
| user1| android| 2017-12-20 09:25:20|Success| 3|
| user1| iphone| 2017-12-22 12:06:18|Success| 1|
| user1| iphone| 2017-12-22 11:15:12| failed| 2|
| user1| iphone| 2017-12-22 10:06:18|Success| 3|
| user1| iphone| 2017-12-22 09:15:12| failed| 4|
| user1| iphone| 2017-12-20 16:44:35|Success| 1|
| user1| iphone| 2017-12-20 16:44:25|Success| 2|
| user1| iphone| 2017-12-20 16:44:20|Success| 3|
| user1| iphone| 2017-12-20 12:06:18|Success| 4|
| user1| iphone| 2017-12-20 11:15:12| failed| 5|
| user1| iphone| 2017-12-20 10:06:18|Success| 6|
| user1| iphone| 2017-12-20 09:15:12| failed| 7|
+--------+--------------+---------------------+-------+--------+
What I want. So I want the latest timestamp along with its count by making sure I'm in the same time window.
+--------+--------------+---------------------+-------+--------+
|username|. device| date_time| status|attempts|
+--------+--------------+---------------------+-------+--------+
| user1 | android | 2017-12-20 09:25:20|Success| 3|
| user1 | iphone | 2017-12-22 09:15:12| failed| 4|
| user1 | iphone | 2017-12-20 09:15:12| failed| 7|
+--------+--------------+---------------------+-------+--------+**