I am experiencing the same situation as described in Spark structured steaming from kafka - last message processed again after resume from checkpoint. When I restart my spark job after a failure the last message gets processed again. One of the answers suggest that the sink has to be idempotent. I am not sure I understand this well.
Right now I write to ES sink and the 3 methods are implemented as follows:
- open method returns true
- process method does Http post to ES
- close method closes the connection
I would like to know how to make the ES sink idempotent and also how to use the 2 parameters partitionId and version in the open method to return false if the data has already been processed.
Thanks in advance.