I have to perform batch queries (basically in a loop) from Kafka via Spark, each time starting from the last offset read at the previous iteration, so that I only read new data.
Dataset<Row> df = spark
.read()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test-reader")
.option("enable.auto.commit", true)
.option("kafka.group.id", "demo-reader") //not sure about the one to use
.option("group.id", "demo-reader")
.option("startingOffset", "latest")
.load()
It seems that latest
is not supported in batch queries. I'm wondering if it is possible to do something similar in another way (without dealing directly with offsets).
EDIT:
earliest
seems to retrieve the whole data contained in topic.