Splunk Search

Why do real-time queries include duplicates or are incomplete?

New Member

I am absolutely new to Splunk and having a play. I was trying to use the java API (through scala, but that shouldn't matter) to do a realtime query based on these instructions http://dev.splunk.com/view/java-sdk/SP-CAAAEHQ#realtimejob.

Normal queries work, and so do realtime queries the first time. I upload my 955 records three times and log what the job is seeing.

[2017-07-17 19:39:12,700] INFO seen=2, job.getNumPreviews=6, job.getResultPreviewCount=2, job.getEventCount=2, offs=0
[2017-07-17 19:39:20,795] INFO seen=955, job.getNumPreviews=7, job.getResultPreviewCount=955, job.getEventCount=955, offs=2
[2017-07-17 19:39:25,818] INFO seen=955, job.getNumPreviews=9, job.getResultPreviewCount=955, job.getEventCount=955, offs=955
[2017-07-17 19:39:30,849] INFO seen=957, job.getNumPreviews=10, job.getResultPreviewCount=957, job.getEventCount=957, offs=955
[2017-07-17 19:39:36,487] INFO seen=1408, job.getNumPreviews=11, job.getResultPreviewCount=1408, job.getEventCount=1910, offs=957
[2017-07-17 19:39:41,512] INFO seen=1408, job.getNumPreviews=12, job.getResultPreviewCount=1408, job.getEventCount=1910, offs=1408
[2017-07-17 19:39:46,540] INFO seen=1408, job.getNumPreviews=13, job.getResultPreviewCount=1408, job.getEventCount=1910, offs=1408
[2017-07-17 19:39:51,559] INFO seen=1408, job.getNumPreviews=14, job.getResultPreviewCount=1408, job.getEventCount=1910, offs=1408
[2017-07-17 19:39:56,580] INFO seen=1408, job.getNumPreviews=15, job.getResultPreviewCount=1408, job.getEventCount=1910, offs=1408
[2017-07-17 19:40:01,602] INFO seen=1408, job.getNumPreviews=16, job.getResultPreviewCount=1408, job.getEventCount=2865, offs=1408
[2017-07-17 19:40:06,622] INFO seen=1408, job.getNumPreviews=17, job.getResultPreviewCount=1408, job.getEventCount=2865, offs=1408
[2017-07-17 19:40:11,645] INFO seen=1408, job.getNumPreviews=18, job.getResultPreviewCount=1408, job.getEventCount=2865, offs=1408
[2017-07-17 19:40:16,670] INFO seen=1408, job.getNumPreviews=19, job.getResultPreviewCount=1408, job.getEventCount=2865, offs=1408
[2017-07-17 19:40:21,695] INFO seen=1408, job.getNumPreviews=20, job.getResultPreviewCount=1408, job.getEventCount=2865, offs=1408
[2017-07-17 19:40:26,719] INFO seen=1408, job.getNumPreviews=21, job.getResultPreviewCount=1408, job.getEventCount=2865, offs=1408
[2017-07-17 19:40:31,740] INFO seen=1408, job.getNumPreviews=22, job.getResultPreviewCount=1408, job.getEventCount=2865, offs=1408
[2017-07-17 19:40:36,760] INFO seen=1408, job.getNumPreviews=23, job.getResultPreviewCount=1408, job.getEventCount=2865, offs=1408
[2017-07-17 19:40:41,785] INFO seen=1408, job.getNumPreviews=24, job.getResultPreviewCount=1408, job.getEventCount=2865, offs=1408
[2017-07-17 19:40:46,808] INFO seen=1408, job.getNumPreviews=25, job.getResultPreviewCount=1408, job.getEventCount=2865, offs=1408
[2017-07-17 19:40:51,834] INFO seen=1408, job.getNumPreviews=26, job.getResultPreviewCount=1408, job.getEventCount=2865, offs=1408
[2017-07-17 19:40:56,855] INFO seen=1408, job.getNumPreviews=27, job.getResultPreviewCount=1408, job.getEventCount=2865, offs=1408

So, the event count seems ok, but I don't get all the events in the previews. If I try to count the offsets, some of the records seem to vanish, so the count is too high and I don't see the new records. If I set the offset to 0, I get repeated events.

There is a comment in the docs

Depending on the time range to search, the number of events that are arriving to be indexed, and the count of previews to retrieve, the previews from one set to the next might include duplicates or be incomplete.

So, I suppose this is expected. I have start and end set to rt. No window.

What I am aiming to do, is just process every event once (and put it on a kafka topic). So, my primary question is how do I do this?
Should I just query on a set of known time windows? Or is it that I am missing something like a commit to realtime queries which will make them work?

The secondary question, out of curiosity, what is the point of the realtime API if it randomly misses events?

0 Karma

New Member

This question seem very related, but the answer is not definitive https://answers.splunk.com/answers/243607/whats-the-correct-way-to-get-real-time-continuous.html.

This one https://answers.splunk.com/answers/218075/how-to-create-247-real-time-search-using-the-java.html is exactly what I am trying to do, but it would be good to get some official confirmation on the best way to do this.

0 Karma