Splunk Search

Why do real-time queries include duplicates or are incomplete?

nigelbrown
New Member

I am absolutely new to Splunk and having a play. I was trying to use the java API (through scala, but that shouldn't matter) to do a realtime query based on these instructions http://dev.splunk.com/view/java-sdk/SP-CAAAEHQ#realtimejob.

Normal queries work, and so do realtime queries the first time. I upload my 955 records three times and log what the job is seeing.

[2017-07-17 19:39:12,700] INFO seen=2, job.getNumPreviews=6, job.getResultPreviewCount=2, job.getEventCount=2, offs=0
[2017-07-17 19:39:20,795] INFO seen=955, job.getNumPreviews=7, job.getResultPreviewCount=955, job.getEventCount=955, offs=2
[2017-07-17 19:39:25,818] INFO seen=955, job.getNumPreviews=9, job.getResultPreviewCount=955, job.getEventCount=955, offs=955
[2017-07-17 19:39:30,849] INFO seen=957, job.getNumPreviews=10, job.getResultPreviewCount=957, job.getEventCount=957, offs=955
[2017-07-17 19:39:36,487] INFO seen=1408, job.getNumPreviews=11, job.getResultPreviewCount=1408, job.getEventCount=1910, offs=957
[2017-07-17 19:39:41,512] INFO seen=1408, job.getNumPreviews=12, job.getResultPreviewCount=1408, job.getEventCount=1910, offs=1408
[2017-07-17 19:39:46,540] INFO seen=1408, job.getNumPreviews=13, job.getResultPreviewCount=1408, job.getEventCount=1910, offs=1408
[2017-07-17 19:39:51,559] INFO seen=1408, job.getNumPreviews=14, job.getResultPreviewCount=1408, job.getEventCount=1910, offs=1408
[2017-07-17 19:39:56,580] INFO seen=1408, job.getNumPreviews=15, job.getResultPreviewCount=1408, job.getEventCount=1910, offs=1408
[2017-07-17 19:40:01,602] INFO seen=1408, job.getNumPreviews=16, job.getResultPreviewCount=1408, job.getEventCount=2865, offs=1408
[2017-07-17 19:40:06,622] INFO seen=1408, job.getNumPreviews=17, job.getResultPreviewCount=1408, job.getEventCount=2865, offs=1408
[2017-07-17 19:40:11,645] INFO seen=1408, job.getNumPreviews=18, job.getResultPreviewCount=1408, job.getEventCount=2865, offs=1408
[2017-07-17 19:40:16,670] INFO seen=1408, job.getNumPreviews=19, job.getResultPreviewCount=1408, job.getEventCount=2865, offs=1408
[2017-07-17 19:40:21,695] INFO seen=1408, job.getNumPreviews=20, job.getResultPreviewCount=1408, job.getEventCount=2865, offs=1408
[2017-07-17 19:40:26,719] INFO seen=1408, job.getNumPreviews=21, job.getResultPreviewCount=1408, job.getEventCount=2865, offs=1408
[2017-07-17 19:40:31,740] INFO seen=1408, job.getNumPreviews=22, job.getResultPreviewCount=1408, job.getEventCount=2865, offs=1408
[2017-07-17 19:40:36,760] INFO seen=1408, job.getNumPreviews=23, job.getResultPreviewCount=1408, job.getEventCount=2865, offs=1408
[2017-07-17 19:40:41,785] INFO seen=1408, job.getNumPreviews=24, job.getResultPreviewCount=1408, job.getEventCount=2865, offs=1408
[2017-07-17 19:40:46,808] INFO seen=1408, job.getNumPreviews=25, job.getResultPreviewCount=1408, job.getEventCount=2865, offs=1408
[2017-07-17 19:40:51,834] INFO seen=1408, job.getNumPreviews=26, job.getResultPreviewCount=1408, job.getEventCount=2865, offs=1408
[2017-07-17 19:40:56,855] INFO seen=1408, job.getNumPreviews=27, job.getResultPreviewCount=1408, job.getEventCount=2865, offs=1408

So, the event count seems ok, but I don't get all the events in the previews. If I try to count the offsets, some of the records seem to vanish, so the count is too high and I don't see the new records. If I set the offset to 0, I get repeated events.

There is a comment in the docs

Depending on the time range to search, the number of events that are arriving to be indexed, and the count of previews to retrieve, the previews from one set to the next might include duplicates or be incomplete.

So, I suppose this is expected. I have start and end set to rt. No window.

What I am aiming to do, is just process every event once (and put it on a kafka topic). So, my primary question is how do I do this?
Should I just query on a set of known time windows? Or is it that I am missing something like a commit to realtime queries which will make them work?

The secondary question, out of curiosity, what is the point of the realtime API if it randomly misses events?

0 Karma

nigelbrown
New Member

This question seem very related, but the answer is not definitive https://answers.splunk.com/answers/243607/whats-the-correct-way-to-get-real-time-continuous.html.

This one https://answers.splunk.com/answers/218075/how-to-create-247-real-time-search-using-the-java.html is exactly what I am trying to do, but it would be good to get some official confirmation on the best way to do this.

0 Karma
Career Survey
First 500 qualified respondents will receive a $20 gift card! Tell us about your professional Splunk journey.

Can’t make it to .conf25? Join us online!

Get Updates on the Splunk Community!

Can’t Make It to Boston? Stream .conf25 and Learn with Haya Husain

Boston may be buzzing this September with Splunk University and .conf25, but you don’t have to pack a bag to ...

Splunk Lantern’s Guide to The Most Popular .conf25 Sessions

Splunk Lantern is a Splunk customer success center that provides advice from Splunk experts on valuable data ...

Unlock What’s Next: The Splunk Cloud Platform at .conf25

In just a few days, Boston will be buzzing as the Splunk team and thousands of community members come together ...