Re: Why is the Splunk Java SDK returning duplicate...

wlawrence_inter · ‎04-19-2016

Hello Splunk community,

I'm using the current Java SDK (1.5.0) to programmatically export large amounts of data from Splunk.

I'm doing this by submitting an export job;

JobExportArgs exportArgs = new JobExportArgs();
exportArgs.setEarliestTime(new DateTime(startMsEpoch).toString());
exportArgs.setLatestTime(new DateTime(endMsEpoch).toString());
exportArgs.setOutputMode(JobExportArgs.OutputMode.XML);
exportArgs.setSearchMode(JobExportArgs.SearchMode.NORMAL);

exportJobStream = splunkService.export("search " + searchFilters, exportArgs);

(Note that all code samples here are not 100% complete, I haven't yet created the minimum code required to reproduce to problem)

Then I create a multi-results reader, and a results iterator;

splunkResultsReader = new MultiResultsReaderXml(exportJobStream);
splunkResultIterator = splunkResultsReader.iterator();

After this, I essentially check if splunkResultIterator.hasNext(), and if so, I grab splunkEventIterator = splunkResultIterator.next().iterator
While splunkEventIterator.hasNext() I grab and process Splunk events. When that iterator no longer hasNext(), I move onto the next value in the splunkResultsIterator via .hasNext() and .next(). When that iterator no longer has values, I consider the search complete, and all data extracted.

This has mostly worked fine for us, but when we start to search for over ~250K records, we start getting around ~10K records more than what the Splunk UI returns. For a ~660K records, I get ~25K more than expected.

If I say, save the event_id values for those events to a file, and take a look at it, I find that those extra 10K records are all duplicates. I can also run the search query on the Splunk UI along with | stats count by event_id and I never see duplicates, so I'm a little confused.

I'm looking for any suggestions and/or insight into the problem. I will continue to test, as well as try to setup a minimum set of code to reproduce, rather than taking snippets out of our code base.

--Wes

wlawrence_inter · ‎04-26-2016

I have written some code that reproduces the issue for me, and uploaded it to github, here;

https://github.com/FileTrek/SplunkJavaSdkDuplicateRepoduction

The README outlines what needs to be changed to have the tests work in your own environments.

tthrockm · ‎04-19-2016

Recently found/fixed a similar issue with the Python SDK, related to previews being mixed in with the results:

https://answers.splunk.com/answers/391949/unreliable-export-using-python-sdk.html#answer-391220

I don't know offhand how to account for this using the Java SDK - take a look at http://docs.splunk.com/Documentation/JavaSDK if you haven't already - maybe set JobExportArgs.OutputMode to RAW ?

wlawrence_inter · ‎04-20-2016

OK. I re-ran my tests using JobExportArgs.OutputMode.JSON, but read the data manually, rather than with a MultiResultsReaderJson and parsed each line with Jackson.

Each record did have a "preview":... value. However, it was always false, so previews are not causing duplicates it seems.

Thanks for the help though.

wlawrence_inter · ‎04-20-2016

You know what, maybe there's something to the preview stuff here. The MultiResultsReaders may be removing any 'preview' part by the time it gets to me.

According to this; http://dev.splunk.com/view/java-sdk/SP-CAAAEPZ#export

A reporting (transforming) search returns a set of previews followed by the final events, each as separate elements.
A non-reporting (non-transforming) search returns events as they are read from the index, each as separate elements.
A real-time search returns multiple sets of previews, each preview as a separate element.

But through the SDK, I'm only allowed to choose REALTIME or NORMAL. Maybe normal can return some preview events? Depending on if it's reporting or non-reporting?

wlawrence_inter · ‎04-20-2016

Good thoughts, and thanks for the advice, but sadly doesn't seem to help my case =(

Apparently, only JobExportArgs.SearchMode.REAL_TIME may return previews , where JobExportArgs.SearchMode.NORMAL shouldn't (as far as I can tell from reading http://dev.splunk.com/view/java-sdk/SP-CAAAEHQ).

I also tracked both the original, and duplicated Splunk Events in debug, and neither seems to be marked as 'preview', at least, as far as com.splunk.Event objects can tell...

I can't try JobExportArgs.OutputMode.RAW AFAIK, since there isn't a MultiResultsReaderRaw? I'll poke around, maybe there's a different way to parse RAWs rather than MultiResultsReaders...

Thanks again, gives me more to think about, and poke around with. 😃

wlawrence_inter · ‎04-19-2016

It may be worth nothing that the entire data set is >10M, but with the search query we're using, the results are in the hundreds of thousands range.

ineeman · ‎04-26-2016

If you just run the same search via curl (export and everything) and dump it all into a file (until it completes), do you also see the duplicates there? My main question is whether we're seeing duplicates that the SDK is generating due to some error or whether Splunk is returning unexpected results.

By the way, is event_id something in your data, or something you are seeing Splunk return?

Finally, keep in mind that it's not necessarily correct to compare what you see in the UI vs what you see from the SDK - the UI passes in quite a few different parameters that the SDK does not by default (for a variety of reasons), and the UI is not running an export search.

wlawrence_inter · ‎04-26-2016

1)
Here's the UI search I use right now;

index="wineventlog" source="WinEventLog:Security" (EventCode=4624 OR EventCode=4625 OR EventCode=4648 OR EventCode=4634 OR EventCode=4647 OR EventCode=4768 OR EventCode=4769 OR EventCode=4770 OR EventCode=4771 OR EventCode=4772 OR EventCode=4773 OR EventCode=4776 OR EventCode=4777) | fields dvc,event_id | fields - _raw

If it add | stats count, I get 377,173 records. If I do | stats count by event_id I don't ever see a count over 1 (no duplicates on the event_id key).

If I do a cURL export like this

curl -k -u <username>:<password> -d "output_mode=csv" -o /data/splunkCurlExport/results.csv https://<server>:8089/servicesNS/<username>/search/search/jobs/export --data-urlencode 'search=search earliest="04/01/2016:00:00:00" latest="04/09/2016:00:00:00" index="wineventlog" source="WinEventLog:Security" (EventCode=4624 OR EventCode=4625 OR EventCode=4648 OR EventCode=4634 OR EventCode=4647 OR EventCode=4768 OR EventCode=4769 OR EventCode=4770 OR EventCode=4771 OR EventCode=4772 OR EventCode=4773 OR EventCode=4776 OR EventCode=4777) | table _time,dvc,event_id'

I can a csv file with 377,173 records. Performing some quich stats cat /data/splunkCurlExport/results.csv | egrep -o "[A-Z0-9]+,[0-9]+" | sort | uniq -c, I see no duplicates in this data, and get 377,173 records.

The Java SDK export job uses the same search as the above, and currently produces 382,918 records in total (for me atleast). When tracking duplicates (based on dvc and event_id keys), there are 5745 duplicates (382,918 - 5745 = 377,173).

2) event_id is a key specific to Windows Event Logs. I'm actually using dvc + event_id as a unique key for a window event logs entry (with the UI search, I've just been doing event_id right now).

3) As far as comparing data between the UI and the SDK, fair enough. Most importantly, I'm trying to figure out why the results are different, and why the results seem to contain dupliacte records.

ineeman · ‎04-26-2016

Thanks Wes - this is useful information. Can you try one more thing, which is to do the export as XML and not as CSV from curl? I imagine it'll be the same, but I just want to double check.

We have no known issue with duplicates in the SDK, but that doesn't mean there isn't one. I tried to reproduce this on a large-ish dataset (>1M records), but no luck, with similar code.

wlawrence_inter · ‎04-26-2016

I'm getting weird results with the XML export. There seems to be previews mixed in? I don't have an easy way to filter out preview in the XML export at the moment, but it seems like ~168K records came back, but ~50K might be non-preview?

For the Java SDK, I typically use JSON (I was trying XML as well when I wrote this question). So I also ran a cURL export to json.

That job returns 618,701 results, but I can more easily grep for "preview:false", and I see 377,173 non-preview records.

I'm touching up the minimum-code-to-reproduce and will publish to github. I use JSON export mode for those tests, and as well as using the MultiResultsReaderJson, but more importantly, I also have tests that read the export stream directly, and parse the JSON with Jackson in that case, rather than the MultiResultsReader, and preview is always false for those records, but more records than expected are always returned (~382K vs ~377K expected)...

ineeman · ‎04-26-2016

Great - once you have the repro, ping me. My email is my username here @splunk.com 🙂

Why is the Splunk Java SDK returning duplicate events compared to Splunk Web?

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Fuel Your Journey: What’s Waiting for You at the .conf26 Acceleration Station

Join the Final Session of the Data Management & Federation Bootcamp Series

From Data to Insight: Announcing the Winners of the Splunk Dashboard Contest

Join the Conversation