I have a web application that produces a fairly complicate log structure that looks something like the following.
{ "total":6789, data:[{e1}. {e2}, {e3}] }
I have a python script that's scraping the application every few minutes to get the json out of the web app and onto the file system. The structure of the file looks something like the following.
I've been able to break the events out of the data section in the array so that splunk can index the individual {e1}, {e2}, {e3} events. The problem that I am facing is that each time that scraping script runs, I am getting duplicate events. I seem to get the same event repeated n-times until it rolls out of the log.
I think that the problem is that over time the events 'move' through the log files so it looks to splunk like the file is always changing.
Over time, the files look something like the following:
{ "total":6743, data:[{e1}. {e2}, {e3}] }
{ "total":6522, data:[{e2}. {e3}, {e4}] }
{ "total":6456, data:[{e3}. {e4}, {e5}] }
Which seems to make splunk index e3 three times.
Is there an easy way to keep Splunk from reindexing the events that it already has seen without having to get do a bunch of diffing in scripting to filter out the duplicate events?
Thanks,
Dan