I've been reading around the docs and other questions, and from what I can tell, Splunk is supposed to be taking an MD5 hash of every event going on, and if an incoming event matches an already existing index, it will drop it and not duplicate it. However, I'm getting the exact opposite result, and it's very important for my project to not spend extra resources on unnecessary actions such as reindexing the exact same events any number of times. I've included a screenshot of what I'm talking about - I took an md5 of the incoming _raw variable on the second run of the same Go file that communicates via TCP to my index cve. and as you can see, the hashes of the duplicated events are exactly the same, yet they're duplicated. Any help is appreciated.
You've been mis-informed. There is little in Splunk to prevent duplication of events. File monitors will track their position in the file to avoid re-reading the same data and a SHA1 is calculated on the first and last sections of the file to know if it's changed, but there is nothing like what you describe. There is no hash of each event and Splunk certainly is not searching all other events for repeated hash values before indexing data. It's up to the user to handle duplicates, either at index time or at search time.
If you'll share how this data is onboarded we can offer ways to avoid the duplicates.
--- If this reply helps you, an upvote would be appreciated.
My data is added via a TCP port (9400 for this index), and I have a stream of data coming in from a Go program with one newline at the end that marks the end of the event. The data is collected from the Mitre CVE GitHub page by a web scraper/crawler that will gather the link to the CVE*.json file, the title of the file, time the CVE was reported, and then the actual contents of the json file - all 4 of these fields are comma delimited, and the first three (called link, title, and contenttime respectively) have field extractions set. For example, the first CVE ever in their database (CVE-1999-0001.json) will output a stream that looks like this:
Optimally, there would be something I could put in props.conf or whatever settings file applies that will prevent duplicates, triplicates, etc of the data that matches the same hash that's already in the index at index time - I know how to use dedup and such at search time, but there has to be a way (hopefully?) to prevent duplicates in the first place so I don't waste IO indexing it, then CPU/MEM/IO running a program and going back over my data at the end of the day and running a search to delete the duplicate events myself.