About bmw417

bmw417 · ‎10-06-2019

My data is added via a TCP port (9400 for this index), and I have a stream of data coming in from a Go program with one newline at the end that marks the end of the event. The data is collected from the Mitre CVE GitHub page by a web scraper/crawler that will gather the link to the CVE*.json file, the title of the file, time the CVE was reported, and then the actual contents of the json file - all 4 of these fields are comma delimited, and the first three (called link , title , and contenttime respectively) have field extractions set. For example, the first CVE ever in their database (CVE-1999-0001.json) will output a stream that looks like this: https://raw.githubusercontent.com/CVEProject/cvelist/master/1999/0xxx/CVE-1999-0001.json, CVE-1999-0001.json, 30 Dec 99 00:00 -0600, { "CVE_data_meta": { "ASSIGNER": "cve@mitre.org", "ID": "CVE-1999-0001", "STATE": "PUBLIC" }, "affects": { "vendor": { "vendor_data": [ { "product": { "product_data": [ { "product_name": "n/a", "version": { "version_data": [ { "version_value": "n/a" } ] } } ] }, "vendor_name": "n/a" } ] } }, "data_format": "MITRE", "data_type": "CVE", "data_version": "4.0", "description": { "description_data": [ { "lang": "eng", "value": "ip_input.c in BSD-derived TCP/IP implementations allows remote attackers to cause a denial of service (crash or hang) via crafted packets." } ] }, "problemtype": { "problemtype_data": [ { "description": [ { "lang": "eng", "value": "n/a" } ] } ] }, "references": { "reference_data": [ { "name": "http://www.openbsd.org/errata23.html#tcpfix", "refsource": "CONFIRM", "url": "http://www.openbsd.org/errata23.html#tcpfix" }, { "name": "5707", "refsource": "OSVDB", "url": "http://www.osvdb.org/5707" } ] }} Optimally, there would be something I could put in props.conf or whatever settings file applies that will prevent duplicates, triplicates, etc of the data that matches the same hash that's already in the index at index time - I know how to use dedup and such at search time, but there has to be a way (hopefully?) to prevent duplicates in the first place so I don't waste IO indexing it, then CPU/MEM/IO running a program and going back over my data at the end of the day and running a search to delete the duplicate events myself. Thanks again for the help

bmw417 · ‎10-05-2019

I've been reading around the docs and other questions, and from what I can tell, Splunk is supposed to be taking an MD5 hash of every event going on, and if an incoming event matches an already existing index, it will drop it and not duplicate it. However, I'm getting the exact opposite result, and it's very important for my project to not spend extra resources on unnecessary actions such as reindexing the exact same events any number of times. I've included a screenshot of what I'm talking about - I took an md5 of the incoming _raw variable on the second run of the same Go file that communicates via TCP to my index cve . and as you can see, the hashes of the duplicated events are exactly the same, yet they're duplicated. Any help is appreciated. Thanks!

Posts	2
Solutions	0
Karma Given	0
Karma Received	0
Member Since	‎10-05-2019

Online Status	Offline
Date Last Visited	‎06-05-2020 02:04 AM

Events in Index are Getting Duplicated Even Though...

Re: Events in Index are Getting Duplicated Even Th...

Events in Index are Getting Duplicated Even Though...