My data is added via a TCP port (9400 for this index), and I have a stream of data coming in from a Go program with one newline at the end that marks the end of the event. The data is collected from the Mitre CVE GitHub page by a web scraper/crawler that will gather the link to the CVE*.json file, the title of the file, time the CVE was reported, and then the actual contents of the json file - all 4 of these fields are comma delimited, and the first three (called link , title , and contenttime respectively) have field extractions set. For example, the first CVE ever in their database (CVE-1999-0001.json) will output a stream that looks like this:
https://raw.githubusercontent.com/CVEProject/cvelist/master/1999/0xxx/CVE-1999-0001.json, CVE-1999-0001.json, 30 Dec 99 00:00 -0600, { "CVE_data_meta": { "ASSIGNER": "cve@mitre.org", "ID": "CVE-1999-0001", "STATE": "PUBLIC" }, "affects": { "vendor": { "vendor_data": [ { "product": { "product_data": [ { "product_name": "n/a", "version": { "version_data": [ { "version_value": "n/a" } ] } } ] }, "vendor_name": "n/a" } ] } }, "data_format": "MITRE", "data_type": "CVE", "data_version": "4.0", "description": { "description_data": [ { "lang": "eng", "value": "ip_input.c in BSD-derived TCP/IP implementations allows remote attackers to cause a denial of service (crash or hang) via crafted packets." } ] }, "problemtype": { "problemtype_data": [ { "description": [ { "lang": "eng", "value": "n/a" } ] } ] }, "references": { "reference_data": [ { "name": "http://www.openbsd.org/errata23.html#tcpfix", "refsource": "CONFIRM", "url": "http://www.openbsd.org/errata23.html#tcpfix" }, { "name": "5707", "refsource": "OSVDB", "url": "http://www.osvdb.org/5707" } ] }}
Optimally, there would be something I could put in props.conf or whatever settings file applies that will prevent duplicates, triplicates, etc of the data that matches the same hash that's already in the index at index time - I know how to use dedup and such at search time, but there has to be a way (hopefully?) to prevent duplicates in the first place so I don't waste IO indexing it, then CPU/MEM/IO running a program and going back over my data at the end of the day and running a search to delete the duplicate events myself.
Thanks again for the help
... View more