topic Re: Duplicate entries with continuous csv indexing in Splunk Search

Duplicate entries with continuous csv indexing

haker146 — Wed, 13 Jun 2018 17:32:29 GMT

Hello, I write with a small problem to you. I'm building a wi-fi monitoring system for my diploma thesis. I use Kismet software, which creates a netxml file for me, which I then parse to csv. Then, this csv file I want to constantly watch in splunk and see changes in signal strength over time. Unfortunately, each rebooting my netxml file and creating a new csv with the same name causes that more and more duplicates of the same network appear. As shown in the figure below.

I am asking for help, what should I do to ensure that these duplicates do not arise and there is one original entry.

Re: Duplicate entries with continuous csv indexing

mdsnmss — Wed, 13 Jun 2018 17:46:31 GMT

When you reboot it generates a new log file under the same name. Does it contain the old entries still? Are you able to control the name of the file it generates? It sounds like each time you reboot it reindexes the entire file. You should be able to control this using your inputs.conf. Could you provide what your inputs.conf for this looks like?

Re: Duplicate entries with continuous csv indexing

haker146 — Wed, 13 Jun 2018 18:04:08 GMT

The wrong thing I wrote is that my input file looks like this:

I build this file with a python script from scratch based on the xml file every minute and this csv file has been monitored all the time in splunk. When updating, the address of the mac, etc., the signal strength, last seen etc., changes. And every time I rebuild this file, the csv in splunk shows me a new entry to each maca, even if it was already there. My main point is not to add a new maca entry and update the signal value etc.

Re: Duplicate entries with continuous csv indexing

msivill_splunk — Thu, 14 Jun 2018 08:13:28 GMT

Adding a timestamp to each event in the csv and viewing the csv as a snapshot in time the duplicates can then be handled within Splunk. This also gives the advantage of being able to plot how signal changes over time in Splunk. Splunk can be thought of a a time series database so adding events with the same data but with different timestamps is fine.

Generated event query to show concept of how signal strength changes over time.

| makeresults count=4 
| streamstats count 
| eval _time = _time - (count*3600) 
| eval mac_address = "00:1D:0F:FB:40:4A", channel=6, signal=(random()%10)+ -74 
| timechart max(signal) as Signal span=1h

Also within Splunk you will be able to query the first seen and last seen values, so no need to generate these fields in the extract itself

| makeresults count=4 
| streamstats count 
| eval _time = _time - (count* 3600) 
| eval mac_address = "00:1D:0F:FB:40:4A", channel=6, signal= count + -74 
| eval time=strftime(_time,"%y/%m/%d %H:%M:%S") 
| stats min(time) As first_seen, max(time) AS last_seen by mac_address

To map a field in the csv extract to a Splunk timestamp https://docs.splunk.com/Documentation/Splunk/7.1.1/Data/HowSplunkextractstimestamps

The _time field in Splunk is where the timestamp is held.

Re: Duplicate entries with continuous csv indexing

haker146 — Mon, 18 Jun 2018 16:40:43 GMT

@msivill
Thank you so much for help. I still have a question and what if I want to make a table without repeating entries?

Re: Duplicate entries with continuous csv indexing

msivill_splunk — Tue, 26 Jun 2018 08:00:31 GMT

There is no concept of updating an event in Splunk. If you send the same data twice to Splunk then you will end up with 2 events. Using a timestamp when the events are saved into Splunk will help differentiate between the events. The above example produces a view without repeating entries (but there will still be duplicate events within Splunk itself)