Hi,
I'm a first time splunk user trying to figure out how to do the following:
I have data describing devices, the devices are either "on" or "off" on any given day. I want to search for the ID's of the devices that are "off" on a specific day (let's call them my "new_set") and then track those (and only those) devices over time and output a count of how many of those devices in new_set are "on" and how many are "off" on each subsequent day.
Illustration, on day 0 I might find 200,000 devices that are "off" out of a total of 500,000 devices; then I want to count on each subsequent day how many of those specific 200,000 devices are "on" and how many are "off". It may look like this
Day 0: 200,000 off; 300,000 on
Then, I find the ID's of those 200,000 and call this new_set and then track only them ( I no longer care about the devices that are not in "new_set"):
Day 0: 200,000 off; 0 on
Day 1: 199,000 off: 1,000 on
Day 2: 197,000 off: 3,000 on
Day n: .....
I have tried using a Join command (inner), which eventually works, however the search is very slow / inefficient and I think there must be a more efficient way?
Does anyone have some recommendations or thoughts?
I think you have couple of option
*Option 1 (preferred/more efficient)*
Get a list of all devices off on Day 0 and store them in a lookup csv (http://docs.splunk.com/Documentation/Splunk/6.4.3/SearchTutorial/Usefieldlookups). Your search for this will look something like this
index=xyz earliest=day0 latest=day0 state="off" | stats count by id | fields - count | outputlookup listofdevicesoffonday0.csv
Once you have this file, you can use it to filter, like this
index=xyz earliest=day1 [| inputlookup listofdevicesoffonday0.csv ] | timechart span=1d count by state | fields - on
*Option 2, less efficient*
Use sub-search (http://docs.splunk.com/Documentation/Splunk/6.4.3/SearchTutorial/Useasubsearch) to get the filter list. So your search will look like this
index=xyz earliest=day1 [search index=xyz earliest=day0 latest=day0 state="off" | stats count by id | table id ] | timechart span=1d count by state | fields - on
I think you have couple of option
*Option 1 (preferred/more efficient)*
Get a list of all devices off on Day 0 and store them in a lookup csv (http://docs.splunk.com/Documentation/Splunk/6.4.3/SearchTutorial/Usefieldlookups). Your search for this will look something like this
index=xyz earliest=day0 latest=day0 state="off" | stats count by id | fields - count | outputlookup listofdevicesoffonday0.csv
Once you have this file, you can use it to filter, like this
index=xyz earliest=day1 [| inputlookup listofdevicesoffonday0.csv ] | timechart span=1d count by state | fields - on
*Option 2, less efficient*
Use sub-search (http://docs.splunk.com/Documentation/Splunk/6.4.3/SearchTutorial/Useasubsearch) to get the filter list. So your search will look like this
index=xyz earliest=day1 [search index=xyz earliest=day0 latest=day0 state="off" | stats count by id | table id ] | timechart span=1d count by state | fields - on
Hey,
Thank you for the help, that definitely got me further. However, it appears that my output is giving me the total "On"/"Off" counts from the original population ( the 500,000 in the above example) rather than from the sample of "new_set" (where "new_set" are those devices that were "off" on day=0 i.e. the 200,000). Any idea what I'm doing wrong?
Also, in my data there appear to be duplicates of the device ID's - so I would need to remove duplicates, would I use a "dedup" command and if so, where in the query?
Can you share your query
Thank you!
index=xyz source=abc mac!=pqr
[search index=xyz source=abc mac!=pqr timeStamp="day0" state= “off”
[ search source=uvw product=lmn timeStamp="day0"
| dedup Id
| table Id]
| stats count by Id
| dedup time,mac
| table Id]
| bin_time span=1d
| dedup_time,mac
| stats count by _time,operStatus
| sort_time
I changed some of your recommendations; for example, timechart was giving me "0" for weekend days, so I used the bin and sort instead. And "fields - on" wasn't giving me the comparison I needed of seeing "off" and "on" together for a day.
To limit your search, you need at add 'earliest=' in your base search. So in your example you should add day1 as the earliest OR timestamp>day0
Also, I notice you have two subsearches, not sure you need both. Combine them for faster searches
To eliminate duplicates, you can do dedup, or stats dc(id) as distinc_ids
Can I ask why you recommended field lookup.csv rather than a KV store in this scenario?
KV store will work as well. .csv is the first thing that came to mind.
If this helped you with your query, please mark it as answered to close it out.
Thank you.
Thank you, appreciate your help.