I am currently trying to show a graphical representation of the number of times an a specific thing happens x number of times. When ever an event in our system is processed and fails we retry another 15 times, so if it completely fails there will be 16 entries in splunk. This all happens within a couple of seconds. The log entry contains the guid of the event and can be identified in splunk.
What I have done so far:
1 - Created a custom field that identifies the guid in the log entry, lets call it "eventid"
2 - Created a search that filters based on source and event type, it groups by "eventid" and filters where there are 16 of those events. Finally it shows that in a time chart.
sourcetype="mysource" "IdentifyCorrectEvent" | stats values(_time) as _time, count by eventid | search count = 16 | timechart count | timechart per_hour(count)
This works so far as to show a visual representation of the number of times that this happens. For example if we had one failure (16 errors) in an hour it would show a count of 16, 2 in an hour would show a count of 32 and so on.
How do I get the chart to show the number of time there were 16 errors for a single event? This is my first effort with Splunk so feel free to say it is all wrong and I should have done xyz.
Will this work
sourcetype="mysource" "IdentifyCorrectEvent"
| timechart span=1m count by eventid
| eval count = ceiling(count/16)
The only problem that I can see with this solution would occur if an error event split with exactly 8 failures in one period and 8 in another... it would be counted twice with this scenario, since the time is sliced on minute boundaries.
Here is an alternate solution - in this case failures are defined based on eventid, but failures are also separated based on the time gap between events. In the example, if more than 10 seconds elapse between two events with the same id, they are considered different failures. This is a nice solution, but it will slow down significantly for huge numbers of events. (You could run the report over a shorter time period to compensate.)
sourcetype="mysource" "IdentifyCorrectEvent"
| transaction eventid maxpause=10s
| where eventcount > 15
| eval errorCount = round(errorCount / 16, 0)
| timechart sum(errorCount) as failure by eventid
Will this work
sourcetype="mysource" "IdentifyCorrectEvent"
| timechart span=1m count by eventid
| eval count = ceiling(count/16)
The only problem that I can see with this solution would occur if an error event split with exactly 8 failures in one period and 8 in another... it would be counted twice with this scenario, since the time is sliced on minute boundaries.
Here is an alternate solution - in this case failures are defined based on eventid, but failures are also separated based on the time gap between events. In the example, if more than 10 seconds elapse between two events with the same id, they are considered different failures. This is a nice solution, but it will slow down significantly for huge numbers of events. (You could run the report over a shorter time period to compensate.)
sourcetype="mysource" "IdentifyCorrectEvent"
| transaction eventid maxpause=10s
| where eventcount > 15
| eval errorCount = round(errorCount / 16, 0)
| timechart sum(errorCount) as failure by eventid
We happened to have a Splunk trainer in the building and he came up with pretty much the same solution. I don't have enough points to edit your answer so I will put it in here. sourcetype="mysource" "IdentifyCorrectEvent" | transaction maxspan=5s eventid | where eventcount>=16 | table _time eventid eventcount | timechart count.