Solved: How to create a status time chart based on only ch...

dustinhartje · ‎10-14-2016

I have what seems like a fairly simple analytical problem that I'm having real trouble wrapping into Splunk commands. I hope someone smarter out there can help me work this out as I've run into similar requests a few times and always fallen short solving them!

The goal

Determine the health of all hosts (individually) at any given time based on the status of multiple components which log status change events only in different places. This is needed for both forensics and SLA reporting purposes per host for the most part, though there are far too many hosts (thousands) to build individual searches for. Essentially, someone needs to be able to check the health of a given host at any given point in time (which will probably be somewhere between the change events) as well as chart the health over time of all of these hosts.

The data
Is pretty straightforward in this mockup:

logA.log

2016-01-01 00:01 changeA=OK
2016-01-01 00:05 changeA=BAD
...

logB.log

2016-01-04 00:03 changeB=BAD
2016-01-04 00:04 changeB=OK
...etc

Getting there

To simplify, I'll focus on just the chart structure we're trying to produce to apply health logic to via Eval, and assume we have 3 status components and only 2 hosts to deal with. In order to form a dataset we can build views against effectively the columns would need to lay out something like so:

_time              | host  | statusA | statusB | statusC
2016-01-01 00:01   | hostA | OK      | BAD     | OK
2016-01-01 00:01   | hostB | BAD     | BAD     | OK
2016-01-01 00:02   | hostA | OK      | OK      | OK
2016-01-01 00:02   | hostB | OK      | BAD     | BAD
2016-01-01 00:03   | hostA | OK      | BAD     | BAD
2016-01-01 00:03   | hostB | BAD     | BAD     | OK

So that we can then do an eval to apply our logic to every host during every minute to decide (based on statusA-C relationships) what the overall health is. This would give us something like:

_time              | host | statusA | statusB | statusC | hostHealthThisMinute
2016-01-01 00:01   | hostA | OK     | BAD     | OK      | TBD with eval logic
2016-01-01 00:01   | hostB | BAD    | BAD     | OK      | TBD with eval logic
2016-01-01 00:02   | hostA | OK     | OK      | OK      | TBD with eval logic
2016-01-01 00:02   | hostB | OK     | BAD     | BAD     | TBD with eval logic
2016-01-01 00:03   | hostA | OK     | BAD     | BAD     | TBD with eval logic
2016-01-01 00:03   | hostB | BAD    | BAD     | OK      | TBD with eval logic

Timechart fails me because I end up with the host in the column name rather than as a column and I really need both _time and host on the same axis (I think...).

| timechart latest(changeA) AS statusA latest(changeB) AS statusB latest(changeC) AS statusC by host

gives me:

_time              | hostA:statusA | hostA:statusB | hostA:statusC | hostB:statusA | hostB:statusB | ...
2016-01-01 00:01   | OK            | BAD           | OK            | BAD           | BAD           | ...
2016-01-01 00:02   | OK            | OK            | OK            | OK            | BAD           | ...
2016-01-01 00:03   | OK            | BAD           | OK            | BAD           | BAD           | ...

Because we only get change events, the closest I've gotten is to stats/chart the change events I DO have and then use streamstats to fill in some of the gaps:

index=myindex source=logA OR source=logB OR source=logC
| bucket span=1m _time
| stats latest(changeA) AS changeA latest(changeB) AS changeB latest(changeC) AS changeC by _time host
| streamstats latest(changeA) AS statusA latest(changeB) AS statusB latest(changeC) AS statusC by host
| fields _time host status*
| sort _time host

This fills all the columns for every 1m time slice where AT LEAST ONE status change event occurred, however it still leaves gaps when there were no events in that particular minute because the initial stats/chart command doesn't leave a row with each host and no statuses for every time slice like I need so streamstats can fill it. Note the timestamps below in the simulated results:

_time              | host | statusA | statusB | statusC | overallStatus
2016-01-01 00:01   | hostA | OK     | BAD     | OK      | ...TBD with eval
2016-01-01 00:01   | hostB | BAD    | BAD     | OK      | ...TBD with eval
2016-01-01 00:03   | hostA | OK     | OK      | OK      | ...TBD with eval
2016-01-01 00:04   | hostA | OK     | BAD     | OK      | ...TBD with eval
2016-01-01 00:07   | hostB | BAD    | BAD     | OK      | ...TBD with eval

I've tried every way of applying splits in stats and chart that I can think of, and dug through commands like makecontinuous and transpose looking for ways to get those gaps fill without a timechart command involved. I have failed entirely 😞

Can anyone tell me how I might approach this problem better by:
1. Using timechart in a way that leaves the host as a value per row rather than a column header
2. Using bucket/stats/timechart and fill in the gaps where no events occurred
3. Accomplish our goals in some completely different way, maybe without building such a table at all?

somesoni2 · ‎10-14-2016

Try this (should give you continuous time)

index=myindex source=logA OR source=logB OR source=logC | timechart latest(changeA) AS statusA latest(changeB) AS statusB latest(changeC) AS statusC by host limit=0 | untable _time host_status value | eval host=mvindex(split(host_status,":"),1) | eval status=mvindex(split(host_status,":"),0) | eval {status}=value | fields - status host_status | table _time host status*

View solution in original post

sundareshr · ‎10-14-2016

Try this

index=myindex source=logA OR source=logB OR source=logC | bucket  _time | stats latest(changeA) AS changeA latest(changeB) AS changeB latest(changeC) AS changeC by _time host | timechart span=1m cont=t latest(changeA) AS changeA latest(changeB) AS changeB latest(changeC) AS changeC latest(host) as host | streamstats latest(changeA) AS statusA latest(changeB) AS statusB latest(changeC) AS statusC by host | fields _time host status* | sort _time host

dustinhartje · ‎10-21-2016

Unfortunately using latest(host) as host rather than splitting by host in the timechart results in changes on different hosts within the same minute being combined together and appearing to happen on a single host instead.

somesoni2 · ‎10-14-2016

Try this (should give you continuous time)

index=myindex source=logA OR source=logB OR source=logC | timechart latest(changeA) AS statusA latest(changeB) AS statusB latest(changeC) AS statusC by host limit=0 | untable _time host_status value | eval host=mvindex(split(host_status,":"),1) | eval status=mvindex(split(host_status,":"),0) | eval {status}=value | fields - status host_status | table _time host status*

dustinhartje · ‎10-21-2016

This ALMOST got where I needed to go though the |untable command removes the empty timeslices where no changes occurred leaving gaps similar to my previous attempst. However, this answer does solve the trickiest part of my problem by getting the host names back down into a column after splitting with them in the timechart and after a bit of further experimentation I was able to populate the empty values in a way that works for my purposes and keeps those timeslices from getting dropped during the untable and then used stats to recombine them into single lines per host at the end like so:

index=myindex source=logA OR source=logB OR source=logC
| timechart span=1m cont=t limit=0 latest(changeA) AS changeA latest(changeB) AS changeB by host
| filldown
| fillnull value=null
| untable _time host_status value
| eval host=mvindex(split(host_status,":"),1)
| eval change=mvindex(split(host_status,":"),0)
| eval {change}=value
| stats latest(changeA) AS statusA latest(changeB) AS statusB by _time host
| eval statusA=if(match(statusA,"null"),null(),statusA)
| eval statusB=if(match(statusB,"null"),null(),statusB)

How to create a status time chart based on only change events with splits?

Get Inspired! We’ve Got Validation that Your Hard Work is Paying Off

What's New in Splunk Enterprise 9.4: Features to Power Your Digital Resilience

Take Your Breath Away with Splunk Risk-Based Alerting (RBA)