I have a search like this:
| mstats span=1d sum(_value) as "ClosedTime" WHERE index=metrics_prod metric_name=com.foo.timeClosed | eval ClosedTimeinMin = ((ClosedTime/1000)/60) | table _time ClosedTimeinMin
It basically shows how long a device in a shop was closed (out of order) a day:
"_time",ClosedTimeinMin "2019-09-25T00:00:00.000+0200","0.27218333333" "2019-09-26T00:00:00.000+0200","528.49098333333" "2019-09-27T00:00:00.000+0200","1077.4227000000" "2019-09-30T00:00:00.000+0200","3410.40781666667" "2019-10-01T00:00:00.000+0200","533.04851666667"
The problem is that it contains those time periods as well when the shop was closed. When the shop is closed and the device is off it is not a problem so this time period should not be in the report. I can call a REST API which returns the opening hours of the shop. This Python script (which calls the REST API) is written according to the rules described here (external lookup): https://docs.splunk.com/Documentation/Splunk/7.3.1/Knowledge/Configureexternallookups
It was written by me so it can return the opening hours in any format. How could I use this script in the search in order to exclude the hours when the shop was closed? Can I use the "lookup" command together with "mstats" command? And if yes how should I phrase the query in order to achieve the desired result?
it should work like this and my apologies for using "meta-code" to sketch the idea:
| mstats span=1d sum(_value) as "ClosedTime" WHERE index=metrics_prod metric_name=com.foo.timeClosed | <external.py returning fields openingTime, closingTime as epoch (integer)> <<< Add the opening hours | where _time>=openingTime AND _time<=closingTime <<< Select only the events that fall between open and close | stats sum(ClosedTimeinMin) as TotalOutage <<< Sum up all outages to get the total Outage time
Alternatively, you could add some additional parameter to the metrics, detailing the failed device/source and add a by to the stats to get the outages per device.
This is just a rough sketch. To make it perfect, you would rather test if the end of the outage was inside the opening hours, clipping the outage length to the openingTime whenever the outage started before store open. The same could be done for all outages that start while open: clip their length at shop close. It gets tricky if an outage spans several days... but you get the point.
Hope it helps
Thanks for your answer. I am waiting for our Splunk admin to let me upload the script then I try what you suggested.
I tried to use an already existing lookup though and it didn't work:
| mstats span=1d sum(_value) as "ClosedTime" WHERE index=till_sre_metrics_prod metric_name=com.tesco.ui.prod.timeOnClosedPage | lookup store_lookup.csv host AS host OUTPUTNEW store_number AS store_number | fields store_number
the "store_number" field doesn't appear. Why? "Host" field should be available even in case of stats.