Trying to determine Best Practices for the following, and I don't want to reinvent the wheel if a Splunker had already resolved this issue.
This is for a printer dashboard.
This is a minimized small scale of reality.
• 5 printers: A, B, C, D, E
• 2 printer status’: UP, DOWN
• Dashboard will be refreshed every 5 minutes searching for the latest status of printers A – E
• The 1st 5 minutes, printers A – E show status as UP
• The 2nd 5 minutes, printers A – D show status as UP, E as DOWN
• The 3rd 5 minutes, printers A – D are UP, E is ???? This is because the print server has not received any events from printer E; therefore, neither has the Splunk indexers
Possible solution (50,000 feet)
1. Lookup table that stores the last status ingested for each printer, including time
2. The next time the search is run (5 minutes later) any printers missing a status, "No Printer Events!", will be searched for in the lookup table
3. The dashboard will be populated with the lookup status for the printer
4. Once the dashboard is fully populated, the lookup table will be cleared of all rows and repopulated from the dashboard status (status saved in a token with time for each printer)
I think this will work, but it will be a lot of coding.
A first response* to the above might be increase the search time range from 5 minutes to 60 minutes, or 4 hours, or 24 hours, etc. Problem is at some point, a printer will have sent its status before that new time range.
Below is the reality. Case in point.
Because of the limitation on how many images I can upload, these 3 time ranges (15 mins, 4 hours, all-time) have been combined into one image.
Notice the different statuses, especially oix21. This printer was offline between 16 minutes and 4 hours ago. If the Helpdesk only had the 15-minute view, they would not know this printer is down, because a down printer doesn’t write logs to the print server.
Now we discover printer rv44 status is “toner low”. According to our 15-minute this printer had no recent events.
This is with only 2 possible statuses/statii. There are over 15 (door open, out of paper, etc.).
I hope I did not convolute things with my explanation.
Is there a Splunk Best Practice for storing latest status (and time thereof), updating it when a new status is learned? Hmm kvstore perhaps?????
If you can capture all the printer statuses in one search, you can write the result to a lookup table. In the above dashboard, you can then look directly at the lookup table for the last known status. This will give you the most accurate result.
Secondly, it would be worth vile having a drill down, where you can show the time chart of the different events of the printer, so you can trace through what might have been the reason of failure.
For the second part, read this docs page, which will explain how the drill downs work:
With the drilldown a new panel(s) will be populated at the bottom of the dashboard or in a new dashboard.
Examples of drilldown you can add:
- identify the user groups which make most use of the printer
- identify the last known fix action on each of the printer
- identify the list of common issue with the printer
- identify the jobs that were executed in the last 24hrs which lead to the printer's current fail state
- display the maintenance history of the printer
This can assist further with troubleshooting and also ongoing maintenance.
Note: these are just examples and you should extend the capabilities as per requirements
Thanks for the clarification about drill downs. I am using one where when the user clicks a printer name a new panel opens at the top of the dashboard displaying more details, including contact info, and floor plan, and a click to the printer device itself.
Concerning the lookup, what do you think would be best practice? Lookup table or KV store?
Thanks and God bless,
In order to answer the KV versus lookup table question, you should ask: How much data do I plan on holding? Do you want to be able to CRUD (Create, Read, Update, and Delete) a single event or are you going to just grab everything and update the whole table each time?
If you don't need single event CRUD or you have <100,000 rows in this lookup, I would probably avoid KV store.
You are essentially creating a stateful table in Splunk. I have done this before and here are the pitfalls I ran into:
You can avoid those mistakes by ensuring you are monitoring rolled over log files and you do not have crcSalt enabled and you have overlapping search time periods in your scheduled search to keep the file updated. This means if you have a 5 minute cron like
1-59/5 * * * * I'd suggest at least an earliest time of
-15m or even more depending on what you worst case ingestion delay observed is on your events.
As for the search that populates and keeps the status up to date it would be something like this on the first run that I would probably go back 30 days at least to ensure you get all of your possible hosts:
search that populates a list of statuses | stats max(_time) as _time latest(status) as status by printer | outputlookup LatestPrinterStatus.csv
I would then create a saved search that runs every 5 minutes with a 15 minute look back
search that populates a list of statuses earliest=-15m latest=now | inputlookup LatestPrinterStatus.csv append=true | stats max(_time) as _time latest(status) as status by printer | outputlookup LatestPrinterStatus.csv
The inputlookup that appends the table on line 2 will ensure that you are always including everything that has ever reported previously so if you have a gap where the host doesn't report in that 5 minute query period, it will still be accounted for.