Dashboards & Visualizations

Are real-time dashboards for NOC putting too much strain on the Splunk Cloud environment?

Explorer

We have a NOC and have started to use some SPLUNK dashboards to show the performance of specific applications. The dashboards show 1-2hours of data and refresh every 60 seconds. Recently we have been getting mixed information about the viability of doing this and have been told that refreshing every 60 seconds puts too much strain on the SPLUNK Cloud environment.

  1. Can SPLUNK Cloud handle real-time dashboarding?
  2. what are the best practices for NOC dashboards that need to show data quickly so as to make it actionable?
0 Karma
1 Solution

SplunkTrust
SplunkTrust

If the question is, "How can I make the dashboards take less CPU time and other resources when they refresh?"

Instead of chewing up the same 1-2 hours of data every 1 minute, you create a timed search that summarizes the data to a summary index, and your dash looks at the summarized data instead of the detailed data. This can be done a number of different ways - with accelerated data models or scheduled searches that collect to a summary index.

Then your main dash displays the summarized data, and you set up drilldowns, as I said in the other answer, to explore the detailed records.

Splunk can "guarantee" real time to the exact same degree any competitor can... to the degree that you deploy the resources to handle the level of throughput. The process of limiting the resources required, by tuning the searches themselves, and building dashboards that don't add unnecessary strain and clutter to the system, and so on, could fill a whole library.

Even one search that is badly written - for example a search with a regular expression that contains catastrophic backtracking - can stress the entire system. This will be true for pretty much any vendor.

View solution in original post

0 Karma

SplunkTrust
SplunkTrust

If the question is, "How can I make the dashboards take less CPU time and other resources when they refresh?"

Instead of chewing up the same 1-2 hours of data every 1 minute, you create a timed search that summarizes the data to a summary index, and your dash looks at the summarized data instead of the detailed data. This can be done a number of different ways - with accelerated data models or scheduled searches that collect to a summary index.

Then your main dash displays the summarized data, and you set up drilldowns, as I said in the other answer, to explore the detailed records.

Splunk can "guarantee" real time to the exact same degree any competitor can... to the degree that you deploy the resources to handle the level of throughput. The process of limiting the resources required, by tuning the searches themselves, and building dashboards that don't add unnecessary strain and clutter to the system, and so on, could fill a whole library.

Even one search that is badly written - for example a search with a regular expression that contains catastrophic backtracking - can stress the entire system. This will be true for pretty much any vendor.

View solution in original post

0 Karma

Explorer

thank you again @daljeanis. I can work with data models and saved searches to make my dashboards.

SplunkTrust
SplunkTrust

Yes, every simultaneous search takes up CPU, so you want them infrequent, not real time. they should be matched to the business need.

Ask what the SLA to respond to an event is. If there is no SLA, then real time is not required. If the SLA is more than 15 minutes, then real time is not required.

The next question is, is a person going to be watching that dash live, or is it just going to be reviewed occasionally? If no one will be monitoring it every minute, then real time is not required.

The next question is, under what circumstances would the actions of the monitor personnel change if the minute-to-minute details shown on the dash were to change?

Listen carefully to the answer. With this question, you are trying to prove that the above conclusions were wrong, and they DO need a real-time graph. They almost never do... and they certainly would not need it open all the time.

Along with the above, there is the question of, "What happens if an occurrence on the dash is not noticed by the monitor personnel, and it continues without being investigated?" Again, this is contextual information. If someone will die or go to jail, then maybe a real time search or alert may be reasonable.

Other than that, the probable alternative is to chew up the data every 2, 3, 5 or 10 minutes, depending on the real need or the real SLA. Then, if there is something actually happening, they should be able to open an actual realtime search... but it should be severely limited in scope and duration.

Explorer

@DalJeanis, thank you very much for your answer.

Currently our NOC is full of real-time dashboards that watch a number of our major applications. A real-time dashboard is required to see when performance starts to degrade so that people can react to that outage. If the data comes in every 10 minutes then we could potentially be 10 minutes behind a problem. For those that work in operations/support we all know that being 10 minutes behind a problem is almost useless.

I feel that SPLUNK needs to figure this out because their competition is already guaranteeing real-time dash-boarding for exactly the the scenario I state above. I feel that .conf should have some data regarding this and i hope its as good as it can be.

0 Karma

SplunkTrust
SplunkTrust

@gt_dev - You asked a specific question, and I gave you the answer to that question. I specifically told you how to go about finding out what the real SLA needed to be for any given dash. I'll give you a second answer to answer the question, "How can I make the dashboards take less CPU time when they refresh?"

Explorer

sorry, maybe there has been some confusion on my first question. My first question specifically asks if SPLUNK can support real time dashboards that run every minute. Supports in this case means that they can sustain a dashboard (hopefully multiple) running at that frequency. Other vendors are doing this and it seems odd that SPLUNK would not be focused on this.

0 Karma