All Apps and Add-ons

Optimizing Dashboards performances, looking for the better design

guilmxm
SplunkTrust
SplunkTrust

Hi,

Currently finalizing a Splunk application for my company, i am looking for the better way to optimize dashboards performances.

My application manages monitoring raw data collected by various nagios collectors (networking and security components) to provide complex reports with charts, this may represent a large amount of data lines to be analysed by Splunk.

Here are some example of search i use to create my reporting Dashboards: (i use lookup with csv files to define various fields and values)

Example 1, simple aggregation of stats

index="xxx_index" sourcetype="xxx_source" technical_zone="INTERNET" leveltechnical_zone="N1" monitor="CONNEXIONS" monitor_label="connexions" | dedup _time hour hostname monitor monitor_label value | bucket _time span=5m | stats sum(value) As value by _time | timechart span=5m eval(round(mean(value),0)) As Datacenter_Average_Session eval(round(max(value),0)) As Datacenter_Max_Session

Example 2, more standard representation by host

index="xxx_index" sourcetype="xxx_source" technical_zone="INTERNET" leveltechnical_zone="N1" monitor="CPU" monitor_label="cpu" | dedup _time hour hostname monitor monitor_label value | bucket _time span=5m | timechart span=5m eval(round(mean(value),2)) As Average_CPU eval(round(max(value),2)) As Max_CPU by hostname

Example 3, more complex managing networking counter type with multiple series:

index="xxx_index" sourcetype="xxx_source" technical_zone="INTERNET" leveltechnical_zone="N1" functionnal_zone="XXXXX" traffic_sense="IN" | dedup _time hour hostname monitor monitor_label value | streamstats current=f global=f window=1 first(value) as next_value, first(_time) as next_time by monitor_label, hostname | eval dt=next_time-_time | eval deltavalue=next_value-value | eval realvalue=deltavalue/dt | where realvalue>=0 | eval realvalue=round(realvalue,2) | eval value=realvalue | eval value=value*8/1000000 | bucket _time span=5m | stats max(value) As ValMax by _time,monitor_label,hostname | eval ValMax=round(ValMax,2) | eval s1="max" | makemv s1 | mvexpand s1 | eval yval=case(s1=="max",ValMax) | eval series=hostname+":"+monitor_label+":"+s1 | xyseries _time,series,yval | makecontinuous _time

About span / bin:

As i was not fully satisfied by auto span charting (i and i have to get charts as granular as possible), i found (after having tested various approaches) as the better solution to use javascript (using sideviews custom behavior) that will define span value depending on selected timerange (span value is being down-streamed), this requires inline search inside advanced xml views:

content of application.js: (got the main code from http://pastebin.com/jqDktMhC)

//Assign CustomBehavior triggers
if(typeof(Sideview)!="undefined"){
        $(document).bind("allModulesInHierarchy",function(){

                Sideview.utils.forEachModuleWithCustomBehavior("GatherBins",function(b,a){

//isReadyForContextPush -- don't push to the next modules, since the bins aren't assigned yet.
                        a.isReadyForContextPush = function(){
                                if(!this.RetrievedBinCount) return Splunk.Module.DEFER;
                                if (this.getLoadState() < Splunk.util.moduleLoadStates.HAS_CONTEXT) return false;
                                return true
                        }


//onJobProgress -- Actually figure out the number of bins.
                        a.onJobProgress = function() {
                                var c=this.getContext();
                                //This will be the upstream * | head 1 search job, which will give us absolute values for the TimeRangePicker          
                                var d=c.get("search").job;


                                var Bins = 0;
                                var Binsize = "";
                                var Span = "";
                                var Showspan = "";
                                var latest = new Date(d._latestTime);
                                var earliest = new Date(d._earliestTime);
                                //Handle latestTime = 0 (Not sure how often this should happen -- came up when I was testing)
                                if(latest.valueOf() == 0){
                                        latest = new Date();
                                }

                                //Calculate difference in seconds
                                var Difference = (latest.valueOf() - earliest.valueOf()) / 1000;


                                //Figure out how many bins to assign, based on the range. The below is for 10 minute data increments.
                                //If you had only hourly data, and were searching over 10 years, you might need to add an additional layer of summary.

                                if(Difference > (730*24*60*60)){
                                        //alert("More than 730 days -- summarize four days");
                                        Bins = parseInt(Difference / (96*60*60))+2;
                                        Binsize = "Four Day";
                                        Showspan = "4 jours";
                                        Span = "4d";
                                }else if(Difference > (450*24*60*60)){
                                        //alert("More than 450 days -- summarize two days");
                                        Bins = parseInt(Difference / (48*60*60))+2;
                                        Binsize = "Two Day";
                                        Showspan = "2 jours";
                                        Span = "2d";
                                }else if(Difference > (150*24*60*60)){
                                        //alert("More than 150 days -- summarize daily");
                                        Bins = parseInt(Difference / (24*60*60))+2;
                                        Binsize = "One Day";
                                        Showspan = "1 jour";
                                        Span = "1d";
                                }else if(Difference > (100*24*60*60)){
                                        //alert("More than 100 days -- summarize 12 hourly");
                                        Bins = parseInt(Difference / (12*60*60))+2;
                                        Binsize = "12 Hour";
                                        Showspan = "12 heures";
                                        Span = "12h";
                                }else if(Difference > (50*24*60*60)){
                                        //alert("More than 50 days -- summarize 8 hourly");
                                        Bins = parseInt(Difference / (8*60*60))+2;
                                        Binsize = "8 Hour";
                                        Showspan = "8 heures";
                                        Span = "8h";
                                }else if(Difference > (14*24*60*60)){
                                        //alert("More than 14 days -- summarize 4 hourly");
                                        Bins = parseInt(Difference / (4*60*60))+2;
                                        Binsize = "4 Hour";
                                        Showspan = "4 heures";
                                        Span = "4h";
                                }else if(Difference > (6*24*60*60)){
                                        //alert("More than 6 days -- summarize hourly");
                                        Bins = parseInt(Difference / (60*60))+2;
                                        Binsize = "One Hour";
                                        Showspan = "1 heure";
                                        Span = "1h";
                                }else if(Difference > (2*24*60*60)){
                                        //alert("More than 2 days -- summarize half-hourly");
                                        Bins = parseInt(Difference / (30*60))+2;
                                        Binsize = "30 Minute";
                                        Showspan = "30 minutes";
                                        Span = "30m";
                                }else if(Difference > (1*24*60*60)){
                                        //alert("More than 1 day -- summarize 10 minutes");
                                        Bins = parseInt(Difference / (10*60))+2;
                                        Binsize = "10 Minute";
                                        Showspan = "10 minutes";
                                        Span = "10m";       
                                }else{
                                        //alert("Less or equal to 1 day -- summarize to 5 minutes");
                                        Bins = parseInt(Difference / (5*60))+2;
                                        Binsize = "5 Minute";
                                        Showspan = "5 minutes";
                                        Span = "5m";
                                }


                                // Assign to context                           
                                this.Bins = Bins;
                                this.Binsize = Binsize;
                                this.Span = Span;
                                this.Showspan = Showspan;
                                this.RetrievedBinCount = true;

                                //Now that we have everything we need, we're ready to roll on to the next modules.
                                this.pushContextToChildren();

                        }

//getModifiedContent -- put the Bins into $Bins$

                        a.getModifiedContext=function(){
                                var context=this.getContext();
                                context.set("Bins", this.Bins);
                                context.set("Binsize", this.Binsize);
                                context.set("Span", this.Span);
                                context.set("Showspan", this.Showspan);
                                return context
                        }
                })
        })
}

Here is an example of xml view (using search example 1):

Note: To understand the view, an home page with a timerange button gives to user the time range selection choice, which is down-streamed to the view being called

<view autoCancelInterval="90" isVisible="False" onunloadCancelJobs="True" template="dashboard.html" stylesheet="dashboard_customsize.css" isSticky="False">

<!-- Version = 0.1 / Last update = March 9, 2013 -->

  <label>INTERNET - FW N1</label>

<!-- standard splunk chrome at the top -->
  <module name="AccountBar" layoutPanel="appHeader"/>
  <module name="AppBar" layoutPanel="navigationHeader"/>
  <module name="SideviewUtils" layoutPanel="appHeader" />

  <module name="Message" layoutPanel="messaging">
    <param name="filter">*</param>
    <param name="clearOnJobDispatch">False</param>
    <param name="maxSize">1</param>
  </module>

  <module name="URLLoader" layoutPanel="panel_row1_col1" autoRun="True">

   <module name="HTML" layoutPanel="panel_row1_col1">
      <param name="html"><![CDATA[

       <p></p>
       <h1>Capacity Planning: $title$</h1>

      ]]></param>
    </module>

<!-- Global TimeRangePicker --> 
    <module name="TimeRangePicker" layoutPanel="splSearchControls-inline">
        <param name="searchWhenChanged">True</param>

  <module name="Search" autoRun="True">
            <param name="search">* | head 1</param>

     <module name="CustomBehavior">
    <param name="customBehavior">GatherBins</param>
     <param name="requiresDispatch">True</param>     

    <module name="HTML" layoutPanel="panel_row1_col1">
      <param name="html"><![CDATA[

       <p></p>
       <h2>Laps de temps d'analyse: $Showspan$</h2>

      ]]></param>
    </module>


    <module name="SearchControls" layoutPanel="panel_row1_col1">
        <param name="sections">print</param>
    </module>       

<!-- ########################################       BEGIN OF SECTIONS           ######################################## -->

<!-- ########################################       FIREWALL N1         ######################################## -->

<!-- #####################    SESSIONS    ##################### -->

<!-- Using custom size -->

        <module name="Search" layoutPanel="panel_row2_col1" autoRun="True">
            <param name="search">index="xxx_index" sourcetype="xxx_source" technical_zone="INTERNET" leveltechnical_zone="N1" monitor="CONNEXIONS" monitor_label="connexions" | dedup _time hour hostname monitor monitor_label value | bucket _time span=5m | stats sum(value) As value by _time | timechart span=$Span$ eval(round(mean(value),0)) As Datacenter_Average_Session eval(round(max(value),0)) As Datacenter_Max_Session
            </param>

            <module name="HTML" layoutPanel="panel_row2_col1">
                <param name="html"><![CDATA[
                <h3>Vision Datacenter - Pics (Valeur Max) du nombre de sessions simultanées</h3>
                ]]></param>
            </module>           

            <module name="HiddenFieldPicker">
            <param name="strictMode">True</param>
            <module name="JobProgressIndicator">
              <module name="EnablePreview">
                <param name="display">False</param>
                <param name="enable">True</param>
                <module name="HiddenChartFormatter">
                  <param name="charting.legend.placement">bottom</param>
                  <param name="charting.chart.nullValueMode">connect</param>
                  <param name="charting.chart">line</param>line
                  <param name="charting.axisTitleX.text">Periode</param>
                  <param name="charting.axisTitleY.text">Sessions</param>
                  <module name="JSChart">
                    <param name="width">100%</param>
            <param name="height">300px</param>
                    <module name="ConvertToDrilldownSearch">
                      <module name="ViewRedirector">
                        <param name="viewTarget">flashtimeline</param>
                      </module>
                    </module>
                  </module>
                  <module name="ViewRedirectorLink">
                    <param name="viewTarget">flashtimeline</param>
                  </module>
                </module>
              </module>
            </module>
          </module>
        </module>

<!-- ########################################       END OF SECTIONS         ######################################## -->

    </module> <!-- TimeRangePicker -->

</module> <!-- URLLoader -->

</module> <!-- CustomBehavior -->
</module> <!-- Search -->

</view>

This is working very great and answers perfectly to my needs. (regarding granularity of charts)

Therefore, i am looking for the better approach to optimize dashboard performance and reduce number of jobs and their CPU cost.

1. Schedule Saved searches

This was my first approach, defining specific timerange to be scheduled (like Alltime and Last 30 days as for an example).

Any time the user selects one of defined scheduled timerange, a specific version of this view is being called (the view contains corresponding time savedsearches) and executed.

Any other timerange selected by the user calls a "timerange" version of this view using instant searches.

Advantages:
- Works great, loading dashboard is very quick when calling previously executed jobs
- Keeps CPU as free as possible for other users

Constraints:
- Various xml file versions to maintain for the same dashboard
- Multiple savedsearches which becomes hard to maintain and implement with numerous dashboards

This is definitively a too complex and limited approach, very hard to keep clean with time and dashboards add, not satisfying.

2. Summary indexing

Summary indexing as far as i understood the way Splunk works is one of logical way to achieve optimization.

Unfortunately, all my configuration tests intends to demonstrate worst performances than using normal index and searches... (perhaps my fault!)

I have tried with or without "si" commands, with almost same results

I used to define a schedule saved search to generate data into a dedicated summary index, let's call it "xxx_summary"

All data is being collected each night around 2h. AM (my dashboards are reports with day -1), so i don't need to often execute savedsearches to populate the summary index.

With search example 1, i used as a scheduled search to populate the summary index with the lower Span value i need (5 minutes):

[INTERNET_FW_N1_sessions_sum_XXX]
action.email.inline = 1
alert.digest_mode = True
alert.suppress = 0
alert.track = 1
cron_schedule = */55 * * * * 
description = INTERNET_FW_N1_sessions_sum_XXX
dispatch.earliest_time = -1d@d
dispatch.latest_time = now
enableSched = 1
realtime_schedule = 0
auto_summarize = 0
auto_summarize.dispatch.earliest_time = 0
action.summary_index = 1
action.summary_index._name = xxx_summary
action.summary_index.report = INTERNET_FW_N1_sessions_sum_XXX
search = index="xxx_index" sourcetype="xxx_source" technical_zone="INTERNET" leveltechnical_zone="N1" monitor="CONNEXIONS" monitor_label="connexions" | dedup _time hour hostname monitor monitor_label value | bucket _time span=5m | stats sum(value) As value by _time | timechart span=5m eval(round(mean(value),0)) As Datacenter_Average_Session eval(round(max(value),0)) As Datacenter_Max_Session

Then, i my view i use the following inline search:

<param name="search">
index="xxx_summary" report="INTERNET_FW_N1_sessions_sum_XXX" | timechart span=$Span$ eval(round(mean(value),0)) As Datacenter_Average_Session eval(round(max(value),0)) As Datacenter_Max_Session by hostname
</param>

Used the python script to populate previous periods, example:

./splunk cmd python fill_summary_index.py -app My_Application -name "INTERNET_FW_N1_sessions_sum_XXX" -et -7d@d -lt @d -j 8

Then, when the search is being called, every thing works fine and i get my chart as expected.

But, performances are strangely worst than using the Raw-data index, where i used my index containing millions of events, and the summary only containing one filled report for a few days!

Performance test with search example 1:

Using normal index and normal search, i get as execution time:

This search has completed and has returned 276 results by scanning 1,654 events in 0.838 seconds

Using summary indexed search (populated with normal commands), i get:

This search has completed and has returned 276 results by scanning 19,021 events in 4.027 seconds.

I almost have same kind of performances using "si" commands.

What am i missing ? I don't really understand why Splunk has to scan so much more events when using summary index search, and why does the request takes so long to be executed

In this perf test example, my index contains 22.82 millions of events, where my summary index only contains 00,06 Millions of events, shouldn't we expect better result with summary ?

It seems Splunk is scanning all events in summary before getting exepected result, isn't the filter "report" enough to prevent this ?

Thank you in advance for any help you could provide me.

1 Solution

sideview
SplunkTrust
SplunkTrust

OK. I'm not going to try and organize this answer very much - it's pretty much just notes that I took as I was reading through all your descriptions.


The dedup in your "Example 1" and "Example 2" searches looks very weird and/or problematic.

If you ever have more than one event for any given combination of those 6 fields, and

A) it's just because you're indexing the same data more than once.... then don't index the same data more than once! that's bad, and it slows down your index.

B) If you ever have more than one event for any given combination of those 6 fields and it's not just duplicate events, then the search will be throwing real data away and your chart can be arbitrarily wrong.,

C) if you never have more than one event for each combination, then the dedup will be doing nothing so just remove it.


You're using the old style of customBehavior declarations from 1.3. This is OK but if you're using Sideview 2.X already you should switch to the newer declaration style. The newer style is a little better for a couple obscure reasons.


You might find this interesting. It'll save you some busywork and you wont have to touch properties like _earliest and _latest that are (supposed to be thought of as) private.
var job=c.get("search").job;
var duration = job.getTimeRange().getDuration() / 1000;


To be honest instead of this whole customBehavior I would just piggyback a postProcess onto this existing search, use the addinfo command to give me the earliest and latest timebounds, use a little eval with case(), to get string-valued fields holding the correct binSize and span arguments in the search language, then use ResultsValueSetter to pull those string arguments down, and then I'm done cause I can plug those arguments into my searches downstream. That would replace the entire CustomBehavior you have here meaning you can delete all this javascript and replace it with roughly 5 lines of XML. and the need for all this async logic and DEFER's would go away (kudos on figuring them out though!!)


DANGER: You have an autoRun="True" on a module that is nested other module that has an autoRun="True" attribute.

There's a common misconception that each Search module in a page needs an autoRun="True" but that's not the case at all. No matter how many levels deep your XML is, you only need one autoRun="True" at the top to kick it all off, and more than one will cause problems.
At best, 2 or more will slow down your page with some weird http aborts() and/or POST's right after page load.
At worst, this can gum up the works and break the page causing bugs where tokens don't seem to be getting passed around correctly. Keep the more upstream of the two autoRun's.


Summary indexing

at the level of understanding you're at you should ignore the si commands. You know your way around stats and you can do what the si commands do more efficiently with explicit stats commands on the rows going into the summary index, and explicit stats commands commands on the retrieved summary rows later.

UPDATE: I used the word "efficiently" when that's not really accurate. the reason not to use the si commands is that stats will be no less efficient, and far safer. The si commands, if the clause on the si side doesn't match the clause on the retreival side, can go sideways on you, strange things might happen and you'll never know. Whereas stats is always simple and explicit.


Note that you generally want to keep 'counts' and 'sums' in your SI rows, and not averages. The reason being that keeping averages inevitably leads to taking averages of averages, and averages of averages are weird and not very trustworthy. If your data is extremely spiky they can be quite wrong.


Your scheduled summary index search looks very peculiar to me. You're having it run every 55 minutes, but it's retrieving a whole day's worth of data each time. The data it gets will be hte same every time so you'll have the exact same rows in there roughly 26 times. But maybe I'm missing something.


I don't quite understand how your inline search that uses the summary index is working at all. It shouldn't. You're saving timechart output to the summary rows, which means your "events" in the summary index will look like timechart rows. And when on the retrieval side you pipe those rows back into that | timechart span=$Span$ eval(round(mean(value),0)) As Datacenter_Average_Session eval(round(max(value),0)) As Datacenter_Max_Session by hostname, that shouldn't be working. At least, you have no "value" field in the summary index rows -- you have only Datacenter_Average_Session and Datacenter_Max_Session and _time, so the final timechart should be all zeros.


Performance issues in Summary indexes:

If you are ending up putting tons and tons of separate events that happen in a single second, into a summary index, you can get some very poor performance. This might be happening. take a close look at your SI data by running this search:

index="xxx_summary" | stats count by _time report | sort - count

If you are sending on the order of 10,000 events into single seconds something is extremely wrong. Also frankly if you're sending more than 100 events into single seconds you may be missing some opportunity to aggregate the data further before sending it into the summaryindex.


It does sound strange though. Something has gone very wrong with that summary index. I have seen things like this where a summary index just goes weird and becomes immensely, soulcrushingly slow. I don't know any trick other than to clean it and slowly layer pieces back in and watch out for problems step by step.

View solution in original post

0 Karma

guilmxm
SplunkTrust
SplunkTrust

Hi,

Thank you very much for you answer, i really appreciate the time you spent studying my request!

To complete and/or answer your great comments


To complete and/or answer your great comments

Dedup line

Yes, i understand it would be better without, moreover that it may contribute to slow execution time and costs CPU.
But sometimes you can't control everything 🙂 In deed, i am automatically indexing nagios raw data being extracted by a scheduled task of the service responsible for nagios monitoring.

It happened that accidentally (mainly human mistakes) this job was modified or re-run and wrongly extracted the same data several times, meaning duplicated data for my Splunk application when the daily file was indexed.

This is why i had to insert the dedup to prevent duplicated data from being analysed.
In normal operation, this is impossible to get the same data duplicated.

After some searches, i found here a solution: http://splunk-base.splunk.com/answers/67033/how-to-remove-duplicate-events-in-search-results-without...

In my case, using my main fields being extracted from raw data:

index=xxx_index | streamstats count by _time,type,hostname,monitor,monitor_label,data_unity,value,_raw | where count > 1

gives me duplicated lines.

I was thinking about scheduling each night such a kind of search as a user with delete priveleges, therefore it seems delete cannot be achieved that way (error message about non streaming commands used in conjunction with delete...)


customBehavior to control bin and span definition depending on TimeRange, with the goal to get better chart granularity

Based on your very pertinent suggestion, i finally converted it to a macro search, as follows:

Macro: I added the following code inside a macro (macro.conf)

[define_span]
definition = * | head 1 | addinfo\
| eval searchStartTIme=strftime(info_min_time,"%a %d %B %Y %H:%M") \
| eval searchEndTime=strftime(info_max_time,"%a %d %B %Y %H:%M") \
| eval earliest=info_min_time \
| eval latest=info_max_time \
| eval Difference = (latest - earliest) \
| eval Span=case(\
Difference > (730*24*60*60),"4d",\
Difference > (450*24*60*60),"2d",\
Difference > (150*24*60*60),"1d",\
Difference > (100*24*60*60),"12h",\
Difference > (50*24*60*60),"8h",\
Difference > (29*24*60*60),"4h",\
Difference > (14*24*60*60),"2h",\
Difference > (6*24*60*60),"1h",\
Difference > (2*24*60*60),"30m",\
Difference > (1*24*60*60),"10m",\
Difference <= (24*60*60),"5m"\
)\
| eval Showspan=case(\
Difference > (730*24*60*60),"4 Jours",\
Difference > (450*24*60*60),"2 Jours",\
Difference > (150*24*60*60),"1 Jour",\
Difference > (100*24*60*60),"12 Heures",\
Difference > (50*24*60*60),"8 Heures",\
Difference > (29*24*60*60),"4 Heures",\
Difference > (14*24*60*60),"2 Heures",\
Difference > (6*24*60*60),"1 Heure",\
Difference > (2*24*60*60),"30 Minutes",\
Difference > (1*24*60*60),"10 Minutes",\
Difference <= (24*60*60),"5 Minutes"\
)
iseval = 0

Some remarks:

  • I was wondering if this is possible to assign values inside the same "case" for both my 2 vars, Span and Showspan ? it would prevent me from running 2 cases with the same code just to assign 2 variables ^^ If you have any suggestions don't hesitate

  • for information to anyone that could read that thread, the character "\" is required only inside the macro.conf file (or the macro won't work)
    When using inline search, don't add it

  • To debug and check this, i used the "return" command to facilitate operations "| return earliest,latest,Difference,Span,Showspan

  • The 2 first "strftime" commands are not required, i added it because i later (in my view) use to show the search time range in human readable kind

*** In the xml code view:**

In my view, i added/replace with this simple code:

<!-- Calling macro "define_span" to define Span and Showspan value dynamicaly depending on selected TimeRange (see macro.conf) -->
  <module name="Search">
            <param name="search">`define_span`</param>  

<!-- Downstream values of Span and Showspan to all modules -->
<module name="ResultsValueSetter">
    <param name="fields">Span,Showspan</param>

I run this simple search at the beginning of my view, then all my other search are nested to it, i only run this once.
Then i call the vars inside my Search modules.

This works very great, better than it was with the javascript solution, it helped me to solved some bugs i had sometimes when using the javascript and the user was changing the timerange inside the view (and not coming from the home page which is down-streaming the TimeRange), from times to times jobs were strangely being canceled and the expected span value wasn't applied, producing truncated charts. (needed to refresh the page)

This has been solved using this solution, THANK YOU!

Summary indexing (and summary indexing bad performances)

In deed perhaps i had an issue with my summary index, maybe caused by several data delete during tests. (though i never had any specific error message in log or console)

I am still wondering about the interest of summary (or savedsearches) in the context of my application, my goal is to manage and generate numerous stats for numerous networking goods, managing all these savedsearches and summary polling may not be as accurate or pertinent as having inline searches inside dashboards xml code.

I also have various views that allow users to select themselves networking goods to analyse (including multi selection) and automatically generating the list of monitors and interfaces available, then generating charts with various options (kind of chart, kind of stats...)
Summary indexing would be very very difficult to implement in this context, and maybe simply not adapted


Duplicated Autorun=true entries

Thanks 🙂 I've corrected my views.

In deed this is not something very very clear...

Thank you very much for the help you provided!

0 Karma

sideview
SplunkTrust
SplunkTrust

OK. I'm not going to try and organize this answer very much - it's pretty much just notes that I took as I was reading through all your descriptions.


The dedup in your "Example 1" and "Example 2" searches looks very weird and/or problematic.

If you ever have more than one event for any given combination of those 6 fields, and

A) it's just because you're indexing the same data more than once.... then don't index the same data more than once! that's bad, and it slows down your index.

B) If you ever have more than one event for any given combination of those 6 fields and it's not just duplicate events, then the search will be throwing real data away and your chart can be arbitrarily wrong.,

C) if you never have more than one event for each combination, then the dedup will be doing nothing so just remove it.


You're using the old style of customBehavior declarations from 1.3. This is OK but if you're using Sideview 2.X already you should switch to the newer declaration style. The newer style is a little better for a couple obscure reasons.


You might find this interesting. It'll save you some busywork and you wont have to touch properties like _earliest and _latest that are (supposed to be thought of as) private.
var job=c.get("search").job;
var duration = job.getTimeRange().getDuration() / 1000;


To be honest instead of this whole customBehavior I would just piggyback a postProcess onto this existing search, use the addinfo command to give me the earliest and latest timebounds, use a little eval with case(), to get string-valued fields holding the correct binSize and span arguments in the search language, then use ResultsValueSetter to pull those string arguments down, and then I'm done cause I can plug those arguments into my searches downstream. That would replace the entire CustomBehavior you have here meaning you can delete all this javascript and replace it with roughly 5 lines of XML. and the need for all this async logic and DEFER's would go away (kudos on figuring them out though!!)


DANGER: You have an autoRun="True" on a module that is nested other module that has an autoRun="True" attribute.

There's a common misconception that each Search module in a page needs an autoRun="True" but that's not the case at all. No matter how many levels deep your XML is, you only need one autoRun="True" at the top to kick it all off, and more than one will cause problems.
At best, 2 or more will slow down your page with some weird http aborts() and/or POST's right after page load.
At worst, this can gum up the works and break the page causing bugs where tokens don't seem to be getting passed around correctly. Keep the more upstream of the two autoRun's.


Summary indexing

at the level of understanding you're at you should ignore the si commands. You know your way around stats and you can do what the si commands do more efficiently with explicit stats commands on the rows going into the summary index, and explicit stats commands commands on the retrieved summary rows later.

UPDATE: I used the word "efficiently" when that's not really accurate. the reason not to use the si commands is that stats will be no less efficient, and far safer. The si commands, if the clause on the si side doesn't match the clause on the retreival side, can go sideways on you, strange things might happen and you'll never know. Whereas stats is always simple and explicit.


Note that you generally want to keep 'counts' and 'sums' in your SI rows, and not averages. The reason being that keeping averages inevitably leads to taking averages of averages, and averages of averages are weird and not very trustworthy. If your data is extremely spiky they can be quite wrong.


Your scheduled summary index search looks very peculiar to me. You're having it run every 55 minutes, but it's retrieving a whole day's worth of data each time. The data it gets will be hte same every time so you'll have the exact same rows in there roughly 26 times. But maybe I'm missing something.


I don't quite understand how your inline search that uses the summary index is working at all. It shouldn't. You're saving timechart output to the summary rows, which means your "events" in the summary index will look like timechart rows. And when on the retrieval side you pipe those rows back into that | timechart span=$Span$ eval(round(mean(value),0)) As Datacenter_Average_Session eval(round(max(value),0)) As Datacenter_Max_Session by hostname, that shouldn't be working. At least, you have no "value" field in the summary index rows -- you have only Datacenter_Average_Session and Datacenter_Max_Session and _time, so the final timechart should be all zeros.


Performance issues in Summary indexes:

If you are ending up putting tons and tons of separate events that happen in a single second, into a summary index, you can get some very poor performance. This might be happening. take a close look at your SI data by running this search:

index="xxx_summary" | stats count by _time report | sort - count

If you are sending on the order of 10,000 events into single seconds something is extremely wrong. Also frankly if you're sending more than 100 events into single seconds you may be missing some opportunity to aggregate the data further before sending it into the summaryindex.


It does sound strange though. Something has gone very wrong with that summary index. I have seen things like this where a summary index just goes weird and becomes immensely, soulcrushingly slow. I don't know any trick other than to clean it and slowly layer pieces back in and watch out for problems step by step.

0 Karma

guilmxm
SplunkTrust
SplunkTrust

Hi,

Thank you very much for you answer, i really appreciate the time you spent studying my request!

I have a full answer with various comments and info to post, but i can't...each time i try to, i don't get any answer to the post reuest with the web page load running with no ends...

0 Karma
Get Updates on the Splunk Community!

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...