Splunk Search

Can you help me design an event count summary?

tjago11
Communicator

Currently, we have about 100 applications writing about 50 million events to a logging index/sourcetype per day. It works fine when you are looking for the specific application at specific times, which is the most common scenario. It works okay when you want to get a more a general view of the application over the course of about week. Much beyond that and things start to get slow and annoying.

I tried setting up a summary index that would only pull unique combinations of logging events — something like this:

index=logging sourcetype=mylogs
| sistats count by application, eventcode, eventtext, host

The summary ran daily and brought the events from ~50 million to a little over 1 million. Success!! Now the users can run queries like this:

index=summary source=loggingSummary application=myApplication
| bin _time span=1d
| stats sum(count) by  _time

This worked just fine for the first couple weeks of data but then things got slow again. Also, the summary job that runs takes over an hour and we get warnings in the audit log like this:

11-02-2018 00:33:32.701 +0000 WARN  AggregatorMiningProcessor - Too many events (1200K) with the same timestamp: incrementing timestamps 12 second(s) into the future to insure retrievability - data_source="loggingSummary", data_host="mySearchHead", data_sourcetype="stash_new"

Also, because the job takes so long to complete, the last time we did a search head restart the job got killed and the data lost.

Ultimately the users want to know the following:

  • When was the first time this event happened?
  • When was the most recent time this event happened?
  • What is the total count of these events (daily count is a nice to have)?
  • Keep at least one year of data
  • Be able to compare current data with historical data to find new events that have never happened before

Thought about maintaining a running total of the events and the dates in a lookup, but seems like we have way too many rows for a CSV. Also considered a KV store, but haven't tried that yet.

Might also need to just give up on putting all the application data in one spot and find a way to divide it up. Could be that 1 million+ rows of summary data per day is just too much. Thanks.

0 Karma
1 Solution

tjago11
Communicator

Okay, so ended up having to change how the grouping worked and used a bit of trickery to preserve the granularity I wanted. In essence I pulled stuff out of the "by" clause and put them into "values" columns. The part that gave me grief was being able to get back to the granular counts.

So I took this query:

index=logging sourcetype=mylogs
| sistats count by application, eventcode, eventtext, host

Then moved a couple of the "by" grouping to be a "values" like this:

index=logging sourcetype=mylogs
| sistats count, values(host) as host, values(eventcode) as eventcode by application, eventtext

This caused a problem though because now I can't get the count by host easily. I did notice that the SI contained a private field called prsvd_vm_host that has field values like this:
server1 ; 2 ; server2 ; 4 ; server3 ; 6 ;

I'm pretty sure I could use that field to get back to the original counts by host, but what I need is a combination of the host AND the eventcode. The SI fields work individually but not in tandem. Ended up modifying the query to look like this:

index=logging sourcetype=mylogs
| eval errorLocation_{host}_{eventcode} = 1
| sistats sum(errorLocation_*) as *, values(host) as host, values(eventcode) as eventcode by application, eventtext

By doing this I'm able to get back to the host+eventcode counts, so far working well. The number of events has decreased dramatically and on querying back the data the performance seems good. Looks like having fewer events that are wider performs better than more events that are narrow. Thanks.

View solution in original post

0 Karma

tjago11
Communicator

Okay, so ended up having to change how the grouping worked and used a bit of trickery to preserve the granularity I wanted. In essence I pulled stuff out of the "by" clause and put them into "values" columns. The part that gave me grief was being able to get back to the granular counts.

So I took this query:

index=logging sourcetype=mylogs
| sistats count by application, eventcode, eventtext, host

Then moved a couple of the "by" grouping to be a "values" like this:

index=logging sourcetype=mylogs
| sistats count, values(host) as host, values(eventcode) as eventcode by application, eventtext

This caused a problem though because now I can't get the count by host easily. I did notice that the SI contained a private field called prsvd_vm_host that has field values like this:
server1 ; 2 ; server2 ; 4 ; server3 ; 6 ;

I'm pretty sure I could use that field to get back to the original counts by host, but what I need is a combination of the host AND the eventcode. The SI fields work individually but not in tandem. Ended up modifying the query to look like this:

index=logging sourcetype=mylogs
| eval errorLocation_{host}_{eventcode} = 1
| sistats sum(errorLocation_*) as *, values(host) as host, values(eventcode) as eventcode by application, eventtext

By doing this I'm able to get back to the host+eventcode counts, so far working well. The number of events has decreased dramatically and on querying back the data the performance seems good. Looks like having fewer events that are wider performs better than more events that are narrow. Thanks.

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

Getting a count over time split by some fields is a perfect first use case for datamodel acceleration. See http://docs.splunk.com/Documentation/Splunk/7.2.0/Knowledge/Aboutdatamodels for more details.

In short, create a data model with the search index=logging sourcetype=mylogs and the fields application, eventcode, eventtext on top of default fields such as _time, host, source, sourcetype. Accelerate the data model over a year. Have a lunch break or two. Once the data model is mostly accelerated, try searches like this:

| tstats min(_time) as first_time from datamodel=mymodel.myobject where myobject.eventcode=42
| tstats count from datamodel=mymodel.myobject where myobject.application=myApplication by _time span=1d myobject.eventcode

[Syntax untested, but should mostly be correct...]
You can also use the Pivot interface to build simple datamodel-backed reports like that to ease the hurdle of learning tstats syntax.

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

There have been datamodel improvements in many versions, but very high cardinality things will always take some time to run. Especially the split by eventtext sounds expensive, but it's hard to be sure without knowing your specific use case and data.

0 Karma

tjago11
Communicator

We used datamodels before with not much success. The jobs to build them took a long time and the results were not that fast. Though admittedly we were not using tstats to select the data back and were on version 6.4.x.

Do you know if there have been any data model acceleration improvements up to our current version of 7.0.5?? Also, it is possible that this problem would be a good fit for the metrics index?? Don't have much experience there either. Thanks.

0 Karma
Get Updates on the Splunk Community!

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Survey for Splunk Admins and App Developers is open now! | Earn a $35 gift card!      Hello there,  Splunk ...

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

You’ve probably heard the latest about AppDynamics joining the Splunk Observability portfolio, deepening our ...

Monitoring Amazon Elastic Kubernetes Service (EKS)

As we’ve seen, integrating Kubernetes environments with Splunk Observability Cloud is a quick and easy way to ...