Solved: Can you help me design an event count summary?

tjago11 · ‎11-02-2018

Currently, we have about 100 applications writing about 50 million events to a logging index/sourcetype per day. It works fine when you are looking for the specific application at specific times, which is the most common scenario. It works okay when you want to get a more a general view of the application over the course of about week. Much beyond that and things start to get slow and annoying.

I tried setting up a summary index that would only pull unique combinations of logging events — something like this:

index=logging sourcetype=mylogs
| sistats count by application, eventcode, eventtext, host

The summary ran daily and brought the events from ~50 million to a little over 1 million. Success!! Now the users can run queries like this:

index=summary source=loggingSummary application=myApplication
| bin _time span=1d
| stats sum(count) by  _time

This worked just fine for the first couple weeks of data but then things got slow again. Also, the summary job that runs takes over an hour and we get warnings in the audit log like this:

11-02-2018 00:33:32.701 +0000 WARN  AggregatorMiningProcessor - Too many events (1200K) with the same timestamp: incrementing timestamps 12 second(s) into the future to insure retrievability - data_source="loggingSummary", data_host="mySearchHead", data_sourcetype="stash_new"

Also, because the job takes so long to complete, the last time we did a search head restart the job got killed and the data lost.

Ultimately the users want to know the following:

When was the first time this event happened?
When was the most recent time this event happened?
What is the total count of these events (daily count is a nice to have)?
Keep at least one year of data
Be able to compare current data with historical data to find new events that have never happened before

Thought about maintaining a running total of the events and the dates in a lookup, but seems like we have way too many rows for a CSV. Also considered a KV store, but haven't tried that yet.

Might also need to just give up on putting all the application data in one spot and find a way to divide it up. Could be that 1 million+ rows of summary data per day is just too much. Thanks.

tjago11 · ‎11-21-2018

Okay, so ended up having to change how the grouping worked and used a bit of trickery to preserve the granularity I wanted. In essence I pulled stuff out of the "by" clause and put them into "values" columns. The part that gave me grief was being able to get back to the granular counts.

So I took this query:

index=logging sourcetype=mylogs
| sistats count by application, eventcode, eventtext, host

Then moved a couple of the "by" grouping to be a "values" like this:

index=logging sourcetype=mylogs
| sistats count, values(host) as host, values(eventcode) as eventcode by application, eventtext

This caused a problem though because now I can't get the count by host easily. I did notice that the SI contained a private field called prsvd_vm_host that has field values like this:
server1 ; 2 ; server2 ; 4 ; server3 ; 6 ;

I'm pretty sure I could use that field to get back to the original counts by host, but what I need is a combination of the host AND the eventcode. The SI fields work individually but not in tandem. Ended up modifying the query to look like this:

index=logging sourcetype=mylogs
| eval errorLocation_{host}_{eventcode} = 1
| sistats sum(errorLocation_*) as *, values(host) as host, values(eventcode) as eventcode by application, eventtext

By doing this I'm able to get back to the host+eventcode counts, so far working well. The number of events has decreased dramatically and on querying back the data the performance seems good. Looks like having fewer events that are wider performs better than more events that are narrow. Thanks.

View solution in original post

tjago11 · ‎11-21-2018

Okay, so ended up having to change how the grouping worked and used a bit of trickery to preserve the granularity I wanted. In essence I pulled stuff out of the "by" clause and put them into "values" columns. The part that gave me grief was being able to get back to the granular counts.

So I took this query:

index=logging sourcetype=mylogs
| sistats count by application, eventcode, eventtext, host

Then moved a couple of the "by" grouping to be a "values" like this:

index=logging sourcetype=mylogs
| sistats count, values(host) as host, values(eventcode) as eventcode by application, eventtext

This caused a problem though because now I can't get the count by host easily. I did notice that the SI contained a private field called prsvd_vm_host that has field values like this:
server1 ; 2 ; server2 ; 4 ; server3 ; 6 ;

I'm pretty sure I could use that field to get back to the original counts by host, but what I need is a combination of the host AND the eventcode. The SI fields work individually but not in tandem. Ended up modifying the query to look like this:

index=logging sourcetype=mylogs
| eval errorLocation_{host}_{eventcode} = 1
| sistats sum(errorLocation_*) as *, values(host) as host, values(eventcode) as eventcode by application, eventtext

By doing this I'm able to get back to the host+eventcode counts, so far working well. The number of events has decreased dramatically and on querying back the data the performance seems good. Looks like having fewer events that are wider performs better than more events that are narrow. Thanks.

martin_mueller · ‎11-04-2018

Getting a count over time split by some fields is a perfect first use case for datamodel acceleration. See http://docs.splunk.com/Documentation/Splunk/7.2.0/Knowledge/Aboutdatamodels for more details.

In short, create a data model with the search index=logging sourcetype=mylogs and the fields application, eventcode, eventtext on top of default fields such as _time, host, source, sourcetype. Accelerate the data model over a year. Have a lunch break or two. Once the data model is mostly accelerated, try searches like this:

| tstats min(_time) as first_time from datamodel=mymodel.myobject where myobject.eventcode=42
| tstats count from datamodel=mymodel.myobject where myobject.application=myApplication by _time span=1d myobject.eventcode

[Syntax untested, but should mostly be correct...]
You can also use the Pivot interface to build simple datamodel-backed reports like that to ease the hurdle of learning tstats syntax.

martin_mueller · ‎11-09-2018

There have been datamodel improvements in many versions, but very high cardinality things will always take some time to run. Especially the split by eventtext sounds expensive, but it's hard to be sure without knowing your specific use case and data.

tjago11 · ‎11-09-2018

We used datamodels before with not much success. The jobs to build them took a long time and the results were not that fast. Though admittedly we were not using tstats to select the data back and were on version 6.4.x.

Do you know if there have been any data model acceleration improvements up to our current version of 7.0.5?? Also, it is possible that this problem would be a good fit for the metrics index?? Don't have much experience there either. Thanks.

Can you help me design an event count summary?

Splunk Enterprise Security 8.x: The Essential Upgrade for Threat Detection, ...

Get Early Access to AI Playbook Authoring: Apply for the Alpha Private Preview ...

Reduce and Transform Your Firewall Data with Splunk Data Management

Are you a member of the Splunk Community?

Can you help me design an event count summary?

Splunk Enterprise Security 8.x: The Essential Upgrade for Threat Detection, ...

Get Early Access to AI Playbook Authoring: Apply for the Alpha Private Preview ...

Reduce and Transform Your Firewall Data with Splunk Data Management