Knowledge Management

Summary Index for non-stats data?

mfrost8
Builder

I'm trying to create searches that can parse through a large set of events to return daily reports. Essentially counting the distinct number of users for some subset of applications that are stored in a big-ish index.

I wanted to
a) make these searches run quickly
b) be able to store this information longer than we would normally retain data from the raw index
c) also get quick results for the same info in a monthly and weekly report

I figured this sounded like a good case for summary indexing -- fast and can retain small amounts of data a long time.

Originally, I ran something like "stats dc(users) ..." and stuffed that into the summary index hourly. It then occurred to me that that wasn't going to work as you can't tell what the distinct count is for the day versus the hour vs the month from just a count. That is, you can't tell if the 30 users from 2 hours ago are some of the same users in the count of 50 from the last hour.

Now it seems to me, that it would make more sense to get a list of unique users that occurred in that hour and write that to the index. Then when the target search runs it would effectively "stats dc(users) span=1d ..." etc as it's working on a smaller set of short data. This would be generated from an hourly search like

... | dedup user | table user

The thing here though, is that in a general sense this is summarized data in that I whittled down the raw events to a smaller set and only the field I want. In a more specific sense though, I'm not using an actual stats command here. I couldn't find anywhere where it said specifically that you could only write stats-like commands to a summary index or not. Or really if that would be a good idea at all.

I can't think of another way to do this instead of doing some very long searches to gather this info and then only be able to report on a much shorter time range than we'd like.

Am I going the wrong way?

Thanks

0 Karma

DalJeanis
Legend

A summary index can contain literally anything you want, whether it is really a summary of anything or not. It could be a synthetic field, a copy of an entire event, or whatever.

Sounds like you just want _time (binned by hour) and user. Then you can just read your summary index records for the period in question, and do dc(user) over that time range.

Depending on what use cases you see, you might consider a count of records, and/or a first(_time) and last(_time) for the hour. It wouldn't be much more space, and the range of available reporting would be expanded quite a bit. Your call.

0 Karma

mfrost8
Builder

Thanks. Good to know that I’m not veering off in the wrong direction.

I am a bit unclear about the _time stuff though. You’re saying i should really write out

... | table _time, user

Then? For some reason i was thinking that the time range that the search ran in would be captured in the summary index on its own and i wouldnt have to explicitly write it out.

I am also unclear about the first and last time thing you suggested. How would i use that?

Mark

0 Karma
Get Updates on the Splunk Community!

Webinar Recap | Revolutionizing IT Operations: The Transformative Power of AI and ML ...

The Transformative Power of AI and ML in Enhancing Observability   In the realm of IT operations, the ...

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...