Solved: Re: Method to force the implied search in first li...

ClubMed

This is just a fun optimization question. The benefit may be very little in fact!

My Splunk searches are already optimized joining 24 million events across 3 sourcetypes in just about 40 seconds searching over 30 days by using the stats method for joining data. - https://conf.splunk.com/files/2019/slides/FNC2751.pdf

However, before I do all the join operations using stats, I have to first use stats latest() to ensure each event is the latest.

That is because all my sourcetypes have historical data, but has unique identifiers. Not all sourcetypes have data every single day, so I have to look back at least 30 days to get a reasonably complete picture.

Here's an example stats latest():

<initial search>
| fields _time, xxx, xxx, <pick your required fields>

| eval coalesced_primary_key=coalesce(sourcetype_1_primary, sourcetype_2_primary, sourcetype_3_primary)

| stats
latest(*) AS *
by coalesced_primary_key

The total events in the index before the implicit search (first line) is run are 24,000,000 events.

After the implicit search, but before stats latest() is run, I have 13,000,000 events total.

After stats latest() is run, total becomes 750,000 events.

What if the "stats latest" pipe was skipped altogether? By somehow making the implied search (first line) to return only the latest events. In other words, cutting the event total from 24,000,000 to 750,000 events directly? That can optimize the query to be much faster if this is possible.

I have the unique primary keys for each sourcetype already, so it would be the idea of using latest(sourcetype_1_primary) but in the first line implicit search.

I'm afraid my Splunk knowledge doesn't help me there, and googling doesn't seem to pull up anything.

PickleRick

No, you can't. Search just pulls data from the index. It doesn't do any inter-event comparisons and such so you can't just get latest event. That's what stats is for.

Also remember that in a clustered environment the latest event will come from just one of the indexers and the search command is a distributed streaming command so it obviously gets distributed to all search peers and runs independently on each of them. How would you like to get latest event from a particular index not knowing if other peers have a more recent event? And since it's a distributed streaming command subsequent commands which do not move the processing to the SH tier (more distributed searching commands - most notably eval) will also get executed on all indexers taking part in the search.

So no, search is for searching, stats is for aggregation (and latest() is a form of aggregation).

View solution in original post

bowesmana

One additional optimisation that may be possible in your case...

You are expecting 13m events from 24m that satisfy the search criteria, so you want to totally ignore 11m events if possible, so they are never even scanned.

If you look at the job properties in the job inspector you will see scanCount and eventCount. One key way to improve performance is to reduce the scanCount, where the indexes go look at the raw event data to find if your search matches.

This can be done using the TERM(x) directive, where x is a piece of data that is a TERM, i.e. surrounded by major breakers in the data, so that it will be recorded in Splunk's tsidx files. When you use the TERM(x) directive, it will search the tsidx files for the given term and if not found will not even look at the bucket raw data for that term.

If your search criteria have constraints that can be converted to TERM directives, try that.

There's a really good talk about this topic from conf 20 here

https://conf.splunk.com/files/2020/slides/PLA1089C.pdf

PickleRick

Yes, both @bowesmana 's as well as @yuanliu 's methods can be used to speed up your search.

Another way to help Splunk with searching such stuff is using summary indexing. You can do an incremental scheduled search which finds your latest value from - for example - every five minute or one hour long window and stores it in an auxiliary index. Then you have to just search a small set of summarized data to find the "true latest" one. Of course it means a fairly constant load on your system to build those summaries but you do it only once per each batch of data. After that you just have a relatively fast search from the summary index.

So while Splunk cannot directly do what you want in the initial search there are some advanced techniques which can help you write a fairly efficient search to do a similar thing another way.

yuanliu

As I always say, what search is best data analytics solution depends on data. While this is as true as what @PickleRick explained in general, it is even more true with an ambiguous case as yours. A discussion last year has showed me possibilities that I hadn't known before. But whether it will help your use case depends on a lot of things. So, let me first put out some qualifiers that immediately come to mind. There can be many others.

Does every field of interest appear in every event in which sourcetype_1_primary, sourcetype_2_primary, or sourcetype_3_primary is present?
Are sourcetype_1_primary, sourcetype_2_primary, and sourcetype_3_primary already extracted at search time, i.e., your <initial search> does not have to extract any of them?
Gain from such optimization also depends on how many calculations are to be performed between index search and stats.

This is not to say that failing these qualifiers will preclude potential benefits from similar strategies, but the following is based on them.

The idea is to limit search intervals using subsearches. For this to work, of course, employed subsearches must be extremely light. Hence tstats. Here is a little demonstration.

original with time filters

index=_introspection component=* earliest=-4h
| stats latest(*) as * by component

index=_introspection component=* earliest=-4h
[tstats max(_time) as latest where index=_introspection earliest=-4h
by component index
| eval earliest = latest - 0.1, latest = latest + 0.1]
| stats latest(*) as * by component

I tested them on a standalone instance on my laptop. That is to say there are few events (only 10 components); instead of 0.1s shifts, I use 1s. Even so, the baseline is extremely unstable, ranging from 0.76s to 1.8s. The biggest gain I saw was from 1.8s to 0.6s. Smaller gains were like from 0.75s to 0.68s.

Back to your correlation search. Assuming your <initial search> is a combined search, try something like this:

(sourcetype=sourcetype_1 sourcetype_1_primary=*
    [tstats max(_time) as latest where sourcetype=sourcetype_1 by sourcetype_1_primary
    | eval earliest = latest - 0.1, latest = latest + 0.1])
    OR (sourcetype=sourcetype_2 sourcetype_2_primary
    [tstats max(_time) as latest where sourcetype=sourcetype_2 by sourcetype_2_primary
    | eval earliest = latest - 0.1, latest = latest + 0.1])
    OR (sourcetype=sourcetype_3 sourcetype_3_primary=*
    [tstats max(_time) as latest where sourcetype=sourcetype_3 by sourcetype_3_primary
    | eval earliest = latest - 0.1, latest = latest + 0.1])
| fields _time, xxx, xxx, <pick your required fields>
| eval coalesced_primary_key=coalesce(sourcetype_1_primary, sourcetype_2_primary, sourcetype_3_primary)
| stats latest(*) AS * by coalesced_primary_key

ClubMed

This is great, and long story short for your two qualifiers: Yes to both two (#1 and #2). I was indeed using a combined search as well.

Now for tstats, I really like your idea. The concern I had is, let's say I do have a sourcetype_1 with over 1,000,000 unique sourcetype_1_primary keys.

This sourcetype is also incremental, so any "net-new" changes for any of the 1,000,000 primary keys are dumped into Splunk once every 24 hours and not all of the 1,000,000 keys are not updated every day.

My rule of thumb is to look back a maximum of 30 days to catch all the changes and use stats latest() to create the latest data for each of the 1,000,000 primary keys.

So with your tstats example, it seems to only work for sourcetypes with full data dumps each day if the specific length between latest and earliest is known, instead of incremental sourcetypes. Else, I could have set earliest=-24h and be done with it.

It's actually kind of ironic knowing how Splunk searches work with timeframes. Assuming you're searching with 'earliest' time modifier and latest is now(), Splunk does search backwards from now() to the earliest. In other words, searches backwards from latest to earliest.

You can see the Splunk search working backwards in real time by observing the 'Timeline' under the ad-hoc search pane.

With my understanding that Splunk does search backwards, I just wish there's a way which when Splunk is doing the index searches, there's a way to tell Splunk to just keep only the latest event of each unique value of a field.

For example: When doing Index searches, tell Splunk to keep only the first occurring event of each unique value in the field sourcetype_1_primary. Splunk is to ignore any subsequent duplicate values as Splunk continues to search backwards.

Edit: I'm not describing streamstats command aren't I?

Edit2: I converted my stats latest() to streamstats latest() and did not see improvements. Additionally, streamstats appear to break the ability to do stats join when switching it from stats values() to streamstats values(). Appears streamstats work correctly only for latest() but not when joining data.

PickleRick

No, you can't. Search just pulls data from the index. It doesn't do any inter-event comparisons and such so you can't just get latest event. That's what stats is for.

Also remember that in a clustered environment the latest event will come from just one of the indexers and the search command is a distributed streaming command so it obviously gets distributed to all search peers and runs independently on each of them. How would you like to get latest event from a particular index not knowing if other peers have a more recent event? And since it's a distributed streaming command subsequent commands which do not move the processing to the SH tier (more distributed searching commands - most notably eval) will also get executed on all indexers taking part in the search.

So no, search is for searching, stats is for aggregation (and latest() is a form of aggregation).

Method to force the implied search in first line to give only the latest events without using stats latest()?

stats

Index This | I’m short for "configuration file.” What am I?

New Articles from Academic Learning Partners, Help Expand Lantern’s Use Case Library, ...

Your Guide to SPL2 at .conf24!