Solved: Is it possible to speed up the search or improve t...

Minghao · ‎03-07-2022

I have a log like below:

index=login sourcetype=login new_user=1

I also have logs without new_user label

index=login sourcetype=aa

What's the difference between I search which specify the sourcetype

index=login sourcetype=login new_user=1

and do not specify the sourcetype just using new_user like

index=login new_user=1

I wonder which one is faster or perform better and why?

And if I make new_user=1 as a sourcetype, does

index=login sourcetype=new_user

better than

index=login sourcetype=login new_user=1

Thank you in advance

PickleRick · ‎03-08-2022

Modern splunk is relatively clever sometimes in terms of optimizing the search so

index=whatever | search field=whatever

will be equal to simply searching for

index=whatever field=whatever

but it works with relatively simple cases where the whole processing pipeine can be easily expanded and simplified.

In general case, it's best to:

1) Limit timerange. It's the single most important factor speeding up the search. Since I/O is more expensive than memory operations, it's best to limit the amount of data that has to be read from the disk.

2) Limit events as much as you can before processing them further. Every operation further down the pipeline uses memory and processing power so whenever you can avoid it there's no point of calculating something that you know you're gonna filter out in the next step. Which means it's better to do

index=whatever myfield IN ("whatever1","whatever2","whatever3") | stats count by myfield

than

index=whatever  | stats count by myfield | search myfield IN ("whatever1","whatever2","whatever3")

It's best to be as specific in selecting your events as you can be from the start

3) If using wildcards, use them at the end of the search term (i.e. field=value*), try not to use them in the middle of the search term (field=val*ue), don't use them at the beginning of the search term (field=*value).

Then there come some more advanced hints about search optimizations which work in distributed environments and so on but I think it's the stuff for another time 😉

View solution in original post

PickleRick · ‎03-08-2022

There are many factors affecting the search speed. The most obvious and most effective way to reduce search time is to narrow the time range - less buckets processed, less time needed of course.

Other than that - indexed fields (like the default splunk's internal fields of source and sourcetype) do speed up the searches for a price of a heavily reduced flexibility of your searches - the field is extracted (or defined) at ingestion time and it stays that way forever in the index.

And lastly (unless we're getting into territory of reports acceleration, summary indexing or datamodels acceleration) - it's good to use "sparse" fields for searching

Contrary to your typical, for example, RDBMS, splunk doesn't work by simply looking through values of a given field and matching them using given condition - splunk extracts most of the fields dynamically so it would be hugely ineffective to parse every single event in search time and only then verify the contents of the field. Not digging too deply into the underlying mechanics, the event's raw data is split into "terms" which are indexed simply as a bunch of strings. Then if you search for, let's say "login=my_user", simplifying things a bit, splunks looks where it has encountered the string "my_user" and only after it locates the events in which that string shows up, it parses the event to see if the occurence was within the field named "login".

That's why you can relatively effectively look for strings with wildcard at the end ("my_user*") but should not use searches for values with wildcard at the beginning ("*my_user") - such search has to look through every single event and see if the actual value of the parsed field matches the string so it's higly ineffective (a bit like searching over an unindexed field in relational database).

The problem arises when you have a value which is very common (for example, 70% of your events contains login=my_user) and becomes even worse if the same field value occurs often across many different fields, because splunk has to read and parse many events until it finds the matching ones. That's when you want your fields indexed regardless of their "inconveniences".

So, after this - a bit long - introduction, from a performance point of view, the search

index=a sourcetype=subsourcetype

should be as fast or faster than

index=a sourcetype=generalsourcetype myfield=myvalue

Unless myfield is an indexed field.

But you have to take into account other factors like usability. Spawning many different sourcetypes can make building searches harder and analysing your data less convenient. Usually the sourcetypes get "split" on the actual type of events coming from a single general sourcetype but not in terms of a specific value contained therein (like status=success vs. status=failed - for that you use tags) but in terms of specific event formats coming from the same source (after all different sourcetypes usually mean different parsing rules).

Minghao · ‎03-08-2022

Thank you for your detail introduction which very useful for me to understand the basics of splunk search and your introduction actually solved my question about the performance optimizition. Just to make sure I get your points, can I summary that the way to speeding up search in splunk are narrowing the time window and specifying the fields as strict as possible? Thanks again！

PickleRick · ‎03-08-2022

Modern splunk is relatively clever sometimes in terms of optimizing the search so

index=whatever | search field=whatever

will be equal to simply searching for

index=whatever field=whatever

but it works with relatively simple cases where the whole processing pipeine can be easily expanded and simplified.

In general case, it's best to:

1) Limit timerange. It's the single most important factor speeding up the search. Since I/O is more expensive than memory operations, it's best to limit the amount of data that has to be read from the disk.

2) Limit events as much as you can before processing them further. Every operation further down the pipeline uses memory and processing power so whenever you can avoid it there's no point of calculating something that you know you're gonna filter out in the next step. Which means it's better to do

index=whatever myfield IN ("whatever1","whatever2","whatever3") | stats count by myfield

than

index=whatever  | stats count by myfield | search myfield IN ("whatever1","whatever2","whatever3")

It's best to be as specific in selecting your events as you can be from the start

3) If using wildcards, use them at the end of the search term (i.e. field=value*), try not to use them in the middle of the search term (field=val*ue), don't use them at the beginning of the search term (field=*value).

Then there come some more advanced hints about search optimizations which work in distributed environments and so on but I think it's the stuff for another time 😉

Minghao · ‎03-10-2022

Thank you for your patient answer, and it's a great help for me and my team! I want to express the depth of my gratitude to you!

gcusello · ‎03-10-2022

Hi @Minghao,

good for you, see next time!

Ciao and happy splunking

Giuseppe

P.S.: Karma Points are appreciated by all the Contributors;-)

gcusello · ‎03-08-2022

Hi @Minghao,

as you can suppose as more as conditions you use as you have better performaces.

In your case, you could use something like this:

index=login ((sourcetype=login new_user=1) OR sourcetype=aa)
| ...

indexed fields (as sourcetype or source or host) are very useful for improve performences in you searches.

Ciao.

Giuseppe

Minghao · ‎03-08-2022

HI @gcusello ,

Thank you for your answer! I think that my not clear enough description that confused you. Actually, I wonder whether a additional sourcetype can speed up search compared with using a field value. Another answer has solved my problem. Thank you again!

gcusello · ‎03-09-2022

Hi @Minghao,

good for you, it was a pleasure to help you.

Please accept the right answer for the other people of Community.

Ciao and happy splunking.

Giuseppe

P.S.: Karma Points are appreciated by all the Contributors 😉

Minghao · ‎03-10-2022

Thank you all the same!

Is it possible to speed up the search or improve the running performance by increasing the number of sourcetype

other

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life