I ran below search over 11 millions record to plot a graph:-
sourcetype="syslog" | search query: | timechart count by host
The search above is already bounded by the earliest time and latest time of the last 15 minutes only.
The 4 x CPU consumed is 50% and iostat -xNh
shows that the Read per second is low, only under 30, and Write per second is also low under 30.
However, the search is very slow and takes about 3 minutes to complete.
Is there anyway to speed up the search?
The point is that I have 3 similar searches running together on the same screen, and it results in 100% CPU consumption of the 4 x CPU, and it takes more than 3 minutes waiting time to complete the search.
The query/search optimizing game of Splunk is, compared to let's say an Oracle Database for instance, pretty weak. If you want to understand how and why different searches take so much time (or not), I strongly recommend to learn the basics of relational algebra (to have a starting point check out Wikipedia: https://en.wikipedia.org/wiki/Relational_algebra ). To give you a rough oversimplification of the basic idea: You want to keep your interim results you are operating on as small as possible as early as possible. Let's say you have some data with two fields:
Fields: "a" and "b".
And (to keep it simple here) let's say we know the following:
Field "a":
- 60% of all events got a="yes"
- 40% got a="no"
Field "b":
- 90% of all events got "b"="foo",
- 5% got "b"="bar"
- The rest is a wild mix of values or unknown.
And know let's say you want to search for something like this:
search a=yes | search b=bar
(I use the pipe intentionally here)
If this search is performed by Splunk the following will happen (Hind: I'm not sure if Splunk wouldn't even optimize such kind of searches, how ever, I'm trying to explain the idea behind the optimation):
Let's say we search 10^6 (= 1,000,000) Events: search a=yes | search b=bar
- First: "a=yes": 1,000,000 events are searched, 600,000 events are left
- Second: "b=bar": 600,000 events are searched
- In total: 1,600,000 search operations
If we, how ever, switch the search order, we would have this: search b=bar | search a=yes
- First: "b=bar": 1,000,000 events are searched, 50,000 events are left
- Second: "b=bar": 50,000 events are searched
- In total: 1,050,000 search operations
Compared, Splunk had to search ~45% less events (assuming the search time is linear to the events searched 45% faster, too). How ever, please don't fixate on this simple example, just realize the idea behind it. As Runals already said, a good idea is to start by adding a index (if you have more than one) to reduce the interim results. Having some statistical information about your data (like in our example) may help, too. The idea, why the TERM(...) syntax could help, is because it's reducing the searches to a single search (and not breaking them up and joining the results afterwards).
Another idea to optimize your search is to use always projection before selection (those are terms from the relational algebra, seriously, learning that will help you a lot actually).
- A selection is filtering rows
- A projection is filtering columns
When using search query:
Splunk will have to check all fields (columns) of your events. Whereas using a term like search text=query:
would be
- 1st a projection (search in column text
, ignore all other columns)
- 2nd a selection (select only rows matching query:
)
If you check out the "Quick tips for optimization" (here: http://docs.splunk.com/Documentation/Splunk/6.2.0/Search/Quicktipsforoptimization ) you'll see basically the same (without an explanation why, which is much more important to understand).
How many indexes do you have? I'd start with putting index=(index where syslog data is). Then I'd take out the pipe and search command and simply put "query:" in befor your first pipe. Leave the quotes in assuming there is a space before the q and after the colon. If there IS a space you could put that in the query like
index=foo sourcetype=syslog TERM(query:)
To learn more about TERM you'll likely have to search on this board.
The main point is though the way your search is constructed is it is first pulling back ALL the syslog data and then looking for just the logs that contain query: which is inefficient.
Edit: In thinking more about your question - when you say 4 cores is that all that is available on the system and is that system also processing logs? The general rule of thumb is Splunk uses 4 cores to ingest data and 1 core per active search (pre 6.3 or 6.3 w/o turning some knobs and flipping some switches). The next bit is what sort of IOPS does your system get? The reference architecture is something like 12 cores, 12 GB of RAM, and 800 IOPS (or is that 1200 now? maybe just 'enterprise level'). At any rate there are things that can be done to speed up your search but if the system is over taxed then there is only so much that can be done.
Thanks Runals for the tips, I tried with the TERM() but it won't help much, I think it's because almost more than 95% of the syslog have the keyword and all the search is within the same index since I only have one index with all the data inside it.
Below are my result, and I find that by elimiating the search query:, I got the fastest time to complete:-
sourcetype="syslog" TERM(query:) earliest=-15m | timechart count by host 03:15
sourcetype="syslog" earliest=-15m | search query: | timechart count by host 02:29
sourcetype="syslog" earliest=-15m | timechart count by host 1:41
However, some of the events are without the query:, therefore, in reality, I can't eliminate the query: in the search string.