I have two indexed fields. When I search using the 1st field, the performance is great. However, when I search using the 2nd field, the performance is bad, as slow as if it's not an indexed field.
I have done lots of troubleshooting but I still couldn't figure out why. Let me show what I did.
My transforms.conf:
[nw-uriBasePhp]
REGEX = ^[^\t]+\t[^\t]+\t[^\t]+\t(?:[A-Z]+)\ (\/[^\s\?]*\.php)
FORMAT = uriBasePhp::$1
WRITE_META = True
[nw-apiClass2]
REGEX = ^[^\t]+\t[^\t]+\t[^\t]+\t(?:[A-Z]+)\ \/[^\s\?]*api\.php\S*[?&]class\=([^\s&]+)
FORMAT = apiClass2::$1
WRITE_META = True
And below are related fragments from my props.conf and fields.conf:
props.conf:
[my_web3]
TRANSFORMS-nw = nw-uriBasePhp,nw-apiClass2
fields.conf:
[uriBasePhp]
INDEXED=true
[nw-apiClass2]
INDEXED=true
I have verified that both fields are successfully indexed by using the tstats
command.
| tstats count WHERE index=indextest5 uriBasePhp="/product.php"
| tstats count WHERE index=indextest5 apiClass2="product"
Both searches return non-zero, which is good as it shows they're both indexed fields.
Then I run a search using the first field.
index=indextest5 uriBasePhp="/product.php" | stats count
The performance is good (5 times faster than a non-indexed version of the search).
Duration Component Invocations Input Count Output Count
10.69 command.search 64 - 769,860
9.51 command.search.rawdata 63 - -
0.35 command.search.kv 63 - -
0.31 command.search.filter 63 - -
However, when I run a similar search using the other indexed field:
index=indextest5 apiClass2="product" | stats count
The performance is not good (it's just similar to what I got if I search using a non-indexed field):
Duration Component Invocations Input Count Output Count
19.76 command.search 334 - 384,286
9.08 command.search.rawdata 333 - -
5.38 command.search.filter 333 - -
4.08 command.search.kv 333 - -
What I don't understand is, each search involves only a single indexed field, but somehow the 2nd search spent lots of time in command.search.filter
and command.search.kv
, which is totally unexpected.
What else can I do to figure out what things caused those command.search.filter and command.search.kv in my 2nd search?
The root problem is found (thanks to @Iguinn and @cyue from Splunk)
In fields.conf, I made a typo.
Instead of:
[nw-apiClass2]
INDEXED=true
It should be:
[apiClass2]
INDEXED=true
That explains why although apiClass2 was found in the index, but Splunk didn't use it as an indexed field during search. Silly typo mistake from me.
The root problem is found (thanks to @Iguinn and @cyue from Splunk)
In fields.conf, I made a typo.
Instead of:
[nw-apiClass2]
INDEXED=true
It should be:
[apiClass2]
INDEXED=true
That explains why although apiClass2 was found in the index, but Splunk didn't use it as an indexed field during search. Silly typo mistake from me.
Please accept an answer.
Additional info (thanks to lguinn):
- Number of events in the sample data: 11M
- 1st field - uriBasePhp - has 614 unique values
- 2nd field - apiClass2 - has 18 unique values
The relative performance of indexed field extractions is highly dependent on the uniqueness of field values. (A list of random numbers has a high "uniqueness factor": we expect to see few, if any, duplications. A list of 1s and 0s has an extremely low uniqueness factor.)
In Splunk, this uniqueness factor needs to be considered across all fields. Splunk creates an index for each keyword that appears in the input stream; this means that every field value is indexed. If a value (like 8969087) appears only once in all the data, then a search will find this value very rapidly. Making the associated field an indexed extraction may cause the overall keyword index size to grow dramatically. Ultimately, you will increase indexing time, but you will not improve search time; in fact, search speed could degrade.
In the opposite case, assume that a value (like "WARN") appears in many different places in the data. Especially if the value appears in many different fields, the keyword index is less helpful. Splunk might retrieve 10,000 occurrences of "WARN" and then have to figure out which occurrences correspond to the field of interest. (Example, X="WARN" vs log_level="WARN"). In this case, using an indexed extraction will still increase the overall index size, at least somewhat, and increase indexing time. But you may see a significant improvement in search time.
I suggest that you choose your index-time field extractions carefully. In almost all cases, search-time field extractions are preferable: they are dynamic, they are applied to all data, and they generally perform very well. I would use an index-time field extraction only if:
(1) there is low variability in data, which is causing slow search times, or (2) the data format itself is highly variable (occasionally happens when log files provide format via headers).
lguinn, thanks a lot for the great explanation. For the indexed field (apiClass2) which doesn't give good performance, it has a very low uniqueness (only 18 distinct values, among a sample data of 11M events), and that's why we think we could speed things up a lot by creating an indexed field.
For the faster field (uriBasePhp), which we saw a 5x speed improvement after using indexed-field, it actually contains 614 distinct values, which is also low IMO.
Both fields will be used in real searches regularly, and that's why I was puzzled why indexing apiClass2 didn't help the perf at all.
Thanks for that, but after I inspected the search log of both searches, I hope I am a step closer to the root cause.
Comparing the search log from the fast search (using uriBasePhp) and the slow search (using apiClass2), I found a big difference:
Fast:
10-18-2016 16:12:12.550 INFO LocalCollector - Final required fields list = subsecond,index,prestats_reserved,psrsvd_,uriBasePhp
10-18-2016 16:12:12.550 INFO UserManager - Unwound user context: admin -> NULL
10-18-2016 16:12:12.550 INFO UserManager - Setting user context: admin
10-18-2016 16:12:12.550 INFO UserManager - Done setting user context: NULL -> admin
10-18-2016 16:12:12.550 INFO UnifiedSearch - snapped earliest=1472659201 based on index min times
10-18-2016 16:12:12.550 INFO BatchSearch - Recategorizing indextest5~2~662F7CD7-EFD9-432E-867D-4901863BDEDC as non-restartable for responsiveness.
10-18-2016 16:12:12.550 INFO BatchSearch - Searching index:indextest5 with LISPY:'[ AND uribasephp::/product.php ]'
10-18-2016 16:12:12.550 INFO DatabaseDirectoryManager::Bucket - use_bloomfilter = true
10-18-2016 16:12:12.677 INFO SearchOperator:kv - no fields required. Running only non-optimizeable extractions ...
Slow:
10-18-2016 15:57:48.658 INFO LocalCollector - Final required fields list = BUFFER_POOL_AND_MEMORY,Message,raw,_subsecond,apiClass2,dest_ip,dest_mac,dest_nt_host,host,index,prestats_reserved,psrsvd_,source,src_dns,src_ip,uri
10-18-2016 15:57:48.658 INFO UserManager - Unwound user context: admin -> NULL
10-18-2016 15:57:48.658 INFO UserManager - Setting user context: admin
10-18-2016 15:57:48.658 INFO UserManager - Done setting user context: NULL -> admin
10-18-2016 15:57:48.658 INFO UnifiedSearch - snapped earliest=1472659201 based on index min times
10-18-2016 15:57:48.658 INFO BatchSearch - Recategorizing indextest5~2~662F7CD7-EFD9-432E-867D-4901863BDEDC as non-restartable for responsiveness.
10-18-2016 15:57:48.658 INFO BatchSearch - Searching index:indextest5 with LISPY:'[ AND product ]'
As seen in the bold part of the log, in the "slow" case, somehow Splunk decided to:
- Get more fields than I ask it to get (e.g. dest_ip, dest_mac, src_ip, etc.)
- Not to search from the indexed field apiClass2
But why??? How can I fix it? I have also shared the two search log files in https://goo.gl/6X02r2
@patng323 - Just so you know your comments got caught in the site spam filter (if I had to guess probably a combination of a lot of punctuation and repeated words in the logs you included). You should have gotten a message about it needing approval from a moderator prior to being posted. In the future when this happens I'd ask that you don't repost everything from your comment, and just add additional thoughts as separate comments. Otherwise it leaves our volunteer moderators in the position of trying to figure out which of your comments has the most information in it to keep and discard the others for cleanliness. I hope I picked the correct one, but you may want to review it. Keep calm and non-spam will be approved, usually on US waking hours time.
@acharlieh, thanks for being the moderator, and sorry if I was causing trouble. However, when I post my comment, it went thru successfully without any warning.. But when I refreshed the page, my comment suddenly disappeared. That's why I did repeated reposts because I thought there was a bug in the system (actually the lack of warning was indeed a bug). Next time when it happened again I will know it's a forum bug and will wait.
Yeah, that shouldn't be what happens with that workflow; I'll see if I can prod someone on it. Thanks for letting us know your experience 🙂
But the determination is not just "how many unique values does this field have" but also "how common and these field values across ALL fields?"
Another factor, which I did not mention, is whether the field values contain punctuation or whitespace. This changes the segmentation of the values and could affect index size/performance as well.