I have a simple search that is not performing well over a large dataset.
Processors: 12 cpu instances in total.
Six-Core AMD Opteron(tm) Processor 2435 CPU 1
Six-Core AMD Opteron(tm) Processor 2435 CPU 2
Index in question: 2 millon records per minute (550 GB per day)
A simple search on this data, say over a 1 minute period takes in excess of 10 minutes to cmplete.
From this it is evident that, even using summary indexing will not work as I would end up with 10 concurrent searches just for the summary population.
zfs iostat shows periods when the disks are not being accessed when the search is running.
mpstat shows a single cpu instance running at user=100%, no io-wait at all.
prstat show splunkd only using 25% of usr cpu, with no iowait.
From this I conclude that there is no io bottleneck, so what is causing this search to run extremely slow to the point where I cannot even populate a summary index.
Any assistance/direction in where to look for the bottleneck and possible fixes would be greatfully received.
It would be helpful to know what the search looks like, what the data looks like, and perhaps what the storage on which the index(es) are stored are and how it's configured.
In particular, it would be important to know the "density" of the search, and the easiest way is to see the search and understand how much of the data the search terms are expected to occur in.
If your search is dense (retrieving a large percentage of the index) then what you're seeing is going to be expected. The only way to deal with this (other than summarization) is to run multiple instances of Splunk. Splunk isn't designed for maximum search performance on a single server, but rather to be pretty fast on that and to then scale horizontally over many nodes.
I would also expect that if you're indexing at a rate of 550GB/day, you should see 2 or 3 cores running at 100% for indexing alone. You may be seeing this, but didn't mention it.