Indexing and Searching Performance issues

cbauerlein · ‎04-11-2016

Hi,

I'm writing here out of desperation. We're having significant performance issues with our Splunk environment. I'll share as much info as I can and welcome any input or suggestions greatly:

2 standalone search heads
- 1 ES
- 1 non-ES Searching and Reporting

7 indexers
2 heavy forwarders

~8000 UFs

All boxes are 20 cores and 48 GB RAM running Ubuntu and on ESX in a dedicated UCS farm with no overprovisioning. We're using shared Vmax storage for indexers and shared NIMBLE everywhere else.

All of our indexing and forwarding queues are 90+% filled and our indexing is hours and in some cases days behind.

We're struggling to identify the root cause. Any feedback is hugely appreciated.

Thank you in advance.

kerryc · ‎10-11-2016

Did you get this resolved? Have you limited your datamodels to only creating their summaries from the indexes they require? This saved us a huge amount of resource when we implemented it.

joshuascott94 · ‎04-11-2016

All systems are virtual? You mentioned no over-provisioning, but do you actually have reserved CPU/Memory allocated in ESX? Make sure the reservations are explicitly set per guest system.

cbauerlein · ‎04-12-2016

My indexers spend the most process time on indexer

All of my data queues are 100% except parsing queues which range from 85 - 97% full.

We have not set explicit reservations. My VM team has assured us this is not necessary as we have one virtual host per blade and have left 4 cores and 16 GB mem for ESX.

Here is the output from top on one of my indexers:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2541 root 20 0 2037788 882972 8768 S 103.0 1.8 11:33.48 splunkd
8307 root 20 0 253304 110608 8156 S 100.3 0.2 3:59.04 splunkd
20956 root 20 0 2100908 895840 8760 S 100.3 1.8 131:29.97 splunkd
30105 root 20 0 2693144 1.257g 8744 S 100.3 2.7 42:34.67 splunkd
2185 root 20 0 1687216 385472 8992 S 100.0 0.8 11:22.53 splunkd
23300 root 20 0 622288 532196 7500 S 100.0 1.1 0:53.69 splunkd
23885 root 20 0 1175232 252336 7736 S 100.0 0.5 0:32.61 splunkd
8336 root 20 0 183676 110536 7304 S 99.7 0.2 3:59.64 splunkd
23878 root 20 0 1350480 198208 7896 S 98.7 0.4 0:31.95 splunkd
18148 root 20 0 1652732 159784 8172 S 98.0 0.3 2:14.20 splunkd
26549 root 20 0 1445884 260176 8236 S 94.7 0.5 13:12.25 splunkd
25556 root 20 0 2524964 450532 14332 S 59.1 0.9 578:43.78 splunkd
24804 root 20 0 122236 47516 5488 S 21.6 0.1 0:00.65 splunkd
25691 root 20 0 93304 11660 9112 S 0.7 0.0 1:29.14 splunkd
8 root 20 0 0 0 0 S 0.3 0.0 73:30.09 rcu_sched
15 root 20 0 0 0 0 S 0.3 0.0 10:30.61 rcuos/6
21 root 20 0 0 0 0 S 0.3 0.0 12:38.55 rcuos/12
22 root 20 0 0 0 0 S 0.3 0.0 11:49.96 rcuos/13
24 root 20 0 0 0 0 S 0.3 0.0 10:48.43 rcuos/15
18158 root 20 0 62820 5856 528 S 0.3 0.0 0:00.39 splunkd
21499 root 20 0 17100 1736 1084 R 0.3 0.0 0:00.23 top
23879 root 20 0 62820 5868 528 S 0.3 0.0 0:00.08 splunkd

and vmstat:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
16 1 144848 3053864 20 32453728 0 0 31 458 1 1 22 2 76 0 0

martin_mueller · ‎04-11-2016

How much volume are you pushing to your 7 indexers? In ES environments, 100gb/indexer can be too much.

Are you routing all 8000 UFs through two HFs? If so, only two indexers receive data at a time making bad use of the available boxes. Consider adding more HFs, or upgrade to 6.3+ and use multiple parallel output pipeline sets to send to multiple indexers per HF.

What processors are the biggest in the indexing performance view of the distributed management console? For example, if regexreplacement takes lots of time you probably have inefficient regular expressions running at index time.
Make sure to also include the HF's logs in this as those may be doing most of the heavy lifting during parsing.
To further narrow this down, check where the full queues "end" - the last processors in the pipeline that have full input queues usually are the culprits.

Is it just indexing, or also searching that is affected? If it's just indexing and you have plenty of idle cores, you can consider upgrading to 6.3+ and use multiple parallel indexing pipeline sets to speed up indexing at the expense of using more cores.

In general, update Splunk and especially ES and any standard TAs you have to current versions. Stuff like windows, unix, oracle, etc. TAs have recently been updated with greatly improved performance. Splunk 6.4.0 has also made some improvements to core itself for handling ES type workloads more efficiently.

martin_mueller · ‎04-12-2016

If you're spending tons of time on the indexer processor, I'd blindly blame the actual writing-to-disk being slow... assuming that's the indexers' processor, not the HFs.
"Down there" in the pipeline, all the complicated stuff has already been done, all that's left is to write it to disk: http://wiki.splunk.com/Community:HowIndexingWorks

Does that affect all seven indexers at the same time, or is it "two overloaded, five idle"? If the former, blame storage again. If the latter, add more HFs to balance things out.

Note, imbalanced indexing also can affect search performance. If indexers 1 and 2 get all the data to index, indexers 1 and 2 also have to serve all search requests on their own.

cbauerlein · ‎04-12-2016

Yes THP has been disabled and ulimit increased or set to unlimited.

cbauerlein · ‎04-11-2016

Hi - we do about 350 GB per day.

Yes all UFs are sent directly to HFs. We are running 6.3.1 and ES 4.

I will have to get back to you on the processors and queues.

Both searching and indexing are issues. My main concern is indexing though because I don't want to be in a situation where I'm losing data.

Searching is extremely slow and 90% of CPU load on the indexers is reduced when powering down the SHs.

We've been told by colleagues that our shared storage could potentially be the culprit here but I just find that hard to believe.

Thanks for your quick responses.

starcher · ‎04-12-2016

Do you have transparent huge pages disabled on the systems?

vasildavid · ‎04-12-2016

Regarding storage being the culprit: what type of CPU load are you seeing when looking at your process list in 'top' or 'vmstat'? Is it mostly USER time or is it SYSTEM? If you are seeing a lot of CPU time spent on the SYSTEM side, you could be IO bound. If it is all tied up in USER land you are CPU bound. For IO-bound workloads, you will need to increase your available storage IOPS/bandwidth. For CPU-bound workloads, add processors/cores.

How many of your ES datamodels are you really using and how many do you have accelerated? Try disabling acceleration on datamodels that you are not using. You might also want to look at limiting the number of concurrent datamodel acceleration jobs that can run simultaneously as well as the backfill_time in your datamodels.conf:

[default]
acceleration.max_concurrent = 1
acceleration.backfill_time = -1h

starcher · ‎04-11-2016

If using ES I would target more to 70-80GB/day per indexer. On 6.3+ I would definitely add more HFs with an input pipeline of 2 in front of your indexers. Then do the same pipeline improvement on the indexers once you get more HFs added. You might also consider forcing time based load balancing from the HFs to the Indexers.

Indexing and Searching Performance issues

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Best Practices: Splunk auto adjust pipeline queue

Laser Bananas and Edge Hubs: Exploring Operational Technology (OT) Data Through a ...

Event Series: Mastering AI Tokenomics and Splunk Agent Observability

Join the Conversation

Indexing and Searching Performance issues

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Best Practices: Splunk auto adjust pipeline queue

Laser Bananas and Edge Hubs: Exploring Operational Technology (OT) Data Through a ...

Event Series: Mastering AI Tokenomics and Splunk Agent Observability