Hello everybody,
I just started with Splunk and I ‘am having already some large performance problems.
my System :
* AIX 5300-08-02-0822
* 4 GB Memory (free 2.5GB)
* CPU (could use up to 5 power6 cpu's)
* Splunk 4.1.4
* 3 Index
* Total amount of data 10GB
* Daily Volume around 1GB per Day
My Problem is now, if I am stopping all the forwarders and there is no searching activity the server has no load (100% idle, no I/O). If i am starting a search, for example the license calculation for the last 7 Days, It takes over 2 Minutes until the calculation is finished. Also all other queries taking very very long and this only with maybe 1% of our expected data size.
If i' am looking at the systemressources I can see a few I/O's (maybe 45 to 100) and about one power6 CPU witch works (this seems to be normal as one search can't split of to multiple CPU's). If I am testing the disk i can get an average I/O of 1800.
And now the magic question : What am I doing wrong ?
thx christian
Hi Christian, I am answering here so others could participate.
Another customer with great Splunk performance checked his AIX settings for me and allowed me to publish them (thanks Aaron 😉
8 CPUs
entitled 0.4
minperm 3.0
maxperm 90.0
numperm 35.5
4 CPUs
entitled 0.2
minperm 3.0
maxperm 90.0
numperm 55.5
If I understood correctly, their RAM is usually used 86% by file caching. Hope this helps to compare it with your settings. Cheers, Meno
Hello all,
thanks for you support, I think we solved the Problem. The Problem was neither I/O or CPU I was just the parameter maxclient% which controls how much memory can be consumed for filesystem cache. Just to complete this question the well formated output of vmstat -v
4194304 memory pages
4040624 lruable pages
2224021 free pages
2 memory pools
485051 pinned pages
80.0 maxpin percentage
1.0 minperm percentage
80.0 maxperm percentage
31.6 numperm percentage
1277976 file pages
0.0 compressed percentage
0 compressed pages
31.6 numclient percentage
80.0 maxclient percentage
1277976 client pages
0 remote pageouts scheduled
370 pending disk I/Os blocked with no pbuf
0 paging space I/Os blocked with no psbuf
2228 filesystem I/Os blocked with no fsbuf
3261 client filesystem I/Os blocked with no fsbuf
894 external pager filesystem I/Os blocked with no fsbuf
0 Virtualized Partition Memory Page Faults
0.00 Time resolving virtualized partition memory page fault
The Problem with the License Calculation still exists but seems to be a different Problem. Search Performance is for the moment OK.
Setting vmo Parameter
vmo -p -o maxperm%=80
vmo -p -o maxclient%=80
Hello all,
thanks for you support, I think we solved the Problem. The Problem was neither I/O or CPU I was just the parameter maxclient% which controls how much memory can be consumed for filesystem cache. Just to complete this question the well formated output of vmstat -v
4194304 memory pages
4040624 lruable pages
2224021 free pages
2 memory pools
485051 pinned pages
80.0 maxpin percentage
1.0 minperm percentage
80.0 maxperm percentage
31.6 numperm percentage
1277976 file pages
0.0 compressed percentage
0 compressed pages
31.6 numclient percentage
80.0 maxclient percentage
1277976 client pages
0 remote pageouts scheduled
370 pending disk I/Os blocked with no pbuf
0 paging space I/Os blocked with no psbuf
2228 filesystem I/Os blocked with no fsbuf
3261 client filesystem I/Os blocked with no fsbuf
894 external pager filesystem I/Os blocked with no fsbuf
0 Virtualized Partition Memory Page Faults
0.00 Time resolving virtualized partition memory page fault
The Problem with the License Calculation still exists but seems to be a different Problem. Search Performance is for the moment OK.
Setting vmo Parameter
vmo -p -o maxperm%=80
vmo -p -o maxclient%=80
Hi,
since you have problems with the Search performance, I think you should give more cpu entitlement. Eventually as test, decrease the number of lcpu, obviously with less users connected and scheduled saved searches.
Can you try to verify the performance against one of the applications provided, like the *nix app?
Hi, i tried it with the following Setup, Entitelment 2.0 uncapped open to 8.0, as recomende I increased it to 8GB (uasge is even with filesystem cache around 2-3 GB) The Result is the same. As a reference for my performance i am useing always the licens calculation. So I think it's not our SearchApp which makes problems or am I wrong ?
Hi Christian, I am answering here so others could participate.
Another customer with great Splunk performance checked his AIX settings for me and allowed me to publish them (thanks Aaron 😉
8 CPUs
entitled 0.4
minperm 3.0
maxperm 90.0
numperm 35.5
4 CPUs
entitled 0.2
minperm 3.0
maxperm 90.0
numperm 55.5
If I understood correctly, their RAM is usually used 86% by file caching. Hope this helps to compare it with your settings. Cheers, Meno
Hi, I controlled now the settings, i increased also the value for maxclient percentage to 75%, this gives now i bit more performance and the filesystemcache is now used. But it's still too slow, the Blocksize of Splunk 4 during the search is between 4KB and 27KB witch is not very high.
This sounds like an LPAR environment. Make sure your VIOS LPARs (if you have them) have enough CPU entitlement - Splunk is very I/O heavy. (It might help to share what type of disk you are using and how it's attached)
Is your Splunk LPAR capped or uncapped, and what is its CPU entitlement?
One tool we've found useful on our network is LPAR2RRD - it pulls HMC utilization data and loads it into an RRDtool database. http://sourceforge.net/projects/lpar2rrd/
[0 external pager filesystem I/Os blocked with no fsbuf] [0 Virtualized Partition Memory Page Faults] [0.00 Time resolving virtualized partition memory page faults]
Output of vmstat -v
[2097152 memory pages] [1996704 lruable pages] [1451545 free pages] [2 memory pools] [189140 pinned pages] [80.0 maxpin percentage] [1.0 minperm percentage] [80.0 maxperm percentage] [14.2 numperm] [284766 file pages] [0.0 compressed percentage] [0 compressed pages] [14.2 numclient percentage] [75.0 maxclient percentage] [284766 client pages] [0 remote pageouts scheduled] [0 pending disk I/Os blocked with no pbuf] [0 paging space I/Os blocked with no psbuf][2228 filesystem I/Os blocked with no fsbuf] [3261 client filesystem I/Os blocked with no fsbuf]
Hi, you a right, I ajusted the value for maxclient and this gives a bit more performance (but still way beyond) Now the Filesystemcache is working and the numperm Value is increasing.
Hmm, just read @Meno's comment below... AIX JFS2 uses 'client' memory and not 'perm' memory for its filesystem cache. (Original JFS1 used 'perm' memory) Any chance you could just update your question above with full output of vmstat -v ?
4% numperm means that only 4% of your 4GB of memory is being used for filesystem cache. That, coupled with the large amount of 'free' memory makes me wonder exactly what's going on. Just as a curiosity, is the filesystem with your Splunk data on it mounted with the "dio" or "cio" options?
Hi, the values are : 1.0 minperm percentage, 80.0 maxperm percentage, 4.0 numperm percentage. Looks okay for me as I understand the settings. greetz christian
The 2.5GB of free memory is interesting. Normally, AIX's filesystem cache stuff tries very hard to keep truly 'free' memory near zero using it instead for filesystem cache. You can tune minperm/maxperm and such to influence that. Splunk greatly depends on the OS file cache and does not cache data itself (unlike, say, DB2 or Oracle). What does your "vmstat -v" say about numperm, minperm, and maxperm?
Hi, thanks for your answer. Yes this is a LPAR and the disk is a SAN disk connected to the vio (FC), vio and LPAR are connected through vscsi. By making various test (writing to disk, reading from disk) I'm sure there is enough performance available (as said there is an average of 2000 I/O's).
CPU is uncapped 0.5 entitlement, 5 virtual processors, weight is on the second highest possibility, higher are only the vios. There are enough CPU resources available. I know this looks like there are not enough I/O's but they are there 🙂
I'm curious to see what kind of response you get to this question.
I run in an environment similar to yours- Indexer running on an LPAR running AIX 6.1 (Was on 5.3 just a few weeks ago tho), 1.5 POWER6 CPUs capped, 4 GB of RAM. We haven't experienced the performance problem that you're experiencing, but we're not indexing as much data as you.
(Calculating the same license usage for us takes only about 12 seconds, not the 2+ minutes you're experiencing.)
How long have you been running Splunk?
I do notice that Splunk becomes a CPU hog when it is first started, but it settles down after a little while. But it sounds like your problem goes far beyond that.
Out of curiosity, what kind of disk are you using?
Hi, we were just started with splunk, the amount of data should not be a problem there are larger environments around. It never worked well with the version 4.x . I will try to run splunk 3.x on the same server, because we have another server were we once installed splunk 3.x and it seems it run's mutch faster, but it's on a different server. thx anyway for your support