Hi
Other users are unable to open splunk screens for up to 1 minute while one user is running a large base search?
We have one search head and one indexer.
The base search takes 2 minutes to run off datamodels.
When this search is running other user can't even open splunk to do anything.
Would increasing the number of search head or indexers help?
Any help would be great as i dont want to revert from base search.
Thanks in advance
Robert Lynch
Is your base search ending with a transformation command or can you post how your base search looks like ?
no transformation command
| tstats summariesonly=true max(MXTIMING.Elapsed) AS Elapsed max(MXTIMING.CPU) AS CPU max(MXTIMING.CPU_PER) AS CPU_PER values(MXTIMING.RDB_COM1) AS RDB_COM values(MXTIMING.RDB_COM_PER1) AS RDB_COM_PER max(MXTIMING.Memory) AS Memory max(MXTIMING.Elapsed_C) AS Elapsed_C values(source) AS source_MXTIMING avg(MXTIMING.Elapsed) AS average, count(MXTIMING.Elapsed) AS count, stdev(MXTIMING.Elapsed) AS stdev, median(MXTIMING.Elapsed) AS median, exactperc95(MXTIMING.Elapsed) AS perc95, exactperc99.5(MXTIMING.Elapsed) AS perc99.5, min(MXTIMING.Elapsed) AS min,earliest(_time) as start, latest(_time) as stop FROM datamodel=MXTIMING_V8_5_Seconds WHERE
host=QCST_RSAT_40
AND MXTIMING.Elapsed > 5
GROUPBY _time MXTIMING.Machine_Name MXTIMING.Context+Command MXTIMING.NPID MXTIMING.Date MXTIMING.Time MXTIMING.MXTIMING_TYPE_DM source MXTIMING.UserName2 MXTIMING.source_path MXTIMING.Command3 MXTIMING.Context3 span=1s
| rename MXTIMING.Context+Command as Context+Command
| rename MXTIMING.NPID as NPID
| rename MXTIMING.MXTIMING_TYPE_DM as TYPE
| rename MXTIMING.Date as Date
| rename MXTIMING.Time as Time
| rename MXTIMING.Machine_Name as Machine_Name
| rename MXTIMING.UserName2 as UserName
| rename MXTIMING.source_path as source_path
| eval Date=strftime(strptime(Date,"%Y%m%d"),"%d/%m/%Y")
| eval Time = Date." ".Time
| eval FULL_EVENT=Elapsed_C
| eval FULL_EVENT=replace(FULL_EVENT,"\d+.\d+","FULL_EVENT")
| join Machine_Name NPID type=left
[| tstats summariesonly=true count(SERVICE.NPID) AS count2 values(source) AS source_SERVICES FROM datamodel=SERVICE_V5 WHERE ( host=QCST_RSAT_40 earliest=1525269600 latest=1525357584) AND SERVICE.NICKNAME IN ()
GROUPBY SERVICE.Machine_Name SERVICE.NICKNAME SERVICE.NPID
| rename SERVICE.NPID AS NPID
| rename SERVICE.NICKNAME AS NICKNAME
| rename SERVICE.Machine_Name as Machine_Name
| table NICKNAME NPID source_SERVICES Machine_Name ]
| lookup MXTIMING_lookup_Base Context_Command AS "Context+Command" Type as "TYPE" OUTPUT Tags CC_Description Threshold Alert
| appendpipe
[| where isnull(Threshold)
| rename TYPE AS BACKUP_TYPE
| eval TYPE=""
| lookup MXTIMING_lookup_Base Context_Command AS "Context+Command" Type as "TYPE" OUTPUT Tags CC_Description Threshold Alert
| rename BACKUP_TYPE AS TYPE]
| dedup Time, NPID,Context+Command
| where Elapsed > Threshold OR isnull('Threshold')
| fillnull Tags
| eval Tags=if(Tags=0,"PLEASE_ADD_TAG",Tags)
| makemv Tags delim=","
| eval Tags=split(Tags,",")
| search Tags IN (*)
| eval source_SERVICES_count=mvcount(split(source_SERVICES, " "))
| eval NICKNAME=if(source_SERVICES_count > 1, "MULTIPLE_OPTIONS_FOUND",NICKNAME)
Did you set ulimits and thp?
index=_internal ulimit shows it everytime splunk restarts.
Hi
THP is off.
grep -e AnonHugePages /proc/*/smaps | awk '{ if($2>4) print $0} ' | awk -F "/" '{print $0; system("ps -fp " $3)} ' | grep splunk
However when i run (ALL Time)
index=_internal ulimit i don't get any answers back below, so i am not sure what i am looking at?
09/05/2018
13:51:41.284
05-09-2018 13:51:41.284 +0200 INFO ulimit - Linux transparent hugetables support, enabled="always" defrag="always"
host = dell425srv index = _internal linecount = 1 source = /dell845srv/apps/splunk_forwarder/splunkforwarder_NT_PAC/var/log/splunk/splunkd.log sourcetype = splunkd splunk_server = dell425srv
2 09/05/2018
13:51:41.284
05-09-2018 13:51:41.284 +0200 INFO ulimit - Limit: cpu time: unlimited
host = dell425srv index = _internal linecount = 1 source = /dell845srv/apps/splunk_forwarder/splunkforwarder_NT_PAC/var/log/splunk/splunkd.log sourcetype = splunkd splunk_server = dell425srv
3 09/05/2018
13:51:41.284
05-09-2018 13:51:41.284 +0200 INFO ulimit - Limit: user processes: 790527 processes
host = dell425srv index = _internal linecount = 1 source = /dell845srv/apps/splunk_forwarder/splunkforwarder_NT_PAC/var/log/splunk/splunkd.log sourcetype = splunkd splunk_server = dell425srv
4 09/05/2018
13:51:41.284
05-09-2018 13:51:41.284 +0200 INFO ulimit - Limit: open files: 65536 files
host = dell425srv index = _internal linecount = 1 source = /dell845srv/apps/splunk_forwarder/splunkforwarder_NT_PAC/var/log/splunk/splunkd.log sourcetype = splunkd splunk_server = dell425srv
5 09/05/2018
13:51:41.284
05-09-2018 13:51:41.284 +0200 INFO ulimit - Limit: data file size: unlimited
host = dell425srv index = _internal linecount = 1 source = /dell845srv/apps/splunk_forwarder/splunkforwarder_NT_PAC/var/log/splunk/splunkd.log sourcetype = splunkd splunk_server = dell425srv
6 09/05/2018
13:51:41.284
05-09-2018 13:51:41.284 +0200 INFO ulimit - Limit: core file size: unlimited
host = dell425srv index = _internal linecount = 1 source = /dell845srv/apps/splunk_forwarder/splunkforwarder_NT_PAC/var/log/splunk/splunkd.log sourcetype = splunkd splunk_server = dell425srv
7 09/05/2018
13:51:41.284
05-09-2018 13:51:41.284 +0200 INFO ulimit - Limit: stack size: 8388608 bytes [hard maximum: unlimited]
host = dell425srv index = _internal linecount = 1 source = /dell845srv/apps/splunk_forwarder/splunkforwarder_NT_PAC/var/log/splunk/splunkd.log sourcetype = splunkd splunk_server = dell425srv
8 09/05/2018
13:51:41.284
05-09-2018 13:51:41.284 +0200 INFO ulimit - Limit: resident memory size: unlimited
host = dell425srv index = _internal linecount = 1 source = /dell845srv/apps/splunk_forwarder/splunkforwarder_NT_PAC/var/log/splunk/splunkd.log sourcetype = splunkd splunk_server = dell425srv
9 09/05/2018
13:51:41.284
05-09-2018 13:51:41.284 +0200 INFO ulimit - Limit: data segment size: unlimited
host = dell425srv index = _internal linecount = 1 source = /dell845srv/apps/splunk_forwarder/splunkforwarder_NT_PAC/var/log/splunk/splunkd.log sourcetype = splunkd splunk_server = dell425srv
10 09/05/2018
13:51:41.284
05-09-2018 13:51:41.284 +0200 INFO ulimit - Limit: virtual address space size: unlimited
host = dell425srv index = _internal linecount = 1 source = /dell845srv/apps/splunk_forwarder/splunkforwarder_NT_PAC/var/log/splunk/splunkd.log sourcetype = splunkd splunk_server = dell425srv
Looks like thp is enabled. I’m not sure if that would cause this behavior though.
Hi
I do a check for THP and to me it's off. Why do you think it is on?
I check all the process and it returns nothing.
grep -e AnonHugePages /proc/*/smaps | awk '{ if($2>4) print $0} ' | awk -F "/" '{print $0; system("ps -fp " $3)} ' | grep splunk
What happens during the "unable" period? Do you get a "queueing" message, or some other update or is it just a blank screen or what?
Hi Woodcock - Hope you are well 🙂
The behavior can change.
Sometime i cant log into any new screen when the ~2 minute big search is running - Chrome will just keep going around and around.
Sometime i can use screens that are open, however they don't work that well.
Then other times, it seems to work fine..ish...but slower....
Its a funny one.
I've experienced a limitation on the number of sockets available. When a single dash opens many simultaneous searches, it can prevent other uses from receiving a socket for their dash.
To see if that is the case, ask the users who are "dead" to note any messages that they may receive.
Also, check the dash and see how the base search works. This may be a candidate for the technique where the base search runs, then saves the jobid of the results. The remainder of the searches, instead of using the bse search as such, use loadjob with the id that was returned.
Interesting, how do we use the jobid rather than using the base search concept in subsequent searches?
If this is the case, be aware that sockets
are also inodes
so you may be suffering from inode-exhaustion.
Ok - am seeing "waiting on available sockets"
When i click on "inspect" - this is the first time i have spotted this..
Inodes are controlled by ulimit
; see what @jkat54 said.
So, how do i check if i am running out of sockets?
Perhaps try
index=_internal "HttpListener - Can't handle request for" sourcetype=splunkd
Although I haven't seen this particular issue so not 100% sure that's the correct search...the above will work for example on a deployment server receiving too many connections...
Run top on your search head while the user executes the search, I expect you to see utilization jump to 100% This sounds like a resource utilization problem where your hardware cannot keep up with the demand.
HI
I cant get the BOX over 20% ?
It a big big box... but i am looking to push it.
I think i might have to add indexers and SH - I have not done it before, but i will try
vmstat and top
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
19 0 0 907140 4304 375184512 0 0 28 104 0 3 4 1 95 0 0
15 0 0 3216580 4304 374960896 0 0 336 462 518492 24231 22 6 72 0 0
21 0 0 3744904 4304 374987200 0 0 0 29153 436836 30985 22 7 71 0 0
14 0 0 3243016 4304 375016384 0 0 0 17118 450371 29062 23 7 69 0 0
17 0 0 3439856 4304 375056768 0 0 0 427 517405 22007 24 6 70 0 0
15 0 0 3589468 4304 375072640 0 0 0 1417 499387 40596 24 5 71 0 0
15 0 0 3413696 4304 375063040 0 0 0 11410 473863 25533 23 5 72 0 0
25 0 0 3953944 4304 375152448 0 0 0 21670 492277 35373 23 6 70 0 0
13 0 0 4148080 4304 375221440 0 0 0 16422 373584 40051 20 5 75 0 0
6 0 0 4738224 4304 375207680 0 0 0 66 52522 22534 11 3 86 0 0
11 1 0 4355948 4304 375392992 0 0 0 78 54571 23997 10 4 86 0 0
10 0 0 4507112 4304 375828352 0 0 0 217 51837 17508 10 4 86 0 0
8 0 0 3708840 4304 375964480 0 0 0 46526 60016 17598 11 4 85 0 0
7 0 0 3552936 4304 376003744 0 0 0 87206 40021 10533 9 3 87 0 0
12 0 0 1958064 4304 376103168 0 0 0 806 56752 16889 23 6 71 0 0
10 0 0 2568120 4304 376076704 0 0 0 1843 60440 16156 15 4 81 0 0
9 0 0 2318016 4304 376160064 0 0 0 2604 51968 15052 14 3 82 0 0
8 0 0 2563528 4304 376129504 0 0 0 282 40177 15186 12 3 85 0 0
10 0 0 3183756 4304 376142976 0 0 0 37471 55299 14889 11 4 85 0 0
top - 15:47:42 up 2 days, 7:39, 4 users, load average: 14.84, 12.97, 10.54
Tasks: 658 total, 2 running, 656 sleeping, 0 stopped, 0 zombie
%Cpu(s): 15.5 us, 6.5 sy, 0.0 ni, 77.9 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 39595564+total, 977992 free, 54651488 used, 34032617+buff/cache
KiB Swap: 67108860 total, 67108860 free, 0 used. 33537020+avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5254 autoeng+ 20 0 47.470g 0.036t 24340 S 680.1 9.7 777:05.55 splunkd
2274 autoeng+ 20 0 5913624 985.0m 12344 S 199.7 0.3 34:07.72 splunkd
661 autoeng+ 20 0 5807112 1.508g 12336 S 99.7 0.4 5:55.52 splunkd
40803 autoeng+ 20 0 5688312 1.542g 12412 S 99.7 0.4 7:18.46 splunkd
304 root 20 0 0 0 0 S 21.2 0.0 0:04.37 kswapd1
303 root 20 0 0 0 0 S 20.5 0.0 0:04.52 kswapd0
42315 autoeng+ 20 0 92588 41836 15156 R 8.3 0.0 0:00.25 splunkd
10272 autoeng+ 20 0 6034912 5.297g 12348 S 6.6 1.4 329:38.43 splunkd
19849 autoeng+ 20 0 2262496 1.902g 12376 S 6.3 0.5 274:43.51 splunkd
19818 autoeng+ 20 0 394720 152636 12348 S 4.3 0.0 138:57.36 splunkd
42264 autoeng+ 20 0 813372 19864 7740 S 4.3 0.0 0:00.13 java
41966 root 0 -20 0 0 0 S 2.0 0.0 1:49.34 kworker/6:0H
42261 autoeng+ 20 0 107900 13508 5096 S 1.7 0.0 0:00.05 python
19845 autoeng+ 20 0 292320 48484 12364 S 0.7 0.0 1:37.45 splunkd
25248 root 0 -20 0 0 0 S 0.7 0.0 2:06.24 kworker/9:2H
25469 root 20 0 155552 95620 95288 S 0.7 0.0 16:09.59 systemd-journal
39096 root 0 -20 0 0 0 S 0.7 0.0 0:03.88 kworker/10:2H
42073 autoeng+ 20 0 52772 2724 1440 R 0.7 0.0 0:00.14 top
Think about beefing up your servers with proper/recommended H/W. If you're already there then I would increase both SH and indexer. What's your current HW configuration BTW?
Hi
2 CPU 14 cores - With Turbo threading on : Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz- SO 56 Cores ().
RAM size 384GB
6TB SSD
Red Hat
When you say increase you mean add on more - at the moment i have one of each. I think i need to add on more.