Archive

Splunk freezing for other users when doing large base search.

robertlynch2020
Motivator

Hi

Other users are unable to open splunk screens for up to 1 minute while one user is running a large base search?

We have one search head and one indexer.
The base search takes 2 minutes to run off datamodels.

When this search is running other user can't even open splunk to do anything.

Would increasing the number of search head or indexers help?

Any help would be great as i dont want to revert from base search.

Thanks in advance
Robert Lynch

0 Karma

ssadanala1
Contributor

Is your base search ending with a transformation command or can you post how your base search looks like ?

0 Karma

robertlynch2020
Motivator

no transformation command

| tstats summariesonly=true max(MXTIMING.Elapsed) AS Elapsed max(MXTIMING.CPU) AS CPU max(MXTIMING.CPU_PER) AS CPU_PER values(MXTIMING.RDB_COM1) AS RDB_COM values(MXTIMING.RDB_COM_PER1) AS RDB_COM_PER max(MXTIMING.Memory) AS Memory max(MXTIMING.Elapsed_C) AS Elapsed_C values(source) AS source_MXTIMING avg(MXTIMING.Elapsed) AS average, count(MXTIMING.Elapsed) AS count, stdev(MXTIMING.Elapsed) AS stdev, median(MXTIMING.Elapsed) AS median, exactperc95(MXTIMING.Elapsed) AS perc95, exactperc99.5(MXTIMING.Elapsed) AS perc99.5, min(MXTIMING.Elapsed) AS min,earliest(_time) as start, latest(_time) as stop FROM datamodel=MXTIMING_V8_5_Seconds WHERE
host=QCST_RSAT_40
AND MXTIMING.Elapsed > 5
GROUPBY _time MXTIMING.Machine_Name MXTIMING.Context+Command MXTIMING.NPID MXTIMING.Date MXTIMING.Time MXTIMING.MXTIMING_TYPE_DM source MXTIMING.UserName2 MXTIMING.source_path MXTIMING.Command3 MXTIMING.Context3 span=1s
| rename MXTIMING.Context+Command as Context+Command
| rename MXTIMING.NPID as NPID
| rename MXTIMING.MXTIMING_TYPE_DM as TYPE
| rename MXTIMING.Date as Date
| rename MXTIMING.Time as Time
| rename MXTIMING.Machine_Name as Machine_Name
| rename MXTIMING.UserName2 as UserName
| rename MXTIMING.source_path as source_path
| eval Date=strftime(strptime(Date,"%Y%m%d"),"%d/%m/%Y")
| eval Time = Date." ".Time
| eval FULL_EVENT=Elapsed_C
| eval FULL_EVENT=replace(FULL_EVENT,"\d+.\d+","FULL_EVENT")
| join Machine_Name NPID type=left
[| tstats summariesonly=true count(SERVICE.NPID) AS count2 values(source) AS source_SERVICES FROM datamodel=SERVICE_V5 WHERE ( host=QCST_RSAT_40 earliest=1525269600 latest=1525357584) AND SERVICE.NICKNAME IN ()
GROUPBY SERVICE.Machine_Name SERVICE.NICKNAME SERVICE.NPID
| rename SERVICE.NPID AS NPID
| rename SERVICE.NICKNAME AS NICKNAME
| rename SERVICE.Machine_Name as Machine_Name
| table NICKNAME NPID source_SERVICES Machine_Name ]
| lookup MXTIMING_lookup_Base Context_Command AS "Context+Command" Type as "TYPE" OUTPUT Tags CC_Description Threshold Alert
| appendpipe
[| where isnull(Threshold)
| rename TYPE AS BACKUP_TYPE
| eval TYPE=""
| lookup MXTIMING_lookup_Base Context_Command AS "Context+Command" Type as "TYPE" OUTPUT Tags CC_Description Threshold Alert
| rename BACKUP_TYPE AS TYPE]
| dedup Time, NPID,Context+Command
| where Elapsed > Threshold OR isnull('Threshold')
| fillnull Tags
| eval Tags=if(Tags=0,"PLEASE_ADD_TAG",Tags)
| makemv Tags delim=","
| eval Tags=split(Tags,",")
| search Tags IN (*)
| eval source_SERVICES_count=mvcount(split(source_SERVICES, " "))
| eval NICKNAME=if(source_SERVICES_count > 1, "MULTIPLE_OPTIONS_FOUND",NICKNAME)

0 Karma

jkat54
SplunkTrust
SplunkTrust

Did you set ulimits and thp?

index=_internal ulimit shows it everytime splunk restarts.

0 Karma

robertlynch2020
Motivator

Hi

THP is off.
grep -e AnonHugePages /proc/*/smaps | awk '{ if($2>4) print $0} ' | awk -F "/" '{print $0; system("ps -fp " $3)} ' | grep splunk

However when i run (ALL Time)
index=_internal ulimit i don't get any answers back below, so i am not sure what i am looking at?

09/05/2018

13:51:41.284

05-09-2018 13:51:41.284 +0200 INFO ulimit - Linux transparent hugetables support, enabled="always" defrag="always"
host = dell425srv index = _internal linecount = 1 source = /dell845srv/apps/splunk_forwarder/splunkforwarder_NT_PAC/var/log/splunk/splunkd.log sourcetype = splunkd splunk_server = dell425srv
2 09/05/2018
13:51:41.284

05-09-2018 13:51:41.284 +0200 INFO ulimit - Limit: cpu time: unlimited
host = dell425srv index = _internal linecount = 1 source = /dell845srv/apps/splunk_forwarder/splunkforwarder_NT_PAC/var/log/splunk/splunkd.log sourcetype = splunkd splunk_server = dell425srv
3 09/05/2018
13:51:41.284

05-09-2018 13:51:41.284 +0200 INFO ulimit - Limit: user processes: 790527 processes
host = dell425srv index = _internal linecount = 1 source = /dell845srv/apps/splunk_forwarder/splunkforwarder_NT_PAC/var/log/splunk/splunkd.log sourcetype = splunkd splunk_server = dell425srv
4 09/05/2018
13:51:41.284

05-09-2018 13:51:41.284 +0200 INFO ulimit - Limit: open files: 65536 files
host = dell425srv index = _internal linecount = 1 source = /dell845srv/apps/splunk_forwarder/splunkforwarder_NT_PAC/var/log/splunk/splunkd.log sourcetype = splunkd splunk_server = dell425srv
5 09/05/2018
13:51:41.284

05-09-2018 13:51:41.284 +0200 INFO ulimit - Limit: data file size: unlimited
host = dell425srv index = _internal linecount = 1 source = /dell845srv/apps/splunk_forwarder/splunkforwarder_NT_PAC/var/log/splunk/splunkd.log sourcetype = splunkd splunk_server = dell425srv
6 09/05/2018
13:51:41.284

05-09-2018 13:51:41.284 +0200 INFO ulimit - Limit: core file size: unlimited
host = dell425srv index = _internal linecount = 1 source = /dell845srv/apps/splunk_forwarder/splunkforwarder_NT_PAC/var/log/splunk/splunkd.log sourcetype = splunkd splunk_server = dell425srv
7 09/05/2018
13:51:41.284

05-09-2018 13:51:41.284 +0200 INFO ulimit - Limit: stack size: 8388608 bytes [hard maximum: unlimited]
host = dell425srv index = _internal linecount = 1 source = /dell845srv/apps/splunk_forwarder/splunkforwarder_NT_PAC/var/log/splunk/splunkd.log sourcetype = splunkd splunk_server = dell425srv
8 09/05/2018
13:51:41.284

05-09-2018 13:51:41.284 +0200 INFO ulimit - Limit: resident memory size: unlimited
host = dell425srv index = _internal linecount = 1 source = /dell845srv/apps/splunk_forwarder/splunkforwarder_NT_PAC/var/log/splunk/splunkd.log sourcetype = splunkd splunk_server = dell425srv
9 09/05/2018
13:51:41.284

05-09-2018 13:51:41.284 +0200 INFO ulimit - Limit: data segment size: unlimited
host = dell425srv index = _internal linecount = 1 source = /dell845srv/apps/splunk_forwarder/splunkforwarder_NT_PAC/var/log/splunk/splunkd.log sourcetype = splunkd splunk_server = dell425srv
10 09/05/2018
13:51:41.284

05-09-2018 13:51:41.284 +0200 INFO ulimit - Limit: virtual address space size: unlimited
host = dell425srv index = _internal linecount = 1 source = /dell845srv/apps/splunk_forwarder/splunkforwarder_NT_PAC/var/log/splunk/splunkd.log sourcetype = splunkd splunk_server = dell425srv

0 Karma

jkat54
SplunkTrust
SplunkTrust

Looks like thp is enabled. I’m not sure if that would cause this behavior though.

0 Karma

robertlynch2020
Motivator

Hi

I do a check for THP and to me it's off. Why do you think it is on?

I check all the process and it returns nothing.
grep -e AnonHugePages /proc/*/smaps | awk '{ if($2>4) print $0} ' | awk -F "/" '{print $0; system("ps -fp " $3)} ' | grep splunk

0 Karma

woodcock
Esteemed Legend

What happens during the "unable" period? Do you get a "queueing" message, or some other update or is it just a blank screen or what?

0 Karma

robertlynch2020
Motivator

Hi Woodcock - Hope you are well 🙂

The behavior can change.
Sometime i cant log into any new screen when the ~2 minute big search is running - Chrome will just keep going around and around.
Sometime i can use screens that are open, however they don't work that well.

Then other times, it seems to work fine..ish...but slower....
Its a funny one.

0 Karma

DalJeanis
SplunkTrust
SplunkTrust

I've experienced a limitation on the number of sockets available. When a single dash opens many simultaneous searches, it can prevent other uses from receiving a socket for their dash.

To see if that is the case, ask the users who are "dead" to note any messages that they may receive.

Also, check the dash and see how the base search works. This may be a candidate for the technique where the base search runs, then saves the jobid of the results. The remainder of the searches, instead of using the bse search as such, use loadjob with the id that was returned.

0 Karma

macadminrohit
Contributor

Interesting, how do we use the jobid rather than using the base search concept in subsequent searches?

0 Karma

woodcock
Esteemed Legend

If this is the case, be aware that sockets are also inodes so you may be suffering from inode-exhaustion.

0 Karma

robertlynch2020
Motivator

Ok - am seeing "waiting on available sockets"
When i click on "inspect" - this is the first time i have spotted this..

0 Karma

woodcock
Esteemed Legend

Inodes are controlled by ulimit; see what @jkat54 said.

0 Karma

robertlynch2020
Motivator

So, how do i check if i am running out of sockets?

0 Karma

gjanders
SplunkTrust
SplunkTrust

Perhaps try
index=_internal "HttpListener - Can't handle request for" sourcetype=splunkd

Although I haven't seen this particular issue so not 100% sure that's the correct search...the above will work for example on a deployment server receiving too many connections...

0 Karma

David_Naylor
Path Finder

Run top on your search head while the user executes the search, I expect you to see utilization jump to 100% This sounds like a resource utilization problem where your hardware cannot keep up with the demand.

0 Karma

robertlynch2020
Motivator

HI

I cant get the BOX over 20% ?

It a big big box... but i am looking to push it.
I think i might have to add indexers and SH - I have not done it before, but i will try

0 Karma

robertlynch2020
Motivator

vmstat and top

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
19 0 0 907140 4304 375184512 0 0 28 104 0 3 4 1 95 0 0
15 0 0 3216580 4304 374960896 0 0 336 462 518492 24231 22 6 72 0 0
21 0 0 3744904 4304 374987200 0 0 0 29153 436836 30985 22 7 71 0 0
14 0 0 3243016 4304 375016384 0 0 0 17118 450371 29062 23 7 69 0 0
17 0 0 3439856 4304 375056768 0 0 0 427 517405 22007 24 6 70 0 0
15 0 0 3589468 4304 375072640 0 0 0 1417 499387 40596 24 5 71 0 0
15 0 0 3413696 4304 375063040 0 0 0 11410 473863 25533 23 5 72 0 0
25 0 0 3953944 4304 375152448 0 0 0 21670 492277 35373 23 6 70 0 0
13 0 0 4148080 4304 375221440 0 0 0 16422 373584 40051 20 5 75 0 0
6 0 0 4738224 4304 375207680 0 0 0 66 52522 22534 11 3 86 0 0
11 1 0 4355948 4304 375392992 0 0 0 78 54571 23997 10 4 86 0 0
10 0 0 4507112 4304 375828352 0 0 0 217 51837 17508 10 4 86 0 0
8 0 0 3708840 4304 375964480 0 0 0 46526 60016 17598 11 4 85 0 0
7 0 0 3552936 4304 376003744 0 0 0 87206 40021 10533 9 3 87 0 0
12 0 0 1958064 4304 376103168 0 0 0 806 56752 16889 23 6 71 0 0
10 0 0 2568120 4304 376076704 0 0 0 1843 60440 16156 15 4 81 0 0
9 0 0 2318016 4304 376160064 0 0 0 2604 51968 15052 14 3 82 0 0
8 0 0 2563528 4304 376129504 0 0 0 282 40177 15186 12 3 85 0 0
10 0 0 3183756 4304 376142976 0 0 0 37471 55299 14889 11 4 85 0 0

top - 15:47:42 up 2 days, 7:39, 4 users, load average: 14.84, 12.97, 10.54
Tasks: 658 total, 2 running, 656 sleeping, 0 stopped, 0 zombie
%Cpu(s): 15.5 us, 6.5 sy, 0.0 ni, 77.9 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 39595564+total, 977992 free, 54651488 used, 34032617+buff/cache
KiB Swap: 67108860 total, 67108860 free, 0 used. 33537020+avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5254 autoeng+ 20 0 47.470g 0.036t 24340 S 680.1 9.7 777:05.55 splunkd
2274 autoeng+ 20 0 5913624 985.0m 12344 S 199.7 0.3 34:07.72 splunkd
661 autoeng+ 20 0 5807112 1.508g 12336 S 99.7 0.4 5:55.52 splunkd
40803 autoeng+ 20 0 5688312 1.542g 12412 S 99.7 0.4 7:18.46 splunkd
304 root 20 0 0 0 0 S 21.2 0.0 0:04.37 kswapd1
303 root 20 0 0 0 0 S 20.5 0.0 0:04.52 kswapd0
42315 autoeng+ 20 0 92588 41836 15156 R 8.3 0.0 0:00.25 splunkd
10272 autoeng+ 20 0 6034912 5.297g 12348 S 6.6 1.4 329:38.43 splunkd
19849 autoeng+ 20 0 2262496 1.902g 12376 S 6.3 0.5 274:43.51 splunkd
19818 autoeng+ 20 0 394720 152636 12348 S 4.3 0.0 138:57.36 splunkd
42264 autoeng+ 20 0 813372 19864 7740 S 4.3 0.0 0:00.13 java
41966 root 0 -20 0 0 0 S 2.0 0.0 1:49.34 kworker/6:0H
42261 autoeng+ 20 0 107900 13508 5096 S 1.7 0.0 0:00.05 python
19845 autoeng+ 20 0 292320 48484 12364 S 0.7 0.0 1:37.45 splunkd
25248 root 0 -20 0 0 0 S 0.7 0.0 2:06.24 kworker/9:2H
25469 root 20 0 155552 95620 95288 S 0.7 0.0 16:09.59 systemd-journal
39096 root 0 -20 0 0 0 S 0.7 0.0 0:03.88 kworker/10:2H
42073 autoeng+ 20 0 52772 2724 1440 R 0.7 0.0 0:00.14 top

0 Karma

somesoni2
Revered Legend

Think about beefing up your servers with proper/recommended H/W. If you're already there then I would increase both SH and indexer. What's your current HW configuration BTW?

0 Karma

robertlynch2020
Motivator

Hi

2 CPU 14 cores - With Turbo threading on : Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz- SO 56 Cores ().
RAM size 384GB
6TB SSD
Red Hat

When you say increase you mean add on more - at the moment i have one of each. I think i need to add on more.

0 Karma
.conf21 CFS Extended through 5/20!

Don't miss your chance
to share your Splunk
wisdom in-person or
virtually at .conf21!

Call for Speakers has
been extended through
Thursday, 5/20!