Getting Data In

Is the Splunk 6.3 universal forwarder using 90% of your CPUs?

lycollicott
Motivator

I'm not sure how long it has been happening, but I began to see it across our UFs today.

0 Karma
1 Solution

lycollicott
Motivator

Version 6.3.2 seems to have done the trick.

View solution in original post

lycollicott
Motivator

Version 6.3.2 seems to have done the trick.

jgbricker
Contributor

Luckily I was able to remove the wildcards from my stanzas and this solved my problem. However in your case that doesn't seem to be a good solution. You probably are already aware of your options here - http://docs.splunk.com/Documentation/Splunk/6.1/Data/Whitelistorblacklistspecificincomingdata

0 Karma

lycollicott
Motivator

I removed the wildcards too, but a new problem arose where some UFs flapped cpu instead of pegging it.

Support logged bug SPL-108954 for the issues.

0 Karma

freni59
New Member

So I'm seeing this as well on some of my UFs. I have 4 servers on which this has occured 1-3 times durring the last week, there is nothing special about these 4 boxes, running same version, same config, as the rest of the bunch ~105 windows boxes. They're in the same Windows domain, but so are others, they are not on the same vmware host, it doesn't occur syncrounus on all 4, what is common though is that they complain about the time in their splunkd.log, but only one system has logged any other errors connected to time issues in eventlogs.

0 Karma

lycollicott
Motivator

What was their complaint about time?

I eliminated the wildcards from all my 6.3 UFs and splunkd.exe stopped pegging every host at 100% cpu, but I still had some problems. splunkd.exe flapped cpu wildly on some hosts, but not others.

Support sent my original case off to development and they are working on it as a bug, but in the meantime I have downgraded.

0 Karma

freni59
New Member

Log example:
11-23-2015 13:03:15.529 +0100 WARN TimeoutHeap - Either time adjusted forwards by, or event loop was descheduled for 3694994ms.
11-23-2015 13:03:15.607 +0100 WARN TimeoutHeap - Either time adjusted forwards by, or event loop was descheduled for 3694963ms.
11-23-2015 13:03:15.623 +0100 WARN TimeoutHeap - Either time adjusted forwards by, or event loop was descheduled for 3695009ms.
11-23-2015 13:03:15.733 +0100 WARN TimeoutHeap - Either time adjusted forwards by, or event loop was descheduled for 3694244ms.
11-23-2015 13:03:16.248 +0100 WARN TimeoutHeap - Either time adjusted forwards by, or event loop was descheduled for 3711931ms.
11-23-2015 13:03:18.154 +0100 WARN TimeoutHeap - Either time adjusted forwards by, or event loop was descheduled for 3703228ms.
11-23-2015 13:07:07.951 +0100 WARN TimeoutHeap - Either time adjusted forwards by, or event loop was descheduled for 4293229ms.
11-23-2015 13:00:19.506 +0100 WARN TimeoutHeap - Detected system time adjusted backwards by 3691155ms.
11-23-2015 13:00:19.506 +0100 WARN TimeoutHeap - Detected system time adjusted backwards by 3690904ms.
11-23-2015 13:00:19.756 +0100 WARN TimeoutHeap - Detected system time adjusted backwards by 3690920ms.
11-23-2015 13:00:19.772 +0100 WARN TimeoutHeap - Detected system time adjusted backwards by 3690920ms.
11-23-2015 13:00:20.553 +0100 WARN TimeoutHeap - Detected system time adjusted backwards by 3675326ms.
11-23-2015 13:00:26.240 +0100 WARN TimeoutHeap - Detected system time adjusted backwards by 3681920ms.
11-23-2015 13:05:36.024 +0100 WARN TimeoutHeap - Detected system time adjusted backwards by 3091934ms.

As you see above typically I get 6-7 of the first log entry type, there is nothing in the log before that that's close in time and could result in a message of that kind, then you get a typically 7 of the second type of entry and then there is nothing before restarting the splunk forwarder service, notice the time difference between entry 7 and 8, the log time actually adjusted backwards. Although as it seems the number of milliseconds don't necissarily add up to the shift backwards in time as I have another box with roughly the same amount of milliseconds in the message but the log entry actually jumping ~45 minutes back in time.

Interesting though that you mention the flapping in cpu because I have the impression that after the first "Restart-Service splunkforwarder" that it is actually flapping all over the place but that it's calming after issuing a second restart-service....

As this showed up in 6.3 I've gone the other way compared to you and upgraded to 6.3.1 yesterday and am keeping a close eye on this today. Funny thing though is that this showed up not directly after the upgrade to 6.3 but some weeks later....

(BTW when you say you've eliminated the wildcards what exactly are you meaning by that?

0 Karma

aosso
Path Finder

We are having this problem on some servers. It does actually happen on the same day and time every week since we upgraded our UF to 6.3.0, so we are currently trying to narrow down the reason.

In any case, this did not happen with 6.2.5... maybe we'll downgrade some UF's to 6.2.6 and see what happens.

lycollicott
Motivator

We saw the high cpu behavior almost all the time on 6.3, but as soon as downgraded to 6.2.5 the cpu ran at 0-1%.

0 Karma

lycollicott
Motivator

I may wave the flag of surrender. I wanted to create a diag tarball for 6.2.5 and for 6.3.0, so after I made the first one I upgraded back to 6.3.0 on one node.....CPU was not and would not peg. I did the same procedure on a second node and it did peg.

I must say that this lacks some consistency.

0 Karma

aosso
Path Finder

Yeah, it's totally random... I just did a downgrade to 6.2.6 (we went 6.2.5>6.3.0>6.2.6) to see what happens. Today splunk uf totally bogged down two database servers.

0 Karma

aosso
Path Finder

Well, the problem did not reproduce again after downgrading the affected servers to 6.2.6. However, I have another Splunk instance with different servers with 6.3.0 installed and this problem does not happen there.

I'll keep installing 6.2.6 for now on new deployments.

0 Karma

DaClyde
Contributor

Have you tried using the Blacklist feature in the inputs.conf to specifically blacklist the pagefile.sys file?

0 Karma

lycollicott
Motivator

Yes, I did, but it made no difference.

0 Karma

DaClyde
Contributor

What regex did you use to try and blacklist the pagefile.sys?

0 Karma

lycollicott
Motivator

Actually it was just
blacklist = pagefile.sys

0 Karma

DaClyde
Contributor

It is expecting a regular expression, so try this (to escape the dot):

blacklist = pagefile\.sys
0 Karma

lycollicott
Motivator

I did, but it made no difference. I also tried just pagefile, but to no avail.

0 Karma

lycollicott
Motivator

More info....

This has been happening semi-consistently for 3 weeks since the exact date of the 6.3 UF upgrades.
We have monitors in place like this (which did not cause high CPU in 6.2) :

[monitor://j:\tomcat-inst*\logs\]

because we have many Tomcat directories like these:

j:\tomcat-inst1a
j:\tomcat-inst1b
j:\tomcat-inst2a
j:\tomcat-inst2b

Apparently that is making Splunk read all of j: where it staggers over pagefile.sys which causes these in splunkd.log:

FilesystemChangeWatcher - error getting attributes of path "j:\pagefile.sys": The process cannot access the file
because it is being used by another process.

The wildcard is the cause of the cpu distress, but inputs.conf won't use a regex without a wildcard in the same segment which defeats the whole regex in that case. There are simply too many unique j:\tomcat-inst* so we don't want to hard code them all.

Any ideas?

0 Karma
Get Updates on the Splunk Community!

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...