Solved: Re: Is the Splunk 6.3 universal forwarder using 90...

lycollicott · ‎10-15-2015

I'm not sure how long it has been happening, but I began to see it across our UFs today.

lycollicott · ‎01-21-2016

Version 6.3.2 seems to have done the trick.

lycollicott · ‎01-21-2016

Version 6.3.2 seems to have done the trick.

jgbricker · ‎12-15-2015

Luckily I was able to remove the wildcards from my stanzas and this solved my problem. However in your case that doesn't seem to be a good solution. You probably are already aware of your options here - http://docs.splunk.com/Documentation/Splunk/6.1/Data/Whitelistorblacklistspecificincomingdata

lycollicott · ‎12-15-2015

I removed the wildcards too, but a new problem arose where some UFs flapped cpu instead of pegging it.

Support logged bug SPL-108954 for the issues.

freni59 · ‎11-25-2015

So I'm seeing this as well on some of my UFs. I have 4 servers on which this has occured 1-3 times durring the last week, there is nothing special about these 4 boxes, running same version, same config, as the rest of the bunch ~105 windows boxes. They're in the same Windows domain, but so are others, they are not on the same vmware host, it doesn't occur syncrounus on all 4, what is common though is that they complain about the time in their splunkd.log, but only one system has logged any other errors connected to time issues in eventlogs.

lycollicott · ‎11-25-2015

What was their complaint about time?

I eliminated the wildcards from all my 6.3 UFs and splunkd.exe stopped pegging every host at 100% cpu, but I still had some problems. splunkd.exe flapped cpu wildly on some hosts, but not others.

Support sent my original case off to development and they are working on it as a bug, but in the meantime I have downgraded.

freni59 · ‎11-25-2015

Log example:
11-23-2015 13:03:15.529 +0100 WARN TimeoutHeap - Either time adjusted forwards by, or event loop was descheduled for 3694994ms.
11-23-2015 13:03:15.607 +0100 WARN TimeoutHeap - Either time adjusted forwards by, or event loop was descheduled for 3694963ms.
11-23-2015 13:03:15.623 +0100 WARN TimeoutHeap - Either time adjusted forwards by, or event loop was descheduled for 3695009ms.
11-23-2015 13:03:15.733 +0100 WARN TimeoutHeap - Either time adjusted forwards by, or event loop was descheduled for 3694244ms.
11-23-2015 13:03:16.248 +0100 WARN TimeoutHeap - Either time adjusted forwards by, or event loop was descheduled for 3711931ms.
11-23-2015 13:03:18.154 +0100 WARN TimeoutHeap - Either time adjusted forwards by, or event loop was descheduled for 3703228ms.
11-23-2015 13:07:07.951 +0100 WARN TimeoutHeap - Either time adjusted forwards by, or event loop was descheduled for 4293229ms.
11-23-2015 13:00:19.506 +0100 WARN TimeoutHeap - Detected system time adjusted backwards by 3691155ms.
11-23-2015 13:00:19.506 +0100 WARN TimeoutHeap - Detected system time adjusted backwards by 3690904ms.
11-23-2015 13:00:19.756 +0100 WARN TimeoutHeap - Detected system time adjusted backwards by 3690920ms.
11-23-2015 13:00:19.772 +0100 WARN TimeoutHeap - Detected system time adjusted backwards by 3690920ms.
11-23-2015 13:00:20.553 +0100 WARN TimeoutHeap - Detected system time adjusted backwards by 3675326ms.
11-23-2015 13:00:26.240 +0100 WARN TimeoutHeap - Detected system time adjusted backwards by 3681920ms.
11-23-2015 13:05:36.024 +0100 WARN TimeoutHeap - Detected system time adjusted backwards by 3091934ms.

As you see above typically I get 6-7 of the first log entry type, there is nothing in the log before that that's close in time and could result in a message of that kind, then you get a typically 7 of the second type of entry and then there is nothing before restarting the splunk forwarder service, notice the time difference between entry 7 and 8, the log time actually adjusted backwards. Although as it seems the number of milliseconds don't necissarily add up to the shift backwards in time as I have another box with roughly the same amount of milliseconds in the message but the log entry actually jumping ~45 minutes back in time.

Interesting though that you mention the flapping in cpu because I have the impression that after the first "Restart-Service splunkforwarder" that it is actually flapping all over the place but that it's calming after issuing a second restart-service....

As this showed up in 6.3 I've gone the other way compared to you and upgraded to 6.3.1 yesterday and am keeping a close eye on this today. Funny thing though is that this showed up not directly after the upgrade to 6.3 but some weeks later....

(BTW when you say you've eliminated the wildcards what exactly are you meaning by that?

aosso · ‎10-19-2015

We are having this problem on some servers. It does actually happen on the same day and time every week since we upgraded our UF to 6.3.0, so we are currently trying to narrow down the reason.

In any case, this did not happen with 6.2.5... maybe we'll downgrade some UF's to 6.2.6 and see what happens.

lycollicott · ‎10-19-2015

We saw the high cpu behavior almost all the time on 6.3, but as soon as downgraded to 6.2.5 the cpu ran at 0-1%.

lycollicott · ‎10-20-2015

I may wave the flag of surrender. I wanted to create a diag tarball for 6.2.5 and for 6.3.0, so after I made the first one I upgraded back to 6.3.0 on one node.....CPU was not and would not peg. I did the same procedure on a second node and it did peg.

I must say that this lacks some consistency.

aosso · ‎10-23-2015

Yeah, it's totally random... I just did a downgrade to 6.2.6 (we went 6.2.5>6.3.0>6.2.6) to see what happens. Today splunk uf totally bogged down two database servers.

aosso · ‎11-05-2015

Well, the problem did not reproduce again after downgrading the affected servers to 6.2.6. However, I have another Splunk instance with different servers with 6.3.0 installed and this problem does not happen there.

I'll keep installing 6.2.6 for now on new deployments.

DaClyde · ‎10-16-2015

Have you tried using the Blacklist feature in the inputs.conf to specifically blacklist the pagefile.sys file?

lycollicott · ‎10-16-2015

Yes, I did, but it made no difference.

DaClyde · ‎10-16-2015

What regex did you use to try and blacklist the pagefile.sys?

lycollicott · ‎10-16-2015

Actually it was just
blacklist = pagefile.sys

DaClyde · ‎10-16-2015

It is expecting a regular expression, so try this (to escape the dot):

blacklist = pagefile\.sys

lycollicott · ‎10-19-2015

I did, but it made no difference. I also tried just pagefile, but to no avail.

lycollicott · ‎10-16-2015

More info....

This has been happening semi-consistently for 3 weeks since the exact date of the 6.3 UF upgrades.
We have monitors in place like this (which did not cause high CPU in 6.2) :

[monitor://j:\tomcat-inst*\logs\]

because we have many Tomcat directories like these:

j:\tomcat-inst1a
j:\tomcat-inst1b
j:\tomcat-inst2a
j:\tomcat-inst2b

Apparently that is making Splunk read all of j: where it staggers over pagefile.sys which causes these in splunkd.log:

FilesystemChangeWatcher - error getting attributes of path "j:\pagefile.sys": The process cannot access the file
because it is being used by another process.

The wildcard is the cause of the cpu distress, but inputs.conf won't use a regex without a wildcard in the same segment which defeats the whole regex in that case. There are simply too many unique j:\tomcat-inst* so we don't want to hard code them all.

Any ideas?

Is the Splunk 6.3 universal forwarder using 90% of your CPUs?

SOC4Kafka - New Kafka Connector Powered by OpenTelemetry

Your Voice Matters! Help Us Shape the New Splunk Lantern Experience

Building Momentum: Splunk Developer Program at .conf25

Are you a member of the Splunk Community?

Is the Splunk 6.3 universal forwarder using 90% of your CPUs?

SOC4Kafka - New Kafka Connector Powered by OpenTelemetry

Your Voice Matters! Help Us Shape the New Splunk Lantern Experience

Building Momentum: Splunk Developer Program at .conf25