I'm not sure how long it has been happening, but I began to see it across our UFs today.
Version 6.3.2 seems to have done the trick.
Luckily I was able to remove the wildcards from my stanzas and this solved my problem. However in your case that doesn't seem to be a good solution. You probably are already aware of your options here - http://docs.splunk.com/Documentation/Splunk/6.1/Data/Whitelistorblacklistspecificincomingdata
I removed the wildcards too, but a new problem arose where some UFs flapped cpu instead of pegging it.
Support logged bug SPL-108954 for the issues.
So I'm seeing this as well on some of my UFs. I have 4 servers on which this has occured 1-3 times durring the last week, there is nothing special about these 4 boxes, running same version, same config, as the rest of the bunch ~105 windows boxes. They're in the same Windows domain, but so are others, they are not on the same vmware host, it doesn't occur syncrounus on all 4, what is common though is that they complain about the time in their splunkd.log, but only one system has logged any other errors connected to time issues in eventlogs.
What was their complaint about time?
I eliminated the wildcards from all my 6.3 UFs and splunkd.exe stopped pegging every host at 100% cpu, but I still had some problems. splunkd.exe flapped cpu wildly on some hosts, but not others.
Support sent my original case off to development and they are working on it as a bug, but in the meantime I have downgraded.
Log example:
11-23-2015 13:03:15.529 +0100 WARN TimeoutHeap - Either time adjusted forwards by, or event loop was descheduled for 3694994ms.
11-23-2015 13:03:15.607 +0100 WARN TimeoutHeap - Either time adjusted forwards by, or event loop was descheduled for 3694963ms.
11-23-2015 13:03:15.623 +0100 WARN TimeoutHeap - Either time adjusted forwards by, or event loop was descheduled for 3695009ms.
11-23-2015 13:03:15.733 +0100 WARN TimeoutHeap - Either time adjusted forwards by, or event loop was descheduled for 3694244ms.
11-23-2015 13:03:16.248 +0100 WARN TimeoutHeap - Either time adjusted forwards by, or event loop was descheduled for 3711931ms.
11-23-2015 13:03:18.154 +0100 WARN TimeoutHeap - Either time adjusted forwards by, or event loop was descheduled for 3703228ms.
11-23-2015 13:07:07.951 +0100 WARN TimeoutHeap - Either time adjusted forwards by, or event loop was descheduled for 4293229ms.
11-23-2015 13:00:19.506 +0100 WARN TimeoutHeap - Detected system time adjusted backwards by 3691155ms.
11-23-2015 13:00:19.506 +0100 WARN TimeoutHeap - Detected system time adjusted backwards by 3690904ms.
11-23-2015 13:00:19.756 +0100 WARN TimeoutHeap - Detected system time adjusted backwards by 3690920ms.
11-23-2015 13:00:19.772 +0100 WARN TimeoutHeap - Detected system time adjusted backwards by 3690920ms.
11-23-2015 13:00:20.553 +0100 WARN TimeoutHeap - Detected system time adjusted backwards by 3675326ms.
11-23-2015 13:00:26.240 +0100 WARN TimeoutHeap - Detected system time adjusted backwards by 3681920ms.
11-23-2015 13:05:36.024 +0100 WARN TimeoutHeap - Detected system time adjusted backwards by 3091934ms.
As you see above typically I get 6-7 of the first log entry type, there is nothing in the log before that that's close in time and could result in a message of that kind, then you get a typically 7 of the second type of entry and then there is nothing before restarting the splunk forwarder service, notice the time difference between entry 7 and 8, the log time actually adjusted backwards. Although as it seems the number of milliseconds don't necissarily add up to the shift backwards in time as I have another box with roughly the same amount of milliseconds in the message but the log entry actually jumping ~45 minutes back in time.
Interesting though that you mention the flapping in cpu because I have the impression that after the first "Restart-Service splunkforwarder" that it is actually flapping all over the place but that it's calming after issuing a second restart-service....
As this showed up in 6.3 I've gone the other way compared to you and upgraded to 6.3.1 yesterday and am keeping a close eye on this today. Funny thing though is that this showed up not directly after the upgrade to 6.3 but some weeks later....
(BTW when you say you've eliminated the wildcards what exactly are you meaning by that?
We are having this problem on some servers. It does actually happen on the same day and time every week since we upgraded our UF to 6.3.0, so we are currently trying to narrow down the reason.
In any case, this did not happen with 6.2.5... maybe we'll downgrade some UF's to 6.2.6 and see what happens.
We saw the high cpu behavior almost all the time on 6.3, but as soon as downgraded to 6.2.5 the cpu ran at 0-1%.
I may wave the flag of surrender. I wanted to create a diag tarball for 6.2.5 and for 6.3.0, so after I made the first one I upgraded back to 6.3.0 on one node.....CPU was not and would not peg. I did the same procedure on a second node and it did peg.
I must say that this lacks some consistency.
Yeah, it's totally random... I just did a downgrade to 6.2.6 (we went 6.2.5>6.3.0>6.2.6) to see what happens. Today splunk uf totally bogged down two database servers.
Well, the problem did not reproduce again after downgrading the affected servers to 6.2.6. However, I have another Splunk instance with different servers with 6.3.0 installed and this problem does not happen there.
I'll keep installing 6.2.6 for now on new deployments.
Have you tried using the Blacklist feature in the inputs.conf to specifically blacklist the pagefile.sys file?
Yes, I did, but it made no difference.
What regex did you use to try and blacklist the pagefile.sys?
Actually it was just
blacklist = pagefile.sys
It is expecting a regular expression, so try this (to escape the dot):
blacklist = pagefile\.sys
I did, but it made no difference. I also tried just pagefile, but to no avail.
More info....
This has been happening semi-consistently for 3 weeks since the exact date of the 6.3 UF upgrades.
We have monitors in place like this (which did not cause high CPU in 6.2) :
[monitor://j:\tomcat-inst*\logs\]
because we have many Tomcat directories like these:
j:\tomcat-inst1a
j:\tomcat-inst1b
j:\tomcat-inst2a
j:\tomcat-inst2b
Apparently that is making Splunk read all of j: where it staggers over pagefile.sys which causes these in splunkd.log:
FilesystemChangeWatcher - error getting attributes of path "j:\pagefile.sys": The process cannot access the file
because it is being used by another process.
The wildcard is the cause of the cpu distress, but inputs.conf won't use a regex without a wildcard in the same segment which defeats the whole regex in that case. There are simply too many unique j:\tomcat-inst* so we don't want to hard code them all.
Any ideas?