This is an FYI to anyone else that may be seeing abnormally high CPU utilization on their universal forwarders over the past few days. There was a correction in system time for the "leap second" that occurred June 30, 2012 at 23:59:60 UTC. On *Nix systems, the NTP daemons detect that a leap second should occur and make adjustments to the system clock by adding an extra second to the minute. This resulted in problems all over the world, and it turned out to cause us issues, as well.
On our servers, the leap-second change by NTPd resulted in a constant stream of timer interrupts that we traced back to our splunk universal forwarders (both 4.3.1 and 4.3.3). Since the leap-second occurred, a top command shows splunkd at around 100-120% CPU usage (using one full core's worth of CPU or more), whereas it was typically at 1-3% on the servers before the leap-second. Restarting and even reinstalling makes no difference. We had tried to prevent our systems from getting the leap second adjustment by shutting down NTP in advance, but missed a few systems and these are the only ones that have this issue.
The following commands instantly fixed the CPU usage by the universal forwarder:
/etc/init.d/ntp stop
(date +"%H:%M:%S" |perl -pe 'chomp';echo `date +"%N"` / 999999999|bc -l) | sudo perl -ne 'chomp;system ("date","-s",$_);'
/etc/init.d/ntp start
Note that a simpler date command has been referenced in other articles on the web, but we found it to be less accurate than the above (which includes microseconds).
As hexx recommended, posting this as an answer:
The following commands instantly fixed the CPU usage by the universal forwarder:
/etc/init.d/ntp stop
(date +"%H:%M:%S" |perl -pe 'chomp';echo `date +"%N"` / 999999999|bc -l) | sudo perl -ne 'chomp;system ("date","-s",$_);'
/etc/init.d/ntp start
Note that a simpler date command has been referenced in other articles on the web, but we found it to be less accurate than the above (which includes microseconds).
You can also use the following if you don't have bc
installed
(date +"%H:%M:%S" |perl -pe 'chomp';echo ".$(expr $(date +"%N")000000000 / 999999999)") | sudo perl -ne 'chomp;system ("date","-s",$_);'
,A little late to the party, but anyone wondering how to run this without using bc
(which isn't installed by default on Ubuntu Lucid):
(date +"%H:%M:%S" |perl -pe 'chomp';echo ".$(expr $(date +"%N")000000000 / 999999999)") | sudo perl -ne 'chomp;system ("date","-s",$_);'
The OS kernel bug in 2012 caused a "livelock" in the "futex" code (the code responsible for handling user-land mutexes in multithreaded programs) The most common thing affected on servers seems to have been java processes. However, like java splunk is very multithreaded so it can be victimized as well
So the issue with leap second was never with Splunk, but rather with *nix distributions, most of which are likely no longer in use. From the perspective of Splunk, there isn't anything in our code to fix and we don't expect issues. Even in 2012, it wasn't our bug, we just worked around it. We recommend working with your OS vendors on upgrading tkem as needed. A few possibly useful links https://bugzilla.redhat.com/show_bug.cgi?id=836803 , https://lkml.org/lkml/2012/7/1/176. Please google for your OS context, https://blogs.oracle.com/java-platform-group/entry/the_2015_leap_second_s, https://access.redhat.com/articles/15145, ...
Thanks for filling in some gaps for us here, and confirming Splunk as victim, not perp.
Our engineering folks have been rolling out Red Hat's patch(es) over the last few weeks, so we'll expect that patched machines will have no issues. Inevitably, there will be some stragglers, so we're thinking we may just deploy some dummy package to all forwarders that will call for restartSplunkd.
We have another leap-second coming this year: June 30, 2015 23:59:60 UTC.
I'd like to know whether we may expect the same behavior from forwarders (or any other splunk component), or if anyone is aware of things that might have changed in the last 3 years, such as splunk building in a fix / workaround to avoid whatever presumed uncaught logical issue that causes the high-CPU condition.
The OS kernel bug in 2012 caused a "livelock" in the "futex" code (the code responsible for handling user-land mutexes in multithreaded programs) The most common thing affected on servers seems to have been java processes. However, like java splunk is very multithreaded so it can be victimized as well
So the issue with leap second was never with Splunk, but rather with *nix distributions, most of which are likely no longer in use. From the perspective of Splunk, there isn't anything in our code to fix and we don't expect issues. Even in 2012, it wasn't our bug, we just worked around it. We recommend working with your OS vendors on upgrading tkem as needed. A couple of possibly useful links https://bugzilla.redhat.com/show_bug.cgi?id=836803 , https://lkml.org/lkml/2012/7/1/176. Please google for your OS context
As hexx recommended, posting this as an answer:
The following commands instantly fixed the CPU usage by the universal forwarder:
/etc/init.d/ntp stop
(date +"%H:%M:%S" |perl -pe 'chomp';echo `date +"%N"` / 999999999|bc -l) | sudo perl -ne 'chomp;system ("date","-s",$_);'
/etc/init.d/ntp start
Note that a simpler date command has been referenced in other articles on the web, but we found it to be less accurate than the above (which includes microseconds).
I have no AIX systems to test with, hope you found a fix, jnhth!
Seems to work on linux but what about aix?
You should really split the solution and post it as an answer.
Really glad I found this, fixed the issue. Wasted some time on splunk reinstallation as well.
Great one, thanks.