Solved: Re: Universal Forwarder High CPU after Leap Second...

gcoles · ‎07-03-2012

This is an FYI to anyone else that may be seeing abnormally high CPU utilization on their universal forwarders over the past few days. There was a correction in system time for the "leap second" that occurred June 30, 2012 at 23:59:60 UTC. On *Nix systems, the NTP daemons detect that a leap second should occur and make adjustments to the system clock by adding an extra second to the minute. This resulted in problems all over the world, and it turned out to cause us issues, as well.

On our servers, the leap-second change by NTPd resulted in a constant stream of timer interrupts that we traced back to our splunk universal forwarders (both 4.3.1 and 4.3.3). Since the leap-second occurred, a top command shows splunkd at around 100-120% CPU usage (using one full core's worth of CPU or more), whereas it was typically at 1-3% on the servers before the leap-second. Restarting and even reinstalling makes no difference. We had tried to prevent our systems from getting the leap second adjustment by shutting down NTP in advance, but missed a few systems and these are the only ones that have this issue.

The following commands instantly fixed the CPU usage by the universal forwarder:

/etc/init.d/ntp stop
(date +"%H:%M:%S" |perl -pe 'chomp';echo `date +"%N"` / 999999999|bc -l) | sudo perl -ne 'chomp;system ("date","-s",$_);'
/etc/init.d/ntp start

Note that a simpler date command has been referenced in other articles on the web, but we found it to be less accurate than the above (which includes microseconds).

gcoles · ‎11-02-2012

As hexx recommended, posting this as an answer:

The following commands instantly fixed the CPU usage by the universal forwarder:

/etc/init.d/ntp stop
(date +"%H:%M:%S" |perl -pe 'chomp';echo `date +"%N"` / 999999999|bc -l) | sudo perl -ne 'chomp;system ("date","-s",$_);'
/etc/init.d/ntp start

Note that a simpler date command has been referenced in other articles on the web, but we found it to be less accurate than the above (which includes microseconds).

View solution in original post

nemski · ‎08-17-2015

You can also use the following if you don't have bc installed

(date +"%H:%M:%S" |perl -pe 'chomp';echo ".$(expr $(date +"%N")000000000 / 999999999)") | sudo perl -ne 'chomp;system ("date","-s",$_);'

,A little late to the party, but anyone wondering how to run this without using bc (which isn't installed by default on Ubuntu Lucid):

(date +"%H:%M:%S" |perl -pe 'chomp';echo ".$(expr $(date +"%N")000000000 / 999999999)") | sudo perl -ne 'chomp;system ("date","-s",$_);'

uuppuluri_splun · ‎06-30-2015

The OS kernel bug in 2012 caused a "livelock" in the "futex" code (the code responsible for handling user-land mutexes in multithreaded programs) The most common thing affected on servers seems to have been java processes. However, like java splunk is very multithreaded so it can be victimized as well

So the issue with leap second was never with Splunk, but rather with *nix distributions, most of which are likely no longer in use. From the perspective of Splunk, there isn't anything in our code to fix and we don't expect issues. Even in 2012, it wasn't our bug, we just worked around it. We recommend working with your OS vendors on upgrading tkem as needed. A few possibly useful links https://bugzilla.redhat.com/show_bug.cgi?id=836803 , https://lkml.org/lkml/2012/7/1/176. Please google for your OS context, https://blogs.oracle.com/java-platform-group/entry/the_2015_leap_second_s, https://access.redhat.com/articles/15145, ...

kscher · ‎07-01-2015

Thanks for filling in some gaps for us here, and confirming Splunk as victim, not perp.

Our engineering folks have been rolling out Red Hat's patch(es) over the last few weeks, so we'll expect that patched machines will have no issues. Inevitably, there will be some stragglers, so we're thinking we may just deploy some dummy package to all forwarders that will call for restartSplunkd.

kscher · ‎04-24-2015

We have another leap-second coming this year: June 30, 2015 23:59:60 UTC.

I'd like to know whether we may expect the same behavior from forwarders (or any other splunk component), or if anyone is aware of things that might have changed in the last 3 years, such as splunk building in a fix / workaround to avoid whatever presumed uncaught logical issue that causes the high-CPU condition.

uuppuluri_splun · ‎06-11-2015

The OS kernel bug in 2012 caused a "livelock" in the "futex" code (the code responsible for handling user-land mutexes in multithreaded programs) The most common thing affected on servers seems to have been java processes. However, like java splunk is very multithreaded so it can be victimized as well

So the issue with leap second was never with Splunk, but rather with *nix distributions, most of which are likely no longer in use. From the perspective of Splunk, there isn't anything in our code to fix and we don't expect issues. Even in 2012, it wasn't our bug, we just worked around it. We recommend working with your OS vendors on upgrading tkem as needed. A couple of possibly useful links https://bugzilla.redhat.com/show_bug.cgi?id=836803 , https://lkml.org/lkml/2012/7/1/176. Please google for your OS context

gcoles · ‎11-02-2012

As hexx recommended, posting this as an answer:

The following commands instantly fixed the CPU usage by the universal forwarder:

/etc/init.d/ntp stop
(date +"%H:%M:%S" |perl -pe 'chomp';echo `date +"%N"` / 999999999|bc -l) | sudo perl -ne 'chomp;system ("date","-s",$_);'
/etc/init.d/ntp start

Note that a simpler date command has been referenced in other articles on the web, but we found it to be less accurate than the above (which includes microseconds).

gcoles · ‎11-01-2012

I have no AIX systems to test with, hope you found a fix, jnhth!

jnhth · ‎08-17-2012

Seems to work on linux but what about aix?

hexx · ‎08-08-2012

You should really split the solution and post it as an answer.

akom · ‎07-05-2012

Really glad I found this, fixed the issue. Wasted some time on splunk reinstallation as well.

yannK · ‎07-03-2012

Great one, thanks.

Universal Forwarder High CPU after Leap Second Correction

Mastering Data Pipelines: Unlocking Value with Splunk

The Latest Cisco Integrations With Splunk Platform!

AI Adoption Hub Launch | Curated Resources to Get Started with AI in Splunk