Getting Data In

Universal Forwarder High CPU after Leap Second Correction

gcoles
Communicator

This is an FYI to anyone else that may be seeing abnormally high CPU utilization on their universal forwarders over the past few days. There was a correction in system time for the "leap second" that occurred June 30, 2012 at 23:59:60 UTC. On *Nix systems, the NTP daemons detect that a leap second should occur and make adjustments to the system clock by adding an extra second to the minute. This resulted in problems all over the world, and it turned out to cause us issues, as well.

On our servers, the leap-second change by NTPd resulted in a constant stream of timer interrupts that we traced back to our splunk universal forwarders (both 4.3.1 and 4.3.3). Since the leap-second occurred, a top command shows splunkd at around 100-120% CPU usage (using one full core's worth of CPU or more), whereas it was typically at 1-3% on the servers before the leap-second. Restarting and even reinstalling makes no difference. We had tried to prevent our systems from getting the leap second adjustment by shutting down NTP in advance, but missed a few systems and these are the only ones that have this issue.

The following commands instantly fixed the CPU usage by the universal forwarder:

/etc/init.d/ntp stop
(date +"%H:%M:%S" |perl -pe 'chomp';echo `date +"%N"` / 999999999|bc -l) | sudo perl -ne 'chomp;system ("date","-s",$_);'
/etc/init.d/ntp start

Note that a simpler date command has been referenced in other articles on the web, but we found it to be less accurate than the above (which includes microseconds).

1 Solution

gcoles
Communicator

As hexx recommended, posting this as an answer:

The following commands instantly fixed the CPU usage by the universal forwarder:

/etc/init.d/ntp stop
(date +"%H:%M:%S" |perl -pe 'chomp';echo `date +"%N"` / 999999999|bc -l) | sudo perl -ne 'chomp;system ("date","-s",$_);'
/etc/init.d/ntp start

Note that a simpler date command has been referenced in other articles on the web, but we found it to be less accurate than the above (which includes microseconds).

View solution in original post

nemski
Explorer

You can also use the following if you don't have bc installed

(date +"%H:%M:%S" |perl -pe 'chomp';echo ".$(expr $(date +"%N")000000000 / 999999999)") | sudo perl -ne 'chomp;system ("date","-s",$_);'

,A little late to the party, but anyone wondering how to run this without using bc (which isn't installed by default on Ubuntu Lucid):

(date +"%H:%M:%S" |perl -pe 'chomp';echo ".$(expr $(date +"%N")000000000 / 999999999)") | sudo perl -ne 'chomp;system ("date","-s",$_);'
0 Karma

uuppuluri_splun
Splunk Employee
Splunk Employee

The OS kernel bug in 2012 caused a "livelock" in the "futex" code (the code responsible for handling user-land mutexes in multithreaded programs) The most common thing affected on servers seems to have been java processes. However, like java splunk is very multithreaded so it can be victimized as well

So the issue with leap second was never with Splunk, but rather with *nix distributions, most of which are likely no longer in use. From the perspective of Splunk, there isn't anything in our code to fix and we don't expect issues. Even in 2012, it wasn't our bug, we just worked around it. We recommend working with your OS vendors on upgrading tkem as needed. A few possibly useful links https://bugzilla.redhat.com/show_bug.cgi?id=836803 , https://lkml.org/lkml/2012/7/1/176. Please google for your OS context, https://blogs.oracle.com/java-platform-group/entry/the_2015_leap_second_s, https://access.redhat.com/articles/15145, ...

0 Karma

kscher
Path Finder

Thanks for filling in some gaps for us here, and confirming Splunk as victim, not perp.

Our engineering folks have been rolling out Red Hat's patch(es) over the last few weeks, so we'll expect that patched machines will have no issues. Inevitably, there will be some stragglers, so we're thinking we may just deploy some dummy package to all forwarders that will call for restartSplunkd.

0 Karma

kscher
Path Finder

We have another leap-second coming this year: June 30, 2015 23:59:60 UTC.

I'd like to know whether we may expect the same behavior from forwarders (or any other splunk component), or if anyone is aware of things that might have changed in the last 3 years, such as splunk building in a fix / workaround to avoid whatever presumed uncaught logical issue that causes the high-CPU condition.

0 Karma

uuppuluri_splun
Splunk Employee
Splunk Employee

The OS kernel bug in 2012 caused a "livelock" in the "futex" code (the code responsible for handling user-land mutexes in multithreaded programs) The most common thing affected on servers seems to have been java processes. However, like java splunk is very multithreaded so it can be victimized as well

So the issue with leap second was never with Splunk, but rather with *nix distributions, most of which are likely no longer in use. From the perspective of Splunk, there isn't anything in our code to fix and we don't expect issues. Even in 2012, it wasn't our bug, we just worked around it. We recommend working with your OS vendors on upgrading tkem as needed. A couple of possibly useful links https://bugzilla.redhat.com/show_bug.cgi?id=836803 , https://lkml.org/lkml/2012/7/1/176. Please google for your OS context

0 Karma

gcoles
Communicator

As hexx recommended, posting this as an answer:

The following commands instantly fixed the CPU usage by the universal forwarder:

/etc/init.d/ntp stop
(date +"%H:%M:%S" |perl -pe 'chomp';echo `date +"%N"` / 999999999|bc -l) | sudo perl -ne 'chomp;system ("date","-s",$_);'
/etc/init.d/ntp start

Note that a simpler date command has been referenced in other articles on the web, but we found it to be less accurate than the above (which includes microseconds).

gcoles
Communicator

I have no AIX systems to test with, hope you found a fix, jnhth!

0 Karma

jnhth
Explorer

Seems to work on linux but what about aix?

0 Karma

hexx
Splunk Employee
Splunk Employee

You should really split the solution and post it as an answer.

0 Karma

akom
New Member

Really glad I found this, fixed the issue. Wasted some time on splunk reinstallation as well.

0 Karma

yannK
Splunk Employee
Splunk Employee

Great one, thanks.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...