Installation

"Events may not be returned in sub-second order due to search memory limits configured in limits.conf:[search]:max_rawsize_perchunk." error after upgrade to Splunk 6.0.1

Nicolo_Figiani
Path Finder

Hi everyone,

I successfully upgraded all my Splunk instances (a distributed non-pooled infrastructure made of 2 indexers and 1 search head) from Splunk 5.0.2 to Splunk 6.0.1.

However, when running some searches from the search head, the following error pops out:

Events may not be returned in sub-second order due to search memory limits configured in limits.conf:[search]:max_rawsize_perchunk. See search.log for more information.

Maybe this issue is already addressed as a "known issue" (SPL-74818😞

A search returning lots of large events with multikv applied can crash the indexer's splunkd process. 

I've read that you can work around this issue by getting splunkd to search on smaller chunks of data by reducing the number assigned to max_rawsize_perchunk, on all indexers and search heads by editing the following stanza in limits.conf:

[search]
max_rawsize_perchunk = $smaller_value_than_previous$

I've edited limits.conf as explained but I keep getting this issue.

Has anyone faced the same problem? How can I fix this once for all?

Thanks a lot,
cheers.

Labels (4)
1 Solution

sideview
SplunkTrust
SplunkTrust

What it means, is that while your events are coming back, in the cases where a set of events all happen within a single second, Splunk may not be returning them in correct order.

Splunk stores events at the second granularity, but in such a way that it does actually preserve the subsecond ordering (I think to milliseconds by default but I'm not sure). However if you have X events living in a single second, to do the magic of always returning them in subsecond order, with the way it is implemented Splunk has to load all of those events in memory at once. If X is very large and if the events are also very large, as in big multikv events, this can create some nasty memory usage problems in the search process.

So, what it's saying is, that while events are coming back for you still in the right seconds, if you look closely you might spot some events within a single second are slightly out of order from reality.

This also comes up and has a very good explanation by hexx over in this answer here - http://answers.splunk.com/answers/90576/what-does-events-may-not-be-returned-in-sub-second-order-due...

You can manually investigate whether you have large numbers of events being indexed into single seconds.

On any given timerange,

<your search terms> | stats count by _time | sort 0 - count | head 100

will tell you the 100 seconds that have the most indexed events in them. Or if you want to see table with both event counts and raw bytes and KB here's a quick and dirty way.

<your search terms> | eval bytes=len(_raw) | eval KB=bytes/1024 | stats count sum(bytes) as bytes sum(KB) as KB by _time | sort 0 - count | head 100

It's always possible that some timestamping is just misconfigured, which sometimes dumps lots of events into single seconds, and that more reliable timestamp extraction will make the problem go away.

View solution in original post

sideview
SplunkTrust
SplunkTrust

What it means, is that while your events are coming back, in the cases where a set of events all happen within a single second, Splunk may not be returning them in correct order.

Splunk stores events at the second granularity, but in such a way that it does actually preserve the subsecond ordering (I think to milliseconds by default but I'm not sure). However if you have X events living in a single second, to do the magic of always returning them in subsecond order, with the way it is implemented Splunk has to load all of those events in memory at once. If X is very large and if the events are also very large, as in big multikv events, this can create some nasty memory usage problems in the search process.

So, what it's saying is, that while events are coming back for you still in the right seconds, if you look closely you might spot some events within a single second are slightly out of order from reality.

This also comes up and has a very good explanation by hexx over in this answer here - http://answers.splunk.com/answers/90576/what-does-events-may-not-be-returned-in-sub-second-order-due...

You can manually investigate whether you have large numbers of events being indexed into single seconds.

On any given timerange,

<your search terms> | stats count by _time | sort 0 - count | head 100

will tell you the 100 seconds that have the most indexed events in them. Or if you want to see table with both event counts and raw bytes and KB here's a quick and dirty way.

<your search terms> | eval bytes=len(_raw) | eval KB=bytes/1024 | stats count sum(bytes) as bytes sum(KB) as KB by _time | sort 0 - count | head 100

It's always possible that some timestamping is just misconfigured, which sometimes dumps lots of events into single seconds, and that more reliable timestamp extraction will make the problem go away.

ShaunBaker
Path Finder

Would this error the OP mentioned cause actual events to not show in the search head? I have this error popping up with the Tenable add-on and app, and do not have as much logs/events coming into Splunk as can be seen in SecCenter. Its like some of the data gets ingested, some does not. Or in this case it is ingested and in the DB, but the search head cannot list it all.

0 Karma

sideview
SplunkTrust
SplunkTrust

With more recent versions of splunk I would use a search like this to see how much of a problem this is with your data, and where in time the problem might be occurring.

| tstats prestats=t count WHERE index=cisco_cdr by _time source host span=1s | stats count by _time source host | sort - count

It'll tell you in what time and source you're getting lots of events in single seconds. Any values of 10,000 or 10,001 are where you're hitting the limit and bad/notGreat things are happening.

It's a little weird to use prestats plus stats plus another stats. Why I'm doing that is a) you can get some wonky time-bucketing behavior out of tstats, and b) prestats=t plus an explicit stats command has often (for me at least) seems to make some other tstats wonkiness go away in edge cases. Since 10K events in a single second is implicitly an edge case, a+b= this search is just being careful.

Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...