Some events missing from data model - only if acce...

helge · ‎11-12-2018

The problem

Sometimes specific events are missing from an accelerated data model. These events can be found with a regular SPL search.
When searching the data model, the events are only returned when the data model is not accelerated. Once the acceleration is enabled, the events do not show up any more. Of course, we checked that the DMA's status is "100% completed". Rebuilding the accelerated data model does not help.

What is really strange: it happens that events are disappearing from the accelerated data model that showed up just fine mere minutes earlier. I noticed that today when I refreshed a dashboard and some charts suddenly were empty (which led me to investigate this again and post this question).

An example search

Here is an example of a search that is not returning all expected events (we have noticed this issue with different datasets in the past, too):

| pivot uberAgent Process_NetworkTargetPerformance
count(Process_NetworkTargetPerformance) as "Event count"
splitrow AppName
filter host is "Client11"

We have seen this issue on various versions of Splunk Enterprise 7.1.x and 7.2.x (including 7.2.1).

Screenshots

Screenshots of the above search, first with acceleration disabled (9 results), then with acceleration enabled (6 results):

Scheduler.log

I ran a search to query scheduler.log over a time range of 8 hours, which covered the time of the issue:

index=_internal source=*scheduler.log 
| stats count by log_level

Result:

DEBUG: 22537
ERROR: 3
INFO: 5811

Update 2020-02-04: back in Splunk 8.x

Dominik's answer fixed the problem for Splunk versions 7.x. However, it is back in Splunk 8 (we tested 8.0 and 8.0.1).

codebuilder · ‎02-04-2020

We just recently chased this very issue, though our root cause may be different than yours, not certain.
The issue manifested itself in some of our dashboards, which are powered by accelerated datamodels. Many began to run very slow, while others (different datamodels) were fine.

During RCA, we found that the event count for a given datamodel was different, depending on acceleration. Easy to check by toggling summariesonly=t / f. We also found that there were gaps in the acceleration, though the UI was showing 100% accelerated.

In any case, check your internal logs for PCRE_ERROR_MATCHLIMIT entries:

index=_internal source=*splunkd.log* PCRE_ERROR_MATCHLIMIT | cluster showcount=t | table cluster_count _raw | sort -cluster_count

What we found, was that in certain periods of the accelerated date/time range, the forwarder had encountered corrupted events. Though the data was still indexed, one or more datamodel acceleration jobs/searches would fail due to a regex exceeding the PCRE_ERROR_MATCHLIMIT (default is 100k). Later jobs would pick up at a later time and succeed, thus causing the gap(s) in our acceleration.

So, when a user would open a dashboard and pick a date/time range that included the "gap", the query would drop back to searching non-accelerated data, causing the slowness.

In our case, a single corrupted event was the cause. It was full of hex/binary data. We narrowed this down using timechart and a span of 1 second. Note that the "problem" event was just prior to the gap in acceleration, not within the gap itself.

We are on Splunk 7.2.6, no other errors were thrown, and the datamodel acceleration showed 100% complete.

Hope this helps.

----
An upvote would be appreciated and Accept Solution if it helps!

codebuilder · ‎02-04-2020

Forgot to mention, our fix was to pipe that single event to "delete" and rebuild dm acceleration.

----
An upvote would be appreciated and Accept Solution if it helps!

wryanthomas · ‎03-17-2020

For finding the corrupt event, you said you "narrowed this down using timechart and a span of 1 second." Could you elaborate? I'm seeing something similar, and I'm not sure I have a corrupt event or not... but not sure how to reveal it.

helge · ‎02-05-2020

That is an interesting issue you have described, @codebuilder, it seems to be different from what we are seeing, though.

dominikatvastli · ‎05-17-2019

Splunk support told us it is a known bug which will be fixed in the future. They came up with the following workaround which fixes the problem for us:

Please edit/create the file $SPLUNK_HOME/etc/system/local/limits.conf .
Add the following stanza

[directives]
read_summary = disabled

Restart the instance for changes to take effect.
In a distributed env it should be done on the SH.

gjanders · ‎05-17-2019

Which Splunk version & OS?

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

helge · ‎02-04-2020

This fixes the problem for Splunk 7.x. However, the issue is back in 8.x and the read_summary setting does not help.

dominikatvastli · ‎05-19-2019

Splunk 7.x on Windows. Our Ubuntu Splunk servers are not affected.

dominikatvastli · ‎12-10-2018

We did some further testing and it seems that the problem only occurs when Splunk is running on Windows. Splunk on Ubuntu is fine.

martin_mueller · ‎11-13-2018

Good to hear that you at least got the log fixed. Now that you have scheduler logs, what's your skip ratio in the MC?

helge · ‎11-13-2018

For most hours, the skip ratio is between 0% and 0.5%.

martin_mueller · ‎11-12-2018

Dominik's environment had an almost-empty scheduler.log, indicating something really fishy going on with the scheduler.

dominikatvastli · ‎11-12-2018

Dominik here. The empty scheduler.log was the result of a misconfiguration in $SPLUNK_HOME\etc\log.cfg. It is fixed now, the scheduler.log fills up again, but the problem that Helge has is still there.

I also gave the scheduler more resources by setting max_searches_perc and auto_summary_perc from 50 to 80 in the limits.conf.

Splunk version: 7.2.1

helge · ‎11-12-2018

According to this machine's scheduler.log Splunk has been busy (in a good way). Lots of DEBUG and INFO messages, nearly no errors. I updated the question with details.

martin_mueller · ‎11-12-2018

Did you get your lack of scheduler logs fixed?

helge · ‎11-12-2018

Hi @martin_mueller, I am not sure I understand your question. However, I just re-checked (several hours after my initial post) and the missing events "magically" reappeared. I do not consider that a solution, because we have seen this before, and I cannot have searches sometimes showing incomplete results.
Even more interesting, the acceleration was turned off in these past hours. When I just turned it back on, it immediately showed a status of "100% completed" and it immediately returned the events that had been missing.