Sometimes specific events are missing from an accelerated data model. These events can be found with a regular SPL search.
When searching the data model, the events are only returned when the data model is not accelerated. Once the acceleration is enabled, the events do not show up any more. Of course, we checked that the DMA's status is "100% completed". Rebuilding the accelerated data model does not help.
What is really strange: it happens that events are disappearing from the accelerated data model that showed up just fine mere minutes earlier. I noticed that today when I refreshed a dashboard and some charts suddenly were empty (which led me to investigate this again and post this question).
Here is an example of a search that is not returning all expected events (we have noticed this issue with different datasets in the past, too):
| pivot uberAgent Process_NetworkTargetPerformance
count(Process_NetworkTargetPerformance) as "Event count"
splitrow AppName
filter host is "Client11"
We have seen this issue on various versions of Splunk Enterprise 7.1.x and 7.2.x (including 7.2.1).
Screenshots of the above search, first with acceleration disabled (9 results), then with acceleration enabled (6 results):
I ran a search to query scheduler.log
over a time range of 8 hours, which covered the time of the issue:
index=_internal source=*scheduler.log
| stats count by log_level
Result:
Dominik's answer fixed the problem for Splunk versions 7.x. However, it is back in Splunk 8 (we tested 8.0 and 8.0.1).
We just recently chased this very issue, though our root cause may be different than yours, not certain.
The issue manifested itself in some of our dashboards, which are powered by accelerated datamodels. Many began to run very slow, while others (different datamodels) were fine.
During RCA, we found that the event count for a given datamodel was different, depending on acceleration. Easy to check by toggling summariesonly=t / f. We also found that there were gaps in the acceleration, though the UI was showing 100% accelerated.
In any case, check your internal logs for PCRE_ERROR_MATCHLIMIT entries:
index=_internal source=*splunkd.log* PCRE_ERROR_MATCHLIMIT | cluster showcount=t | table cluster_count _raw | sort -cluster_count
What we found, was that in certain periods of the accelerated date/time range, the forwarder had encountered corrupted events. Though the data was still indexed, one or more datamodel acceleration jobs/searches would fail due to a regex exceeding the PCRE_ERROR_MATCHLIMIT (default is 100k). Later jobs would pick up at a later time and succeed, thus causing the gap(s) in our acceleration.
So, when a user would open a dashboard and pick a date/time range that included the "gap", the query would drop back to searching non-accelerated data, causing the slowness.
In our case, a single corrupted event was the cause. It was full of hex/binary data. We narrowed this down using timechart and a span of 1 second. Note that the "problem" event was just prior to the gap in acceleration, not within the gap itself.
We are on Splunk 7.2.6, no other errors were thrown, and the datamodel acceleration showed 100% complete.
Hope this helps.
Forgot to mention, our fix was to pipe that single event to "delete" and rebuild dm acceleration.
For finding the corrupt event, you said you "narrowed this down using timechart and a span of 1 second." Could you elaborate? I'm seeing something similar, and I'm not sure I have a corrupt event or not... but not sure how to reveal it.
That is an interesting issue you have described, @codebuilder, it seems to be different from what we are seeing, though.
Splunk support told us it is a known bug which will be fixed in the future. They came up with the following workaround which fixes the problem for us:
Please edit/create the file $SPLUNK_HOME/etc/system/local/limits.conf .
Add the following stanza
[directives]
read_summary = disabled
Restart the instance for changes to take effect.
In a distributed env it should be done on the SH.
Which Splunk version & OS?
This fixes the problem for Splunk 7.x. However, the issue is back in 8.x and the read_summary
setting does not help.
Splunk 7.x on Windows. Our Ubuntu Splunk servers are not affected.
We did some further testing and it seems that the problem only occurs when Splunk is running on Windows. Splunk on Ubuntu is fine.
Good to hear that you at least got the log fixed. Now that you have scheduler logs, what's your skip ratio in the MC?
For most hours, the skip ratio is between 0% and 0.5%.
Dominik's environment had an almost-empty scheduler.log, indicating something really fishy going on with the scheduler.
Dominik here. The empty scheduler.log was the result of a misconfiguration in $SPLUNK_HOME\etc\log.cfg. It is fixed now, the scheduler.log fills up again, but the problem that Helge has is still there.
I also gave the scheduler more resources by setting max_searches_perc and auto_summary_perc from 50 to 80 in the limits.conf.
Splunk version: 7.2.1
According to this machine's scheduler.log
Splunk has been busy (in a good way). Lots of DEBUG and INFO messages, nearly no errors. I updated the question with details.
Did you get your lack of scheduler logs fixed?
Hi @martin_mueller, I am not sure I understand your question. However, I just re-checked (several hours after my initial post) and the missing events "magically" reappeared. I do not consider that a solution, because we have seen this before, and I cannot have searches sometimes showing incomplete results.
Even more interesting, the acceleration was turned off in these past hours. When I just turned it back on, it immediately showed a status of "100% completed" and it immediately returned the events that had been missing.