With the help of a few hints from a friendly at Splunk, I believe that I have managed to get this working. I have tested on numerous configurations (single server vs. 3 SHC with 6 indexer cluster, generating vs. eventing base searches, with and w/o previews, localop and parallel on indexers) and all seem to work. Sometimes you must tune a timing parameter ( throttleusec ) that helps the custom command child process throttle the results passing back to the Splunk parent daemon, but I have gotten this version to work with hundreds of millions of events very reliably.
The solution is embodied in the new echo custom command implemented in this change...
Also, Splunk reports on my service ticket that this underlying timing bug will be resolved in a future release.
I have spent a couple of days attempting to better understand and work around this problem. At the end of my efforts, I have concluded that there is a bug in the Splunk daemon itself that behaves somewhat differently (timing-wise) from version to version and machine to machine. Along the way, I found multiple opportunities to enhance/improve the python SDK, but my fixes did not ultimately prevent the underlying problem from recurring. Details below.
We initially observed this problem in production as a scheduled job began consistently failing after working fine for months. The problem signature was (as stated in the question here):
<timestamp> ERROR ChunkedExternProcessor - Failure writing result chunk, buffer full. External process possibly failed to read its stdin.
<timestamp> ERROR ChunkedExternProcessor - Error in '<our_custom_cmd>' command: Failed to send message to external search command, see search.log.
Once this error appeared, it occurred consistently. The exact timing of when it occurred to the search relative to the search launch time varied somewhat. We were quickly able to reproduce this problem on local, much simpler Splunk workstation installs (single machine) using | makeresults or even index=_* | head 1000000 | table _time host source sourcetype style base searches connected to our custom command. As also already stated here, we also quickly determined that reducing the custom command to the simplest possible configuration (that simply yielded back its input) still produced the problem. During these rounds of testing, we found that the error was not 100% consistent and varying the number of events sent to the custom command and the size of those events seemed to change the frequency of the reported error. Additionally, debugging, esp. logging, added to the custom command impacting the likelihood of hitting the error.
Given the text of the error, we began to suspect that somehow our custom command was allowing a pipe to fill causing this issue. Especially suspicious was the custom command's stdin which we knew to be connected to the Splunk daemon that was reporting the error. Reviewing the implementation strategy of the command, including the python SDK base classes, presented a few potential optimizations.
First, the python SDK will currently simply read the entirety of the data input into RAM before processing (due to the implementation of _read_chunk() in SearchCommand ). This seemed problematic for multiple reasons (memory usage of the custom command, lack of true streaming implementation for large data sets). We first attempted to repair this by building a "chunk-aware" input and processing the events as we read them from stdin . This timing change (reading the input records more slowly and producing output records while doing so) seemed to much more quickly trigger the buffer full failure, so, while we think this is actually the best implementation, we abandoned it.
def __init__(self, infile, limit):
self._file = infile
self._limit = limit
if self._limit <= 0:
line = self._file.readline()
self._limit -= len(line)
Instead, we repaired the "read everything into RAM" problem by implementing a new class StreamingSearchCommand , derived from SearchCommand . In this implementation, we reworked _read_chunk() and _records_protocol_v2() to download the incoming events into a gzip'd file in the dispatch directory and then reopen and stream them back from there. This greatly reduced memory footprint required by our simple test custom SPL command, but the buffer full errors continued.
Next, we imagined that perhaps, since our command was not continuing to monitor and read from stdin once it had collected all of the incoming events that the pipe attached to our stdin might be filling to the point where the Splunk daemon was going to block attempting to write into it. We imagined that this condition could be underlying the error reported here. We repaired this oversight by improving our custom SmartStreamingCommand class to also occasionally poll ( select actually) the stdin file descriptor and read out a chunk if data was present. Testing of this implementation confirmed that the Splunk daemon did, in fact, continue to occasionally write things to us through this pipe, even after we had collected all input records. Still, this improvement did not completely prevent the dreaded buffer full error.
Finally, reviewing the python SDK implementation further we were concerned that it might not be flushing records in a streaming fashion, but instead waiting until the generator chain ( self._record_writer.write_records(process(self._records(ifile))) ) drained. So, we added an occasional flush() call to our SmartStreamingCommand implementation. Unfortunately, we continued to hit buffer full errors.
At this point, we decided to bring out the big guns ( strace ) and we started by monitoring our process. We could easily see that we were regularly monitoring stdin and reading it quickly if data was present. Everything on the python SDK/custom command process side seemed okay, so we switched to strace ing the Splunk daemon itself. We found that attaching strace had the (Heisenberg) effect of eliminating the problem altogether. Excellent.
After numerous tries, changing the number and list of syscalls that we were intercepting, we finally managed to catch the failure in action one time. We expected review of this precious, captured output to show a system call returning EWOULDBLOCK or similar, allowing us to work backwards to understand the condition that caused the Splunk daemon to become upset and produce this error. Unfortunately, after quite a bit of time tracing file descriptors, futexes and signals across threads in the Splunk daemon, all of the system calls looked fine and no clear culprit was illuminated.
Additional inspection of the many search.log examples that were generated during the testing and evaluation of this issue did seem to show a pattern. Specifically, the Splunk daemon would fairly consistently issue this error approximately 80ms after the custom command had flushed a batch of input to it. We believe this suggests that the code associated with reading event records back from the custom command is thus likely implicated in this issue. We attempted, to some, but not complete success, to leverage this observation by adding a slight sleep() before flushing each batch of records.
After numerous attempts to work around this issue, including building lower footprint, more efficient python SDK replacements, we remain stuck with this issue, unable to build custom commands that process more than a million or so events without causing issues. This is a significant weakness in our current "big data" infrastructure and is blocking us on a few fronts. We would welcome advice or collaboration intended to work towards a solution to this issue. I will file a Splunk support ticket referencing this item on Splunk Answers...
... View more