I've found myself recently looking at the Pipelines in Splunk, through the How Indexing Works wiki page, or @amrit and @Jag's conf2014 talk. It seems like the input through indexing pipelines are very modular and how they're wired together through various .xml files in $SPLUNK_HOME/etc that wind up together in $SPLUNK_HOME/var/run/splunk/composite.xml
If an event lands in the indexing pipeline, I have a number of options, I could send the event to a remote splunk instance or the raw event over TCP to some other server, I could send the event to a remote syslog server, or I could just index the event locally. (And with the steps taken on the event in previous pipelines and configurations, I could do any one to all three of these options).
But I have been asked, what if I wanted to ship raw logs during the indexing pipeline to another service that didn't take just raw TCP or UDP, for example, if I needed to wrap the event in some other protocol or something before sending? Is there a guide on how one could develop a custom processor, install it with a Splunk App, and inject it into the indexing pipeline (or alternatively create a queue and a custom pipeline before the indexing pipeline)?
Yes, there is a danger that my bad processor code could back up my indexing pipeline, but if I try to send tcp to a remote instance that cannot accept any more input I could back up my indexing pipeline as well thus in that sense the choice to be in Splunk directly versus stand up a custom server that accepts tcp and sends it on seems a bit moot. However I would think that if I could be in my Splunk pipeline, I should be able to take advantage of some of the additional metadata around the events as well (and not just the raw event or have to parse things again) which feels like it would be an advantage.
Yes, I could let the event be indexed, then have a scheduled / real-time search ship off the event either with a custom command or an alert script. While I could join and transform the event with more sources of data, if my goal is simply to ship certain events off to another system, that means I'm potentially taking up search head processing time (and licensing) seemingly unnecessarily for this task.
Giving an upvote because it's rather interesting app, and I can see implementing it other places (and possibly enabling Splunk to replace parts of upstream system 😄 ), however I don't think the use case quite fits (and if you agree, we can make this a comment instead of an answer).
My inputs into Splunk are standard monitors on UFs and (where I would want to implement such routing) a HF / Indexer that's receiving splunktcp input from UFs. So to use PDI to solve my routing problem, I would have to reimplement the splunktcp receiver which at first impression awkward at best, and impossible at worst (requiring access to source code or a wrapper from Splunk, Inc. to keep up with improvements in splunktcp protocol, and having to stay lockstep with Splunk versions).
Additionally I feel that injection after the typing pipeline is ideal since this would enable us to be after linebreaking, and would enable per-event routing if needed. (If implement with a custom queue and pipeline, we could use regexprocessor to route events to customQueue instead of indexQueue or nullQueue, which feels very modular, and injects us into the correct part of processing... ( I think the only source code needed from Splunk, Inc. would be APIs for implementing processors including reading/manipulating pipeline data, and probably better documentation around how to define the custom queue & pipeline )