I am creating an app for Splunk 4.1 that has a scripted input that retrieves data from a database. At first run, it will fetch the entire database into Splunk, and on consecutive runs it will only get events that are not already indexed by Splunk.
To accomplish this, all events will contain a field containing the primary key of the database entry from which the event was generated. As the script runs, it will simply perform a CLI search in Splunk to collect the highest such value indexed in Splunk, and fetch all events with higher values from the database.
Some events are, despite this, still being indexed more than once. As far as I have been able to determine, this is because the indexer has not finished indexing the previous block of events before the scripted input starts up again and makes its query.
How can I ensure that all events delivered by the scripted input have been made available to search and are not still in the process of being indexed? Is there any alternative other than simply setting interval to a higher value in inputs.conf and hoping for the best?
I actually managed to come up with a satisfactory conclusion to this dilemma on my own, after some more contemplation.
Simply put, I made sure the input script would not terminate until the latest event was showing up in a search, ie something along these lines:
while splunk_search("unique_id=%s | head 1" % last_id) == "":
time.sleep(5)
splunk_search() being a function that dispatches a CLI search to the Splunk instance, and returns the response as a string.
This will ensure the input script never terminates until the event with the "unique_id"-field is available in a search, and has thus been indexed. This was a satisfactory solution for me, and "| head 1" ensures reasonable performance of the search.
Looks like you have a solution. However, as a general solution I would record your pointer (id number, etc, whatever marks the location that you've already read) in some other location that is tied to your script. Yes, unfortunately this means that your script must keep state around. For low-throughput applications, it's okay to query Splunk to find the most recent indexed data, and to hold the state in memory. However, this isn't very stable if you e.g., restart Splunk or the script. And if the input needs to be very high rate, this will not be able to keep us, as you won't be able to send data in to Splunk asynchronously, but will have to block and wait in batches.
A scripted input is going to use up at least one file descriptor full time. Unlike with a file, Splunk can't close the input pipe and come back to it later. On the other hand, it's just a file descriptor. Those are cheap if nothing is coming through them, unless you're bumping up against the OS per-process limit.
Yeah, I was wondering about the restart scenario as well--I could see that leading to duplicate events.... In terms of resources, do you know what the impact is of having an input script block for a period of time? I know with monitor
inputs there are maximum open file descriptor considerations. I'm guessing there is probably something similar with script
inputs as well. And idea on how that works?
I actually managed to come up with a satisfactory conclusion to this dilemma on my own, after some more contemplation.
Simply put, I made sure the input script would not terminate until the latest event was showing up in a search, ie something along these lines:
while splunk_search("unique_id=%s | head 1" % last_id) == "":
time.sleep(5)
splunk_search() being a function that dispatches a CLI search to the Splunk instance, and returns the response as a string.
This will ensure the input script never terminates until the event with the "unique_id"-field is available in a search, and has thus been indexed. This was a satisfactory solution for me, and "| head 1" ensures reasonable performance of the search.
You are correct, there is a delay between when your script writes out events and when they are actually available via a search.
You could increase your polling interval... but that may not be ideal.
I would recommend a different approach. You would probably be better off creating a state file of some kind. (It could be stored on the file system or in the database itself).
If your database has an increasing numeric id as the primary key (which is sounds like you do), then all you have to store that value between runs and start there on the next invocation. That's a great approach when your table is setup that way. Or if you have a timestamp you can track, that can work well too. I've built a hybrid of these two approaches before, I had a 24-character string as a primary key (with no recognizable sort order) and a timestamp column. I did slightly overlapping queries based on the timestamp column on each invocation and before writing out the event for splunk to pickup, I would check the primary id against a list of let last 100 (or so) last seen primary keys. If the key was previously seen, then the event was silently skipped. And all new primary keys were saved back to disk when the script was done.
It may also be helpful if you provide a little more detail about what kind of events your trying to pull into splunk. Are they log-like events? Are you trying to load reference or look tables (if so, you should really checkout the lookup
capabilities that splunk provides.)