I have a python script that retrieves data from an external source and stores it in several .csv files. I have added the necessary information to transforms.conf and savedsearches.conf to use the lookup function in the search to find the data mappings. The .csv files are stored in the apps//lookups directory. This is working as expected.
I plan to run the python program once per hour to refresh the data in the .csv files but I'm looking for the recommended way to do this.
- What is the best way to run the script on a schedule?
- Is there a specific entry I should make in savedsearches.conf? Should the script be placed in the apps//bin directory?
- Is it advisable to use inputs.conf, send the tables to stdout and have splunk index them directly? (I really only want one copy of the data, it is not time-based)
- When performing the lookups, does splunk cache the .csv data?
- If the .csv file is updated on the fly, does splunk know to refresh it's internal representation?
- Is there a reduction in efficiency if the lookup tables grow very large? I expect 10K-20K rows.
I would recommend setting up a scripted inputs for this in inputs.conf like so:
disabled = false
## once per week on wednesday; using cron such that search doesn't execute @ start time
interval = 0 0 * * 3
For schedules, you can use an interval specified as # secs between executions, or a chron schedule. I think the approach you are using to generate a .csv and use as a lookup w/in Splunk is the correct one. I don't believe Splunk cache's the .csv data, so contents will be read from disk per invocation. Updates to the .csv should take immediate affect in Splunk. 10k-20k rows should not be a problem. There are considerations for distributed environments as the list will by default be replicated down to the indexers. If the list is interacted w/ via "| lookup" instead of props.conf you can add the csv to distsearch.conf replication blacklist and use "| lookup local=true" which will make the lookup local to your search server.