Hi,
I'd like to use Splunk itself to to generate a .csv-lookupfile to enrich one of my data sources. The data for this lookup comes from an XML-file that is undergoing several changes per week.
Now I want to automate this as good as possible and thought about the following solution:
a) Configure Splunk to monitor a folder in which I'd copy the new XML-file once it has undergone one or more changes. The sourcetype for that file is configured to see every line of that XML-file as a new event.
b) Use a schedules saved search (I already finished the searchstring) to filter down and process the events from the XML-file and generate a new lookupfile (overwrite the old one) with the | outputlookup-command.
My problem is that I somehow have to differentiate in the scheduled search to only use the newest data from the XML-file. Let's say I'd run a scheduled search every day for the last 24 hours and there was no new XML-file uploaded - outputlook would overwrite the .csv-file with an empty one. And in case I'd upload two new XML files per day, I'd end up with an .csv-file having double entries.
Someone has an idea how to solve this problem?
Regards,
Leo
I happened to work this out for someone else yesterday. What you really want is "lookup file contains the results of the last non-empty job run, right? So basically, you include in the lookup the time of the run (or of the data set or whatever). In mine below, I just take it as the job run time:
... | eval jobtime=now() | inputlookup append=true mylookup | sort 0 - jobtime | streamstats max(jobtime) as latestjobtime | head (jobtime==latestjobtime) null=t | fields - latestjobtime | outputlookup mylookup
Now the trick is that the part of the search before eval
(the ...
) needs to either return the last valid set, or else no results at all. So if you have a daily upload, and you run your base search from -1d@d to @d each day, that should work fine.
I happened to work this out for someone else yesterday. What you really want is "lookup file contains the results of the last non-empty job run, right? So basically, you include in the lookup the time of the run (or of the data set or whatever). In mine below, I just take it as the job run time:
... | eval jobtime=now() | inputlookup append=true mylookup | sort 0 - jobtime | streamstats max(jobtime) as latestjobtime | head (jobtime==latestjobtime) null=t | fields - latestjobtime | outputlookup mylookup
Now the trick is that the part of the search before eval
(the ...
) needs to either return the last valid set, or else no results at all. So if you have a daily upload, and you run your base search from -1d@d to @d each day, that should work fine.
Ah! Thanks for this comment. I think I know solved it:
sourcetype=xml_input "Service" | eval hostname=substr(name, 1, len(name)-25) | rex field=hostname "^(?
Thanks for your help!
Looks to me like you left out the part about inputlookup. Also, would recommend you do your table, rename, and other field manipulations before adding eval jobtime, etc.
Hm, sounds logical but doesn't work quite yet. I modified my searchstring so it now looks like this:
sourcetype=xml_input "Service" | eval hostname=substr(name, 1, len(name)-25) | rex field=hostname "^(?
If I run it over a time period in which I have not uploaded new data, the lookupfile "id_lookup.csv" gets overwritten with an empy file. I'd like to avoid this.