Solved: Generating .csv-Lookups from XML-File

lrudolph · ‎10-19-2013

Hi,

I'd like to use Splunk itself to to generate a .csv-lookupfile to enrich one of my data sources. The data for this lookup comes from an XML-file that is undergoing several changes per week.

Now I want to automate this as good as possible and thought about the following solution:

a) Configure Splunk to monitor a folder in which I'd copy the new XML-file once it has undergone one or more changes. The sourcetype for that file is configured to see every line of that XML-file as a new event.

b) Use a schedules saved search (I already finished the searchstring) to filter down and process the events from the XML-file and generate a new lookupfile (overwrite the old one) with the | outputlookup-command.

My problem is that I somehow have to differentiate in the scheduled search to only use the newest data from the XML-file. Let's say I'd run a scheduled search every day for the last 24 hours and there was no new XML-file uploaded - outputlook would overwrite the .csv-file with an empty one. And in case I'd upload two new XML files per day, I'd end up with an .csv-file having double entries.

Someone has an idea how to solve this problem?

Regards,

Leo

gkanapathy · ‎10-19-2013

I happened to work this out for someone else yesterday. What you really want is "lookup file contains the results of the last non-empty job run, right? So basically, you include in the lookup the time of the run (or of the data set or whatever). In mine below, I just take it as the job run time:

... | eval jobtime=now() | inputlookup append=true mylookup | sort 0 - jobtime | streamstats max(jobtime) as latestjobtime | head (jobtime==latestjobtime) null=t | fields - latestjobtime | outputlookup mylookup

Now the trick is that the part of the search before eval (the ...) needs to either return the last valid set, or else no results at all. So if you have a daily upload, and you run your base search from -1d@d to @d each day, that should work fine.

View solution in original post

gkanapathy · ‎10-19-2013

I happened to work this out for someone else yesterday. What you really want is "lookup file contains the results of the last non-empty job run, right? So basically, you include in the lookup the time of the run (or of the data set or whatever). In mine below, I just take it as the job run time:

... | eval jobtime=now() | inputlookup append=true mylookup | sort 0 - jobtime | streamstats max(jobtime) as latestjobtime | head (jobtime==latestjobtime) null=t | fields - latestjobtime | outputlookup mylookup

Now the trick is that the part of the search before eval (the ...) needs to either return the last valid set, or else no results at all. So if you have a daily upload, and you run your base search from -1d@d to @d each day, that should work fine.

lrudolph · ‎10-21-2013

Ah! Thanks for this comment. I think I know solved it:

Thanks for your help!

gkanapathy · ‎10-21-2013

Looks to me like you left out the part about inputlookup. Also, would recommend you do your table, rename, and other field manipulations before adding eval jobtime, etc.

lrudolph · ‎10-21-2013

Hm, sounds logical but doesn't work quite yet. I modified my searchstring so it now looks like this:

If I run it over a time period in which I have not uploaded new data, the lookupfile "id_lookup.csv" gets overwritten with an empy file. I'd like to avoid this.

Generating .csv-Lookups from XML-File

Technical Workshop Series: Splunk Data Management and SPL2 | Register here!

Spotting Financial Fraud in the Haystack: A Guide to Behavioral Analytics with Splunk

Solve Problems Faster with New, Smarter AI and Integrations in Splunk Observability