Hey everyone. I am trying to put together an application and need some ideas. Right now my situation involves taking in data from a variety of sources across an enterprise network. The problem is a number of the sources only have the ability to export data to a single location, and that has to be dedicated to billing. So, before I can get the data, I have to wait for the billing department to do their thing with it, and then export it so I can use it.
I have one source feeding to me directly utilizing the universal forwarder. So in total I have one source in near real time, one source once every 24 hours, and the final source as late as 72-96 hours. To have a complete record of an event, I will need all of the sources.
So for each of the sources, as they come in, I will periodically run a scheduled search to toss the contents I'm interested in into a summary index specific to that sourcetype. Then, whenever a user searches against our dashboard, it will correlate all of the records in the different summary indexes in real time. The fields from data sources we currently have will be filled in. The fields we don't have will be blank. This will be updated as more stuff comes in.
I would love for there to be an automated way to do this, however I've been told that a scheduled search that outputs to a summary index can't backfill in the missing data; it just creates new events in the summary index. Is this correct?
The issue is finding a way for this to be done via an API. Suppose I have a saved search which returns the most current data in real time. Can you take arguments via the REST API like you can using a dashboard???
Well, technically a scheduled search CAN backfill data, provided you execute it from the Command line interface. All the explanations are in the docs.
There is a python script which will take care of running a saved search for over period of time in the past (-et and -lt) running it as many times as it would have run for real.
./splunk cmd python fill_summary_index.py -app search -name *your_saved_seach* -et -2d@d -lt @d -j 8 -owner admin -auth admin:changeme
The -dedup param might suit your need not to override data that is already there. I think this might be very helpful in your case.
Summary indexes are not a sort of excel file in which you populate some cells of the same lines at different times.
Deduping works at the "row" level, not column.
However, it all depends on the search you use to read summary data and present it to the user, so fields on different summary index events might become part of the same row in a table (transactions, | bucket _time span=1h | stats first(field1), first(field6) by _time, ....)
I think you might want to create a test summary index and test this a little bit. I'm not sure I got your usecase right
Also, to ensure i understand... suppose source 1 is the one I get in real time. Everything is based off that, and that's where the final summary index gets its 1st 5 fields. Source type two comes in once a day. I get the 2nd 5 fields from that. Then there is the other one that comes in every 72 hours, and i get the last 5 fields from that. Source one comes in, the search runs from CLI, and it puts the data from source one into the index, leaving source 2/3's fields blank. Then source 2 comes in, the search runs, and that data gets added to the index, in the same event as source 1's event?
So to verify, running something like this with dedup would just fill in the missing fields with any additional data that may have come in for that particular complete record??? I want to make sure I understand because this could fundamentally change my system design - in a good way! I could run the search every 5 minutes, and it would just correlate the additional source data as it becomes available instead of having to do everything at search time, which would speed everything up for everyone.