Currently, a million logs come into a location daily. At the end of every month, these logs are indexed, and a report based on search results is created. Since thirty million logs are all being processed in a block, it takes a lot of time to index them - and an even longer time to search.
A single search runs over the course of the month, indexing new logs as they arrive, searching them, and appending all results in one large XML, or CSV, or similar.
• Set an alert that triggers on detecting new files to be indexed. I.e: if not already indexed, index them and immediately run the search on these new files, then append resulting search data to file.
tscollect daily on data that is already not indexed in a
.tsidx to collect a relevant subset of data from raw, then process it in a block to create a report at end-of-month using the quicker
• Simply set a scheduled search (searching last 24h) to run daily after the logs are indexed, appending results to file.
Thanks for the help!
eh.!? why you want to do all these complicated logic of triggering detection?
Splunk collection in inputs.conf uses "monitor" and "batch" can automatically do this for you.. ie. look when they arrive and index it instantly and it will check if it has been indexed already etc.
In your case, I would just do a
monitor and it will be all good
[monitor:///your/location/with/read/permission/*.log] disabled = 0 sourcetype = mycustom_logs index = my_index
Also Please: Ensure you do props & transforms to do indextime and search-time extraction for your sourcetype
Cool, thanks for the response! I have a follow up question:
Can I leverage
monitor to make sure that new incoming logs are only searched once? At the moment, the search restarts and reads every indexed log. This becomes a problem when the amount of logs for the month grows to a very large size; the search begins to take many hours to complete.
That is correct. Splunks
monitor is quite powerful and leverages lot of sanity checks . Please read the inputs.conf details and look for initCrcLength, and reduancy checks which splunk does and the defaults. (you can of course change them, but it is normally not required)
I'm sure, in your case Splunk will index from the point it saw the previous data and shouldn't be an issue at all
if you find the answer is good, please upvote & accept. cheers
Sorry Koshyk, I should clairfy what I mean to ask. I'm all okay with indexing the raw data - what I'm asking about is whether or not it is possible to search the indexed logs only once.
I want to perform a search, and output the resulting data to a .csv file. Here is some example code for that:
index="myIndex" mySearchTerms | outputcsv myCsvFile.csv append=true
But if I perform this search once a day, logs in the index will get searched more than one time, which leads to longer processing times (as the index gets larger), and redundant data making its way into the .csv output.
Is it possible to perform a search only on data that has not been searched yet?
for searching, its upto your logic. What we normally do is "scheduled searches". i.e. run saved-searches in a scheduled/cron manner
- Run a search every 30mins and search for (earliest=-1h and latest=-30m)
- Run it continously and you will get whatever it hasn't searched before
- You can alert anything particular from this or summary index or put into a outputCSV as you wish
Also, Splunk searches very fast. For example if you indexer is capable,(or clustered), for a billion events Splunk can search within 30seconds (a rough estimate)
Awesome, didn't know that about scheduled searches! Thank you for all the help 🙂