Getting Data In

Performing procedural search on daily indexed data, appending results as they appear.

jacksonmcarthur
Engager

Just looking for the best practice solution to the below problem. I'm pretty new to Splunk, so I feel the answer might be quite simple.

The problem:
Currently, a million logs come into a location daily. At the end of every month, these logs are indexed, and a report based on search results is created. Since thirty million logs are all being processed in a block, it takes a lot of time to index them - and an even longer time to search.

The fix:
A single search runs over the course of the month, indexing new logs as they arrive, searching them, and appending all results in one large XML, or CSV, or similar.

The implementations:
• Set an alert that triggers on detecting new files to be indexed. I.e: if not already indexed, index them and immediately run the search on these new files, then append resulting search data to file.

• Run tscollect daily on data that is already not indexed in a .tsidx to collect a relevant subset of data from raw, then process it in a block to create a report at end-of-month using the quicker tstats.

• Simply set a scheduled search (searching last 24h) to run daily after the logs are indexed, appending results to file.

Thanks for the help!

0 Karma
1 Solution

koshyk
Super Champion

eh.!? why you want to do all these complicated logic of triggering detection?

Splunk collection in inputs.conf uses "monitor" and "batch" can automatically do this for you.. ie. look when they arrive and index it instantly and it will check if it has been indexed already etc.

https://docs.splunk.com/Documentation/Splunk/8.0.2/Data/Monitorfilesanddirectorieswithinputs.conf

In your case, I would just do a monitor and it will be all good

[monitor:///your/location/with/read/permission/*.log]
disabled = 0
sourcetype = mycustom_logs
index = my_index

Also Please: Ensure you do props & transforms to do indextime and search-time extraction for your sourcetype

View solution in original post

0 Karma

koshyk
Super Champion

eh.!? why you want to do all these complicated logic of triggering detection?

Splunk collection in inputs.conf uses "monitor" and "batch" can automatically do this for you.. ie. look when they arrive and index it instantly and it will check if it has been indexed already etc.

https://docs.splunk.com/Documentation/Splunk/8.0.2/Data/Monitorfilesanddirectorieswithinputs.conf

In your case, I would just do a monitor and it will be all good

[monitor:///your/location/with/read/permission/*.log]
disabled = 0
sourcetype = mycustom_logs
index = my_index

Also Please: Ensure you do props & transforms to do indextime and search-time extraction for your sourcetype

0 Karma

jacksonmcarthur
Engager

Cool, thanks for the response! I have a follow up question:

Can I leverage monitor to make sure that new incoming logs are only searched once? At the moment, the search restarts and reads every indexed log. This becomes a problem when the amount of logs for the month grows to a very large size; the search begins to take many hours to complete.

0 Karma

koshyk
Super Champion

That is correct. Splunks monitor is quite powerful and leverages lot of sanity checks . Please read the inputs.conf details and look for initCrcLength, and reduancy checks which splunk does and the defaults. (you can of course change them, but it is normally not required)
I'm sure, in your case Splunk will index from the point it saw the previous data and shouldn't be an issue at all

if you find the answer is good, please upvote & accept. cheers

0 Karma

jacksonmcarthur
Engager

Sorry Koshyk, I should clairfy what I mean to ask. I'm all okay with indexing the raw data - what I'm asking about is whether or not it is possible to search the indexed logs only once.

I want to perform a search, and output the resulting data to a .csv file. Here is some example code for that:

index="myIndex" mySearchTerms | outputcsv myCsvFile.csv append=true

But if I perform this search once a day, logs in the index will get searched more than one time, which leads to longer processing times (as the index gets larger), and redundant data making its way into the .csv output.

Is it possible to perform a search only on data that has not been searched yet?

0 Karma

koshyk
Super Champion

for searching, its upto your logic. What we normally do is "scheduled searches". i.e. run saved-searches in a scheduled/cron manner

eg.
- Run a search every 30mins and search for (earliest=-1h and latest=-30m)
- Run it continously and you will get whatever it hasn't searched before
- You can alert anything particular from this or summary index or put into a outputCSV as you wish

Also, Splunk searches very fast. For example if you indexer is capable,(or clustered), for a billion events Splunk can search within 30seconds (a rough estimate)

jacksonmcarthur
Engager

Awesome, didn't know that about scheduled searches! Thank you for all the help 🙂

0 Karma
Get Updates on the Splunk Community!

Earn a $35 Gift Card for Answering our Splunk Admins & App Developer Survey

Survey for Splunk Admins and App Developers is open now! | Earn a $35 gift card!      Hello there,  Splunk ...

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

You’ve probably heard the latest about AppDynamics joining the Splunk Observability portfolio, deepening our ...

Monitoring Amazon Elastic Kubernetes Service (EKS)

As we’ve seen, integrating Kubernetes environments with Splunk Observability Cloud is a quick and easy way to ...