Best way to monitor and index millions of files in...

raja21 · ‎06-03-2018

Hi developers, I am trying to analyse some logs by extracting them in JSON format and feeding to splunk.
I have millions of these logs each resulting in a JSON file of 4-5 kb.
How to monitor these files effectively so that spunk picks up each file.

Thanks.

ddrillic · ‎06-04-2018

A major issue can be the ulimits for open files. Read please the great post by @yannk at how to tune ulimit on my server ?

FrankVl · ‎06-04-2018

I see 2 main options:

Put a Universal Forwarder on the system that is storing these logs and create a monitor input for the respective folder.
If you're using some kind of script to extract those logs, you could modify that script to send the JSON data by HTTP POST request to a Splunk Heavy Forwarder / Indexer set up as a HTTP Event Collector: http://docs.splunk.com/Documentation/Splunk/latest/Data/AboutHEC

I don't have experience myself with such huge amounts of files, but unless you get some specific recommendations here, I'd suggest to just give it a try (in a test setup ideally of course) and see what issues you run into. Then you can always post back here to get help resolving those issues.

raja21 · ‎06-07-2018

hi @FrankVl, I tried HTTP Event Collector method and found it to be useful.

Now the issue is i have to run curl command for each files. On a daily basis i get millions of files to process so would it be an overhead to run curl so many times?

I also have an idea of merging all the JSON records into one file seperated by [EOF] and send that file across to splunk and break events using [EOF].
But its not getting inputted into splunk as [EOF] is not in JSON format.

Any other solutions??

FrankVl · ‎06-07-2018

Don't think curl should give too much overhead, but you should be able to see that for yourself whether it causes problematic load.

As per your other idea: I don't completely follow what you tried and what is failing.

Best way to monitor and index millions of files in Splunk

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics!

New in Observability Cloud - Explicit Bucket Histograms