Getting Data In

What is the most efficient way to sends a large number of raw csv files to splunk?

ilaila
New Member

I have a network share folder with a huge number of directories and files (.csv). Files are constantly being added and periodically getting removed for archiving. When creating a Directory data source for this share I found that splunk is opening a lot of file handles as it tries to watch the files and monitor for changes. When restarting my instance it also takes a very long time for this monitoring to begin working; I was told this is because it needs to "catch up" with any missed changes by scanning the entire share.

So I tried using splunk's HEC to send my .csv files. If I try to send each row of the csv as an individual event (source type=_json) then the overhead of repeating the csv headers for each row quickly builds up (these csvs are often very large and have >40 headers (the headers are also not very static)) and it unnecessarily burns through the license limit. It does work though.

If I try to send the raw string content of the csv file as a single event (source type = csv) it doesn't interpret it correctly (it doesn't detect the fields). I'm not even certain this is supposed to work in the first place.

So both the Directory data source and HEC seem to be inefficient for my scenario. Are there any other options I can try (out of the box preferably or an official app)? Or perhaps tweaks to above the methods (preferably not undocumented settings)?

Tags (3)
0 Karma

tiagofbmm
Influencer

What is the problem with putting one or more monitor stanzas over those files in that directory?

0 Karma

ilaila
New Member

Do you mean a stanza for each folder? The folders that are added have random guid names and they get removed regularly as well.

0 Karma

tiagofbmm
Influencer

No, just put a monitor on that folder and Splunk will read everything in that directory substructure by default and you don't have to worry about it.

By the way the option is called recursive and the default is true.

I believe this solves your issue. Let me know

0 Karma

ilaila
New Member

Unless I'm missing something, that doesn't sound any different from my existing set-up.

The problem with that is the monitor is too aggressive with the IO operations that it does on the share. As well as the "catch up" time that I mentioned when the instance is rebooted.

0 Karma

tiagofbmm
Influencer

Yes you are correct, the monitor stanza I'm talking about is what you called Directory data source.

Monitor is the correct way to monitor files when you have actually the ability to do it. I'm not sure what you mean by "aggressive". Is it taking too long to index data?

The "catch up" time is normal specially if you have very big files and a large number of them.

Maybe you need a second pipeline in that Universal Forwarder: https://docs.splunk.com/Documentation/Forwarder/7.0.2/Forwarder/Configureaforwardertohandlemultiplep...

0 Karma

tiagofbmm
Influencer

Please let me know if the answer was useful for you. If it was, accept it and upvote. If not, give us more input so we can help you with that

0 Karma
Get Updates on the Splunk Community!

How to Monitor Google Kubernetes Engine (GKE)

We’ve looked at how to integrate Kubernetes environments with Splunk Observability Cloud, but what about ...

Index This | How can you make 45 using only 4?

October 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with this ...

Splunk Education Goes to Washington | Splunk GovSummit 2024

If you’re in the Washington, D.C. area, this is your opportunity to take your career and Splunk skills to the ...