I have a network share folder with a huge number of directories and files (.csv). Files are constantly being added and periodically getting removed for archiving. When creating a Directory data source for this share I found that splunk is opening a lot of file handles as it tries to watch the files and monitor for changes. When restarting my instance it also takes a very long time for this monitoring to begin working; I was told this is because it needs to "catch up" with any missed changes by scanning the entire share.
So I tried using splunk's HEC to send my .csv files. If I try to send each row of the csv as an individual event (source type=_json) then the overhead of repeating the csv headers for each row quickly builds up (these csvs are often very large and have >40 headers (the headers are also not very static)) and it unnecessarily burns through the license limit. It does work though.
If I try to send the raw string content of the csv file as a single event (source type = csv) it doesn't interpret it correctly (it doesn't detect the fields). I'm not even certain this is supposed to work in the first place.
So both the Directory data source and HEC seem to be inefficient for my scenario. Are there any other options I can try (out of the box preferably or an official app)? Or perhaps tweaks to above the methods (preferably not undocumented settings)?
What is the problem with putting one or more monitor stanzas over those files in that directory?
Do you mean a stanza for each folder? The folders that are added have random guid names and they get removed regularly as well.
No, just put a monitor on that folder and Splunk will read everything in that directory substructure by default and you don't have to worry about it.
By the way the option is called recursive and the default is true.
I believe this solves your issue. Let me know
Unless I'm missing something, that doesn't sound any different from my existing set-up.
The problem with that is the monitor is too aggressive with the IO operations that it does on the share. As well as the "catch up" time that I mentioned when the instance is rebooted.
Yes you are correct, the monitor stanza I'm talking about is what you called Directory data source.
Monitor is the correct way to monitor files when you have actually the ability to do it. I'm not sure what you mean by "aggressive". Is it taking too long to index data?
The "catch up" time is normal specially if you have very big files and a large number of them.
Maybe you need a second pipeline in that Universal Forwarder: https://docs.splunk.com/Documentation/Forwarder/7.0.2/Forwarder/Configureaforwardertohandlemultiplep...
Please let me know if the answer was useful for you. If it was, accept it and upvote. If not, give us more input so we can help you with that