I am developing a Splunk app and just wanted to hear for someone what is considered to be the best practice when it comes sending events to Splunk, to be processed and indexed.
Basically, I am concerned that sending events into Splunk as soon as they are available would take a toll on the indexer because there will be constant flow of data every few seconds, but on the other hand, waiting for all data to come in and then index, is not an option because events could be coming in for days, and I can't wait so long to see the data in the system. So, my best guess is to set a cap on the number of events that would be indexed at a time, so for example, I would wait for 10000 events to accumulate and then send them into Splunk for processing. Could someone offer advice on this ?
You're overthinking it. Splunk indexers are specifically designed to handle the constant stream of data coming in. In fact, if for some reason Splunk slows down, the downstream forwarders will just queue up.
How are you sending the data to Splunk? I ask because in normal usage of Splunk you should never have to worry about this topic.
If I remember correctly, the indexing process will use about one core of the CPU so the other cores are available for returning search results of that data. If that load is insufficient, increase the indexing pipelines (more cores used) and increase the indexers (better data distribution).
Also, won't the users be misled if they try to run reports on the data and don't realize that it's incomplete or not currently sending?
I'm not sure if you can tell, but I'm very concerned by the question. I am confident that any means of modulating the data flow will provide a terrible experience with the Splunk platform.
Respond back with more info and I'm happy to answer other concerns about this.
I don't know about best practices in this area, but IMO, data should be indexed as soon as it's available. Splunk can't act on data it doesn't have.