I would like to ask you something, and I'm hoping that you can help me with this.
I'm sending data from one universal forwarder to 2 environments (1 Stand Alone and 1 Cluster).
For some reason, when i have 100 events on Stand Alone with this time window (earliest=-3m@m latest=-m@m) on the cluster, I have only 80 and only some minutes after i get 100 events on cluster.
Why can this be happening? The source types are the same and i wasn't able to find some error on internal.
Thanks you for your time and i hope you can help me.
So, here's the steps that I would go through to triage this one...
1) Identify the latency on each side.
(your search that gets the data in question) | eval latency = _index_time - _time | stats count as event_count avg(latency) as latency_avg stdev(latency) as latency_stdev by splunk_server
Look at the patterns in the above. You can also replace the stats with a timechart, and look at it with min or max latency, to see whatever you see about the machines' performance...
| timechart span=30s max(latency) by splunk_server
Now, the above should make something obvious. Either it will be obvious that one or more of the machines are slow to index the data, thus a high level of latency, or it will be obvious that there is no such latency.
If there is high latency, then investigate what is slowing down the machine(s) in question. maybe they are CPU bound, or IO bound, or virtual machines that are fighting for resources (either oversubscribed or overspecified, either one of which can cause performance problems.)
If there is no high latency, then look for network issues. That could be slow replication, slow transmission, inability to reach a particular server, and so on.
If none of this pans out, then get onto the Splunk Slack channel that is linked here -- https://answers.splunk.com/answers/443734/is-there-a-splunk-slack-channel.html -- (I'm not putting the direct link here because that link will be kept up to date) -- and then get into the #index_clustering subchannel, and ask the question there so you know what next to investigate.
Please let us know how it turns out.
Hello DalJeanis and thanks for reply.
I have executed those search and as i say, there is no latency but, in the case of event count distribution, the indexer 1 has the double of the second indexer.
In the case that this is an network issue, is there a way to check it from splunk before ask to the administrator to check it ?
1) is that proportion consistent over time?
2) I assume you mean "no significant latency", because there is ALWAYS latency. If the latency is zero, then your data is getting the wrong timestamp.
3) If you run the same search verbose with fixed earliest and latest times, then run it again later, you can compare the two results against each other and see which events were delayed. Might be some information there.
4) It sounds like you have only two indexers in that cluster. what are your SF/RF?