Hi All;
Noticed something very interesting and I don't seem to find the smoking gun. Today I was alerted by one of my users that he saw some duplicate events, more recently today at 2:20pm. I took a look, and sure enough I was able to find that an event was duplicated. I narrowed the result down to the following search:
index=app_harmony sourcetype=harmony:\*:access "[04/Jan/2017:14:20*" host=cimasked0047 | eval idxtime=_indextime | table _time idxtime host source splunk_server _raw
The results were as follows:
_time idxtime host source splunk_server _raw
2017-01-04 14:20:02 1483557602 hostname /usr/local/openresty/nginx/logs/access.log splunkindex0006 xx.xx.xxx.xxx - - [04/Jan/2017:14:20:02 -0500] "GET /api/harmony/v1/User/G_IAS_6e680df53be6a06f9e11faf40812dc8c?domain=internal HTTP/1.1" 200 317 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36" "-"
2017-01-04 14:20:02 1483557602 hostname /usr/local/openresty/nginx/logs/access.log splunkindex0009 xx.xx.xx.xxx - - [04/Jan/2017:14:20:02 -0500] "GET /api/harmony/v1/User/G_IAS_6e680df53be6a06f9e11faf40812dc8c?domain=internal HTTP/1.1" 200 317 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36" "-"
The raw event is exactly the same, as is the index time. The only difference between the two events is our indexer. We have a cluster of 5 indexers with a RF of 3.
I double checked the file, and there isn't more than 1 event at that time.
I also verified that I dont see any random '.filepart' or temp files or anything appearing
the input for that particular file is pretty strict and as follows:
[monitor:///usr/local/openresty/nginx/logs/access.log]
index = app_harmony
sourcetype = harmony:openresty:access
I took a look at the internal logs, and didn't see any errors around that time.
Actually, you have an unnecessary second tcpout stanza in there. I don't know if that's the problem, but you should remove one tcpout stanza.
Do you use a master server? If so, change your tcpout stanzas to something like this (and the others settings you'd like to have):
[tcpout]
defaultGroup = companyPVSNewIndexers
indexAndForward = false
[tcpout:companyPVSNewIndexer]
indexerDiscovery = YourIndexerCluster
//additional settings for SSL required when enabled
[indexer_discovery:YourIndexerCluster]
pass4SymmKey = YourKey //defined on master
master_uri = https://YourMasterServer:8089
If you don't use a master (I don't think this is recommended), just merge both tcpout stanzas into one. Not sure about the autoLB settings though.
Edit: typo
Actually, you have an unnecessary second tcpout stanza in there. I don't know if that's the problem, but you should remove one tcpout stanza.
Do you use a master server? If so, change your tcpout stanzas to something like this (and the others settings you'd like to have):
[tcpout]
defaultGroup = companyPVSNewIndexers
indexAndForward = false
[tcpout:companyPVSNewIndexer]
indexerDiscovery = YourIndexerCluster
//additional settings for SSL required when enabled
[indexer_discovery:YourIndexerCluster]
pass4SymmKey = YourKey //defined on master
master_uri = https://YourMasterServer:8089
If you don't use a master (I don't think this is recommended), just merge both tcpout stanzas into one. Not sure about the autoLB settings though.
Edit: typo
Oh very interesting! We do have a cluster master. We actually had a consultant set our outputs.conf up that way. I wonder why he did it that way instead of relying on the CM to delegate the indexers
I think that was the issue, the duplicate stanza! Thanks @skalliger for pointing that one out! Sent a note to the admins to promote as the correct answer.
Glad it helped you! Just wondering why you reported my comment though, haha.
Because you hurt my feelings with your answer. Lol no i reported it as 'other' to get the mods attention to promote it to an 'accepted answer'
Ah, alright. I get way too many Splunk Answers notifications, haha. I converted it. 🙂
You rock, thanks again.
I've had a number of issues the the indexer discovery and I prefer not to use it now!
Your duplicate tcpout stanza might be an issue though...that's an easy fix.
FYI I have tried running the forwarder with useACK=false with no success. Any other ideas would be greatly appreciated as this is still happening for the index 😞
So just to confirm, same source, sourcetype, index and host is producing duplicate log entries?
If you count the number of lines per hour or similar, does that add up to more than the number of lines within the source file on the server ?
If it is consistently duplicated then it would be worth checking the configuration of your outputs.conf ...
So here is what I got:
In Splunk-> index=app_harmony source=/usr/local/openresty/nginx/logs/access.log "13/Jan/2017:08:"
events = 4678
Grepping the actual log file : less access.log | grep -o '13/Jan/2017:08' | wc -l
lines = 1080
If you run a dedup in Splunk so:
index=app_harmony source=/usr/local/openresty/nginx/logs/access.log "13/Jan/2017:08:" | dedup
Do you see much less events ? The numbers are strange you have a lot more than double the number of events in Splunk...also I notice you did not narrow down to a host, are there multiple hosts that this log comes from or just one?
Ah Jeez!! You hare 100% correct, sorry, retried:
[splkadmn@cmasked0083 logs]$ less access.log | grep "13/Jan/2017:08:" | wc -l
1140
index=app_harmony host=cmasked0083 source=/usr/local/openresty/nginx/logs/access.log "13/Jan/2017:08:"
1,959 events
index=app_harmony host=cmasked0083 source=/usr/local/openresty/nginx/logs/access.log "13/Jan/2017:08:" | dedup _raw
1,140 events
Interesting, during the time period where you have the duplicates, are there any warning/error messages from the splunkd? crc checksum or duplication errors in particular...
I wish there were, but nothing in the splunkd on those forwarders 😞
Does the forwarder have a useACK=true in the outputs.conf?
One of the known side effects of this setting can be data duplication in rare cases.
Search for this message in the _internal splunkd.log on the forwarder:
WARN TcpOutputProc - Possible duplication of events with
I took another look and did notice that I do uave 'useACK' as true. This is a default on all of our forwarders:
[tcpout]
defaultGroup = companyPVSNewIndexers
# These two options below are required for forwarders when clustering.
# Max queue size ensures that the forwarder has enough of a buffer while
# waiting for the ACK from the indexer; without useACK, the search head
# will spout yellow warning banners in a clustered environment.
maxQueueSize = 7MB
useACK = true
Should I try running without useACK for this one forwarder?
maybe it's worth checking the setting for the servers in outputs.conf
as well, because if they are wrong this can end in duplicated events as well.
So here is the whole outputs file. To save space on this post I have removed all the settings that have been commented out.
Company BASE SETTINGS
[tcpout]
defaultGroup = companyPVSNewIndexers
# These two options below are required for forwarders when clustering.
# Max queue size ensures that the forwarder has enough of a buffer while
# waiting for the ACK from the indexer; without useACK, the search head
# will spout yellow warning banners in a clustered environment.
maxQueueSize = 7MB
useACK = true
[tcpout]
defaultGroup = companyPVSNewIndexers
indexAndForward = false
# When indexing a large continuous file that grows very large, a universal
# or light forwarder may become "stuck" on one indexer, trying to reach
# EOF before being able to switch to another indexer. The symptoms of this
# are congestion on *one* indexer in the pool while others seem idle, and
# possibly uneven loading of the disk usage for the target index.
# In this instance, forceTimebasedAutoLB can help!
# ** Do not enable if you have events > 64kB **
forceTimebasedAutoLB = true
# Correct an issue with the default outputs.conf for the Universal Forwarder
# or the SplunkLightForwarder app; these don't forward _internal events.
forwardedindex.2.whitelist = (_audit|_introspection|_internal)
[tcpout:companyPVSNewIndexers]
server = masked6.com:9997, masked7.com:9997, masked8.com:9997, masked9.com:9997, masked10.com:9997
autoLB = true
autoLBFrequency = 40
I checked the servers, and they are correct
Thanks for the quick response. I checked for that yesterday around the time of the duplication, and didn't see anything. Also tried a week long search to verify.