I've got a metrics alert that runs every hour and sends me an email when the volume in my dhcp index is over a certain amount. The alert uses the stats command to sum MB (via
| eval MB=round(kb/1024,3)) for series="dhcp" and then uses
where MB > max. This tripped yesterday so I went to do a top by MAC and DevName in the dhcp index and nothing returned in search for the last hour. So there are two curiosities here:
Meanwhile, I checked for a connection on the indexer from the forwarder on the SSL port we use to take logs, and it was there (established). Wondering whether it was a forwarder problem (which has never been problematic before) or something wrong on the indexer, I finally decided to restart splunkd on the indexer (mainly because the admin for the forwarder was out sick). Voila, I started seeing logs in the dhcp index again.
This morning when I came in, I checked the dhcp index and no logs. I ran this search for metrics (pretty much the alert search without the where clause) with timepicker set to Last 60 minutes:
index=_internal source="/opt/splunk/var/log/splunk/metrics.log*" group=per_index_thruput series="dhcp" | rename series as index | eval MB=round(kb/1024,3) | stats sum(MB) as MB by index
It came back with 200 MB. Crazy weird just like yesterday morning. According to this search the most recently received logs are the current time:
| metadata type=hosts index=dhcp | search host=dhcp*
| stats count by host totalCount firstTime recentTime
| convert timeformat="%m/%d/%Y %T" ctime(recentTime) ctime(firstTime)
| rename totalCount as Count recentTime as "Last Update" firstTime as "Earliest Update"
| fieldformat Count=tostring(Count, "commas")
| fields - count
As in when I just ran this it was 10:15 and 'Last Update' came back with 10:15:19. Yet
index=dhcp search comes back with nothing.
Also, looking at Settings->indexes, I see that the Latest Event for the dhcp index is shown as four hours ago (now 10:33 in the morning).
I took a look at splunkd.log, greping it for the IP of the forwarder. There are some fairly recent (8:28 this morning) ClientSessionsManager action=Phonehome result=Ok checksum=0 log entries, and then there is this shorthereafterater:
10-23-2018 08:29:09.302 -0700 ERROR TcpInputProc - Error encountered for connection from src=IP_redacted:39683. Local side shutting down
I've redacted the IP address in this, but it is the IP of the forwarder.
The admin for the forwarder (a solaris 10 box running 22.214.171.124 currently) was in today and I got a look at the logs. I see it connected to my indexer yesterday after I restarted splunkd there, then I'm seeing this in the forwarder's logs from this morning (nothing after that connection yesterday until these):
10-23-2018 08:28:19.635 -0700 INFO TcpOutputProc - Connection to 128.195.xxx.xxx:9998 closed. default Error in SSL_read = 131, SSL Error = error:00000000:lib(0):func(0):reason(0) 10-23-2018 08:28:19.799 -0700 INFO TcpOutputProc - Connected to idx=128.195.xxx.xxx:9998 10-23-2018 08:28:35.707 -0700 INFO TcpOutputProc - Connection to 128.195.xxx.xxx:9998 closed. default Error in SSL_read = 131, SSL Error = error:00000000:lib(0):func(0):reason(0) 10-23-2018 08:28:35.909 -0700 INFO TcpOutputProc - Connected to idx=128.195.xxx.xxx:9998 10-23-2018 08:29:07.386 -0700 INFO TcpOutputProc - Connection to 128.195.xxx.xxx:9998 closed. sock_error = 32. SSL Error = error:00000000:lib(0):func(0):reason(0) 10-23-2018 08:29:07.518 -0700 WARN TcpOutputProc - Applying quarantine to ip=128.195.xxx.xxx port=9998 _numberOfFailures=2 10-23-2018 08:29:54.178 -0700 INFO TcpOutputProc - Removing quarantine from idx=128.195.xxx.xxx:9998 10-23-2018 08:29:54.339 -0700 INFO TcpOutputProc - Connected to idx=128.195.xxx.xxx:9998
I asked the admin to restart splunkd on the solaris box, and he did. I see a 'INFO PubSubSvr - Subscribed' phonehome in my splunkd logs and I see a new connection from the forwarder on my indexer. I don't see any errors involving the forwarder's IP address in my indexer's splunkd logs. However, I still do not see any logs in the dhcp index.
Oh, wait. Another quick look at the Settings->Indexes shows the Latest Event for the dhcp index as three hours ago -- it was four the last time I looked. Switching the search on the dhcp index to All time (real time) I can see logs streaming in with timestamps from earlier this morning -- it is playing "catch up," with events coming in ~8,000/minute.
My indexer is running RHEL 7 and:
$ cat etc/splunk.version
The forwarder has been faithfully sending me logs until this business began yesterday.
Any ideas what has been going wrong?
More trouble last night. After working well all day, you see this in the logs (on the indexer/deployment server):
10-23-2018 11:01:07.879 -0700 ERROR TcpInputProc - Error encountered for connection from src=128.200.xx.xxx:35411. Local side shutting down
10-23-2018 11:01:21.860 -0700 ERROR TcpInputProc - Error encountered for connection from src=128.200.xx.xxx:43490. Local side shutting down
10-23-2018 11:01:40.685 -0700 ERROR TcpInputProc - Error encountered for connection from src=128.200.xx.xxx:43684. Local side shutting down
10-23-2018 11:01:40.803 -0700 ERROR TcpInputProc - Error encountered for connection from src=128.200.xx.xxx:43946. Local side shutting down
10-23-2018 11:02:10.822 -0700 INFO ClientSessionsManager - Adding client: ip=128.200.xx.xxx uts=sunos-sparc id=118563b1987538f45848645415b48e19 name=ECB32768-B062-47DC-B652-34D79B6B2B45
10-23-2018 11:02:10.822 -0700 INFO ClientSessionsManager - ip=128.200.xx.xxx name=ECB32768-B062-47DC-B652-34D79B6B2B45 New record for sc=OIT_SC_GLOBAL_CONF-FILES app=OIT_DA_GLOBAL_CONF-FILES: action=Phonehome result=Ok checksum=0
10-23-2018 11:02:10.822 -0700 INFO ClientSessionsManager - ip=128.200.xx.xxx name=ECB32768-B062-47DC-B652-34D79B6B2B45 New record for sc=OIT_NSP_INDEX_01 app=OIT_NSP_INDEX_DHCP_01: action=Phonehome result=Ok checksum=0
10-23-2018 11:02:10.822 -0700 INFO ClientSessionsManager - ip=128.200.xx.xxx name=ECB32768-B062-47DC-B652-34D79B6B2B45 New record for sc=OIT_NSP_INDEX_01 app=OIT_NSP_INDEX_IDM_01: action=Phonehome result=Ok checksum=0
10-23-2018 11:02:10.823 -0700 INFO ClientSessionsManager - ip=128.200.xx.xxx name=ECB32768-B062-47DC-B652-34D79B6B2B45 New record for sc=OIT_NSP_INDEX_01 app=OIT_NSP_INDEX_LDAP_01: action=Phonehome result=Ok checksum=0
I checked the file handles thing after reading this and that all looks okay. Thanks for the suggestion.
Here is another interesting one (like from that one): https://answers.splunk.com/answers/43285/error-tcpinputproc-error-encountered-for-connection-from-ti...
I was rummaging around Splunk answers looking for information about the heartbeat function that forwarders use. in the Splunk Answers topic What does the heartbeatFrequency setting do in outputs.conf? rphillips_splunk states:
1. How often (in seconds) to send a heartbeat packet to the receiving server.
2. Heartbeats are only sent if sendCookedData=true.
3. Defaults to 30 (seconds).
Heartbeat is a mechanism for the forwarder to know that the receiver (i.e. indexer) is alive. If the indexer does not send a return packet to the forwarder, the forwarder will declare this receiver unreachable and not forward data to it. By default a packet is sent every 30s.
I have been under the impression that the indexer does not need to communicate with forwarders except on port 8089 to send Deployment App bundles. I'm wondering what port the indexer uses to talk to the forwarder to convey this return packet rphillips_splunk mentions.
sendCookedData is a parameter in the tcpout stanza in outputs.conf. It is by default true, so it is set for all newly installed forwarders by default. I'm wondering what the effect would be on the forwarder if a firewall rule was blocking the indexer from getting the return packet to the forwarder after receiving the heartbeat transmission.