I've got a metrics alert that runs every hour and sends me an email when the volume in my dhcp index is over a certain amount. The alert uses the stats command to sum MB (via | eval MB=round(kb/1024,3) ) for series="dhcp" and then uses where MB > max . This tripped yesterday so I went to do a top by MAC and DevName in the dhcp index and nothing returned in search for the last hour. So there are two curiosities here:
Why do the metrics data show indexed log data when...
the dhcp index doesn't have indexed log data during the same period?
Meanwhile, I checked for a connection on the indexer from the forwarder on the SSL port we use to take logs, and it was there (established). Wondering whether it was a forwarder problem (which has never been problematic before) or something wrong on the indexer, I finally decided to restart splunkd on the indexer (mainly because the admin for the forwarder was out sick). Voila, I started seeing logs in the dhcp index again.
This morning when I came in, I checked the dhcp index and no logs. I ran this search for metrics (pretty much the alert search without the where clause) with timepicker set to Last 60 minutes:
index=_internal source="/opt/splunk/var/log/splunk/metrics.log*" group=per_index_thruput series="dhcp"
| rename series as index
| eval MB=round(kb/1024,3)
| stats sum(MB) as MB by index
It came back with 200 MB. Crazy weird just like yesterday morning. According to this search the most recently received logs are the current time:
| metadata type=hosts index=dhcp | search host=dhcp*
| stats count by host totalCount firstTime recentTime
| convert timeformat="%m/%d/%Y %T" ctime(recentTime) ctime(firstTime)
| rename totalCount as Count recentTime as "Last Update" firstTime as "Earliest Update"
| fieldformat Count=tostring(Count, "commas")
| fields - count
As in when I just ran this it was 10:15 and 'Last Update' came back with 10:15:19. Yet index=dhcp search comes back with nothing.
Also, looking at Settings->indexes, I see that the Latest Event for the dhcp index is shown as four hours ago (now 10:33 in the morning).
I took a look at splunkd.log, greping it for the IP of the forwarder. There are some fairly recent (8:28 this morning) ClientSessionsManager action=Phonehome result=Ok checksum=0 log entries, and then there is this shorthereafterater:
10-23-2018 08:29:09.302 -0700 ERROR TcpInputProc - Error encountered for connection from src=IP_redacted:39683. Local side shutting down
I've redacted the IP address in this, but it is the IP of the forwarder.
The admin for the forwarder (a solaris 10 box running 6.2.1.4 currently) was in today and I got a look at the logs. I see it connected to my indexer yesterday after I restarted splunkd there, then I'm seeing this in the forwarder's logs from this morning (nothing after that connection yesterday until these):
10-23-2018 08:28:19.635 -0700 INFO TcpOutputProc - Connection to 128.195.xxx.xxx:9998 closed. default Error in SSL_read = 131, SSL Error = error:00000000:lib(0):func(0):reason(0)
10-23-2018 08:28:19.799 -0700 INFO TcpOutputProc - Connected to idx=128.195.xxx.xxx:9998
10-23-2018 08:28:35.707 -0700 INFO TcpOutputProc - Connection to 128.195.xxx.xxx:9998 closed. default Error in SSL_read = 131, SSL Error = error:00000000:lib(0):func(0):reason(0)
10-23-2018 08:28:35.909 -0700 INFO TcpOutputProc - Connected to idx=128.195.xxx.xxx:9998
10-23-2018 08:29:07.386 -0700 INFO TcpOutputProc - Connection to 128.195.xxx.xxx:9998 closed. sock_error = 32. SSL Error = error:00000000:lib(0):func(0):reason(0)
10-23-2018 08:29:07.518 -0700 WARN TcpOutputProc - Applying quarantine to ip=128.195.xxx.xxx port=9998 _numberOfFailures=2
10-23-2018 08:29:54.178 -0700 INFO TcpOutputProc - Removing quarantine from idx=128.195.xxx.xxx:9998
10-23-2018 08:29:54.339 -0700 INFO TcpOutputProc - Connected to idx=128.195.xxx.xxx:9998
I asked the admin to restart splunkd on the solaris box, and he did. I see a 'INFO PubSubSvr - Subscribed' phonehome in my splunkd logs and I see a new connection from the forwarder on my indexer. I don't see any errors involving the forwarder's IP address in my indexer's splunkd logs. However, I still do not see any logs in the dhcp index.
Oh, wait. Another quick look at the Settings->Indexes shows the Latest Event for the dhcp index as three hours ago -- it was four the last time I looked. Switching the search on the dhcp index to All time (real time) I can see logs streaming in with timestamps from earlier this morning -- it is playing "catch up," with events coming in ~8,000/minute.
My indexer is running RHEL 7 and:
$ cat etc/splunk.version
VERSION=6.5.2
BUILD=67571ef4b87d
PRODUCT=splunk
PLATFORM=Linux-x86_64
The forwarder has been faithfully sending me logs until this business began yesterday.
Any ideas what has been going wrong?
... View more