Splunk Search

Splunk indexing 1,000s of events from one host in millisecond, then receives nothing until we restart splunk

pSull
Engager

I have a Splunk deployment which is monitoring a fair number of network devices. One in particular is having an issue the past few weeks wherein Splunk will show well over a thousand events with the same _time down to the millisecond, and then the host will have no events show in Splunk until we reset the service.

Some sample data following the query to gather it below. I have redacted device info from _raw beyond the device's event_id and device_time

host=*redacted*
| eval indextime = strftime(_indextime, "%Y-%m=%d %H:%M:%S:%N")
| eval latency = _indextime - _time
| sort device_time
| table _time, indextime, latency, _raw

ID_timeindextimelatency_raw
12021-11-18 17:02:03.0002021-11-18 17:02:03.0000235463: 236034: Nov 18 2021 17:01:42.197: <other data>
22021-11-18 17:01:57.2362021-11-18 17:04:07.000129.764235465: 236036: Nov 18 2021 17:02:14.200: <other data>
...    
1472021-11-18 17:01:57.2362021-11-18 17:22:40.0001242.764235607: 236178: Nov 18 2021 17:22:39.196: <other data>
1482021-11-18 17:22:39.1992021-11-18 17:24:51.000131.801235609: 236180: Nov 18 2021 Nov 18 2021 17:22:40.008: <other data>
1492021-11-18 17:22:39.1992021-11-18 17:24:51.000131.801235610: 236181: Nov 18 2021 Nov 18 2021 17:22:40.226: <other data>
150 2021-11-18 17:22:39.1992021-11-18 17:24:51.000131.801235611: 236182: Nov 18 2021 Nov 18 2021 17:22:41.099: <other data>
151 2021-11-18 17:22:39.1992021-11-18 17:24:51.000131.801235612: 236183: Nov 18 2021 Nov 18 2021 17:22:54.084: <other data>
152 2021-11-18 17:22:39.1992021-11-18 17:24:53.000133.801235613: 236184: Nov 18 2021 Nov 18 2021 17:23:15.428: <other data>
...    
160 2021-11-18 17:22:39.1992021-11-18 17:24:53.000133.801235621: 236192: Nov 18 2021 Nov 18 2021 17:23:26.087: <other data>
161 2021-11-18 17:22:39.1992021-11-18 17:24:56.000136.801235622: 236193: Nov 18 2021 Nov 18 2021 17:23:26.087: <other data>
...    
13292021-11-18 17:22:39.1992021-11-18 21:29:24.00014804.801236781: 237364: Nov 18 2021 21:29:23.516: <other data>

 

Everything is working prior to ID 1, and after 1329 we have no data for about an hour or so until we reset Splunk.

From ID 2 through ID 147, you can see the _time value is exactly the same, while the indextime continues to increment more or less appropriately with the device_time given in _raw

IDs 148-152 show events with the same _time and _indextime values despite properly incrementing device_time from _raw

IDs 152-160 show the same

Then through to the end of the sample data, there are 1,173   events that get _time = 2021-11-18 17:22:39.199 with an average time between _indextime and device_time of 16.115s

 

From what I can see, it looks like the host is sending perfectly fine data to Splunk which is correctly indexing the events (_indextime) while assigning an incorrect event time (_time).

Looking around here and trying to figure out what might be going wrong, I thought there might be an issue with some time settings somewhere. We have Splunk and the host in question set to the same timezone, and the host uses NTP to maintain a synchronized clock. Checking NTP, we haven't seen any issues surrounding these events.

 

We are quite open to any ideas here.

Labels (1)
Tags (3)
0 Karma

PickleRick
SplunkTrust
SplunkTrust

First noticeable thing is that _time is different from the timestamp included in the raw event. That means that the _time value is not parsed from the log but instead - as a last resort - it's created by the ingesting forwarder from local server's timestamp.

You have to configure proper time parsing for your sourcetype.

Second - the file(s?) is indeed read in batches - the same timestamp means (luckily for debugging purposes, the timestamps are local to forwarder) that events were read from the file or other input at the same time.

I don't see any info what kind of input you are using for this so it's hard to say why it is so.

EDIT: There is of course also the possibility that _time is explicitly created by other input component (for example source sending events with explicitly created timestamp via HEC, or modular input).

0 Karma

pSull
Engager

The data is all syslog data coming directly from networking devices. There is a line in the device's configuration setting up logging directly to Splunk over TCP.

My understanding is that everything that will show in the terminal on the device is sent directly as-is to our Splunk deployment.

0 Karma

PickleRick
SplunkTrust
SplunkTrust

This is not any rfc-compliant syslog format. That's  for start.

And it's not advisable to receive syslog directly on splunk components. You should rather use some intermediate solution (sc4s, rsyslog) to receive syslog data and send it to splunk.

0 Karma

richgalloway
SplunkTrust
SplunkTrust

What are the inputs.conf and props.conf settings for the problematic device?

What is the data path from device to Splunk?  Is it syslog, universal forwarder, heavy forwarder, SC4S?

When you say "reset Splunk" what exactly are you doing and to which instance(s) are you doing it?

---
If this reply helps you, Karma would be appreciated.
0 Karma

pSull
Engager

There is nothing in either inputs.conf nor props.conf specifically for the device. The device has two lines in its configuration that direct logging output straight to our Splunk deployment.

When I say "reset Splunk" we SSH into our server and systemctl restart splunk

0 Karma
Get Updates on the Splunk Community!

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...