Re: Splunk indexing 1,000s of events from one host...

pSull · ‎11-19-2021

I have a Splunk deployment which is monitoring a fair number of network devices. One in particular is having an issue the past few weeks wherein Splunk will show well over a thousand events with the same _time down to the millisecond, and then the host will have no events show in Splunk until we reset the service.

Some sample data following the query to gather it below. I have redacted device info from _raw beyond the device's event_id and device_time

host=*redacted*
| eval indextime = strftime(_indextime, "%Y-%m=%d %H:%M:%S:%N")
| eval latency = _indextime - _time
| sort device_time
| table _time, indextime, latency, _raw

ID	_time	indextime	latency	_raw
1	2021-11-18 17:02:03.000	2021-11-18 17:02:03.000	0	235463: 236034: Nov 18 2021 17:01:42.197: <other data>
2	2021-11-18 17:01:57.236	2021-11-18 17:04:07.000	129.764	235465: 236036: Nov 18 2021 17:02:14.200: <other data>
...
147	2021-11-18 17:01:57.236	2021-11-18 17:22:40.000	1242.764	235607: 236178: Nov 18 2021 17:22:39.196: <other data>
148	2021-11-18 17:22:39.199	2021-11-18 17:24:51.000	131.801	235609: 236180: Nov 18 2021 Nov 18 2021 17:22:40.008: <other data>
149	2021-11-18 17:22:39.199	2021-11-18 17:24:51.000	131.801	235610: 236181: Nov 18 2021 Nov 18 2021 17:22:40.226: <other data>
150	2021-11-18 17:22:39.199	2021-11-18 17:24:51.000	131.801	235611: 236182: Nov 18 2021 Nov 18 2021 17:22:41.099: <other data>
151	2021-11-18 17:22:39.199	2021-11-18 17:24:51.000	131.801	235612: 236183: Nov 18 2021 Nov 18 2021 17:22:54.084: <other data>
152	2021-11-18 17:22:39.199	2021-11-18 17:24:53.000	133.801	235613: 236184: Nov 18 2021 Nov 18 2021 17:23:15.428: <other data>
...
160	2021-11-18 17:22:39.199	2021-11-18 17:24:53.000	133.801	235621: 236192: Nov 18 2021 Nov 18 2021 17:23:26.087: <other data>
161	2021-11-18 17:22:39.199	2021-11-18 17:24:56.000	136.801	235622: 236193: Nov 18 2021 Nov 18 2021 17:23:26.087: <other data>
...
1329	2021-11-18 17:22:39.199	2021-11-18 21:29:24.000	14804.801	236781: 237364: Nov 18 2021 21:29:23.516: <other data>

Everything is working prior to ID 1, and after 1329 we have no data for about an hour or so until we reset Splunk.

From ID 2 through ID 147, you can see the _time value is exactly the same, while the indextime continues to increment more or less appropriately with the device_time given in _raw

IDs 148-152 show events with the same _time and _indextime values despite properly incrementing device_time from _raw

IDs 152-160 show the same

Then through to the end of the sample data, there are 1,173 events that get _time = 2021-11-18 17:22:39.199 with an average time between _indextime and device_time of 16.115s

From what I can see, it looks like the host is sending perfectly fine data to Splunk which is correctly indexing the events (_indextime) while assigning an incorrect event time (_time).

Looking around here and trying to figure out what might be going wrong, I thought there might be an issue with some time settings somewhere. We have Splunk and the host in question set to the same timezone, and the host uses NTP to maintain a synchronized clock. Checking NTP, we haven't seen any issues surrounding these events.

We are quite open to any ideas here.

PickleRick · ‎11-19-2021

First noticeable thing is that _time is different from the timestamp included in the raw event. That means that the _time value is not parsed from the log but instead - as a last resort - it's created by the ingesting forwarder from local server's timestamp.

You have to configure proper time parsing for your sourcetype.

Second - the file(s?) is indeed read in batches - the same timestamp means (luckily for debugging purposes, the timestamps are local to forwarder) that events were read from the file or other input at the same time.

I don't see any info what kind of input you are using for this so it's hard to say why it is so.

EDIT: There is of course also the possibility that _time is explicitly created by other input component (for example source sending events with explicitly created timestamp via HEC, or modular input).

pSull · ‎11-22-2021

The data is all syslog data coming directly from networking devices. There is a line in the device's configuration setting up logging directly to Splunk over TCP.

My understanding is that everything that will show in the terminal on the device is sent directly as-is to our Splunk deployment.

PickleRick · ‎11-22-2021

This is not any rfc-compliant syslog format. That's for start.

And it's not advisable to receive syslog directly on splunk components. You should rather use some intermediate solution (sc4s, rsyslog) to receive syslog data and send it to splunk.

richgalloway · ‎11-19-2021

What are the inputs.conf and props.conf settings for the problematic device?

What is the data path from device to Splunk? Is it syslog, universal forwarder, heavy forwarder, SC4S?

When you say "reset Splunk" what exactly are you doing and to which instance(s) are you doing it?

---
If this reply helps you, Karma would be appreciated.

pSull · ‎11-22-2021

There is nothing in either inputs.conf nor props.conf specifically for the device. The device has two lines in its configuration that direct logging output straight to our Splunk deployment.

When I say "reset Splunk" we SSH into our server and systemctl restart splunk

Splunk indexing 1,000s of events from one host in millisecond, then receives nothing until we restart splunk

Index This | Why did the turkey cross the road?

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Feel the Splunk Love: Real Stories from Real Customers

Are you a member of the Splunk Community?

Splunk indexing 1,000s of events from one host in millisecond, then receives nothing until we restart splunk

Index This | Why did the turkey cross the road?

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Feel the Splunk Love: Real Stories from Real Customers