Splunk Search

Determining Logging Lag (and Device Feed Monitoring)

rshoward
Path Finder

How do you track log and index lag with little overhead? Per device would be awesome and maybe throw in some kind of trending and graphs while we are at it.

1 Solution

Lowell
Super Champion

In Splunk 6+ you can fairly efficiently measure per-event index delays using tstats against a specific index and/or sourcetype using a search like this:

| tstats count as events WHERE index=perfmon sourcetype="Perfmon:Processor" BY _time,_indextime span=1s | eval indexlag=_indextime-_time | bucket indexlag span=log | chart sum(events) by indexlag

Of course you can group by all the standard fields, host/source/sourcetype, to break this information down however you want.

View solution in original post

mattymo
Splunk Employee
Splunk Employee

Meta Woot all day for this one, gets the basics done for you and you can enrich nicely...

https://splunkbase.splunk.com/app/2949/

shout out to the DI team

- MattyMo

Lowell
Super Champion

In Splunk 6+ you can fairly efficiently measure per-event index delays using tstats against a specific index and/or sourcetype using a search like this:

| tstats count as events WHERE index=perfmon sourcetype="Perfmon:Processor" BY _time,_indextime span=1s | eval indexlag=_indextime-_time | bucket indexlag span=log | chart sum(events) by indexlag

Of course you can group by all the standard fields, host/source/sourcetype, to break this information down however you want.

rshoward
Path Finder

As of 6+ this is the way to go. Long overdue move to what the "acceptable" answer is here now 😛

0 Karma

mataharry
Communicator

what about measuring the lag between the event timestamp and the indextime ?

* | eval lag_sec=_indextime-_time | timechart span=1h min(lag_sec) avg(lag_sec) max(lag_sec) by source

* | eval lag_sec=_indextime-_time | stats min(lag_sec) avg(lag_sec) max(lag_sec) by host

rshoward
Path Finder

NOTE: I'm updating this and accepting a different answer now due to tstats being the way to go as of v6+. Metadata command is cool and all but tstats will give more granularity, let you use indexed extraction'd fields, and also, the metadata command sometimes glitches out and gives silly values for times in some cases that throw charts off. I had switched all this over to tstats a long time ago and completely forgot about this answer.

---original answer---

Here is little trick for feed lag and device tracking that I picked up from working with Splunk a while ago and I've been selfish (or lazy?) and never shared. Here goes..

Splunk keeps awesome track of host information in a meta index of sorts. It is incredibly fast to search most of the time and gives us instant information on how data is being indexed on a per-host basis. Let's pretend I have five (5) high visibility hosts that have shoddy feeds and getting too far behind will cause problems (say, a SOC or intel feed). In the end, I will show you how to determine how to quickly tell what the last date was in the last log that was indexed and what time the index was updated on a per host basis with little overhead (in most cases).

First, make sure all your timezones and time zone corrections are in order...

Then, lets start out with the metadata search for hosts being indexed.

| metadata type=hosts

(don't forget that leading pipe, time range doesn't matter for metadata pulls)

This will show you some cool stuff by host.

"lastTime" is the latest timestamp seen in the indexed data.
"recentTime" is the latest time the index was updated for that host.

Using this we can create a report on how old the data is being indexed, and how stale the host is in indexes. (is it actively logging/feeding?)

| metadata type=hosts | search host="*" | eval loglag=(now()-lastTime)/60 | eval indexlag=(now()-recentTime)/60 | fields host, loglag, indexlag | sort host

(use host="*" to filter in/out any hosts you want for this report)

You now have a list of your hosts and a few new fields derived form lastTime and recentTime; loglag and indexlag respectively.

loglag tells us how many minutes behind the "feed" is. This will fluctuate for batched logs or sparse realtime feeds. Think about what an acceptable delay would be. If your logs are dropped to a directory every 5 minutes, 5-10 minutes is an acceptable delay. If the indexer cannot keep up with volume or the upstream device goes offline, this will increase by the minute every minute etc.

index lag tell us how many minutes behind the indexer indexed something for that host. no matter how far behind loglag is, if the indexer is actively working on indexing data for a host, its indexlag will be near zero. For batched data it will lul then snap to zero again when the file drops. If a device or network issue affects the feed for any extended period of time, index and log lag will increase.

You can further read into the data and develop a few scenarios. For example, if a feed goes out for an hour, indexlag and log lag will be at about 60. Once the feed returns, indexlag will snap to 0 and loglag will slowly approach near 0 (assuming the indexer indexes the data in order and there are no timezone issues of course)

Also, looking at |metadata searches in realtime windows can cause some weird time glitches.

So now, lets make some reports and alerts!

I personally like to have a history of what my feeds lag looks like. So I made these into a summary index. Save the search above with ALL hosts and have it run every 1-5 minutes and save to a summary index. For this example I will use the index of "lagsummary" for simplicity. Create and name your summary index something else if like. If you use a summary with multiple types of searches going into it, you will have to filter them out.

Once this summary index gathers enough data, we can trend it. (last 4 hours)

index="lagsummary" | search orig_host = "*" | timechart span="5m" avg(loglag) by orig_host

(for longer time ranges you will need to increase timechart's bin span. Use orig_host here to make tailored reports on specific sets of hosts.)

Switch to graph view and admire your work. Change loglag to indexlag to see the indexlag trended. You can use these searches to make dials, charts, graphs... all kinds of fun stuff.

Here is a quick test I started doing on a sparkline chart for a quick and easy glance at how my feeds are doing.

index=lagsummary | search orig_host="*-px" | chart sparkline(avg(loglag),5m) as loglag_trend, earliest(loglag), latest(loglag) as l_lag by orig_host | fields orig_host, loglag_trend, l_lag

I look at this one in a realtime 4 hour window. unlike metadata searches, realtime windows on summary indexes are fine since they are fed by the base search. I have the base search running every minute to keep the summary up to date more often.

Now that you have a handle on how up to date all of your devices are feeding with individual statistics, you can be on top of any issues with email alerts and reports and dashboards. Remember how I told you to start thinking about what acceptable lag would be? Save a new search with the first metadata search above and set some alerting constraints on what's important. Is it a realtime feed that should never stop? alert on met condition 'search indexlag > 10'. Is it a time sensitive batched log that will cause an analyst issues if the logs are too old? Alert on met condition 'search loglag > 60'. Remember not to make your summary filling saved searches alerting searches. Make dedicated saved searches for alerting.

Happy Splunking.

P.S. Apologies for the format of this "question". I can post a better how to with pictures if the need arises. This was more of a severely overdue braindump.

Get Updates on the Splunk Community!

Automatic Discovery Part 1: What is Automatic Discovery in Splunk Observability Cloud ...

If you’ve ever deployed a new database cluster, spun up a caching layer, or added a load balancer, you know it ...

Real-Time Fraud Detection: How Splunk Dashboards Protect Financial Institutions

Financial fraud isn't slowing down. If anything, it's getting more sophisticated. Account takeovers, credit ...

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

 Are you tired of troubleshooting delays caused by siloed frontend, application, and network data? We've got a ...