Locate latest record by some grouping field

user-mcuserface · ‎11-24-2023

With a query like the following (I've simplified it a little here and renamed some fields)

index="my-test-index" project="my-project" | eval _time = strptime(my_timestamp, "%Y-%m-%dT%H:%M:%S.%N+00:00") | stats latest(my_timestamp) latest(_time) latest(my_count) as my_count by project

I see behaviour that surprised me:

1. If I repeatedly issue the query, the value of my_count varies
2. It appears the rows from which my_count is taken are always those without a _time value resulting from the eval in my query (because either `my_timestamp` did not match the strptime format, or that field was not present when the record was ingested into splunk -- my data has both cases)
3. In the output of the search, the value of my_timestamp returned does not always come from the same ingested record as my_count.
4. In fact, the value of my_timestamp in the search output is always taken from the same single record: it doesn't change when I repeatedly issue the query.

I guess 1. and 2. are because "null" (or empty or some similar concept) _time values aren't really expected and happen to sort latest.

I guess 3. is because function `latest` operates field-by-field, and is not selecting a whole row -- combined again with the fact that some _time values are null.

4. I don't understand, but perhaps is a coincidence and is not reliably true in general outside of my data set etc., I'm not sure.

What I really want is to find the ingested record with the latest value of `my_timestamp` for a given `project`, so I can present fields like `my_count` by `project` in a "most recent counts" table. I don't really want to operate on individual fields' "latest" values as in the query above, but rather latest entire records.

How can I best achieve that in splunk?

PickleRick · ‎11-25-2023

0. Please post your searches and such in a preformatted paragraph or a code block. Makes it easier to read.

1. There are no miracles. If repeated search yields different results (and it doesn't contain any random element), something must be varying across the separate runs - either you're running it over different time windows (for example - "earliest=-1m latest=now" will contain different events depending on when it's run) or your events for given time window change (maybe there are events backfilling the index due to latency problems). Sometimes you might have connectivity problems and not be getting all results from all indexers (but that should throw a warning) or have performance problems and have your searches finalized before they fully complete).

2. You are forcefully overwriting the _time field (honestly, I have no idea why - you could as well just use another field name; if you want it for automatic formatting you could rename it at the very end of your search).

3. As @yuanliu already pointed out - there seems to be a problem with the quality of your data. A process of data onboarding includes finding the proper "main" timestamp within the event (some events can have multiple timestamps; in such case you need to decide which is the primary timestamp for the event) and making sure it's getting parsed out properly so that the event is indexed at the proper point in time That's one of the most important if not the most important part of onboarding the events - you must know where to look for your data. Otherwise you have no way of knowing what data you have, how much data you have, where it is and how to look for it.

4. Yes, latest(X) looks for the latest value of field X. It doesn't mind any other fields. So latest(X) and latest(Y) will show you latest seen values of respectably fields X and Y but they don't have to be from the same event. If one event had only field X, and other one had only field Y, you'd still get both of them in your results since either of them was the last occurrence of respsective field.

user-mcuserface · ‎11-27-2023

@PickleRick wrote:
3. [...] data onboarding includes finding the proper "main" timestamp within the event (some events can have multiple timestamps; in such case you need to decide which is the primary timestamp for the event) and making sure it's getting parsed out properly so that the event is indexed at the proper point in time

That field is `my_timestamp`

I'm currently using sourcetype `_json`, with the hope that I don't need to get into parsing data too much. However, in order to fetch a latest metric by `my_timestamp` I guess I either somehow need to tell splunk that that field is a timestamp, or just treat this (ISO format) timestamp as a string for that purpose (since ISO strings do sort correctly)? If the former, perhaps I need to define a new sourcetype?

4. Yes, latest(X) looks for the latest value of field X. It doesn't mind any other fields. So latest(X) and latest(Y) will show you latest seen values of respectably fields X and Y but they don't have to be from the same event. If one event had only field X, and other one had only field Y, you'd still get both of them in your results since either of them was the last occurrence of respsective field.

How should I approach this to get a latest record, rather than the latest field?

PickleRick · ‎11-27-2023

OK. Don't use the _json sourcetype. It's there so that in a poorly configured environment data is somehow at least partially correctly processed but in a production scenario it shouldn't be used. You should define your own sourcetype. As you're probably not using indexed extractions (and you generally shouldn't use them), you need to set proper timestamp extraction settings in your config along with other settings from the so-called great 8.

https://lantern.splunk.com/Splunk_Platform/Product_Tips/Data_Management/Configuring_new_source_types

Finding latest/oldest event (or any other ordered-first/last event) can be done for example by using head or tail command (optionally sorting the data first; remember that by default Splunk returns events in reverse chronological order - newest first - so sorting might not always be necessary).

yuanliu · ‎11-24-2023

2. It appears the rows from which my_count is taken are always those without a _time value resulting from the eval in my query (because either `my_timestamp` did not match the strptime format, or that field was not present when the record was ingested into splunk -- my data has both cases)

This is to say that you have bad data. Bad data leads to bad results. You need to find a way to fix your data, or at least fix how you extract from my_timestamp if you cannot fix this field in data.

Locate latest record by some grouping field

eval

stats

Introduction to Splunk AI

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

Maximizing the Value of Splunk ES 8.x

Are you a member of the Splunk Community?

Locate latest record by some grouping field

eval

stats

Introduction to Splunk AI

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

Maximizing the Value of Splunk ES 8.x