Solved: Re: Aggregation search for different fields help

heorhii12412 · ‎06-29-2023

Hello everyone! I have Splunk events in the following format:

   activity_time: 2023-06-29T12:45:06Z
   event_time: 2023-06-29T14:49:42.787Z
   shipment_status: delivered
   timestamp: 2023-06-29T14:49:51.069Z
   tracking_number:95AAEC4900000

And of course, every event has a time value on top of already provided values, shipment_status can contain values like 'delivered' and 'in_process' and some other values.

I need to find the percentage of events with the same 'tracking_number' values, for which events with the shipment_status: in_process values came first, before events with the shipment_status: delivered status values.
It means that I need to filter events based on their tracking_number, find among those events only those with shipment_status: in_process or delivered, and then compare the time of both events and count them to the overall percentage of the filtered events if in_process events were first, for example, 1 hour before the delivered event.

I'm very confused with the operators that Splunk uses for the filtering and calculating logic, could someone please help me with the composition of the query?

yuanliu · ‎06-29-2023

I know that feeling when I couldn't see how the question I wanted answered can be expressed in SPL. In this case, though, you also need some command that people hesitate to use, namely transaction.

But first, is there more vertical knowledge that volunteers will need? For example, should the three timestamps, or any relationship therein be considered? When calculating which event comes first, which time value should be used? In the following, I will only consider _time that Splunk already has.

Another example. For each tracking_number in consideration, how many events can there be in each shipment_status? I will guess that only one tracking_number-shipment_status combination is possible. If there are more, you will need to specify how you are going to group them.

So, let's look at the question at hand: You want to group events with 'tracking_number' - in_process -> delivered, and see how many are grouped and how many are not grouped.

An easy way to think about this is to consider the transition from in_process to delivered as a transaction. SPL's transaction command produces two useful fields, closed_txn tells you whether the transaction is closed, i.e., whether the group satisfies all requirements you specified, and eventcount that tell you how many events are in that group.

Once you have the transaction, you calculate how many events are in closed group and unclosed group, then compare the count in closed group to the total. Like this

| transaction tracking_number startswith=shipment_status=in_process endswith=shipment_status=delivered keepevicted=true
| stats sum(eventcount) as eventcount by closed_txn
| eventstats sum(eventcount) as totalcount
| where closed_txn == 1 ``` remaining eventcount only includes complete transactions ```
| eval percentage = eventcount * 100 / totalcount

Hope this helps.

The following is a data emulation that you can play with and compare with real data

| makeresults
| eval data = mvappend("  activity_time: \"2023-06-29T12:45:06Z\"
   event_time: \"2023-06-29T14:49:42.787Z\"
   shipment_status: delivered
   timestamp: \"2023-06-29T14:49:51.069Z\"
   tracking_number:95AAEC4900000", "  activity_time: \"2023-06-29T12:25:06Z\"
   event_time: \"2023-06-29T14:49:22.787Z\"
   shipment_status: in_process
   timestamp: \"2023-06-29T14:29:51.069Z\"
   tracking_number:95AAEC4900000", "  activity_time: \"2023-06-29T11:45:06Z\"
   event_time: \"2023-06-29T13:49:42.787Z\"
   shipment_status: delivered
   timestamp: \"2023-06-29T13:49:51.069Z\"
   tracking_number:95AAEC4900000")
| mvexpand data
| rename data as _raw
| extract kvdelim=":" pairdelim="
"
| eval _time = strptime(timestamp, "%FT%H:%M:%S.%3N%Z")
``` data emulation above ```

View solution in original post

yuanliu · ‎06-29-2023

I know that feeling when I couldn't see how the question I wanted answered can be expressed in SPL. In this case, though, you also need some command that people hesitate to use, namely transaction.

But first, is there more vertical knowledge that volunteers will need? For example, should the three timestamps, or any relationship therein be considered? When calculating which event comes first, which time value should be used? In the following, I will only consider _time that Splunk already has.

Another example. For each tracking_number in consideration, how many events can there be in each shipment_status? I will guess that only one tracking_number-shipment_status combination is possible. If there are more, you will need to specify how you are going to group them.

So, let's look at the question at hand: You want to group events with 'tracking_number' - in_process -> delivered, and see how many are grouped and how many are not grouped.

An easy way to think about this is to consider the transition from in_process to delivered as a transaction. SPL's transaction command produces two useful fields, closed_txn tells you whether the transaction is closed, i.e., whether the group satisfies all requirements you specified, and eventcount that tell you how many events are in that group.

Once you have the transaction, you calculate how many events are in closed group and unclosed group, then compare the count in closed group to the total. Like this

| transaction tracking_number startswith=shipment_status=in_process endswith=shipment_status=delivered keepevicted=true
| stats sum(eventcount) as eventcount by closed_txn
| eventstats sum(eventcount) as totalcount
| where closed_txn == 1 ``` remaining eventcount only includes complete transactions ```
| eval percentage = eventcount * 100 / totalcount

Hope this helps.

The following is a data emulation that you can play with and compare with real data

| makeresults
| eval data = mvappend("  activity_time: \"2023-06-29T12:45:06Z\"
   event_time: \"2023-06-29T14:49:42.787Z\"
   shipment_status: delivered
   timestamp: \"2023-06-29T14:49:51.069Z\"
   tracking_number:95AAEC4900000", "  activity_time: \"2023-06-29T12:25:06Z\"
   event_time: \"2023-06-29T14:49:22.787Z\"
   shipment_status: in_process
   timestamp: \"2023-06-29T14:29:51.069Z\"
   tracking_number:95AAEC4900000", "  activity_time: \"2023-06-29T11:45:06Z\"
   event_time: \"2023-06-29T13:49:42.787Z\"
   shipment_status: delivered
   timestamp: \"2023-06-29T13:49:51.069Z\"
   tracking_number:95AAEC4900000")
| mvexpand data
| rename data as _raw
| extract kvdelim=":" pairdelim="
"
| eval _time = strptime(timestamp, "%FT%H:%M:%S.%3N%Z")
``` data emulation above ```

heorhii12412 · ‎06-30-2023

Thanks a lot, you really just saved me! I haven't even watched on the transaction command side, tried to do it just by eval and eventstats commands.

yuanliu · ‎07-01-2023

I'm glad that transaction solves the problem at hand. People hesitate to use this command for a good reason. As you gain more experience with stats, eventstats, and some other efficient ways to aggregate data, you may one day find a better method than transaction for this. There are some good examples that you can find in this community as well.

heorhii12412 · ‎07-03-2023

I also have a question related to the time frames, for the query you provided earlier

| transaction tracking_number startswith=shipment_status=in_transit endswith=shipment_status=delivered keepevicted=true
| stats sum(eventcount) as eventcount by closed_txn
| eventstats sum(eventcount) as totalcount
| where closed_txn == 1``` remaining eventcount only includes complete transactions ```
| eval percentage = eventcount * 100 / totalcount

Is it possible to add a clause, that only transactions, for which events span more than 1 hour between them, could be included in the eventcount ? using the _time field for example?
Transaction's maxspan and maxpause do not suit these calculations, and I couldn't write something using mvindex or strptime to make it work.
For example, in_transit event was received at 7/3/23 9:07:00.000 AM and delivered events came after that at 7/3/23 11:10:00.000 AM, in this case, this transaction should be considered as completed and counted to eventcount.
I would be very grateful for your help!

yuanliu · ‎07-03-2023

To confirm, even if a shipment is delivered, the percentage count should only include "long" transactions. Is this correct?

transaction command produces a "duration" field for closed transactions. So, it is easy to filter by duration in addition to closed state.

| transaction tracking_number startswith=shipment_status=in_transit endswith=shipment_status=delivered keepevicted=true
| stats sum(eventcount) as eventcount by closed_txn
| eventstats sum(eventcount) as totalcount
| where closed_txn == 1 AND duration > 3600
``` remaining eventcount only includes complete transactions that lasted longer than 1 hour ```
| eval percentage = eventcount * 100 / totalcount

PickleRick · ‎07-01-2023

+1 to that.

transaction is a relatively straightforward command to use but it has its limitations (both in terms of functionality as well as performace) so it's often better to use other means to get the desired results. But sometimes transaction can be the only reasonable way to do something - in such case it's good to be able to filter your data as much as possible before feeding it to the transaction command.

Aggregation search for different fields help

eval

stats

subsearch

Index This | Forward, I’m heavy; backward, I’m not. What am I?

A Guide To Cloud Migration Success

Join Us for Splunk University and Get Your Bootcamp Game On!