Splunk Search

How to join customer ID data from different sources and create a time-based report where x-axis is total time and y-axis is total events?

gowthamkb
Explorer

I want to correlate data from 2 sources. First data source contains store_events (source1=store_events) and second source contains book_events (source2=book_events). Source store_events contains a field called store_start_time and source book_events contains a field called store_end_time. Both sources contain a field called customerid.

How can I join the data based on customerid and create a report with my x-axis showing total_time (store_start_time + store_end_time) and y-axis showing total events ?

Any help is appreciated.

1 Solution

sideview
SplunkTrust
SplunkTrust

MuS has posted a great link as well, but I thought I'd just write the query for you here directly, since your example gives a nice simple solution without a lot of fiddly bits.

Here is a search that does a lot of the work, getting the total time spent per customer id:

source=store_events OR source2=book_events | stats values(store_start_time) as store_start_time values(store_end_time) as store_end_time by customerid | eval time_spent=store_end_time - store_start_time 

However the fun part of this question I think may concern your reporting phrasing -- "x-axis showing total_time (store_start_time + store_end_time) and y-axis showing total events "

1) The way I'm reading this, is that you're asking for the report to "bucket the total time into sensible buckets like "0-10 seconds", "10-20 seconds" etc, and then display a column chart where those bucketed times are on the x-axis and the total customers are on the y-axis. That would look like this:

source=store_events OR source2=book_events | stats values(store_start_time) as store_start_time values(store_end_time) as store_end_time by customerid | eval time_spent=store_end_time - store_start_time | bin time_spent bins=20 | chart dc(customerid) over time_spent

2) However, what if customerid X appears in 100 transactions, 99 of which take only 5 seconds, and one of which takes 15 seconds. His effect on the chart will be to increment the first and second buckets each by 1, which seems wrong. So I think possibly you want to keep track of the raw event count as well? That would look like this:

source=store_events OR source2=book_events | stats count values(store_start_time) as store_start_time values(store_end_time) as store_end_time by customerid | eval time_spent=store_end_time - store_start_time | bin time_spent bins=20 | chart sum(count) as count over time_spent

The only difference is that my first stats has an extra "count" field coming out of it, and my subsequent chart command has a weird looking sum(count) as count in it. But the net difference is to count the individual transactions, and not just the distinct count of customerid.

3) You might also want to display the raw data as scatter chart. This could be done too but a) scatter charts are never as cool as you think they're going to be and b) in such a report "total customers" would be meaningless because each point would pretty much just represent one customer.

View solution in original post

sideview
SplunkTrust
SplunkTrust

MuS has posted a great link as well, but I thought I'd just write the query for you here directly, since your example gives a nice simple solution without a lot of fiddly bits.

Here is a search that does a lot of the work, getting the total time spent per customer id:

source=store_events OR source2=book_events | stats values(store_start_time) as store_start_time values(store_end_time) as store_end_time by customerid | eval time_spent=store_end_time - store_start_time 

However the fun part of this question I think may concern your reporting phrasing -- "x-axis showing total_time (store_start_time + store_end_time) and y-axis showing total events "

1) The way I'm reading this, is that you're asking for the report to "bucket the total time into sensible buckets like "0-10 seconds", "10-20 seconds" etc, and then display a column chart where those bucketed times are on the x-axis and the total customers are on the y-axis. That would look like this:

source=store_events OR source2=book_events | stats values(store_start_time) as store_start_time values(store_end_time) as store_end_time by customerid | eval time_spent=store_end_time - store_start_time | bin time_spent bins=20 | chart dc(customerid) over time_spent

2) However, what if customerid X appears in 100 transactions, 99 of which take only 5 seconds, and one of which takes 15 seconds. His effect on the chart will be to increment the first and second buckets each by 1, which seems wrong. So I think possibly you want to keep track of the raw event count as well? That would look like this:

source=store_events OR source2=book_events | stats count values(store_start_time) as store_start_time values(store_end_time) as store_end_time by customerid | eval time_spent=store_end_time - store_start_time | bin time_spent bins=20 | chart sum(count) as count over time_spent

The only difference is that my first stats has an extra "count" field coming out of it, and my subsequent chart command has a weird looking sum(count) as count in it. But the net difference is to count the individual transactions, and not just the distinct count of customerid.

3) You might also want to display the raw data as scatter chart. This could be done too but a) scatter charts are never as cool as you think they're going to be and b) in such a report "total customers" would be meaningless because each point would pretty much just represent one customer.

gowthamkb
Explorer

Hi. Thank you. Your answer certainly helped me to get the desired o/p

0 Karma

sideview
SplunkTrust
SplunkTrust

I had another thought, and updated my answer - see #2 in the updated answer above. I think you probably want to count total customers, not just distinct count of customerid as I was doing in #1.

0 Karma

gowthamkb
Explorer

you are right. I want to get the total customers count. Thank you !!

0 Karma

MuS
Legend

Hi gowthamkb,

Take a look at this answer https://answers.splunk.com/answers/129424/how-to-compare-fields-over-multiple-sourcetypes-without-jo... which is all about this topic.

Hope this helps ...

cheers, MuS

gowthamkb
Explorer

Hi Mus. Thanks for sharing the link. it's helpful.

0 Karma
Get Updates on the Splunk Community!

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...

Explore the Latest Educational Offerings from Splunk [January 2025 Updates]

At Splunk Education, we are committed to providing a robust learning experience for all users, regardless of ...

Developer Spotlight with Paul Stout

Welcome to our very first developer spotlight release series where we'll feature some awesome Splunk ...