Solved: Re: How to join customer ID data from different so...

gowthamkb · ‎09-22-2016

I want to correlate data from 2 sources. First data source contains store_events (source1=store_events) and second source contains book_events (source2=book_events). Source store_events contains a field called store_start_time and source book_events contains a field called store_end_time. Both sources contain a field called customerid.

How can I join the data based on customerid and create a report with my x-axis showing total_time (store_start_time + store_end_time) and y-axis showing total events ?

Any help is appreciated.

sideview · ‎09-22-2016

MuS has posted a great link as well, but I thought I'd just write the query for you here directly, since your example gives a nice simple solution without a lot of fiddly bits.

Here is a search that does a lot of the work, getting the total time spent per customer id:

source=store_events OR source2=book_events | stats values(store_start_time) as store_start_time values(store_end_time) as store_end_time by customerid | eval time_spent=store_end_time - store_start_time

However the fun part of this question I think may concern your reporting phrasing -- "x-axis showing total_time (store_start_time + store_end_time) and y-axis showing total events "

1) The way I'm reading this, is that you're asking for the report to "bucket the total time into sensible buckets like "0-10 seconds", "10-20 seconds" etc, and then display a column chart where those bucketed times are on the x-axis and the total customers are on the y-axis. That would look like this:

source=store_events OR source2=book_events | stats values(store_start_time) as store_start_time values(store_end_time) as store_end_time by customerid | eval time_spent=store_end_time - store_start_time | bin time_spent bins=20 | chart dc(customerid) over time_spent

2) However, what if customerid X appears in 100 transactions, 99 of which take only 5 seconds, and one of which takes 15 seconds. His effect on the chart will be to increment the first and second buckets each by 1, which seems wrong. So I think possibly you want to keep track of the raw event count as well? That would look like this:

source=store_events OR source2=book_events | stats count values(store_start_time) as store_start_time values(store_end_time) as store_end_time by customerid | eval time_spent=store_end_time - store_start_time | bin time_spent bins=20 | chart sum(count) as count over time_spent

The only difference is that my first stats has an extra "count" field coming out of it, and my subsequent chart command has a weird looking sum(count) as count in it. But the net difference is to count the individual transactions, and not just the distinct count of customerid.

3) You might also want to display the raw data as scatter chart. This could be done too but a) scatter charts are never as cool as you think they're going to be and b) in such a report "total customers" would be meaningless because each point would pretty much just represent one customer.

View solution in original post

sideview · ‎09-22-2016

MuS has posted a great link as well, but I thought I'd just write the query for you here directly, since your example gives a nice simple solution without a lot of fiddly bits.

Here is a search that does a lot of the work, getting the total time spent per customer id:

source=store_events OR source2=book_events | stats values(store_start_time) as store_start_time values(store_end_time) as store_end_time by customerid | eval time_spent=store_end_time - store_start_time

However the fun part of this question I think may concern your reporting phrasing -- "x-axis showing total_time (store_start_time + store_end_time) and y-axis showing total events "

1) The way I'm reading this, is that you're asking for the report to "bucket the total time into sensible buckets like "0-10 seconds", "10-20 seconds" etc, and then display a column chart where those bucketed times are on the x-axis and the total customers are on the y-axis. That would look like this:

source=store_events OR source2=book_events | stats values(store_start_time) as store_start_time values(store_end_time) as store_end_time by customerid | eval time_spent=store_end_time - store_start_time | bin time_spent bins=20 | chart dc(customerid) over time_spent

2) However, what if customerid X appears in 100 transactions, 99 of which take only 5 seconds, and one of which takes 15 seconds. His effect on the chart will be to increment the first and second buckets each by 1, which seems wrong. So I think possibly you want to keep track of the raw event count as well? That would look like this:

source=store_events OR source2=book_events | stats count values(store_start_time) as store_start_time values(store_end_time) as store_end_time by customerid | eval time_spent=store_end_time - store_start_time | bin time_spent bins=20 | chart sum(count) as count over time_spent

The only difference is that my first stats has an extra "count" field coming out of it, and my subsequent chart command has a weird looking sum(count) as count in it. But the net difference is to count the individual transactions, and not just the distinct count of customerid.

3) You might also want to display the raw data as scatter chart. This could be done too but a) scatter charts are never as cool as you think they're going to be and b) in such a report "total customers" would be meaningless because each point would pretty much just represent one customer.

gowthamkb · ‎09-22-2016

Hi. Thank you. Your answer certainly helped me to get the desired o/p

sideview · ‎09-22-2016

I had another thought, and updated my answer - see #2 in the updated answer above. I think you probably want to count total customers, not just distinct count of customerid as I was doing in #1.

gowthamkb · ‎09-22-2016

you are right. I want to get the total customers count. Thank you !!

MuS · ‎09-22-2016

Hi gowthamkb,

Take a look at this answer https://answers.splunk.com/answers/129424/how-to-compare-fields-over-multiple-sourcetypes-without-jo... which is all about this topic.

Hope this helps ...

cheers, MuS

gowthamkb · ‎09-22-2016

Hi Mus. Thanks for sharing the link. it's helpful.

How to join customer ID data from different sources and create a time-based report where x-axis is total time and y-axis is total events?

Index This | When is October more than just the tenth month?

Observe and Secure All Apps with Splunk

What’s New & Next in Splunk SOAR

Are you a member of the Splunk Community?

How to join customer ID data from different sources and create a time-based report where x-axis is total time and y-axis is total events?

Index This | When is October more than just the tenth month?

Observe and Secure All Apps with Splunk

What’s New & Next in Splunk SOAR