I want to correlate data from 2 sources. First data source contains store_events (source1=store_events) and second source contains book_events (source2=book_events). Source store_events contains a field called store_start_time and source book_events contains a field called store_end_time. Both sources contain a field called customerid.
How can I join the data based on customerid and create a report with my x-axis showing total_time (store_start_time + store_end_time) and y-axis showing total events ?
Any help is appreciated.
MuS has posted a great link as well, but I thought I'd just write the query for you here directly, since your example gives a nice simple solution without a lot of fiddly bits.
Here is a search that does a lot of the work, getting the total time spent per customer id:
source=store_events OR source2=book_events | stats values(store_start_time) as store_start_time values(store_end_time) as store_end_time by customerid | eval time_spent=store_end_time - store_start_time
However the fun part of this question I think may concern your reporting phrasing -- "x-axis showing total_time (store_start_time + store_end_time) and y-axis showing total events "
1) The way I'm reading this, is that you're asking for the report to "bucket the total time into sensible buckets like "0-10 seconds", "10-20 seconds" etc, and then display a column chart where those bucketed times are on the x-axis and the total customers are on the y-axis. That would look like this:
source=store_events OR source2=book_events | stats values(store_start_time) as store_start_time values(store_end_time) as store_end_time by customerid | eval time_spent=store_end_time - store_start_time | bin time_spent bins=20 | chart dc(customerid) over time_spent
2) However, what if customerid X appears in 100 transactions, 99 of which take only 5 seconds, and one of which takes 15 seconds. His effect on the chart will be to increment the first and second buckets each by 1, which seems wrong. So I think possibly you want to keep track of the raw event count as well? That would look like this:
source=store_events OR source2=book_events | stats count values(store_start_time) as store_start_time values(store_end_time) as store_end_time by customerid | eval time_spent=store_end_time - store_start_time | bin time_spent bins=20 | chart sum(count) as count over time_spent
The only difference is that my first stats has an extra "count" field coming out of it, and my subsequent chart
command has a weird looking sum(count) as count
in it. But the net difference is to count the individual transactions, and not just the distinct count of customerid.
3) You might also want to display the raw data as scatter chart. This could be done too but a) scatter charts are never as cool as you think they're going to be and b) in such a report "total customers" would be meaningless because each point would pretty much just represent one customer.
MuS has posted a great link as well, but I thought I'd just write the query for you here directly, since your example gives a nice simple solution without a lot of fiddly bits.
Here is a search that does a lot of the work, getting the total time spent per customer id:
source=store_events OR source2=book_events | stats values(store_start_time) as store_start_time values(store_end_time) as store_end_time by customerid | eval time_spent=store_end_time - store_start_time
However the fun part of this question I think may concern your reporting phrasing -- "x-axis showing total_time (store_start_time + store_end_time) and y-axis showing total events "
1) The way I'm reading this, is that you're asking for the report to "bucket the total time into sensible buckets like "0-10 seconds", "10-20 seconds" etc, and then display a column chart where those bucketed times are on the x-axis and the total customers are on the y-axis. That would look like this:
source=store_events OR source2=book_events | stats values(store_start_time) as store_start_time values(store_end_time) as store_end_time by customerid | eval time_spent=store_end_time - store_start_time | bin time_spent bins=20 | chart dc(customerid) over time_spent
2) However, what if customerid X appears in 100 transactions, 99 of which take only 5 seconds, and one of which takes 15 seconds. His effect on the chart will be to increment the first and second buckets each by 1, which seems wrong. So I think possibly you want to keep track of the raw event count as well? That would look like this:
source=store_events OR source2=book_events | stats count values(store_start_time) as store_start_time values(store_end_time) as store_end_time by customerid | eval time_spent=store_end_time - store_start_time | bin time_spent bins=20 | chart sum(count) as count over time_spent
The only difference is that my first stats has an extra "count" field coming out of it, and my subsequent chart
command has a weird looking sum(count) as count
in it. But the net difference is to count the individual transactions, and not just the distinct count of customerid.
3) You might also want to display the raw data as scatter chart. This could be done too but a) scatter charts are never as cool as you think they're going to be and b) in such a report "total customers" would be meaningless because each point would pretty much just represent one customer.
Hi. Thank you. Your answer certainly helped me to get the desired o/p
I had another thought, and updated my answer - see #2 in the updated answer above. I think you probably want to count total customers, not just distinct count of customerid as I was doing in #1.
you are right. I want to get the total customers count. Thank you !!
Hi gowthamkb,
Take a look at this answer https://answers.splunk.com/answers/129424/how-to-compare-fields-over-multiple-sourcetypes-without-jo... which is all about this topic.
Hope this helps ...
cheers, MuS
Hi Mus. Thanks for sharing the link. it's helpful.