Splunk Search

Cumulative sum of a distinct count over time.

howyagoin
Contributor

I've got a variable, call it "flowers," related to orders from a shop. I'm trying to get a chart of the number of unique flower types purchased over a time range, as well as the cumulative total.

I've found another answer here which gave the hint of:

searchterm | stats values(flowers) as fcount dc(flowers) by date_minute | streamstats dc(fcount) as "Cumulative total"

And whilst that works, it breaks down the number of unique flowers per a given minute, across the date range -- all fine if that range is within a given hour, but, otherwise it's not quite the right way to break this down.

What I'm after though is a way to look at 36 hours worth of flower sales, and for each hour, or minute, see the number of unique flower types sold, but also have the cumulative total.

As I interpret the above example, it's recycling "minutes" (or hours if I use date_hour) so that the 12th minute across two days is only represented by a single value..

Missing something obvious here.....

Tags (1)

gkanapathy
Splunk Employee
Splunk Employee
...
| sort _time 
| streamstats global=f window=0 current=t 
  values(flowers) as v_flowers
| eval v_flowers=mvjoin(v_flowers,";")
| timechart span=1h dc(flowers) as cur_dc, last(v_flowers) as cum_dc
| eval cum_dc=mvcount(split(cum_dc,";"))

The main tricks are (a) you need to sort and get the cumulative count first, and (b) convert the list of items from a multivalue field since it seems that the timechart's last() function doesn't preserve multivalues.

howyagoin
Contributor

I think I follow the logic here, will have to experiment. I ran into a 10000 row limit on this one though. I did find a "hack" that seems to do it:

...| bin _time span=6h as hour | stats values(flowers) as cflowers dc(flowers) as "Flowers per hour" by hour | streamstats dc(cflowers) as "Cumulative total" | eval Time=strftime(hour, "%m-%d.%H") | table Time, "Flowers per hour", "Cumulative total"

0 Karma

hexx
Splunk Employee
Splunk Employee

I think you may want this :

... earliest=-36h | timechart span=1h dc(flower) AS "# of flower types purchased" | streamstats sum("# of flower types purchased") AS total

I'm not sure how relevant the cumulative total will be though, as it is only representative of the sum of distinct counts over time, which obviously will have duplicate flower types.

You might find the results of this search more interesting as they will expose similar information :

... earliest=-36h | timechart span=1h count by flower | addtotals | appendcols [search ... | timechart span=1h dc(flower) AS "# of flower types purchased"]

This will provide the number of purchases per flower type, the total number of purchases, and the distinct count of flower types purchased for every hour.

For more detais and examples, check the Search Reference manual records for timechart, appendcols and addtotals.

0 Karma

hexx
Splunk Employee
Splunk Employee

@howyagoin : Oh ok, you want a distinct count over the entire period, not an on-going cumulative sum of the item sales for each cycle.

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

Try what I came up and posted now.

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

hmm. maybe not. streamstats doesn't work with sistats as I had hoped.

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

This is a bit tricky, but possible. See my answer.

0 Karma

hexx
Splunk Employee
Splunk Employee

I see. How about :

... earliest=-36h | timechart span=1h count, dc(flower) AS "Distinct count of flower types purchased" | streamstats sum(count) AS "Total flowers purchased" | fields - count

0 Karma

howyagoin
Contributor

Both of these are interesting, thanks, gives me something to play with. What I was hoping to accomplish though was to have a graph of time on the X axis, number of flowers on the Y, with one line representing the number of unique flowers per that increment of time (hour/minute, whatever) -- but a second line representing the cumulative total over all time, rather than just for that unit of time.

So the second line would never, in theory, go down; if there were 2 unique flowers at noon on Monday, and 4 at 1300, and a total of 5 unique between those times, the second line should be at "5" on Y

0 Karma
Get Updates on the Splunk Community!

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...