Reliable way to specify span to bucket numeric val...

sideview · ‎05-01-2015

OK. A bit of a journey here. I am searching for a good reliable method of bucketing numeric field values into categorical ranges, with no a-priori knowledge of what the range actually is, and preferably where I can specify based on the number of bins rather than with explicit span.

First step, bin command (aka bucket) The problem here is that bin doesn't represent the intervening buckets for which no data exists. Seen from the perspective of the individual search commands, the behavior makes sense. But it's unacceptable for use in charts.

<search terms> | bin foo bins=10 | stats count by foo

Next stop: chart command. It can actually do this... sometimes. timechart can have a "bins" or "span" command not just in the beginning to control _time, but also after the split by field, to control on-the-fly bucketing of a numeric split-by. Chart command is even nuttier, allowing on the fly numeric-to-categorical bucketing for both the group-by and the split-by field. Here it is in action:

<search terms> | chart bins=10 count over foo

And it works! Yay. The intervening buckets of 1-2, 2-3, 3-4 all show up in this particular case.

Except... in most cases this feature completely loses its mind. Here is a good illustration. It's quite interesting behavior but unfortunately answers will only let me post two images so I have tried to find a representatively nutty one for you:

<search terms> | chart bins=50 count over foo

So... Is there any explanation for what makes chart command unreliable here, and is there a way to beat it with a stick to make it reliably bucket things? Or has anyone found a good flexible way to add in the empty buckets that bin/bucket omits?

Bad solutions:
- In other questions I've seen makecontinuous put forward as a solution, but much effort seems to suggest that makecontinuous does not do anything useful, at least not in 6.2. =/ If you can make makecontinuous solve this though, please tell me how.
- Various tricks that allow for | append [| stats count | eval foo=mvrange(0,50) ] | stats sum(count) as count by foo to glue on a dummy row for each value being bucketed. While these work for particular cases, it's super clunky and I believe all such tricks require a-priori knowledge that I don't have.

martin_mueller · ‎05-01-2015

Here's how you'd use makecontinuous:

| stats count | eval foo = mvrange(0,100) | mvexpand foo | eval foo = random()%100 | rename comment as "dummy"
| bin bins=100 foo | stats count by foo | makecontinuous foo | fillnull count

That'll create the gaps in foo, and fillnull lets the count be zero for those gaps. Works fine in 6.2.3.

Do you have dummy data to reproduce the wonky chart? Attempting something similar suggests chart doesn't even fill in the blanks:

| stats count | eval foo = mvrange(0,100) | mvexpand foo | eval foo = (random()%500)/100
| chart bins=50 count over foo

martin_mueller · ‎05-02-2015

I think the reporting-makecontinuous order on your second example is wrong, but it still goes bananas:

| stats count | eval foo=mvrange(1,10) | mvexpand foo | eval foo=random()%10 | bin foo bins=200 | stats count by foo | makecontinuous foo

I've observed something even more weird: If you run the search up to bin|stats and then do |loadjob | makecontinuous it seems to behave differently, most filled-in buckets seem fine but there seems to be some added 6.9-7.0 buckets instead of some 7.0-7.1 buckets...

sideview · ‎05-01-2015

Thanks for your answer. I understand makecontinuous better now and didn't realize it will only have a useful effect on incoming rows whose values are already bucketed, "1-5","6-10", etc..

However, digging deeper, makecontinuous has the same behavior as chart, and the key is that both of them lose their minds if the bucket size becomes less than 1.

I suppose, in theory I could multiply the values by some very large power of 10, then use something truly evil like rex mode=sed to move the decimal point back over. !! Say it aint so. There has to be a better way. Unfortunately here I definitely have to support bucketing ranges into buckets of size<1. eg: many variables I need to bucket for my users in this reporting UI are floats in [0,1].

Reliable way to specify span to bucket numeric values

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life