OK. A bit of a journey here. I am searching for a good reliable method of bucketing numeric field values into categorical ranges, with no a-priori knowledge of what the range actually is, and preferably where I can specify based on the number of bins rather than with explicit span.
bin command (aka bucket) The problem here is that bin doesn't represent the intervening buckets for which no data exists. Seen from the perspective of the individual search commands, the behavior makes sense. But it's unacceptable for use in charts.
<search terms> | bin foo bins=10 | stats count by foo
Next stop: chart command. It can actually do this... sometimes. timechart can have a "bins" or "span" command not just in the beginning to control _time, but also after the split by field, to control on-the-fly bucketing of a numeric split-by. Chart command is even nuttier, allowing on the fly numeric-to-categorical bucketing for both the group-by and the split-by field. Here it is in action:
<search terms> | chart bins=10 count over foo
And it works! Yay. The intervening buckets of 1-2, 2-3, 3-4 all show up in this particular case.
Except... in most cases this feature completely loses its mind. Here is a good illustration. It's quite interesting behavior but unfortunately answers will only let me post two images so I have tried to find a representatively nutty one for you:
<search terms> | chart bins=50 count over foo
So... Is there any explanation for what makes chart command unreliable here, and is there a way to beat it with a stick to make it reliably bucket things? Or has anyone found a good flexible way to add in the empty buckets that bin/bucket omits?
- In other questions I've seen
makecontinuous put forward as a solution, but much effort seems to suggest that makecontinuous does not do anything useful, at least not in 6.2. =/ If you can make makecontinuous solve this though, please tell me how.
- Various tricks that allow for
| append [| stats count | eval foo=mvrange(0,50) ] | stats sum(count) as count by foo to glue on a dummy row for each value being bucketed. While these work for particular cases, it's super clunky and I believe all such tricks require a-priori knowledge that I don't have.
Here's how you'd use
| stats count | eval foo = mvrange(0,100) | mvexpand foo | eval foo = random()%100 | rename comment as "dummy" | bin bins=100 foo | stats count by foo | makecontinuous foo | fillnull count
That'll create the gaps in
fillnull lets the
count be zero for those gaps. Works fine in 6.2.3.
Do you have dummy data to reproduce the wonky
chart? Attempting something similar suggests
chart doesn't even fill in the blanks:
| stats count | eval foo = mvrange(0,100) | mvexpand foo | eval foo = (random()%500)/100 | chart bins=50 count over foo
Thanks for your answer. I understand makecontinuous better now and didn't realize it will only have a useful effect on incoming rows whose values are already bucketed, "1-5","6-10", etc..
However, digging deeper, makecontinuous has the same behavior as chart, and the key is that both of them lose their minds if the bucket size becomes less than 1.
Here's a canned example of chart working perfectly, bucketing correctly, and putting in the intervening empty buckets --
| stats count | eval foo=mvrange(1,10) | mvexpand foo | eval foo=random()%10 | chart count over foo bins=10 | sort foo
and here's a canned example of makecontinuous losing it's mind.
| stats count | eval foo=mvrange(1,10) | mvexpand foo | eval foo=random()%10 | bin foo bins=200 | makecontinuous foo | chart count over foo | sort foo
I suppose, in theory I could multiply the values by some very large power of 10, then use something truly evil like rex mode=sed to move the decimal point back over. !! Say it aint so. There has to be a better way. Unfortunately here I definitely have to support bucketing ranges into buckets of size<1. eg: many variables I need to bucket for my users in this reporting UI are floats in [0,1].
I think the reporting-makecontinuous order on your second example is wrong, but it still goes bananas:
| stats count | eval foo=mvrange(1,10) | mvexpand foo | eval foo=random()%10 | bin foo bins=200 | stats count by foo | makecontinuous foo
I've observed something even more weird: If you run the search up to
bin|stats and then do
|loadjob | makecontinuous it seems to behave differently, most filled-in buckets seem fine but there seems to be some added 6.9-7.0 buckets instead of some 7.0-7.1 buckets...