I'm trying to achieve the following and hoped someone could help?
I have a multivalue field that contains values that are colors, and would like to know how many fields contain duplicate colors, and what the value of those colors are.
e.g. my data
| colors |
| blue blue red |
| yellow red |
| blue red blue |
| red red green green |
Would return something like:
| duplicate_color | duplicate_count |
| blue | 2 |
| red | 1 |
| green | 1 |
Because 'blue' is present as a duplicate in two entries, 'red' in one entry, and 'green' in one entry. 'yellow' is omitted because it is not a duplicate.
Thank you very much for any help
Steve
The solution becomes more obvious if I restate the problem like this: In addition to colors, you must have another field with four distinct values. Let's call the additional field group, and give them values "a", "b", "c", and "d".
| colors | group |
| blue | a |
| blue | a |
| red | a |
| yellow | b |
| red | b |
| blue | c |
| red | c |
| blue | c |
| red | d |
| red | d |
| green | d |
| green | d |
When data structure is clear, what you are asking is to
Hence,
| stats count by colors group
| where count > 1
| stats dc(group) as duplicate_count by colors
Here is a data emulation you can play with and compare with real data
| makeresults format=csv data="colors,group
blue,a
blue,a
red,a
yellow,b
red,b
blue,c
red,c
blue,c
red,d
red,d
green,d
green,d"
``` data emulation above ```
String the two together, you get
| colors | duplicate_count |
| blue | 2 |
| green | 1 |
| red | 1 |
@yuanliu Thank you for your answer. This rephrasing of the problem is great and this solution helps solve my issue using an additional group (i have a `uuid` field for each row that I can use). Many many thanks.
@bowesmana Thank you. I do have a `uuid` field for each row, that I did not have in the question, and have gone ahead and used that.
If you do not have a 'group' value per event, then you can simply create one by putting
| streamstats c as groupbefore the first stats mentioned by @yuanliu
The solution becomes more obvious if I restate the problem like this: In addition to colors, you must have another field with four distinct values. Let's call the additional field group, and give them values "a", "b", "c", and "d".
| colors | group |
| blue | a |
| blue | a |
| red | a |
| yellow | b |
| red | b |
| blue | c |
| red | c |
| blue | c |
| red | d |
| red | d |
| green | d |
| green | d |
When data structure is clear, what you are asking is to
Hence,
| stats count by colors group
| where count > 1
| stats dc(group) as duplicate_count by colors
Here is a data emulation you can play with and compare with real data
| makeresults format=csv data="colors,group
blue,a
blue,a
red,a
yellow,b
red,b
blue,c
red,c
blue,c
red,d
red,d
green,d
green,d"
``` data emulation above ```
String the two together, you get
| colors | duplicate_count |
| blue | 2 |
| green | 1 |
| red | 1 |