I'm trying to achieve the following and hoped someone could help?
I have a multivalue field that contains values that are colors, and would like to know how many fields contain duplicate colors, and what the value of those colors are.
e.g. my data
colors |
blue blue red |
yellow red |
blue red blue |
red red green green |
Would return something like:
duplicate_color | duplicate_count |
blue | 2 |
red | 1 |
green | 1 |
Because 'blue' is present as a duplicate in two entries, 'red' in one entry, and 'green' in one entry. 'yellow' is omitted because it is not a duplicate.
Thank you very much for any help
Steve
The solution becomes more obvious if I restate the problem like this: In addition to colors, you must have another field with four distinct values. Let's call the additional field group, and give them values "a", "b", "c", and "d".
colors | group |
blue | a |
blue | a |
red | a |
yellow | b |
red | b |
blue | c |
red | c |
blue | c |
red | d |
red | d |
green | d |
green | d |
When data structure is clear, what you are asking is to
Hence,
| stats count by colors group
| where count > 1
| stats dc(group) as duplicate_count by colors
Here is a data emulation you can play with and compare with real data
| makeresults format=csv data="colors,group
blue,a
blue,a
red,a
yellow,b
red,b
blue,c
red,c
blue,c
red,d
red,d
green,d
green,d"
``` data emulation above ```
String the two together, you get
colors | duplicate_count |
blue | 2 |
green | 1 |
red | 1 |
@yuanliu Thank you for your answer. This rephrasing of the problem is great and this solution helps solve my issue using an additional group (i have a `uuid` field for each row that I can use). Many many thanks.
@bowesmana Thank you. I do have a `uuid` field for each row, that I did not have in the question, and have gone ahead and used that.
If you do not have a 'group' value per event, then you can simply create one by putting
| streamstats c as group
before the first stats mentioned by @yuanliu
The solution becomes more obvious if I restate the problem like this: In addition to colors, you must have another field with four distinct values. Let's call the additional field group, and give them values "a", "b", "c", and "d".
colors | group |
blue | a |
blue | a |
red | a |
yellow | b |
red | b |
blue | c |
red | c |
blue | c |
red | d |
red | d |
green | d |
green | d |
When data structure is clear, what you are asking is to
Hence,
| stats count by colors group
| where count > 1
| stats dc(group) as duplicate_count by colors
Here is a data emulation you can play with and compare with real data
| makeresults format=csv data="colors,group
blue,a
blue,a
red,a
yellow,b
red,b
blue,c
red,c
blue,c
red,d
red,d
green,d
green,d"
``` data emulation above ```
String the two together, you get
colors | duplicate_count |
blue | 2 |
green | 1 |
red | 1 |