Solved: Pulling out duplicates from multivalue fields and ...

steve_b_88 · ‎03-27-2024

I'm trying to achieve the following and hoped someone could help?

I have a multivalue field that contains values that are colors, and would like to know how many fields contain duplicate colors, and what the value of those colors are.

e.g. my data

colors

blue
blue
red

yellow
red

blue
red
blue

red
red
green
green

Would return something like:

duplicate_color	duplicate_count
blue	2
red	1
green	1

Because 'blue' is present as a duplicate in two entries, 'red' in one entry, and 'green' in one entry. 'yellow' is omitted because it is not a duplicate.

Thank you very much for any help
Steve

yuanliu · ‎03-27-2024

The solution becomes more obvious if I restate the problem like this: In addition to colors, you must have another field with four distinct values. Let's call the additional field group, and give them values "a", "b", "c", and "d".

colors	group
blue	a
blue	a
red	a
yellow	b
red	b
blue	c
red	c
blue	c
red	d
red	d
green	d
green	d

When data structure is clear, what you are asking is to

Find values of colors that appear more than once with each group value.
Count how many distinct values of group for each of duplicated values of colors.

Hence,

| stats count by colors group
| where count > 1
| stats dc(group) as duplicate_count by colors

Here is a data emulation you can play with and compare with real data

| makeresults format=csv data="colors,group
blue,a
blue,a
red,a
yellow,b
red,b
blue,c
red,c
blue,c
red,d
red,d
green,d
green,d"
``` data emulation above ```

String the two together, you get

colors	duplicate_count
blue	2
green	1
red	1

View solution in original post

steve_b_88 · ‎03-28-2024

@yuanliu Thank you for your answer. This rephrasing of the problem is great and this solution helps solve my issue using an additional group (i have a `uuid` field for each row that I can use). Many many thanks.

@bowesmana Thank you. I do have a `uuid` field for each row, that I did not have in the question, and have gone ahead and used that.

bowesmana · ‎03-27-2024

If you do not have a 'group' value per event, then you can simply create one by putting

| streamstats c as group

before the first stats mentioned by @yuanliu

yuanliu · ‎03-27-2024

The solution becomes more obvious if I restate the problem like this: In addition to colors, you must have another field with four distinct values. Let's call the additional field group, and give them values "a", "b", "c", and "d".

colors	group
blue	a
blue	a
red	a
yellow	b
red	b
blue	c
red	c
blue	c
red	d
red	d
green	d
green	d

When data structure is clear, what you are asking is to

Find values of colors that appear more than once with each group value.
Count how many distinct values of group for each of duplicated values of colors.

Hence,

| stats count by colors group
| where count > 1
| stats dc(group) as duplicate_count by colors

Here is a data emulation you can play with and compare with real data

| makeresults format=csv data="colors,group
blue,a
blue,a
red,a
yellow,b
red,b
blue,c
red,c
blue,c
red,d
red,d
green,d
green,d"
``` data emulation above ```

String the two together, you get

colors	duplicate_count
blue	2
green	1
red	1

Pulling out duplicates from multivalue fields and counting

field extraction

fields

stats

subsearch

Join Us for Splunk University and Get Your Bootcamp Game On!

.conf24 | Learning Tracks for Security, Observability, Platform, and Developers!

Announcing Scheduled Export GA for Dashboard Studio