Solved: dedup or filter row with condition

LearningGuy · ‎10-01-2024

How do I dedup or filter out data with condition?
For example:
Below I want to filter out row that contains name="name0".
The condition should be able to handle any IPs on the ip field because the IP could change, in the real data the IPs are a lot more. The name0 is not in order. The dedup/filter should not be applied to IPs that doesn't contain "name0" AND it should not be applied to unique IP that has "name0"
Thank you for your help.

Data:

ip	name	location
1.1.1.1	name0	location-1
1.1.1.1	name1	location-1
1.1.1.2	name2	location-2
1.1.1.2	name0	location-20
1.1.1.3	name0	location-3
1.1.1.3	name3	location-3
1.1.1.4	name4	location-4
1.1.1.4	name4b	location-4
1.1.1.5	name0	location-0
1.1.1.6	name0	location-0

Expected output:

ip	name	location
1.1.1.1	name1	location-1
1.1.1.2	name2	location-2
1.1.1.3	name3	location-3
1.1.1.4	name4	location-4
1.1.1.4	name4b	location-4
1.1.1.5	name0	location-0
1.1.1.6	name0	location-0

| makeresults format=csv data="ip, name, location
1.1.1.1,	name0,	location-1
1.1.1.1,	name1,	location-1
1.1.1.2,	name2,	location-2
1.1.1.2,	name0,	location-20
1.1.1.3,	name0,	location-3
1.1.1.3,	name3,	location-3
1.1.1.4,	name4,  location-4
1.1.1.4,	name4b, location-4
1.1.1.5,	name0,	location-0
1.1.1.6,	name0,	location-0"

PickleRick · ‎10-01-2024

Of course. That's what you get when you're writing faster than you're thinking 😉

That eventstats should have "BY ip clause" so you get count of distinct values per each separate ip. So

| eventstats dc(name) AS dc BY ip

The rest stays the same.

View solution in original post

PickleRick · ‎10-01-2024

This looks like simple

| dedup ip

If there is some other logic involved, please explain.

LearningGuy · ‎10-01-2024

Hi @PickleRick

Thank you for your help. I also updated the original post.
The name0 is not in order. The dedup/filter should not be applied to IP that doesn't contain "name0"

Data:

ip	name	location
1.1.1.1	name0	location-1
1.1.1.1	name1	location-1
1.1.1.2	name2	location-2
1.1.1.2	name0	location-20
1.1.1.3	name0	location-3
1.1.1.3	name3	location-3
1.1.1.4	name4	location-4
1.1.1.4	name4b	location-4

Expected output:

ip	name	location
1.1.1.1	name1	location-1
1.1.1.2	name2	location-2
1.1.1.3	name3	location-3
1.1.1.4	name4	location-4
1.1.1.4	name4b	location-4

| makeresults format=csv data="ip, name, location
1.1.1.1,	name0,	location-1
1.1.1.1,	name1,	location-1
1.1.1.2,	name2,	location-2
1.1.1.2,	name0,	location-20
1.1.1.3,	name0,	location-3
1.1.1.3,	name3,	location-3
1.1.1.4,	name4,  location-4
1.1.1.4,	name4b, location-4"

PickleRick · ‎10-01-2024

Then just filter out all events with name="name0"

| where name!="name0"

or even

| search name!="name0"

Then you can dedup if needed.

LearningGuy · ‎10-01-2024

Hi @PickleRick
Sorry I missed another condition. I also updated the initial post.
The name0 is not in order. The dedup/filter should not be applied to IPs that doesn't contain "name0" AND it should not be applied to unique IP that has "name0"
So, unique IP like 1.1.1.5 and 1.1.1.6 that has "name0" needs to be remained in the data. What I did now is to filter out statically, but another IP could show up with the same pattern. Thank you again for your help

Data:

ip	name	location
1.1.1.1	name0	location-1
1.1.1.1	name1	location-1
1.1.1.2	name2	location-2
1.1.1.2	name0	location-20
1.1.1.3	name0	location-3
1.1.1.3	name3	location-3
1.1.1.4	name4	location-4
1.1.1.4	name4b	location-4
1.1.1.5	name0	location-0
1.1.1.6	name0	location-0

Expected output:

ip	name	location
1.1.1.1	name1	location-1
1.1.1.2	name2	location-2
1.1.1.3	name3	location-3
1.1.1.4	name4	location-4
1.1.1.4	name4b	location-4
1.1.1.5	name0	location-0
1.1.1.6	name0	location-0

| makeresults format=csv data="ip, name, location
1.1.1.1,	name0,	location-1
1.1.1.1,	name1,	location-1
1.1.1.2,	name2,	location-2
1.1.1.2,	name0,	location-20
1.1.1.3,	name0,	location-3
1.1.1.3,	name3,	location-3
1.1.1.4,	name4,  location-4
1.1.1.4,	name4b, location-4
1.1.1.5,	name0,	location-0
1.1.1.6,	name0,	location-0"

PickleRick · ‎10-01-2024

You can help yourself and check how many distinct values are there in the name field.

| eventstats dc(name) as dc
| where name!="name0" OR (name=="name0" AND dc=1)

Then you can

| dedup name

f needed

LearningGuy · ‎10-01-2024

Hi @PickleRick
When I ran the following command, the dc returned 6 for each row

| eventstats dc(name) as dc

dc	ip	location	name
6	1.1.1.1	location-1	name0
6	1.1.1.1	location-1	name1
6	1.1.1.2	location-2	name2
6	1.1.1.2	location-20	name0
6	1.1.1.3	location-3	name0
6	1.1.1.3	location-3	name3
6	1.1.1.4	location-4	name4
6	1.1.1.4	location-4	name4b
6	1.1.1.5	location-0	name0
6	1.1.1.6	location-0	name0

So, the output are still missing1.1.1.5 and 1.1.1.6
Only name0 that exists on multiple rows should be removed. Thanks for your help

| where name!="name0" OR (name=="name0" AND dc=1)

output

dc	ip	location	name
6	1.1.1.1	location-1	name1
6	1.1.1.2	location-2	name2
6	1.1.1.3	location-3	name3
6	1.1.1.4	location-4	name4
6	1.1.1.4	location-4	name4b

Expected output:

ip	location	name
1.1.1.1	location-1	name1
1.1.1.2	location-2	name2
1.1.1.3	location-3	name3
1.1.1.4	location-4	name4
1.1.1.4	location-4	name4b
1.1.1.5	location-0	name0
1.1.1.6	location-0	name0

PickleRick · ‎10-01-2024

Of course. That's what you get when you're writing faster than you're thinking 😉

That eventstats should have "BY ip clause" so you get count of distinct values per each separate ip. So

| eventstats dc(name) AS dc BY ip

The rest stays the same.

LearningGuy · ‎10-02-2024

Hi @PickleRick ,
I tested your suggestion and it worked. Thank you for your help.
1) I added one more case where the IP has an empty name. I added condition in the where clause (dc=0) and it worked. I am afraid if I used isnull(name), sometimes it contains " " (empty string). Please let me know if this is doable

2) Is it possible to do this without using eventstat?
I have already used eventstats in the search, but for a different field
Will that cause any delays or issues? Did you ever use multiple eventstats in your search?
Thank you so much for your help

ip	name	location
1.1.1.1	name0	location-1
1.1.1.1	name1	location-1
1.1.1.2	name2	location-2
1.1.1.2	name0	location-20
1.1.1.3	name0	location-3
1.1.1.3	name3	location-3
1.1.1.4	name4	location-4
1.1.1.4	name4b	location-4
1.1.1.5	name0	location-0
1.1.1.6	name0	location-0
1.1.1.7		location-7

| makeresults format=csv data="ip, name, location
1.1.1.1,	name0,	location-1
1.1.1.1,	name1,	location-1
1.1.1.2,	name2,	location-2
1.1.1.2,	name0,	location-20
1.1.1.3,	name0,	location-3
1.1.1.3,	name3,	location-3
1.1.1.4,	name4,  location-4
1.1.1.4,	name4b, location-4
1.1.1.5,	name0,	location-0
1.1.1.6,	name0,	location-0
1.1.1.7,,location-7"
| eventstats dc(name) AS dc BY ip
| where name!="name0" OR dc=0 OR (name=="name0" AND dc=1)

PickleRick · ‎10-02-2024

1) I'm honestly not sure what will Splunk do if there is a whitespace-filled value. Try and see 🙂 You're not changing your indexed data after all - you can search it every way you want (if you have too much data to search, just narrow your time range for quick tests).

2) You could try using some streamstats tricks as long as you can make some assumptions on your data (like it being sorted the right way) but that will be bending over backwards. True, eventstats has some limitations when it comes to data set size. You can have however as many eventstats within a single search as you want 🙂

See

https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Eventstats#Usage

and

https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Eventstats#Functions_and_memory_...

LearningGuy · ‎10-03-2024

Hi @PickleRick

I accepted your previous suggestion using eventstats as solution
Thank you for your help. Thanks for bringing me into attention that eventstats also have memory limitation, another limits.conf that only admin can modify 🙂
The default setting is max_mem_usage_mb=200MB.

How do I know if my dataset will ever hit 200 or not?
Does streamstats have limitation?
How do I use streamstats in my case? When I changed eventstats to streamstats, it will show incremental data, instead of total, so I need to figure out on how to filter out if it's possible.

Thanks
Streamstats result

dc ip location name

dc	ip	location	name
1	1.1.1.1	location-1	name0
2	1.1.1.1	location-1	name1
1	1.1.1.2	location-2	name2
1	1.1.1.3	location-3	name0
2	1.1.1.3	location-3	name3
1	1.1.1.4	location-4	name4
2	1.1.1.4	location-4	name4b
1	1.1.1.5	location-0	name0
1	1.1.1.6	location-0	name0
0	1.1.1.7	location-7

PickleRick · ‎10-03-2024

All them commands have some limitations. Some more explicitly configured, some implicitly derived from operating system limits.

Anyway.

Simply replacing eventstats with streamstats obviously won't do. With streamstats you'd need some copying over values and I don't have a ready solution. I have a hunch it's possible but will be ugly and possibly even less efficient than eventstats based one.

dedup or filter row with condition

stats

table

[Puzzles] Solve, Learn, Repeat: Dynamic formatting from XML events

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Stronger Security with Federated Search for S3, GCP SQL & Australian Threat ...

Join the Conversation

dedup or filter row with condition

stats

table

[Puzzles] Solve, Learn, Repeat: Dynamic formatting from XML events

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Stronger Security with Federated Search for S3, GCP SQL & Australian Threat ...