Splunk Search

dedup or filter row with condition

LearningGuy
Builder


How do I dedup or filter out data with condition?
For example:
Below I want to filter out row that contains name="name0".  
The condition should be able to handle any IPs on the ip field because the IP could change, in the real data the IPs are a lot more.   The name0 is not in order. The dedup/filter should not be applied  to IPs that doesn't contain "name0" AND it should not be applied to unique IP that has "name0"
Thank you for your help.

Data:

ipnamelocation
1.1.1.1name0location-1
1.1.1.1name1location-1
1.1.1.2name2location-2
1.1.1.2name0location-20
1.1.1.3name0location-3
1.1.1.3name3location-3
1.1.1.4name4location-4
1.1.1.4name4blocation-4
1.1.1.5name0location-0
1.1.1.6name0location-0


Expected output:

ipnamelocation
1.1.1.1name1location-1
1.1.1.2name2location-2
1.1.1.3name3location-3
1.1.1.4name4location-4
1.1.1.4name4blocation-4
1.1.1.5name0location-0
1.1.1.6name0location-0

 

 

 

| makeresults format=csv data="ip, name, location
1.1.1.1,	name0,	location-1
1.1.1.1,	name1,	location-1
1.1.1.2,	name2,	location-2
1.1.1.2,	name0,	location-20
1.1.1.3,	name0,	location-3
1.1.1.3,	name3,	location-3
1.1.1.4,	name4,  location-4
1.1.1.4,	name4b, location-4
1.1.1.5,	name0,	location-0
1.1.1.6,	name0,	location-0"

 

 



Labels (3)
Tags (3)
0 Karma
1 Solution

PickleRick
SplunkTrust
SplunkTrust

Of course. That's what you get when you're writing faster than you're thinking 😉

That eventstats should have "BY ip clause" so you get count of distinct values per each separate ip. So

| eventstats dc(name) AS dc BY ip

The rest stays the same.

View solution in original post

PickleRick
SplunkTrust
SplunkTrust

This looks like simple

| dedup ip

If there is some other logic involved, please explain.

LearningGuy
Builder

Hi @PickleRick 

Thank you for your help. I also updated the original post.
The name0 is not in order. The dedup/filter should not be applied  to IP that doesn't contain "name0"

Data:

ipnamelocation
1.1.1.1name0location-1
1.1.1.1name1location-1
1.1.1.2name2location-2
1.1.1.2name0location-20
1.1.1.3name0location-3
1.1.1.3name3location-3
1.1.1.4name4location-4
1.1.1.4name4blocation-4


Expected output:

ipnamelocation
1.1.1.1name1location-1
1.1.1.2name2location-2
1.1.1.3name3location-3
1.1.1.4name4location-4
1.1.1.4name4blocation-4
| makeresults format=csv data="ip, name, location
1.1.1.1,	name0,	location-1
1.1.1.1,	name1,	location-1
1.1.1.2,	name2,	location-2
1.1.1.2,	name0,	location-20
1.1.1.3,	name0,	location-3
1.1.1.3,	name3,	location-3
1.1.1.4,	name4,  location-4
1.1.1.4,	name4b, location-4"

 

0 Karma

PickleRick
SplunkTrust
SplunkTrust

Then just filter out all events with name="name0"

| where name!="name0"

or even

| search name!="name0"

Then you can dedup if needed.

LearningGuy
Builder

Hi @PickleRick 
Sorry I missed another condition. I also updated the initial post.
The name0 is not in order. The dedup/filter should not be applied  to IPs that doesn't contain "name0" AND it should not be applied to unique IP that has "name0"
So,  unique IP like 1.1.1.5 and 1.1.1.6 that has "name0" needs to be remained in the data.   What I did now is to filter out statically, but another IP could show up with the same pattern.   Thank you again for your help

Data:

ipnamelocation
1.1.1.1name0location-1
1.1.1.1name1location-1
1.1.1.2name2location-2
1.1.1.2name0location-20
1.1.1.3name0location-3
1.1.1.3name3location-3
1.1.1.4name4location-4
1.1.1.4name4blocation-4
1.1.1.5name0location-0
1.1.1.6name0location-0

 

Expected output:

ipnamelocation
1.1.1.1name1location-1
1.1.1.2name2location-2
1.1.1.3name3location-3
1.1.1.4name4location-4
1.1.1.4name4blocation-4
1.1.1.5name0location-0
1.1.1.6name0location-0

 

| makeresults format=csv data="ip, name, location
1.1.1.1,	name0,	location-1
1.1.1.1,	name1,	location-1
1.1.1.2,	name2,	location-2
1.1.1.2,	name0,	location-20
1.1.1.3,	name0,	location-3
1.1.1.3,	name3,	location-3
1.1.1.4,	name4,  location-4
1.1.1.4,	name4b, location-4
1.1.1.5,	name0,	location-0
1.1.1.6,	name0,	location-0"

 


 

0 Karma

PickleRick
SplunkTrust
SplunkTrust

You can help yourself and check how many distinct values are there in the name field.

| eventstats dc(name) as dc
| where name!="name0" OR (name=="name0" AND dc=1)

Then you can

| dedup name

f needed

LearningGuy
Builder

Hi @PickleRick 
When I ran the following command, the dc returned 6 for each row

 

 

| eventstats dc(name) as dc

 

 

dciplocationname
61.1.1.1location-1name0
61.1.1.1location-1name1
61.1.1.2location-2name2
61.1.1.2location-20name0
61.1.1.3location-3name0
61.1.1.3location-3name3
61.1.1.4location-4name4
61.1.1.4location-4name4b
61.1.1.5location-0name0
61.1.1.6location-0name0


So, the output are still missing1.1.1.5 and 1.1.1.6  
Only name0 that exists on multiple rows should be removed.     Thanks for your help

 

 

| where name!="name0" OR (name=="name0" AND dc=1)

 

 

output

dciplocationname
61.1.1.1location-1name1
61.1.1.2location-2name2
61.1.1.3location-3name3
61.1.1.4location-4name4
61.1.1.4location-4name4b


Expected output:

iplocationname
1.1.1.1location-1name1
1.1.1.2location-2name2
1.1.1.3location-3name3
1.1.1.4location-4name4
1.1.1.4location-4name4b
1.1.1.5location-0name0
1.1.1.6location-0name0



0 Karma

PickleRick
SplunkTrust
SplunkTrust

Of course. That's what you get when you're writing faster than you're thinking 😉

That eventstats should have "BY ip clause" so you get count of distinct values per each separate ip. So

| eventstats dc(name) AS dc BY ip

The rest stays the same.

LearningGuy
Builder

Hi @PickleRick ,
I tested your suggestion and it worked. Thank you for your help.
1) I added one more case where the IP has an empty name.   I added condition in the where clause (dc=0) and it worked.  I am afraid if I used isnull(name), sometimes it contains " " (empty string). Please let me know if this is doable

2)  Is it possible to do this without using eventstat? 
     I have already used eventstats in the search, but for a different field
     Will that cause any delays or issues?   Did you ever use multiple eventstats in your search?
   Thank you so much for your help

ipnamelocation
1.1.1.1name0location-1
1.1.1.1name1location-1
1.1.1.2name2location-2
1.1.1.2name0location-20
1.1.1.3name0location-3
1.1.1.3name3location-3
1.1.1.4name4location-4
1.1.1.4name4blocation-4
1.1.1.5name0location-0
1.1.1.6name0location-0
1.1.1.7 location-7
| makeresults format=csv data="ip, name, location
1.1.1.1,	name0,	location-1
1.1.1.1,	name1,	location-1
1.1.1.2,	name2,	location-2
1.1.1.2,	name0,	location-20
1.1.1.3,	name0,	location-3
1.1.1.3,	name3,	location-3
1.1.1.4,	name4,  location-4
1.1.1.4,	name4b, location-4
1.1.1.5,	name0,	location-0
1.1.1.6,	name0,	location-0
1.1.1.7,,location-7"
| eventstats dc(name) AS dc BY ip
| where name!="name0" OR dc=0 OR (name=="name0" AND dc=1)

 

0 Karma

PickleRick
SplunkTrust
SplunkTrust

1) I'm honestly not sure what will Splunk do if there is a whitespace-filled value. Try and see 🙂 You're not changing your indexed data after all - you can search it every way you want (if you have too much data to search, just narrow your time range for quick tests).

2) You could try using some streamstats tricks as long as you can make some assumptions on your data (like it being sorted the right way) but that will be bending over backwards. True, eventstats has some limitations when it comes to data set size. You can have however as many eventstats within a single search as you want 🙂

See

https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Eventstats#Usage

and

https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Eventstats#Functions_and_memory_...

 

LearningGuy
Builder

Hi @PickleRick 

I accepted your previous suggestion using eventstats as solution
Thank you for your help.   Thanks for bringing me into attention that eventstats also have memory limitation, another limits.conf that only admin can modify 🙂     
The default setting is max_mem_usage_mb=200MB. 

How do I know if my dataset will ever hit 200 or not?

Does streamstats have limitation?
How do I use streamstats in my case?    When I changed eventstats to streamstats, it will show incremental data, instead of total, so I need to figure out on how to filter out if it's possible.

Thanks
Streamstats result

dc ip location name

dc   iplocationname
11.1.1.1location-1name0
21.1.1.1location-1name1
11.1.1.2location-2name2
11.1.1.3location-3name0
21.1.1.3location-3name3
11.1.1.4location-4name4
21.1.1.4location-4name4b
11.1.1.5location-0name0
11.1.1.6location-0name0
01.1.1.7location-7 
0 Karma

PickleRick
SplunkTrust
SplunkTrust

All them commands have some limitations. Some more explicitly configured, some implicitly derived from operating system limits.

Anyway.

Simply replacing eventstats with streamstats obviously won't do. With streamstats you'd need some copying over values and I don't have a ready solution. I have a hunch it's possible but will be ugly and possibly even less efficient than eventstats based one.

Get Updates on the Splunk Community!

How to Monitor Google Kubernetes Engine (GKE)

We’ve looked at how to integrate Kubernetes environments with Splunk Observability Cloud, but what about ...

Index This | How can you make 45 using only 4?

October 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with this ...

Splunk Education Goes to Washington | Splunk GovSummit 2024

If you’re in the Washington, D.C. area, this is your opportunity to take your career and Splunk skills to the ...