How do I dedup or filter out data with condition?
For example:
Below I want to filter out row that contains name="name0".
The condition should be able to handle any IPs on the ip field because the IP could change, in the real data the IPs are a lot more. The name0 is not in order. The dedup/filter should not be applied to IPs that doesn't contain "name0" AND it should not be applied to unique IP that has "name0"
Thank you for your help.
Data:
ip | name | location |
1.1.1.1 | name0 | location-1 |
1.1.1.1 | name1 | location-1 |
1.1.1.2 | name2 | location-2 |
1.1.1.2 | name0 | location-20 |
1.1.1.3 | name0 | location-3 |
1.1.1.3 | name3 | location-3 |
1.1.1.4 | name4 | location-4 |
1.1.1.4 | name4b | location-4 |
1.1.1.5 | name0 | location-0 |
1.1.1.6 | name0 | location-0 |
Expected output:
ip | name | location |
1.1.1.1 | name1 | location-1 |
1.1.1.2 | name2 | location-2 |
1.1.1.3 | name3 | location-3 |
1.1.1.4 | name4 | location-4 |
1.1.1.4 | name4b | location-4 |
1.1.1.5 | name0 | location-0 |
1.1.1.6 | name0 | location-0 |
| makeresults format=csv data="ip, name, location
1.1.1.1, name0, location-1
1.1.1.1, name1, location-1
1.1.1.2, name2, location-2
1.1.1.2, name0, location-20
1.1.1.3, name0, location-3
1.1.1.3, name3, location-3
1.1.1.4, name4, location-4
1.1.1.4, name4b, location-4
1.1.1.5, name0, location-0
1.1.1.6, name0, location-0"
Of course. That's what you get when you're writing faster than you're thinking 😉
That eventstats should have "BY ip clause" so you get count of distinct values per each separate ip. So
| eventstats dc(name) AS dc BY ip
The rest stays the same.
This looks like simple
| dedup ip
If there is some other logic involved, please explain.
Hi @PickleRick
Thank you for your help. I also updated the original post.
The name0 is not in order. The dedup/filter should not be applied to IP that doesn't contain "name0"
Data:
ip | name | location |
1.1.1.1 | name0 | location-1 |
1.1.1.1 | name1 | location-1 |
1.1.1.2 | name2 | location-2 |
1.1.1.2 | name0 | location-20 |
1.1.1.3 | name0 | location-3 |
1.1.1.3 | name3 | location-3 |
1.1.1.4 | name4 | location-4 |
1.1.1.4 | name4b | location-4 |
Expected output:
ip | name | location |
1.1.1.1 | name1 | location-1 |
1.1.1.2 | name2 | location-2 |
1.1.1.3 | name3 | location-3 |
1.1.1.4 | name4 | location-4 |
1.1.1.4 | name4b | location-4 |
| makeresults format=csv data="ip, name, location
1.1.1.1, name0, location-1
1.1.1.1, name1, location-1
1.1.1.2, name2, location-2
1.1.1.2, name0, location-20
1.1.1.3, name0, location-3
1.1.1.3, name3, location-3
1.1.1.4, name4, location-4
1.1.1.4, name4b, location-4"
Then just filter out all events with name="name0"
| where name!="name0"
or even
| search name!="name0"
Then you can dedup if needed.
Hi @PickleRick
Sorry I missed another condition. I also updated the initial post.
The name0 is not in order. The dedup/filter should not be applied to IPs that doesn't contain "name0" AND it should not be applied to unique IP that has "name0"
So, unique IP like 1.1.1.5 and 1.1.1.6 that has "name0" needs to be remained in the data. What I did now is to filter out statically, but another IP could show up with the same pattern. Thank you again for your help
Data:
ip | name | location |
1.1.1.1 | name0 | location-1 |
1.1.1.1 | name1 | location-1 |
1.1.1.2 | name2 | location-2 |
1.1.1.2 | name0 | location-20 |
1.1.1.3 | name0 | location-3 |
1.1.1.3 | name3 | location-3 |
1.1.1.4 | name4 | location-4 |
1.1.1.4 | name4b | location-4 |
1.1.1.5 | name0 | location-0 |
1.1.1.6 | name0 | location-0 |
Expected output:
ip | name | location |
1.1.1.1 | name1 | location-1 |
1.1.1.2 | name2 | location-2 |
1.1.1.3 | name3 | location-3 |
1.1.1.4 | name4 | location-4 |
1.1.1.4 | name4b | location-4 |
1.1.1.5 | name0 | location-0 |
1.1.1.6 | name0 | location-0 |
| makeresults format=csv data="ip, name, location
1.1.1.1, name0, location-1
1.1.1.1, name1, location-1
1.1.1.2, name2, location-2
1.1.1.2, name0, location-20
1.1.1.3, name0, location-3
1.1.1.3, name3, location-3
1.1.1.4, name4, location-4
1.1.1.4, name4b, location-4
1.1.1.5, name0, location-0
1.1.1.6, name0, location-0"
You can help yourself and check how many distinct values are there in the name field.
| eventstats dc(name) as dc
| where name!="name0" OR (name=="name0" AND dc=1)
Then you can
| dedup name
f needed
Hi @PickleRick
When I ran the following command, the dc returned 6 for each row
| eventstats dc(name) as dc
dc | ip | location | name |
6 | 1.1.1.1 | location-1 | name0 |
6 | 1.1.1.1 | location-1 | name1 |
6 | 1.1.1.2 | location-2 | name2 |
6 | 1.1.1.2 | location-20 | name0 |
6 | 1.1.1.3 | location-3 | name0 |
6 | 1.1.1.3 | location-3 | name3 |
6 | 1.1.1.4 | location-4 | name4 |
6 | 1.1.1.4 | location-4 | name4b |
6 | 1.1.1.5 | location-0 | name0 |
6 | 1.1.1.6 | location-0 | name0 |
So, the output are still missing1.1.1.5 and 1.1.1.6
Only name0 that exists on multiple rows should be removed. Thanks for your help
| where name!="name0" OR (name=="name0" AND dc=1)
output
dc | ip | location | name |
6 | 1.1.1.1 | location-1 | name1 |
6 | 1.1.1.2 | location-2 | name2 |
6 | 1.1.1.3 | location-3 | name3 |
6 | 1.1.1.4 | location-4 | name4 |
6 | 1.1.1.4 | location-4 | name4b |
Expected output:
ip | location | name |
1.1.1.1 | location-1 | name1 |
1.1.1.2 | location-2 | name2 |
1.1.1.3 | location-3 | name3 |
1.1.1.4 | location-4 | name4 |
1.1.1.4 | location-4 | name4b |
1.1.1.5 | location-0 | name0 |
1.1.1.6 | location-0 | name0 |
Of course. That's what you get when you're writing faster than you're thinking 😉
That eventstats should have "BY ip clause" so you get count of distinct values per each separate ip. So
| eventstats dc(name) AS dc BY ip
The rest stays the same.
Hi @PickleRick ,
I tested your suggestion and it worked. Thank you for your help.
1) I added one more case where the IP has an empty name. I added condition in the where clause (dc=0) and it worked. I am afraid if I used isnull(name), sometimes it contains " " (empty string). Please let me know if this is doable
2) Is it possible to do this without using eventstat?
I have already used eventstats in the search, but for a different field
Will that cause any delays or issues? Did you ever use multiple eventstats in your search?
Thank you so much for your help
ip | name | location |
1.1.1.1 | name0 | location-1 |
1.1.1.1 | name1 | location-1 |
1.1.1.2 | name2 | location-2 |
1.1.1.2 | name0 | location-20 |
1.1.1.3 | name0 | location-3 |
1.1.1.3 | name3 | location-3 |
1.1.1.4 | name4 | location-4 |
1.1.1.4 | name4b | location-4 |
1.1.1.5 | name0 | location-0 |
1.1.1.6 | name0 | location-0 |
1.1.1.7 | location-7 |
| makeresults format=csv data="ip, name, location
1.1.1.1, name0, location-1
1.1.1.1, name1, location-1
1.1.1.2, name2, location-2
1.1.1.2, name0, location-20
1.1.1.3, name0, location-3
1.1.1.3, name3, location-3
1.1.1.4, name4, location-4
1.1.1.4, name4b, location-4
1.1.1.5, name0, location-0
1.1.1.6, name0, location-0
1.1.1.7,,location-7"
| eventstats dc(name) AS dc BY ip
| where name!="name0" OR dc=0 OR (name=="name0" AND dc=1)
1) I'm honestly not sure what will Splunk do if there is a whitespace-filled value. Try and see 🙂 You're not changing your indexed data after all - you can search it every way you want (if you have too much data to search, just narrow your time range for quick tests).
2) You could try using some streamstats tricks as long as you can make some assumptions on your data (like it being sorted the right way) but that will be bending over backwards. True, eventstats has some limitations when it comes to data set size. You can have however as many eventstats within a single search as you want 🙂
See
https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Eventstats#Usage
and
Hi @PickleRick
I accepted your previous suggestion using eventstats as solution
Thank you for your help. Thanks for bringing me into attention that eventstats also have memory limitation, another limits.conf that only admin can modify 🙂
The default setting is max_mem_usage_mb=200MB.
How do I know if my dataset will ever hit 200 or not?
Does streamstats have limitation?
How do I use streamstats in my case? When I changed eventstats to streamstats, it will show incremental data, instead of total, so I need to figure out on how to filter out if it's possible.
Thanks
Streamstats result
dc ip location name
dc | ip | location | name |
1 | 1.1.1.1 | location-1 | name0 |
2 | 1.1.1.1 | location-1 | name1 |
1 | 1.1.1.2 | location-2 | name2 |
1 | 1.1.1.3 | location-3 | name0 |
2 | 1.1.1.3 | location-3 | name3 |
1 | 1.1.1.4 | location-4 | name4 |
2 | 1.1.1.4 | location-4 | name4b |
1 | 1.1.1.5 | location-0 | name0 |
1 | 1.1.1.6 | location-0 | name0 |
0 | 1.1.1.7 | location-7 |
All them commands have some limitations. Some more explicitly configured, some implicitly derived from operating system limits.
Anyway.
Simply replacing eventstats with streamstats obviously won't do. With streamstats you'd need some copying over values and I don't have a ready solution. I have a hunch it's possible but will be ugly and possibly even less efficient than eventstats based one.