Solved: Re: How to search for matches in two different sea...

h3llocomputer · ‎04-26-2017

I am trying to use Splunk to produce a list of AWS VMs that are both internet facing AND allowing any traffic on any port. I have two searches and need to compare a column from each to get what I'm looking for, but I can not figure out how to accomplish this!

Search 1 - Gives me a list of VMs (just their system IDs) that are allowing traffic on any port:

* host="ourhost.com" source="ec2_security_groups" sourcetype="aws:description" | dedup id | search "rules{}.grants{}.cidr_ip"="0.0.0.0/0" | search "rules{}.from_port"="null" | search "instances{}.id"!="null" | table instances{}.id

Search 2 - Gives me a list of VMs (system IDs in one column, friendly names in another) that have Public IP addresses:

* host="ourhost.com" source="ec2_instances" sourcetype="aws:description"  | search "ip_address"!="null" | dedup id | table id, tags.Name

I need to know if any of the system IDs from Search 1 appear in the list from Search 2. If so, I want the system ID and the friendly name returned.

Any help or insight would be much appreciated. Thanks!

DalJeanis · ‎04-26-2017

There are roughly a dozen ways to do it. Here's a couple. The efficiency depends on the data, so you can try any of them and see what happens.

FORMAT SUBSEARCH

This way uses the implicit "format" command of a subsearch. Everything in square braces ends up being formatted into a string that looks to the system like (id=value1 OR id=value2 OR id=value3...)

index=foo host="ourhost.com" source="ec2_instances" sourcetype="aws:description" 
    [ search index=foo host="ourhost.com" source="ec2_security_groups" sourcetype="aws:description" 
    | dedup id | search "rules{}.grants{}.cidr_ip"="0.0.0.0/0" | search "rules{}.from_port"="null" 
    | search "instances{}.id"!="null" | table instances{}.id | rename instances{}.id as id
    | format ]
 | search "ip_address"!="null" | dedup id | table id, tags.Name

That will work as long as you don't have a crazy high number of ids returned from the first search. If you have a small number of relevant ids and a large cardinality on the ids --- ie you have 1000+ ids and only 10 of them have null ports--- this one might be the best way to go.

BASIC JOIN

This way turns around the results and uses a join. This will work for up to 50K results from the subsearch.

index=foo host="ourhost.com" source="ec2_instances" sourcetype="aws:description"
| search "ip_address"!="null" | dedup id | table id, tags.Name
| join id 
    [ search index=foo host="ourhost.com" source="ec2_security_groups" sourcetype="aws:description" 
    | dedup id | search "rules{}.grants{}.cidr_ip"="0.0.0.0/0" | search "rules{}.from_port"="null" 
    | search "instances{}.id"!="null" | table instances{}.id | rename instances{}.id as id
    ]

It's a pretty usual and useful method. I often will use a join type=left even when I'm discarding the unmatched records, because I like to see under the covers. (I'm a belt AND suspenders AND safety pins AND duct tape kind of guy.)

SPLUNK STEW

Now, THIS is the generally preferred way that you should do it if you can-- and in this case, it's perfect. I call it the "splunk stew" method. You put all the data together in a pot and stir until it melts into whatever you want.

index=foo host="ourhost.com" sourcetype="aws:description"  
(source="ec2_security_groups" OR source="ec2_instances") 
| search ((source="ec2_instances") AND ("ip_address"!="null")) 
    OR   ((source="ec2_security_groups") AND ("rules{}.grants{}.cidr_ip"="0.0.0.0/0") AND
            ("rules{}.from_port"="null") AND ("instances{}.id"!="null")) 
| eval id=coalesce(id,instances{}.id)
| stats values(tags.Name) as Name dc(source) as foundboth by id
| where (isnotnull(Name) AND (foundboth==2))

The reason why this method is generally preferred is because there are no limits on it. Subsearch limits are not implicated because there is no subsearch involved; system is just collecting ALL the data from whatever indexers it is on and then grouping it as requested. You'll notice I didn't dedup the results, because the stats command values() will take care of that automatically and the two different sources need to both be retained (if they exist at all).

LOOKUPS, CSVS, ETC

You can configure one of the searches to write to a CSV or lookup and then pull it back in to the other search. In some instances, that can be really effective. You could use collect to write a big chunk of data to a summary index if you wanted, and use that as the basis for your query. You can go hog wild and bang that data together a whole bunch more ways.

In most cases, however, you want to try for splunk stew unless it's just too damn ugly. That OR/AND search is just about the edge of what I consider readable, but the overall solution is pretty simple so we're good enough here.

View solution in original post

DalJeanis · ‎04-26-2017

There are roughly a dozen ways to do it. Here's a couple. The efficiency depends on the data, so you can try any of them and see what happens.

FORMAT SUBSEARCH

This way uses the implicit "format" command of a subsearch. Everything in square braces ends up being formatted into a string that looks to the system like (id=value1 OR id=value2 OR id=value3...)

index=foo host="ourhost.com" source="ec2_instances" sourcetype="aws:description" 
    [ search index=foo host="ourhost.com" source="ec2_security_groups" sourcetype="aws:description" 
    | dedup id | search "rules{}.grants{}.cidr_ip"="0.0.0.0/0" | search "rules{}.from_port"="null" 
    | search "instances{}.id"!="null" | table instances{}.id | rename instances{}.id as id
    | format ]
 | search "ip_address"!="null" | dedup id | table id, tags.Name

That will work as long as you don't have a crazy high number of ids returned from the first search. If you have a small number of relevant ids and a large cardinality on the ids --- ie you have 1000+ ids and only 10 of them have null ports--- this one might be the best way to go.

BASIC JOIN

This way turns around the results and uses a join. This will work for up to 50K results from the subsearch.

index=foo host="ourhost.com" source="ec2_instances" sourcetype="aws:description"
| search "ip_address"!="null" | dedup id | table id, tags.Name
| join id 
    [ search index=foo host="ourhost.com" source="ec2_security_groups" sourcetype="aws:description" 
    | dedup id | search "rules{}.grants{}.cidr_ip"="0.0.0.0/0" | search "rules{}.from_port"="null" 
    | search "instances{}.id"!="null" | table instances{}.id | rename instances{}.id as id
    ]

It's a pretty usual and useful method. I often will use a join type=left even when I'm discarding the unmatched records, because I like to see under the covers. (I'm a belt AND suspenders AND safety pins AND duct tape kind of guy.)

SPLUNK STEW

Now, THIS is the generally preferred way that you should do it if you can-- and in this case, it's perfect. I call it the "splunk stew" method. You put all the data together in a pot and stir until it melts into whatever you want.

index=foo host="ourhost.com" sourcetype="aws:description"  
(source="ec2_security_groups" OR source="ec2_instances") 
| search ((source="ec2_instances") AND ("ip_address"!="null")) 
    OR   ((source="ec2_security_groups") AND ("rules{}.grants{}.cidr_ip"="0.0.0.0/0") AND
            ("rules{}.from_port"="null") AND ("instances{}.id"!="null")) 
| eval id=coalesce(id,instances{}.id)
| stats values(tags.Name) as Name dc(source) as foundboth by id
| where (isnotnull(Name) AND (foundboth==2))

The reason why this method is generally preferred is because there are no limits on it. Subsearch limits are not implicated because there is no subsearch involved; system is just collecting ALL the data from whatever indexers it is on and then grouping it as requested. You'll notice I didn't dedup the results, because the stats command values() will take care of that automatically and the two different sources need to both be retained (if they exist at all).

LOOKUPS, CSVS, ETC

You can configure one of the searches to write to a CSV or lookup and then pull it back in to the other search. In some instances, that can be really effective. You could use collect to write a big chunk of data to a summary index if you wanted, and use that as the basis for your query. You can go hog wild and bang that data together a whole bunch more ways.

In most cases, however, you want to try for splunk stew unless it's just too damn ugly. That OR/AND search is just about the edge of what I consider readable, but the overall solution is pretty simple so we're good enough here.

h3llocomputer · ‎04-27-2017

This is awesome! I tried the first method since we don't have a huge number of instances right now (it worked!), though I think I'll try out the "Splunk Stew" method to see if I can expand the search to VMs allowing traffic on specific ports, and output which port. Thanks again!

prsshini · ‎05-18-2018

Hi DalJeanis,

I have been trying to resolve almost the same problem but I am with no luck in any of the methods above.

I have 2 queries
1. REsults all available hosts
index=Powershell sourcetype=mysourcetype source="powershell://mysouce" host="USP1*Admin" | dedup host | stats list(host)

REsults hosts only my software is installed.

index=Powershell sourcetype=mysourcetype source="powershell://mysouce" Name= mysoftware | dedup host, Name | stats list(host)

I want to show the list of hosts where my software is installed and the hosts where they are not installed.

can you pls help me.

DalJeanis · ‎04-27-2017

You are quite welcome! Yep, that's the one to practice until it feels natural. The key to making it work is to look at each format of record you might receive, in isolation, and code the evals or other commands so that they work independently on each format of record, as if none of the other formats existed. That way, when you smoosh them all together with the stats command, everything comes out as desired.

One other trick I learned from somesoni2 - if you have multiple different kinds of linked records, but they don't all contain a single key, you can use eventstats to copy a secondary key to the records that are missing it. This example is too simple, just copying a data field, but the process works the same. I'll post a link here if I find a better example of the technique.

Example -
Record type A has userid, username
Record type B has userid, sessionid
Record type C has sessionid, sourceip, destip

| eventstats values(username) by userid
| stats values(username) as username values(sourceip) as sourceip values(destip) as destip by sessionid

How to search for matches in two different searches?

Updated Team Landing Page in Splunk Observability

New! Splunk Observability Search Enhancements for Splunk APM Services/Traces and ...

Webinar Recap | Revolutionizing IT Operations: The Transformative Power of AI and ML ...