Splunk Search

How to pass multiple values from inner search to outer search?

Allene139
Explorer

Hi Folks, 

I could use some help with this query.

 

index=address_index earliest=-30m address
[ search index=registration_index earliest=-30m 
| `get_ip_location(src_ip)`
| rename user as email
| dedup email
| table email src_ip ip_location
| return 15 $email]
| rex field=_raw "REGEX xmlfield"
| xmlkv xmlfield
| eval email=lower(trim(EMAIL_ADDRESS))
| eval city=lower(trim(CITY))
| eval address=lower(trim(ADDRESS1))
| eval state=lower(trim(STATE))
| stats values(city) as city values(state) as state values(address) as address by email

 

 

The inner search looks for all the registrations for the past 30 mins. Then, the return command passes the email to the outer search, which then queries the address index for an address on file according to the email.
my goal, right now, is to pass 2 parameters to the outer search, an email and the src_ip/ip_location.

problem: when I attempt to add a second parameter to the return command, in addition to email, the query no longer works.

The ultimate goal is to build a search that queries registrations from met online, use the get_ip_location on the originating IP address, then compare that ip_location with their address on file (which is usually in the address index).

However, when I try to following query, I get no results:

 

index=address_index earliest=-30m address
[ search index=registration_index earliest=-30m 
| `get_ip_location(src_ip)`
| rename user as email
| dedup email
| table email src_ip ip_location
| return 15 $email $ip_location]
| rex field=_raw "REGEX xmlfield"
| xmlkv xmlfield
| eval email=lower(trim(EMAIL_ADDRESS))
| eval city=lower(trim(CITY))
| eval address=lower(trim(ADDRESS1))
| eval state=lower(trim(STATE))
| stats values(city) as city values(state) as state values(address) as address by email ip_location

 

How can I pass these 2 values, $email and $ip_location, to the outer search?

Labels (2)
0 Karma
1 Solution

johnhuang
Motivator

You can try combining the 2 data sources in the base search and  "join"  (group) them together using stats.

I didn't test the query below. It could look something like this:

(index=address_index address) OR (index=registration_index) earliest=-30m 
| `get_ip_location(src_ip)`
| rex field=_raw "REGEX xmlfield"
| xmlkv xmlfield
| eval email=LOWER(TRIM(COALESCE(EMAIL_ADDRESS, user)))
| eval city=LOWER(TRIM(CITY))
| eval address=LOWER(TRIM(ADDRESS1))
| eval state=LOWER(TRIM(STATE))
| stats values(city) as city values(state) as state values(address) as address values(ip_location) AS ip_location by email

 

View solution in original post

0 Karma

johnhuang
Motivator

You can try combining the 2 data sources in the base search and  "join"  (group) them together using stats.

I didn't test the query below. It could look something like this:

(index=address_index address) OR (index=registration_index) earliest=-30m 
| `get_ip_location(src_ip)`
| rex field=_raw "REGEX xmlfield"
| xmlkv xmlfield
| eval email=LOWER(TRIM(COALESCE(EMAIL_ADDRESS, user)))
| eval city=LOWER(TRIM(CITY))
| eval address=LOWER(TRIM(ADDRESS1))
| eval state=LOWER(TRIM(STATE))
| stats values(city) as city values(state) as state values(address) as address values(ip_location) AS ip_location by email

 

0 Karma

yuanliu
SplunkTrust
SplunkTrust

I agree with @johnhuang that subsearch return value is not the best method to attain your goal.  Returned values will be used as literal strings in the main search, which do not always get what you expect.  In your case, ip_location returned from `get_ip_location(src_ip)` is likely a combination of city, address, state in a format that is not the same string as laid out as in the address index.  You need to reorganize city, address, state in that particular format for comparison; alternatively, break ip_location down to city, address, state in the format used in the address index.

0 Karma

Allene139
Explorer

Im not so sure reorganizing city, address, state in that particular format will be helpful. I could easily combine those values. also, the get_ip_location outputs the whole address or just the partial.

I must combine two queries: 1 that gets the ip address and geoloaction at the time of registration from the registration index, and a second separate query that gets the address on file from a separate. Then I have to compare the 2: the ip GeoLocation and address on file.

0 Karma

johnhuang
Motivator

Ideally you should put the list of known addresses in a lookup, either csv or kvstore. Could you give me an idea of how many unique users are in the address index?

e.g. 

<base search> -12mon@mon
| dedup user
| stats min(_time) AS _time dc(user) AS user_ct
| eval oldest_event=strftime(_time, "%Y-%m-%d")
| table oldest_event user_ct

 

0 Karma

Allene139
Explorer

John, there are over one million users. to put them all in a lookup, imo, wouldn't be feasible.

0 Karma

johnhuang
Motivator

A million users with these few fields should total around 100 MB which is doable for csv lookups and a KVStore could definitely handle it. If this is something that you need to do often, you should consider putting this into a lookup.

0 Karma

Allene139
Explorer

John, thank you so much for your help. In the end, i've decided to put all this info into a summary index. Once again, you folks have been deeply helpful!

0 Karma

yuanliu
SplunkTrust
SplunkTrust

I understand what you want to compare, but cannot visualize your data without illustration.  I can imagine that `get_ip_location()` returns a string like "123 Main St, Sommerville, Sommstate".  According to your sample code, address index returns an XML document, which probably contains something like

<address>123 Main St</address> <city>Sommerville</city> <state>Sommstate</state>

That is why the subsearch contradicts the main search even upon an absolute match.  The combined search cannot return anything.

Given that johnhua is on the right track, can you illustrate what result you get from that code and explain why the output doesn't meet your requirement? (When I say "reorganizing city, address, state in that particular format," I meant to reorganize after the stats command, not in the original search.)  "Didn't work" conveys no information here.

I also have a curious question: Is address_index completely rewritten every 30 minutes or less?  In other words, does 

index=address_index earliest=-30m address

 return EVERY address?  If not, I suggest to modify the first line to 

(index=address_index address) OR (index=registration_index earliest=-30m)

and adjust the main search window so (index=address_index address) can contain every address of interest.

Allene139
Explorer

ah okay. Apologies for the confusion. Let me see if I can visualize it for you.

so the first first search, the one of the registration index, will return all users who are registering in the last 30m. the results will look like this:

emailsrc_ipip_location
John_smith@fake.com123.123.123.123Wakefield, Massachusetts, United States

 

Now, I would like to do a secondary search that queries the address index using the email address above, thus returning the address on file:

usercity_on_filestate_on_fileaddress_on_file
John_smith@fake.comBostonMassachusetts123 fake street, Boston Massachusetts 

you are correct in assuming that this ^ info is in XML format.

Ideally I want the end result to look like this:

email/usersrc_ipip_locationcity_on_filestate_on_fileaddress_on_file
John_smith@fake.com123.123.123.123Wakefield, Massachusetts, United StatesBostonMassachusetts123 fake street, Boston Massachusetts

 

In this particular example, john smith has registered from Wakefield, yet his address on file is boston. Given that IP address GeoLocations are always 100% accurate, id say this may be a false alarm. However, my goal is capture registrations that are very far away from the address on file. I will likely compare states, rather than cities. So, if John Smith were to register in California, yet his address on file states he lives in Massachusetts, then an alert will be raised. 

address_Index is massive. I am using only the past 30 mins for testing purposes. once I get the code down, I will likely do the past 24h. 

`get_ip_location()` does return strings like "123 Main St, Sommerville, Sommstate" but it also returns weather or not the IP is from a anonomizer VPN, it can reutrn just the city or or even just the state.

0 Karma

yuanliu
SplunkTrust
SplunkTrust

@Allene139 Thank you for illustrating data and explain the requirements.  The logic you want to apply on data is sound, and @johnhuang's code is leading to that direction.  Can you explain what you get from that code?

Allene139
Explorer

Yuanliu, I ran John's query and I am getting a rough idea of what I want. 

emailcity_on_filestate_on_fileaddress_on_fileip_locationsrc_ip
john smith   Big Rapis Michigan123.123.123.123
jane doeatlantaga123 fake street  
0 Karma

yuanliu
SplunkTrust
SplunkTrust

This result is the exact format that you indicated earlier.  The reason why one row shows no data from address index, the other row shows no data from registration index is because the search hasn't found a match.

The most likely reason why no match is found is as I speculated earlier: the search window of 30m in address index cannot return enough users to make a match.  @johnhuang's suggestion about dumping extracted address index into a lookup file is a feasible method to make sure any user that has a match can be found.  I understand that CSV lookup in a million rows can be painful. (I am running some million+ lookups myself.) But it is better than missing matches, or running time-series search in address index that crosses perhaps hundreds of millions of records.  An alternative to CSV is to dump that info into a KV store.

For now, you can run an extreme search just to verify that johnhua's method meet your requirements:

(index=address_index address earliest=0) OR (index=registration_index earliest=-30m)
| `get_ip_location(src_ip)`
| rex field=_raw "REGEX xmlfield"
| xmlkv xmlfield
| eval email=LOWER(TRIM(COALESCE(EMAIL_ADDRESS, user)))
| eval city=LOWER(TRIM(CITY))
| eval address=LOWER(TRIM(ADDRESS1))
| eval state=LOWER(TRIM(STATE))
| stats values(city) as city values(state) as state values(address) as address values(ip_location) AS ip_location by email
| where isnotnull(ip_location))

The only difference here is that

  1. Run a complete search in address index (because I don't know how to retrieve every user in that index). You can change earliest to, say, -1w if you are confident that every user can be retrieved in that fashion.
  2. Only display matched records.
0 Karma

johnhuang
Motivator

What's your current query? Let's optimize it.

0 Karma

Allene139
Explorer

hi john

this didnt work, but I appreciate the effort 

Tags (1)
0 Karma
Get Updates on the Splunk Community!

App Building 101 - Build Your First App!

WATCH RECORDING NOW   Tech Talk: App Dev Edition Splunk has tons of out-of-the-box functionality, and you’ve ...

Introducing support for Amazon Data Firehose in Splunk Edge Processor

We’re excited to announce a powerful update to Splunk Data Management with added support for Amazon Data ...

The Observability Round-Up: September 2024

What’s up Splunk Community! Welcome to the latest edition of the Observability Round-Up, a monthly series in ...