Solved: How to pass multiple values from inner search to o...

Allene139 · ‎09-30-2022

Hi Folks,

I could use some help with this query.

index=address_index earliest=-30m address
[ search index=registration_index earliest=-30m 
| `get_ip_location(src_ip)`
| rename user as email
| dedup email
| table email src_ip ip_location
| return 15 $email]
| rex field=_raw "REGEX xmlfield"
| xmlkv xmlfield
| eval email=lower(trim(EMAIL_ADDRESS))
| eval city=lower(trim(CITY))
| eval address=lower(trim(ADDRESS1))
| eval state=lower(trim(STATE))
| stats values(city) as city values(state) as state values(address) as address by email

The inner search looks for all the registrations for the past 30 mins. Then, the return command passes the email to the outer search, which then queries the address index for an address on file according to the email.
my goal, right now, is to pass 2 parameters to the outer search, an email and the src_ip/ip_location.

problem: when I attempt to add a second parameter to the return command, in addition to email, the query no longer works.

The ultimate goal is to build a search that queries registrations from met online, use the get_ip_location on the originating IP address, then compare that ip_location with their address on file (which is usually in the address index).

However, when I try to following query, I get no results:

index=address_index earliest=-30m address
[ search index=registration_index earliest=-30m 
| `get_ip_location(src_ip)`
| rename user as email
| dedup email
| table email src_ip ip_location
| return 15 $email $ip_location]
| rex field=_raw "REGEX xmlfield"
| xmlkv xmlfield
| eval email=lower(trim(EMAIL_ADDRESS))
| eval city=lower(trim(CITY))
| eval address=lower(trim(ADDRESS1))
| eval state=lower(trim(STATE))
| stats values(city) as city values(state) as state values(address) as address by email ip_location

How can I pass these 2 values, $email and $ip_location, to the outer search?

johnhuang · ‎09-30-2022

You can try combining the 2 data sources in the base search and "join" (group) them together using stats.

I didn't test the query below. It could look something like this:

(index=address_index address) OR (index=registration_index) earliest=-30m 
| `get_ip_location(src_ip)`
| rex field=_raw "REGEX xmlfield"
| xmlkv xmlfield
| eval email=LOWER(TRIM(COALESCE(EMAIL_ADDRESS, user)))
| eval city=LOWER(TRIM(CITY))
| eval address=LOWER(TRIM(ADDRESS1))
| eval state=LOWER(TRIM(STATE))
| stats values(city) as city values(state) as state values(address) as address values(ip_location) AS ip_location by email

View solution in original post

johnhuang · ‎09-30-2022

You can try combining the 2 data sources in the base search and "join" (group) them together using stats.

I didn't test the query below. It could look something like this:

(index=address_index address) OR (index=registration_index) earliest=-30m 
| `get_ip_location(src_ip)`
| rex field=_raw "REGEX xmlfield"
| xmlkv xmlfield
| eval email=LOWER(TRIM(COALESCE(EMAIL_ADDRESS, user)))
| eval city=LOWER(TRIM(CITY))
| eval address=LOWER(TRIM(ADDRESS1))
| eval state=LOWER(TRIM(STATE))
| stats values(city) as city values(state) as state values(address) as address values(ip_location) AS ip_location by email

yuanliu · ‎09-30-2022

I agree with @johnhuang that subsearch return value is not the best method to attain your goal. Returned values will be used as literal strings in the main search, which do not always get what you expect. In your case, ip_location returned from `get_ip_location(src_ip)` is likely a combination of city, address, state in a format that is not the same string as laid out as in the address index. You need to reorganize city, address, state in that particular format for comparison; alternatively, break ip_location down to city, address, state in the format used in the address index.

Allene139 · ‎09-30-2022

Im not so sure reorganizing city, address, state in that particular format will be helpful. I could easily combine those values. also, the get_ip_location outputs the whole address or just the partial.

I must combine two queries: 1 that gets the ip address and geoloaction at the time of registration from the registration index, and a second separate query that gets the address on file from a separate. Then I have to compare the 2: the ip GeoLocation and address on file.

johnhuang · ‎09-30-2022

Ideally you should put the list of known addresses in a lookup, either csv or kvstore. Could you give me an idea of how many unique users are in the address index?

e.g.

<base search> -12mon@mon
| dedup user
| stats min(_time) AS _time dc(user) AS user_ct
| eval oldest_event=strftime(_time, "%Y-%m-%d")
| table oldest_event user_ct

Allene139 · ‎09-30-2022

John, there are over one million users. to put them all in a lookup, imo, wouldn't be feasible.

johnhuang · ‎09-30-2022

A million users with these few fields should total around 100 MB which is doable for csv lookups and a KVStore could definitely handle it. If this is something that you need to do often, you should consider putting this into a lookup.

Allene139 · ‎10-03-2022

John, thank you so much for your help. In the end, i've decided to put all this info into a summary index. Once again, you folks have been deeply helpful!

yuanliu · ‎09-30-2022

I understand what you want to compare, but cannot visualize your data without illustration. I can imagine that `get_ip_location()` returns a string like "123 Main St, Sommerville, Sommstate". According to your sample code, address index returns an XML document, which probably contains something like

<address>123 Main St</address> <city>Sommerville</city> <state>Sommstate</state>

That is why the subsearch contradicts the main search even upon an absolute match. The combined search cannot return anything.

Given that johnhua is on the right track, can you illustrate what result you get from that code and explain why the output doesn't meet your requirement? (When I say "reorganizing city, address, state in that particular format," I meant to reorganize after the stats command, not in the original search.) "Didn't work" conveys no information here.

I also have a curious question: Is address_index completely rewritten every 30 minutes or less? In other words, does

index=address_index earliest=-30m address

return EVERY address? If not, I suggest to modify the first line to

(index=address_index address) OR (index=registration_index earliest=-30m)

and adjust the main search window so (index=address_index address) can contain every address of interest.

Allene139 · ‎09-30-2022

ah okay. Apologies for the confusion. Let me see if I can visualize it for you.

so the first first search, the one of the registration index, will return all users who are registering in the last 30m. the results will look like this:

email	src_ip	ip_location
John_smith@fake.com	123.123.123.123	Wakefield, Massachusetts, United States

Now, I would like to do a secondary search that queries the address index using the email address above, thus returning the address on file:

user	city_on_file	state_on_file	address_on_file
John_smith@fake.com	Boston	Massachusetts	123 fake street, Boston Massachusetts

you are correct in assuming that this ^ info is in XML format.

Ideally I want the end result to look like this:

email/user	src_ip	ip_location	city_on_file	state_on_file	address_on_file
John_smith@fake.com	123.123.123.123	Wakefield, Massachusetts, United States	Boston	Massachusetts	123 fake street, Boston Massachusetts

In this particular example, john smith has registered from Wakefield, yet his address on file is boston. Given that IP address GeoLocations are always 100% accurate, id say this may be a false alarm. However, my goal is capture registrations that are very far away from the address on file. I will likely compare states, rather than cities. So, if John Smith were to register in California, yet his address on file states he lives in Massachusetts, then an alert will be raised.

address_Index is massive. I am using only the past 30 mins for testing purposes. once I get the code down, I will likely do the past 24h.

`get_ip_location()` does return strings like "123 Main St, Sommerville, Sommstate" but it also returns weather or not the IP is from a anonomizer VPN, it can reutrn just the city or or even just the state.

yuanliu · ‎09-30-2022

@Allene139 Thank you for illustrating data and explain the requirements. The logic you want to apply on data is sound, and @johnhuang's code is leading to that direction. Can you explain what you get from that code?

Allene139 · ‎09-30-2022

Yuanliu, I ran John's query and I am getting a rough idea of what I want.

email	city_on_file	state_on_file	address_on_file	ip_location	src_ip
john smith				Big Rapis Michigan	123.123.123.123
jane doe	atlanta	ga	123 fake street

yuanliu · ‎09-30-2022

This result is the exact format that you indicated earlier. The reason why one row shows no data from address index, the other row shows no data from registration index is because the search hasn't found a match.

The most likely reason why no match is found is as I speculated earlier: the search window of 30m in address index cannot return enough users to make a match. @johnhuang's suggestion about dumping extracted address index into a lookup file is a feasible method to make sure any user that has a match can be found. I understand that CSV lookup in a million rows can be painful. (I am running some million+ lookups myself.) But it is better than missing matches, or running time-series search in address index that crosses perhaps hundreds of millions of records. An alternative to CSV is to dump that info into a KV store.

For now, you can run an extreme search just to verify that johnhua's method meet your requirements:

(index=address_index address earliest=0) OR (index=registration_index earliest=-30m)
| `get_ip_location(src_ip)`
| rex field=_raw "REGEX xmlfield"
| xmlkv xmlfield
| eval email=LOWER(TRIM(COALESCE(EMAIL_ADDRESS, user)))
| eval city=LOWER(TRIM(CITY))
| eval address=LOWER(TRIM(ADDRESS1))
| eval state=LOWER(TRIM(STATE))
| stats values(city) as city values(state) as state values(address) as address values(ip_location) AS ip_location by email
| where isnotnull(ip_location))

The only difference here is that

Run a complete search in address index (because I don't know how to retrieve every user in that index). You can change earliest to, say, -1w if you are confident that every user can be retrieved in that fashion.
Only display matched records.

johnhuang · ‎09-30-2022

What's your current query? Let's optimize it.

Allene139 · ‎09-30-2022

hi john

this didnt work, but I appreciate the effort

How to pass multiple values from inner search to outer search?

stats

subsearch

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

[Puzzles] Solve, Learn, Repeat: Character substitutions with Regular Expressions

Splunk Community Badges!

[Puzzles] Solve, Learn, Repeat: Matching cron expressions

Join the Conversation