Hi Folks,
I could use some help with this query.
index=address_index earliest=-30m address
[ search index=registration_index earliest=-30m
| `get_ip_location(src_ip)`
| rename user as email
| dedup email
| table email src_ip ip_location
| return 15 $email]
| rex field=_raw "REGEX xmlfield"
| xmlkv xmlfield
| eval email=lower(trim(EMAIL_ADDRESS))
| eval city=lower(trim(CITY))
| eval address=lower(trim(ADDRESS1))
| eval state=lower(trim(STATE))
| stats values(city) as city values(state) as state values(address) as address by email
The inner search looks for all the registrations for the past 30 mins. Then, the return command passes the email to the outer search, which then queries the address index for an address on file according to the email.
my goal, right now, is to pass 2 parameters to the outer search, an email and the src_ip/ip_location.
problem: when I attempt to add a second parameter to the return command, in addition to email, the query no longer works.
The ultimate goal is to build a search that queries registrations from met online, use the get_ip_location on the originating IP address, then compare that ip_location with their address on file (which is usually in the address index).
However, when I try to following query, I get no results:
index=address_index earliest=-30m address
[ search index=registration_index earliest=-30m
| `get_ip_location(src_ip)`
| rename user as email
| dedup email
| table email src_ip ip_location
| return 15 $email $ip_location]
| rex field=_raw "REGEX xmlfield"
| xmlkv xmlfield
| eval email=lower(trim(EMAIL_ADDRESS))
| eval city=lower(trim(CITY))
| eval address=lower(trim(ADDRESS1))
| eval state=lower(trim(STATE))
| stats values(city) as city values(state) as state values(address) as address by email ip_location
How can I pass these 2 values, $email and $ip_location, to the outer search?
You can try combining the 2 data sources in the base search and "join" (group) them together using stats.
I didn't test the query below. It could look something like this:
(index=address_index address) OR (index=registration_index) earliest=-30m
| `get_ip_location(src_ip)`
| rex field=_raw "REGEX xmlfield"
| xmlkv xmlfield
| eval email=LOWER(TRIM(COALESCE(EMAIL_ADDRESS, user)))
| eval city=LOWER(TRIM(CITY))
| eval address=LOWER(TRIM(ADDRESS1))
| eval state=LOWER(TRIM(STATE))
| stats values(city) as city values(state) as state values(address) as address values(ip_location) AS ip_location by email
You can try combining the 2 data sources in the base search and "join" (group) them together using stats.
I didn't test the query below. It could look something like this:
(index=address_index address) OR (index=registration_index) earliest=-30m
| `get_ip_location(src_ip)`
| rex field=_raw "REGEX xmlfield"
| xmlkv xmlfield
| eval email=LOWER(TRIM(COALESCE(EMAIL_ADDRESS, user)))
| eval city=LOWER(TRIM(CITY))
| eval address=LOWER(TRIM(ADDRESS1))
| eval state=LOWER(TRIM(STATE))
| stats values(city) as city values(state) as state values(address) as address values(ip_location) AS ip_location by email
I agree with @johnhuang that subsearch return value is not the best method to attain your goal. Returned values will be used as literal strings in the main search, which do not always get what you expect. In your case, ip_location returned from `get_ip_location(src_ip)` is likely a combination of city, address, state in a format that is not the same string as laid out as in the address index. You need to reorganize city, address, state in that particular format for comparison; alternatively, break ip_location down to city, address, state in the format used in the address index.
Im not so sure reorganizing city, address, state in that particular format will be helpful. I could easily combine those values. also, the get_ip_location outputs the whole address or just the partial.
I must combine two queries: 1 that gets the ip address and geoloaction at the time of registration from the registration index, and a second separate query that gets the address on file from a separate. Then I have to compare the 2: the ip GeoLocation and address on file.
Ideally you should put the list of known addresses in a lookup, either csv or kvstore. Could you give me an idea of how many unique users are in the address index?
e.g.
<base search> -12mon@mon
| dedup user
| stats min(_time) AS _time dc(user) AS user_ct
| eval oldest_event=strftime(_time, "%Y-%m-%d")
| table oldest_event user_ct
John, there are over one million users. to put them all in a lookup, imo, wouldn't be feasible.
A million users with these few fields should total around 100 MB which is doable for csv lookups and a KVStore could definitely handle it. If this is something that you need to do often, you should consider putting this into a lookup.
John, thank you so much for your help. In the end, i've decided to put all this info into a summary index. Once again, you folks have been deeply helpful!
I understand what you want to compare, but cannot visualize your data without illustration. I can imagine that `get_ip_location()` returns a string like "123 Main St, Sommerville, Sommstate". According to your sample code, address index returns an XML document, which probably contains something like
<address>123 Main St</address> <city>Sommerville</city> <state>Sommstate</state>
That is why the subsearch contradicts the main search even upon an absolute match. The combined search cannot return anything.
Given that johnhua is on the right track, can you illustrate what result you get from that code and explain why the output doesn't meet your requirement? (When I say "reorganizing city, address, state in that particular format," I meant to reorganize after the stats command, not in the original search.) "Didn't work" conveys no information here.
I also have a curious question: Is address_index completely rewritten every 30 minutes or less? In other words, does
index=address_index earliest=-30m address
return EVERY address? If not, I suggest to modify the first line to
(index=address_index address) OR (index=registration_index earliest=-30m)
and adjust the main search window so (index=address_index address) can contain every address of interest.
ah okay. Apologies for the confusion. Let me see if I can visualize it for you.
so the first first search, the one of the registration index, will return all users who are registering in the last 30m. the results will look like this:
src_ip | ip_location | |
John_smith@fake.com | 123.123.123.123 | Wakefield, Massachusetts, United States |
Now, I would like to do a secondary search that queries the address index using the email address above, thus returning the address on file:
user | city_on_file | state_on_file | address_on_file |
John_smith@fake.com | Boston | Massachusetts | 123 fake street, Boston Massachusetts |
you are correct in assuming that this ^ info is in XML format.
Ideally I want the end result to look like this:
email/user | src_ip | ip_location | city_on_file | state_on_file | address_on_file |
John_smith@fake.com | 123.123.123.123 | Wakefield, Massachusetts, United States | Boston | Massachusetts | 123 fake street, Boston Massachusetts |
In this particular example, john smith has registered from Wakefield, yet his address on file is boston. Given that IP address GeoLocations are always 100% accurate, id say this may be a false alarm. However, my goal is capture registrations that are very far away from the address on file. I will likely compare states, rather than cities. So, if John Smith were to register in California, yet his address on file states he lives in Massachusetts, then an alert will be raised.
address_Index is massive. I am using only the past 30 mins for testing purposes. once I get the code down, I will likely do the past 24h.
`get_ip_location()` does return strings like "123 Main St, Sommerville, Sommstate" but it also returns weather or not the IP is from a anonomizer VPN, it can reutrn just the city or or even just the state.
@Allene139 Thank you for illustrating data and explain the requirements. The logic you want to apply on data is sound, and @johnhuang's code is leading to that direction. Can you explain what you get from that code?
Yuanliu, I ran John's query and I am getting a rough idea of what I want.
city_on_file | state_on_file | address_on_file | ip_location | src_ip | |
john smith | Big Rapis Michigan | 123.123.123.123 | |||
jane doe | atlanta | ga | 123 fake street |
This result is the exact format that you indicated earlier. The reason why one row shows no data from address index, the other row shows no data from registration index is because the search hasn't found a match.
The most likely reason why no match is found is as I speculated earlier: the search window of 30m in address index cannot return enough users to make a match. @johnhuang's suggestion about dumping extracted address index into a lookup file is a feasible method to make sure any user that has a match can be found. I understand that CSV lookup in a million rows can be painful. (I am running some million+ lookups myself.) But it is better than missing matches, or running time-series search in address index that crosses perhaps hundreds of millions of records. An alternative to CSV is to dump that info into a KV store.
For now, you can run an extreme search just to verify that johnhua's method meet your requirements:
(index=address_index address earliest=0) OR (index=registration_index earliest=-30m)
| `get_ip_location(src_ip)`
| rex field=_raw "REGEX xmlfield"
| xmlkv xmlfield
| eval email=LOWER(TRIM(COALESCE(EMAIL_ADDRESS, user)))
| eval city=LOWER(TRIM(CITY))
| eval address=LOWER(TRIM(ADDRESS1))
| eval state=LOWER(TRIM(STATE))
| stats values(city) as city values(state) as state values(address) as address values(ip_location) AS ip_location by email
| where isnotnull(ip_location))
The only difference here is that
What's your current query? Let's optimize it.