Solved: Extract domain from full url

jwhughes58 · ‎10-10-2023

I'm working with data from this search

index=my_index sourcetype=my_sourcetype (rule=policy_1 OR rule=policy_2 OR rule=policy_3) 
    [ | inputlookup my_list_of_urls.csv ]
| rename url AS my_url
| stats count by my_url
| table my_url

The events look like this

02ef65dc96524dabba54a950da7cb0d8.fp.measure.office.com/
0434c399ca884247875a286a10c969f4.fp.measure.office.com/
14f8c4d0e9b7be86933be5d3c9fb91d7.fp.measure.office.com/
3d8e055913534ff7b3c23101fd1f3ca6.fp.measure.office.com/
4334ede7832f44c5badcfd5a6459a1a2.fp.measure.office.com/
5d44dec60c9b4788fb26426c1e151f46.fp.measure.office.com/
5f021e1b8d3646398fab8ce59f8a6bbd.fp.measure.office.com/
6f6c23c1671f72c36d6179fdeabd1f56.fp.measure.office.com/
7106ea87c1e2ed0aebc9baca86f9af34.fp.measure.office.com/
88c88084fe454cbc8629332c6422e8a4.fp.measure.office.com/
982db5012df7494a88c242d426e07be6.fp.measure.office.com/
a478076af2deaf28abcbe5ceb8bdb648.fp.measure.office.com/
aad.cs.dds.microsoft.com/

In the my_list_of_urls.csv there are these entries

*.microsoft.com/
microsoft.com/
*.office.com/
office.com/

What I'm trying to do is get the microsoft.com and office.com from the results instead of the full url. I'm stumped on how to do it. Any help is appreciated.

TIA,

Joe

yuanliu · ‎10-11-2023

list of the URLs the contractors have access to which is the csv file. The firewall team wants to remove any URLs that aren't used in a period of time. Thus, I have to compare the firewall URLs to the csv

So, the firewall team wants to update that CSV file so it will not contain entries that haven't had matching events for a given time period. Is this correct? This seems to be the opposite of what the Splunk search is doing.

Some more points you need to clarify.

What are field name(s) the index search and the lookup file use to indicate URLs? Based on your code snippet, I assume that they both use url.
Does the CSV file contain additional fields? Based on your code snippet, I will assume none.
Is there some significance of trailing slash (/)? Do all url values end with one trailing slash? This may not be relevant, but some SPL manipulations may ruin your convention. So, I'd like to be cautious.

A more important question is the use of asterisk (*). Are the last two domains (root and second level) the only parts of interest? Given all the illustrations, I have to assume yes. In other words, no differentiation is needed between *.microsoft.com/ and microsoft.com/. Additionally, I will assume that every url in the CSV needs to be paired with a wildcard entry.

Using the above assumptions, the following can show you second level domains that have not been used.

index=my_index sourcetype=my_sourcetype (rule=policy_1 OR rule=policy_2 OR rule=policy_3) 
| eval url = mvjoin(mvindex(split(url, "."), -2,-1), ".")
| dedup url
| inputlookup append=true my_list_of_urls.csv
| fillnull sourcetype value=CSV
| stats values(sourcetype) as sourcetype by url
| where mvcount(sourcetype) == 1 AND sourcetype == "CSV"
| eval url = mvappend(url, "*." . url)
| mvexpand url

The output contains a list of second level domains affixed with a trailing slash, and these same strings prefixed with "*.". These would be the ones to be removed.

If you have lots of events with URLs that have no match in the CSV, you can also use the subsearch as a filter to improve efficiency. Like

index=my_index sourcetype=my_sourcetype (rule=policy_1 OR rule=policy_2 OR rule=policy_3) 
    [ | inputlookup my_list_of_urls.csv ]
| eval url = mvjoin(mvindex(split(url, "."), -2,-1), ".")
| dedup url
| inputlookup append=true my_list_of_urls.csv
| fillnull sourcetype value=CSV
| stats values(sourcetype) as sourcetype by url
| where mvcount(sourcetype) == 1 AND sourcetype == "CSV"
| eval url = mvappend(url, "*." . url)
| mvexpand url

Hope this helps.

View solution in original post

richgalloway · ‎10-10-2023

See if this helps. It replaces the my_url field with a string depending on which regular expression matches the data.

index=my_index sourcetype=my_sourcetype (rule=policy_1 OR rule=policy_2 OR rule=policy_3) 
    [ | inputlookup my_list_of_urls.csv ]
| rename url AS my_url
| eval my_url=case(match(my_url,".*/microsoft\.com$"), "microsoft.com",
                   match(my_url,".*/office\.com$"), "office.com",
                   1==1, my_url)
| stats count by my_url
| table my_url

---
If this reply helps you, Karma would be appreciated.

jwhughes58 · ‎10-10-2023

Hi @richgalloway

Unfortunately the csv file has 1156 entries so the use case would be huge. The other issue is that the csv file gets updated daily and urls are added and removed. I was hoping to use the csv file to say get only this part out of the result urls. Alternatively some way of joining the results url with the csv file urls.

Regards,

Joe

ITWhisperer · ‎10-10-2023

| eval my_url=mvjoin(mvindex(split(url,"."),-2,-1),".")

jwhughes58 · ‎10-10-2023

Hi @ITWhisperer ,

If the urls are consistent, that is a great idea. Unfortunately, the urls have between 1 and 8 parts between . and I don't know where to start. Alternatively, if there is a way of adding what is in the csv to the results, that would work.

Regards,

Joe

yuanliu · ‎10-10-2023

If the urls are consistent, that is a great idea. Unfortunately, the urls have between 1 and 8 parts

This is rather confusing. @ITWhisperer's command concatenates from the end, and it shouldn't matter whether there is 1 part in between or 8. Have you tested in full search?

index=my_index sourcetype=my_sourcetype (rule=policy_1 OR rule=policy_2 OR rule=policy_3) 
    [ | inputlookup my_list_of_urls.csv ]
| eval my_url = mvjoin(mvindex(split(url, "."), -2,-1), ".")
| stats count by my_url

between . and I don't know where to start. Alternatively, if there is a way of adding what is in the csv to the results, that would work.

Another piece of confusing comes from the original search:

index=my_index sourcetype=my_sourcetype (rule=policy_1 OR rule=policy_2 OR rule=policy_3) 
    [ | inputlookup my_list_of_urls.csv ]
| rename url AS my_url
| stats count by my_url
| table my_url

If you table my_url in the end, the result is no different from using inputlookup alone. Why bother with index search? If you only want mircrosoft.com and office.com from the CSV, you could do

| inputlookup my_list_of_urls.csv
| eval my_url = mvjoin(mvindex(split(url, "."), -2,-1), ".")
| stats values(my_url) as my_url
| mvexpand my_url

Maybe you can describe more variety of data (anonymize as needed) and print output from proposed searches, then illustrate what desired outcome is and explain why the actual output is not desired (in case it is not obvious enough)?

jwhughes58 · ‎10-11-2023

@yuanliuYeah, it is a pain of a search. Here is the issue. A firewall device generates an event with URL when certain policies are triggered by contractors. That is the initial search. The firewall team has a list of the URLs the contractors have access to which is the csv file. The firewall team wants to remove any URLs that aren't used in a period of time. Thus, I have to compare the firewall URLs to the csv URLs and output any csv URLs that aren't used in the time frame. The search finds the firewall events.

index=my_index sourcetype=my_sourcetype (rule=policy_1 OR rule=policy_2 OR rule=policy_3) 
    [ | inputlookup my_list_of_urls.csv ]

My issue is the firewall events use the long URL and not the short one. From the firewall

a478076af2deaf28abcbe5ceb8bdb648.fp.measure.office.com/
aad.cs.dds.microsoft.com/

From the csv file

*.microsoft.com/
microsoft.com/
*.office.com/
office.com/

The two events from the firewall mean that the two listed in the csv file are still good and don't need to be removed. I try to think of this as two sets, one the firewall results and the other the csv file, but I can't figure out how to search the firewall results with what is in the csv file.

This make sense?

TIA,

Joe

yuanliu · ‎10-11-2023

list of the URLs the contractors have access to which is the csv file. The firewall team wants to remove any URLs that aren't used in a period of time. Thus, I have to compare the firewall URLs to the csv

So, the firewall team wants to update that CSV file so it will not contain entries that haven't had matching events for a given time period. Is this correct? This seems to be the opposite of what the Splunk search is doing.

Some more points you need to clarify.

What are field name(s) the index search and the lookup file use to indicate URLs? Based on your code snippet, I assume that they both use url.
Does the CSV file contain additional fields? Based on your code snippet, I will assume none.
Is there some significance of trailing slash (/)? Do all url values end with one trailing slash? This may not be relevant, but some SPL manipulations may ruin your convention. So, I'd like to be cautious.

A more important question is the use of asterisk (*). Are the last two domains (root and second level) the only parts of interest? Given all the illustrations, I have to assume yes. In other words, no differentiation is needed between *.microsoft.com/ and microsoft.com/. Additionally, I will assume that every url in the CSV needs to be paired with a wildcard entry.

Using the above assumptions, the following can show you second level domains that have not been used.

index=my_index sourcetype=my_sourcetype (rule=policy_1 OR rule=policy_2 OR rule=policy_3) 
| eval url = mvjoin(mvindex(split(url, "."), -2,-1), ".")
| dedup url
| inputlookup append=true my_list_of_urls.csv
| fillnull sourcetype value=CSV
| stats values(sourcetype) as sourcetype by url
| where mvcount(sourcetype) == 1 AND sourcetype == "CSV"
| eval url = mvappend(url, "*." . url)
| mvexpand url

The output contains a list of second level domains affixed with a trailing slash, and these same strings prefixed with "*.". These would be the ones to be removed.

If you have lots of events with URLs that have no match in the CSV, you can also use the subsearch as a filter to improve efficiency. Like

index=my_index sourcetype=my_sourcetype (rule=policy_1 OR rule=policy_2 OR rule=policy_3) 
    [ | inputlookup my_list_of_urls.csv ]
| eval url = mvjoin(mvindex(split(url, "."), -2,-1), ".")
| dedup url
| inputlookup append=true my_list_of_urls.csv
| fillnull sourcetype value=CSV
| stats values(sourcetype) as sourcetype by url
| where mvcount(sourcetype) == 1 AND sourcetype == "CSV"
| eval url = mvappend(url, "*." . url)
| mvexpand url

Hope this helps.

jwhughes58 · ‎10-11-2023

@yuanliuThanks. I would have never figured out the mvjoin(mvindex. That is something I don't use. You gave me enough help that I was able to work out something I can give to another team. Karma point awarded.

Extract domain from full url

field extraction

Join Us for Splunk University and Get Your Bootcamp Game On!

.conf24 | Learning Tracks for Security, Observability, Platform, and Developers!

Announcing Scheduled Export GA for Dashboard Studio