I'm working with data from this search
index=my_index sourcetype=my_sourcetype (rule=policy_1 OR rule=policy_2 OR rule=policy_3)
[ | inputlookup my_list_of_urls.csv ]
| rename url AS my_url
| stats count by my_url
| table my_url
The events look like this
02ef65dc96524dabba54a950da7cb0d8.fp.measure.office.com/
0434c399ca884247875a286a10c969f4.fp.measure.office.com/
14f8c4d0e9b7be86933be5d3c9fb91d7.fp.measure.office.com/
3d8e055913534ff7b3c23101fd1f3ca6.fp.measure.office.com/
4334ede7832f44c5badcfd5a6459a1a2.fp.measure.office.com/
5d44dec60c9b4788fb26426c1e151f46.fp.measure.office.com/
5f021e1b8d3646398fab8ce59f8a6bbd.fp.measure.office.com/
6f6c23c1671f72c36d6179fdeabd1f56.fp.measure.office.com/
7106ea87c1e2ed0aebc9baca86f9af34.fp.measure.office.com/
88c88084fe454cbc8629332c6422e8a4.fp.measure.office.com/
982db5012df7494a88c242d426e07be6.fp.measure.office.com/
a478076af2deaf28abcbe5ceb8bdb648.fp.measure.office.com/
aad.cs.dds.microsoft.com/
In the my_list_of_urls.csv there are these entries
*.microsoft.com/
microsoft.com/
*.office.com/
office.com/
What I'm trying to do is get the microsoft.com and office.com from the results instead of the full url. I'm stumped on how to do it. Any help is appreciated.
TIA,
Joe
list of the URLs the contractors have access to which is the csv file. The firewall team wants to remove any URLs that aren't used in a period of time. Thus, I have to compare the firewall URLs to the csv
So, the firewall team wants to update that CSV file so it will not contain entries that haven't had matching events for a given time period. Is this correct? This seems to be the opposite of what the Splunk search is doing.
Some more points you need to clarify.
A more important question is the use of asterisk (*). Are the last two domains (root and second level) the only parts of interest? Given all the illustrations, I have to assume yes. In other words, no differentiation is needed between *.microsoft.com/ and microsoft.com/. Additionally, I will assume that every url in the CSV needs to be paired with a wildcard entry.
Using the above assumptions, the following can show you second level domains that have not been used.
index=my_index sourcetype=my_sourcetype (rule=policy_1 OR rule=policy_2 OR rule=policy_3)
| eval url = mvjoin(mvindex(split(url, "."), -2,-1), ".")
| dedup url
| inputlookup append=true my_list_of_urls.csv
| fillnull sourcetype value=CSV
| stats values(sourcetype) as sourcetype by url
| where mvcount(sourcetype) == 1 AND sourcetype == "CSV"
| eval url = mvappend(url, "*." . url)
| mvexpand url
The output contains a list of second level domains affixed with a trailing slash, and these same strings prefixed with "*.". These would be the ones to be removed.
If you have lots of events with URLs that have no match in the CSV, you can also use the subsearch as a filter to improve efficiency. Like
index=my_index sourcetype=my_sourcetype (rule=policy_1 OR rule=policy_2 OR rule=policy_3)
[ | inputlookup my_list_of_urls.csv ]
| eval url = mvjoin(mvindex(split(url, "."), -2,-1), ".")
| dedup url
| inputlookup append=true my_list_of_urls.csv
| fillnull sourcetype value=CSV
| stats values(sourcetype) as sourcetype by url
| where mvcount(sourcetype) == 1 AND sourcetype == "CSV"
| eval url = mvappend(url, "*." . url)
| mvexpand url
Hope this helps.
See if this helps. It replaces the my_url field with a string depending on which regular expression matches the data.
index=my_index sourcetype=my_sourcetype (rule=policy_1 OR rule=policy_2 OR rule=policy_3)
[ | inputlookup my_list_of_urls.csv ]
| rename url AS my_url
| eval my_url=case(match(my_url,".*/microsoft\.com$"), "microsoft.com",
match(my_url,".*/office\.com$"), "office.com",
1==1, my_url)
| stats count by my_url
| table my_url
Unfortunately the csv file has 1156 entries so the use case would be huge. The other issue is that the csv file gets updated daily and urls are added and removed. I was hoping to use the csv file to say get only this part out of the result urls. Alternatively some way of joining the results url with the csv file urls.
Regards,
Joe
| eval my_url=mvjoin(mvindex(split(url,"."),-2,-1),".")
Hi @ITWhisperer ,
If the urls are consistent, that is a great idea. Unfortunately, the urls have between 1 and 8 parts between . and I don't know where to start. Alternatively, if there is a way of adding what is in the csv to the results, that would work.
Regards,
Joe
If the urls are consistent, that is a great idea. Unfortunately, the urls have between 1 and 8 parts
This is rather confusing. @ITWhisperer's command concatenates from the end, and it shouldn't matter whether there is 1 part in between or 8. Have you tested in full search?
index=my_index sourcetype=my_sourcetype (rule=policy_1 OR rule=policy_2 OR rule=policy_3)
[ | inputlookup my_list_of_urls.csv ]
| eval my_url = mvjoin(mvindex(split(url, "."), -2,-1), ".")
| stats count by my_url
between . and I don't know where to start. Alternatively, if there is a way of adding what is in the csv to the results, that would work.
Another piece of confusing comes from the original search:
index=my_index sourcetype=my_sourcetype (rule=policy_1 OR rule=policy_2 OR rule=policy_3)
[ | inputlookup my_list_of_urls.csv ]
| rename url AS my_url
| stats count by my_url
| table my_url
If you table my_url in the end, the result is no different from using inputlookup alone. Why bother with index search? If you only want mircrosoft.com and office.com from the CSV, you could do
| inputlookup my_list_of_urls.csv
| eval my_url = mvjoin(mvindex(split(url, "."), -2,-1), ".")
| stats values(my_url) as my_url
| mvexpand my_url
Maybe you can describe more variety of data (anonymize as needed) and print output from proposed searches, then illustrate what desired outcome is and explain why the actual output is not desired (in case it is not obvious enough)?
@yuanliuYeah, it is a pain of a search. Here is the issue. A firewall device generates an event with URL when certain policies are triggered by contractors. That is the initial search. The firewall team has a list of the URLs the contractors have access to which is the csv file. The firewall team wants to remove any URLs that aren't used in a period of time. Thus, I have to compare the firewall URLs to the csv URLs and output any csv URLs that aren't used in the time frame. The search finds the firewall events.
index=my_index sourcetype=my_sourcetype (rule=policy_1 OR rule=policy_2 OR rule=policy_3)
[ | inputlookup my_list_of_urls.csv ]
My issue is the firewall events use the long URL and not the short one. From the firewall
a478076af2deaf28abcbe5ceb8bdb648.fp.measure.office.com/ aad.cs.dds.microsoft.com/
From the csv file
*.microsoft.com/ microsoft.com/ *.office.com/ office.com/
The two events from the firewall mean that the two listed in the csv file are still good and don't need to be removed. I try to think of this as two sets, one the firewall results and the other the csv file, but I can't figure out how to search the firewall results with what is in the csv file.
This make sense?
TIA,
Joe
list of the URLs the contractors have access to which is the csv file. The firewall team wants to remove any URLs that aren't used in a period of time. Thus, I have to compare the firewall URLs to the csv
So, the firewall team wants to update that CSV file so it will not contain entries that haven't had matching events for a given time period. Is this correct? This seems to be the opposite of what the Splunk search is doing.
Some more points you need to clarify.
A more important question is the use of asterisk (*). Are the last two domains (root and second level) the only parts of interest? Given all the illustrations, I have to assume yes. In other words, no differentiation is needed between *.microsoft.com/ and microsoft.com/. Additionally, I will assume that every url in the CSV needs to be paired with a wildcard entry.
Using the above assumptions, the following can show you second level domains that have not been used.
index=my_index sourcetype=my_sourcetype (rule=policy_1 OR rule=policy_2 OR rule=policy_3)
| eval url = mvjoin(mvindex(split(url, "."), -2,-1), ".")
| dedup url
| inputlookup append=true my_list_of_urls.csv
| fillnull sourcetype value=CSV
| stats values(sourcetype) as sourcetype by url
| where mvcount(sourcetype) == 1 AND sourcetype == "CSV"
| eval url = mvappend(url, "*." . url)
| mvexpand url
The output contains a list of second level domains affixed with a trailing slash, and these same strings prefixed with "*.". These would be the ones to be removed.
If you have lots of events with URLs that have no match in the CSV, you can also use the subsearch as a filter to improve efficiency. Like
index=my_index sourcetype=my_sourcetype (rule=policy_1 OR rule=policy_2 OR rule=policy_3)
[ | inputlookup my_list_of_urls.csv ]
| eval url = mvjoin(mvindex(split(url, "."), -2,-1), ".")
| dedup url
| inputlookup append=true my_list_of_urls.csv
| fillnull sourcetype value=CSV
| stats values(sourcetype) as sourcetype by url
| where mvcount(sourcetype) == 1 AND sourcetype == "CSV"
| eval url = mvappend(url, "*." . url)
| mvexpand url
Hope this helps.
@yuanliuThanks. I would have never figured out the mvjoin(mvindex. That is something I don't use. You gave me enough help that I was able to work out something I can give to another team. Karma point awarded.