I have a CSV of URLs I need to search against my proxy index (the url field), I want to be able to do a count or match of the URLs.
my csv looks like this (with the header of the column called kurl)
kurl
splunk.com
youtube.com
google.com
So far, I have this SPL but it's only counting the matches, i need the URLs that don't exist to count 0
index="web_index"
[| inputlookup URLs.csv
| fields kurl
| rename kurl as url]
| stats count by url
Actually, I made a mistake when I emulate data. (I had a spurious "match" field in emulation which your index subsearch will not give.) Here is the correct solution:
index="web_index"
[| inputlookup URLs.csv
| fields kurl
| rename kurl as url]
| append
[inputlookup URLs.csv
| fields kurl]
| eval match = if(isnull(url), 0, 1)
| eval url = coalesce(url, kurl)
| stats sum(match) as count by url
Here is the corrected emulation
| makeresults
| eval url = mvappend("google.com", "foo.com", "bar.com", "google.com")
| eval index = "web_index"
| mvexpand url
| search *
[ inputlookup URLs.csv
| fields kurl
| rename kurl as url]
``` the above emulates
index="web_index"
[| inputlookup URLs.csv
| fields kurl
| rename kurl as url]
```
This gives
_time | index | url |
2023-09-20 16:20:05 | web_index | google.com |
2023-09-20 16:20:05 | web_index | google.com |
(If your results do not meet the requirement, it would be useful to compare real data with emulation.) Combined simulation is then
| makeresults
| eval url = mvappend("google.com", "foo.com", "bar.com", "google.com")
| mvexpand url
| lookup URLs.csv kurl as url output kurl as match
| where isnotnull(match)
| fields - match
``` the above emulates
index="web_index"
[| inputlookup URLs.csv
| fields kurl
| rename kurl as url]
```
| append
[inputlookup URLs.csv
| fields kurl]
| eval match = if(isnull(url), 0, 1)
| eval url = coalesce(url, kurl)
| stats sum(match) as count by url
I get
url | count |
google.com | 2 |
splunk.com | 0 |
youtube.com | 0 |
However, I still don't understand why the first option (stats before append) would not give you the correct output. Here's my full emulation:
| makeresults
| eval url = mvappend("google.com", "foo.com", "bar.com", "google.com")
| mvexpand url
| lookup URLs.csv kurl as url output kurl as match
| where isnotnull(match)
| fields - match
``` the above emulates
index="web_index"
[| inputlookup URLs.csv
| fields kurl
| rename kurl as url]
```
| stats count by url
| append
[inputlookup URLs.csv
| fields kurl
| rename kurl AS url]
| stats sum(count) as count by url
| fillnull count ``` or you can omit this and leave nonexistent to show null ```
Again It would help if you can run data part of emulation and let me know the difference with real data.
You selected lookup as label, but are using inputlookup😃. You would have the answer if you stick to lookup.
index="web_index"
| lookup URLs.csv kurl as url output kurl as match
| eval match = if(isnull(match), 0, 1)
| stats sum(match) as count by url
That also returned every url not on my lookup. I guess I could just take the hits and do a duplicate compare in excel but it'd be nice to see it all in splunk.
Yeah, sorry about the tag I wasn't paying attention.
Maybe you can explain the requirement further? You said of your original SPL "it's only counting the matches, i need the URLs that don't exist to count 0." So, I thought you would want every url not with a match. Could you mock up some data in web_index and explain why the output doesn't meet the requirements?
Here is my emulation
| makeresults
| fields - _time
| eval url = mvappend("google.com", "foo.com", "bar.com", "google.com")
| mvexpand url
``` the above emulates index="web_index" ```
url |
google.com |
foo.com |
bar.com |
google.com |
Using the exact mock lookup you give, my search will give
url | count |
bar.com | 0 |
foo.com | 0 |
google.com | 2 |
Is this not what you want, bar.com and foo.com count to 0?
Below is the full example
| makeresults
| fields - _time
| eval url = mvappend("google.com", "foo.com", "bar.com", "google.com")
| mvexpand url
``` the above emulates index="web_index" ```
| lookup URLs.csv kurl as url output kurl as match
| eval match = if(isnull(match), 0, 1)
| stats sum(match) as count by url
So my first SPL, it gets me the URLs I'm looking for but doesn't list the URLs (in the lookup) that don't get any results.
So to break it down, say my csv has 120 URLs and I would like to know what my users are hitting and not hitting, when i run my first example search I'm getting 43 results back, but I also want to know what are the URLs (out of the 120) not hitting? Or should I create a search only telling me the URLs that don't match in the index?
Whilst that example does make sense, in the stats table I get all these other URLs that aren't even in my lookup list that I care about. So I go from expecting 120 results to (18,000).
Does that make more sense?
So my first SPL, it gets me the URLs I'm looking for but doesn't list the URLs (in the lookup) that don't get any results.
The parenthesis "(in the lookup)" is key here. My first reading of "only counting the matches, i need the URLs that don't exist to count 0" took it to mean URLs in the index that didn't have a match. This new description makes perfect sense. And your search is more efficient.
index="web_index"
[| inputlookup URLs.csv
| fields kurl
| rename kurl as url]
| stats count by url
| append
[inputlookup URLs.csv
| fields kurl
| rename kurl AS url]
| stats sum(count) as count by url
| fillnull count ``` or you can omit this and leave nonexistent to show null ```
Alternatively, you can do
index="web_index"
[| inputlookup URLs.csv
| fields kurl
| rename kurl as url]
| append
[inputlookup URLs.csv
| fields kurl
| rename kurl AS url]
| eval match = if(isnull(match), 0, 1)
| stats sum(match) as count by url
Yeah it still doesn't mark the URLs in my list that don't exist in the index as 0.
I'll just accept your solution though, thanks for your help 😃
Actually, I made a mistake when I emulate data. (I had a spurious "match" field in emulation which your index subsearch will not give.) Here is the correct solution:
index="web_index"
[| inputlookup URLs.csv
| fields kurl
| rename kurl as url]
| append
[inputlookup URLs.csv
| fields kurl]
| eval match = if(isnull(url), 0, 1)
| eval url = coalesce(url, kurl)
| stats sum(match) as count by url
Here is the corrected emulation
| makeresults
| eval url = mvappend("google.com", "foo.com", "bar.com", "google.com")
| eval index = "web_index"
| mvexpand url
| search *
[ inputlookup URLs.csv
| fields kurl
| rename kurl as url]
``` the above emulates
index="web_index"
[| inputlookup URLs.csv
| fields kurl
| rename kurl as url]
```
This gives
_time | index | url |
2023-09-20 16:20:05 | web_index | google.com |
2023-09-20 16:20:05 | web_index | google.com |
(If your results do not meet the requirement, it would be useful to compare real data with emulation.) Combined simulation is then
| makeresults
| eval url = mvappend("google.com", "foo.com", "bar.com", "google.com")
| mvexpand url
| lookup URLs.csv kurl as url output kurl as match
| where isnotnull(match)
| fields - match
``` the above emulates
index="web_index"
[| inputlookup URLs.csv
| fields kurl
| rename kurl as url]
```
| append
[inputlookup URLs.csv
| fields kurl]
| eval match = if(isnull(url), 0, 1)
| eval url = coalesce(url, kurl)
| stats sum(match) as count by url
I get
url | count |
google.com | 2 |
splunk.com | 0 |
youtube.com | 0 |
However, I still don't understand why the first option (stats before append) would not give you the correct output. Here's my full emulation:
| makeresults
| eval url = mvappend("google.com", "foo.com", "bar.com", "google.com")
| mvexpand url
| lookup URLs.csv kurl as url output kurl as match
| where isnotnull(match)
| fields - match
``` the above emulates
index="web_index"
[| inputlookup URLs.csv
| fields kurl
| rename kurl as url]
```
| stats count by url
| append
[inputlookup URLs.csv
| fields kurl
| rename kurl AS url]
| stats sum(count) as count by url
| fillnull count ``` or you can omit this and leave nonexistent to show null ```
Again It would help if you can run data part of emulation and let me know the difference with real data.
THIS WORKS!!!!!!!!!!
Thanks so much!!!!