Solved: Count occurrences of sourceA fieldA against source...

DaJuDeLa · ‎06-03-2013

OK my first question, i'm a relative SPLUNK newbie- I thought I was good at SQL syntax but I'm lost here ;-(

I have a source feed from a bluecoat proxy and I have a simple text (ofinterest.csv) file which we hold a list of urls of interest one per line.

What I need to generate is a table which lists which of the ofinterest urls have been visited and how many times -
One of my problems are that the url in ofinterest.csv will only be something like facebook.com
whereas the url in the bluecoat log will be something like

http://facebook.com/something%20or%20other.asp?login

so an exact match to field names won't work.

Anyway I've been playing around with a search like

sourcetype=OfInt | eval int=url_item | eval src1="Y" | table int scr1 | join type=outer [search sourcetype=BCLogs | dedup cs_host | eval url=cs_host | eval src2 = "Y" | table url src2 ]|stats count url

Pretty basic I know - I cant get the count to work (how many bluecoat entries contain an ofinterest.csv entry) let alone work with part fields

Any help appreciated.

lguinn2 · ‎06-04-2013

First, I would set up the CSV file as a lookup table if this is a relatively static list of URLs that you want to search. Splunk is best at indexing time-based data - your CSV file doesn't sound like time-based data.

Here is the tutorial for using field lookups. Follow the tutorial, but don't make the lookup automatic. Just load the csv file and define the lookup. In my example below, I will assume that you have defined a lookup called url_lookup

Now, if you have less than 100 items in the list, the easiest way to search for what you want may be this:

sourcetype=BCLogs [ | inputlookup url_lookup ] | stats count by url

This selects the items in the lookup table, strings them together with 'OR's and then inserts the whole thing into the initial search of the bluecoat logs.

This may not be what you want, but I am pretty sure that you don't need all the eval and table commands. Splunk is very different from SQL. I suspect that your "SQL thinking" is actually getting in your way. BTW, why did you dedup the cs_host if you wanted a count of the number of times the url appeared? I am confused.

Another way to go about this would be: (1) collect all the bluecoat entries (2) use a wild-card lookup to match the URLs (3) drop any bluecoat entries that don't match and (4) count them up. This is a bit more complicated, but will work for larger data sets.

For the second alternative, set up the lookup table as before, but then do the following:

(a) in the CSV file, put a * for any part of the URL that you wish to wild-card. Also, it will be easier if you include a second field, which I have called match. Here is an example of the first two lines in the CSV file:

url,match

*facebook.com*,found

(b) Add the following lines to the transforms.conf file that contains your lookup definition:

[url_lookup]
min_matches = 1
default_match = No_Match
case_sensitive_match = false
match_type = WILDCARD

(c) Run the following search - here I am assuming that cs_host is the field that contains the url in the bluecoat logs, and that your CSV file has the header that I showed in (a):

sourcetype=BCLogs | lookup url_lookup url as cs_host | where match="found" | stats count by cs_host

Finally, whenever you want to change the list of URLs to search, you can just upload a fresh copy of the CSV file - any saved searches or lookups will immediately start to use the new data.

Hope this isn't a tl;dr !!

View solution in original post

lguinn2 · ‎06-04-2013