Okay, I've done this once in Plone, but we've moved to Drupal, and things don't look the same.
Basically, I want to grab the top search terms from a given timeframe. Drupal search urls look like:
http://site.example.com/search/site/
A log entry looks something like (in the case I searched for "splunk". Server is apache):
111.222.333.444 - - [06/Feb/2012:14:38:07 -0800] "GET /search/site/splunk HTTP/1.1" 200 9289 "http://site.example.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7"
Previously, in plone, I was using something like:
host="hostname" file="search" SearchableText="*" | eval SearchableText=lower(SearchableText) | top limit=10 SearchableText
But there's no query variable being set like that.
Thoughts? Help?
What is the sourcetype for your Drupal data? It looks like a standard access log. What if you run the following search?
host="hostname" file="search" | kv access-extractions | eval SearchableText=lower(uri) | top limit=10 SearchableText
UPDATE: the final answer from comments below:
The best thing to do would be to make the 'rex' field extraction a permanent one using props.conf (http://docs.splunk.com/Documentation/Splunk/latest/Admin/Propsconf):
[source::.../access_log*]
EXTRACT-access = "/(?<last_part>[^/]+)$" in uri_path
Then you can do:
source="/var/log/apache2/access_log" uri_path="/search/site/*" NOT last_part=*comment* NOT last_part="favicon.ico" | top limit=10 last_part
Take a look at the Web Intelligence app, these use cases and a lot more are built in, and the app is free and supported: http://splunk-base.splunk.com/apps/28994/splunk-app-for-web-intelligence
What is the sourcetype for your Drupal data? It looks like a standard access log. What if you run the following search?
host="hostname" file="search" | kv access-extractions | eval SearchableText=lower(uri) | top limit=10 SearchableText
UPDATE: the final answer from comments below:
The best thing to do would be to make the 'rex' field extraction a permanent one using props.conf (http://docs.splunk.com/Documentation/Splunk/latest/Admin/Propsconf):
[source::.../access_log*]
EXTRACT-access = "/(?<last_part>[^/]+)$" in uri_path
Then you can do:
source="/var/log/apache2/access_log" uri_path="/search/site/*" NOT last_part=*comment* NOT last_part="favicon.ico" | top limit=10 last_part
Cool, that worked! Thanks! I've added the stuff to props.conf, but I have to wait for the webintelligence backfill to finish before restarting splunk.
Thanks again!
Otherwise, in-line, it will be far less efficient. As a rule of thumb, as much filtering as possible should be done to the left of the first pipe:
source="/var/log/apache2/access_log" uri_path="/search/site/*" | rex field=uri_path "/(?<last_part>[^/]+)$" | eval last_part=lower(last_part) | search NOT last_part=*comment* | eval last_part = mvfilter(last_part != "favicon.ico" ) | top limit=10 last_part
The best thing to do would be to make the 'rex' field extraction a permanent one using props.conf (http://docs.splunk.com/Documentation/Splunk/latest/Admin/Propsconf):
[source::.../access_log*]
EXTRACT-access = "/(?<last_part>[^/]+)$" in uri_path
Then you can do:
source="/var/log/apache2/access_log" uri_path="/search/site/*" NOT last_part=*comment* NOT last_part="favicon.ico" | top limit=10 last_part
Okay, last one seems to work (with the rex field). I'm very close, the only issue is, I want to ignore any results that contain the word "comment".
Here's what I have:
source="/var/log/apache2/access_log" uri_path="/search/site/*" | rex field=uri_path "/(?
The mvfilter is obviously removing "favicon" from the results. And I needed to run the results through "lower" to remove the case duplicates.
Almost....There....
So if uri_path is already an extracted field, you don't need the '| kv access-extractions'. You can try this to get query strings:
source="/var/log/apache2/access_log" uri_path="/search/site/*" uri_query=* | top limit=10 uri_query
To get the last part before the query string:
sourcetype="access_combined_wcookie" | rex field=uri_path "\/(?
I think I'm close. The above didn't work quite right, but this seems close...
source="/var/log/apache2/access_log" uri_path="/search/site/*" | kv access-extractions | eval SearchableText=lower(uri) | top limit=10 SearchableText
Problem is, I'm getting results like:
"/search/site/scholarship". Is there a way to just remove the "/search/site/" part of that result, so I just get the actual search term?
Also, how does one remove certain results? Like, getting a favicon.ico in the results because it happens to get loaded from a location with "/search/site" in the url for some reason...
Thoughts?
And thanks. I've got backfilling going with the webintelligence app... will have to see how that works.