Splunk Search

Top search results from Drupal

Path Finder

Okay, I've done this once in Plone, but we've moved to Drupal, and things don't look the same.

Basically, I want to grab the top search terms from a given timeframe. Drupal search urls look like:

http://site.example.com/search/site/ where is something like "splunk" or "foobar" or, whatever.

A log entry looks something like (in the case I searched for "splunk". Server is apache):

111.222.333.444 - - [06/Feb/2012:14:38:07 -0800] "GET /search/site/splunk HTTP/1.1" 200 9289 "http://site.example.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7"

Previously, in plone, I was using something like:

host="hostname" file="search" SearchableText="*" | eval SearchableText=lower(SearchableText) | top limit=10 SearchableText

But there's no query variable being set like that.

Thoughts? Help?

Tags (3)
1 Solution

Splunk Employee
Splunk Employee

What is the sourcetype for your Drupal data? It looks like a standard access log. What if you run the following search?

 host="hostname" file="search" | kv access-extractions | eval SearchableText=lower(uri) | top limit=10 SearchableText

UPDATE: the final answer from comments below:

The best thing to do would be to make the 'rex' field extraction a permanent one using props.conf (http://docs.splunk.com/Documentation/Splunk/latest/Admin/Propsconf):

[source::.../access_log*]
EXTRACT-access = "/(?<last_part>[^/]+)$" in uri_path

Then you can do:

source="/var/log/apache2/access_log" uri_path="/search/site/*" NOT last_part=*comment*  NOT last_part="favicon.ico" | top limit=10 last_part

View solution in original post

0 Karma

Splunk Employee
Splunk Employee

Take a look at the Web Intelligence app, these use cases and a lot more are built in, and the app is free and supported: http://splunk-base.splunk.com/apps/28994/splunk-app-for-web-intelligence

0 Karma

Splunk Employee
Splunk Employee

What is the sourcetype for your Drupal data? It looks like a standard access log. What if you run the following search?

 host="hostname" file="search" | kv access-extractions | eval SearchableText=lower(uri) | top limit=10 SearchableText

UPDATE: the final answer from comments below:

The best thing to do would be to make the 'rex' field extraction a permanent one using props.conf (http://docs.splunk.com/Documentation/Splunk/latest/Admin/Propsconf):

[source::.../access_log*]
EXTRACT-access = "/(?<last_part>[^/]+)$" in uri_path

Then you can do:

source="/var/log/apache2/access_log" uri_path="/search/site/*" NOT last_part=*comment*  NOT last_part="favicon.ico" | top limit=10 last_part

View solution in original post

0 Karma

Path Finder

Cool, that worked! Thanks! I've added the stuff to props.conf, but I have to wait for the webintelligence backfill to finish before restarting splunk.

Thanks again!

0 Karma

Splunk Employee
Splunk Employee

Otherwise, in-line, it will be far less efficient. As a rule of thumb, as much filtering as possible should be done to the left of the first pipe:

source="/var/log/apache2/access_log" uri_path="/search/site/*" | rex field=uri_path "/(?<last_part>[^/]+)$" | eval last_part=lower(last_part) | search NOT last_part=*comment* | eval last_part = mvfilter(last_part != "favicon.ico" ) | top limit=10 last_part
0 Karma

Splunk Employee
Splunk Employee

The best thing to do would be to make the 'rex' field extraction a permanent one using props.conf (http://docs.splunk.com/Documentation/Splunk/latest/Admin/Propsconf):

[source::.../access_log*]
EXTRACT-access = "/(?<last_part>[^/]+)$" in uri_path

Then you can do:

source="/var/log/apache2/access_log" uri_path="/search/site/*" NOT last_part=*comment*  NOT last_part="favicon.ico" | top limit=10 last_part
0 Karma

Path Finder

Okay, last one seems to work (with the rex field). I'm very close, the only issue is, I want to ignore any results that contain the word "comment".

Here's what I have:
source="/var/log/apache2/access_log" uri_path="/search/site/*" | rex field=uri_path "/(?[^/]+)$" | eval last_part=lower(last_part) | eval last_part = mvfilter(last_part != "favicon.ico" ) | top limit=10 last_part

The mvfilter is obviously removing "favicon" from the results. And I needed to run the results through "lower" to remove the case duplicates.

Almost....There....

0 Karma

Splunk Employee
Splunk Employee

So if uri_path is already an extracted field, you don't need the '| kv access-extractions'. You can try this to get query strings:

source="/var/log/apache2/access_log" uri_path="/search/site/*" uri_query=* | top limit=10 uri_query

To get the last part before the query string:

sourcetype="access_combined_wcookie" | rex field=uri_path "\/(?[^\/]+)$" | top limit=10 last_part

0 Karma

Path Finder

I think I'm close. The above didn't work quite right, but this seems close...

source="/var/log/apache2/access_log" uri_path="/search/site/*" | kv access-extractions | eval SearchableText=lower(uri) | top limit=10 SearchableText

Problem is, I'm getting results like:

"/search/site/scholarship". Is there a way to just remove the "/search/site/" part of that result, so I just get the actual search term?

Also, how does one remove certain results? Like, getting a favicon.ico in the results because it happens to get loaded from a location with "/search/site" in the url for some reason...

Thoughts?

And thanks. I've got backfilling going with the webintelligence app... will have to see how that works.

0 Karma