Getting Data In

access_combined hide certain useragents

Path Finder

Hello Im looking to do some stats on the traffic to my companys webserver (apache). Im using splunk as a lightforwarder. And monitoring it with the unix app. I want to hide the hits to the server from different kind of bots and fetchers from google and stuff like that. I have done it manualy with useragent!=. But there always seems to be new kund of useragents to block. How do you write to search for speciall words in the useragent (like bot, feed and spider)

Tags (1)
0 Karma

Explorer

I created an eventtype called BOTS that will match bots that I know of (i.e. http_user_agent="crawler" OR ...etc..). When I want to filter out events created by BOTS, I add it to my search query:

something=something NOT eventtype=BOTS etc

This works very well for me. Periodicly, I see a new bot show up & add it to my list.

0 Karma

Communicator

I may, or may not understand your question...

You are looking at a webserver access log and you want to report stats, but need to filter out bots. You think that useragent string is the way to identify the bots from the people. Unfortunately useragent strings are wild creatures and very hard to process consistently. There is too much manual intervention required.

Could you start by gathering everything that looks for a robots.txt ? Perhaps get a list of IP's or useragents that requested robots.txt and then use those as your filter. It won't get the virus probes and other black hat hits, but should align your stats closer to reality than counting up everything.

0 Karma

Path Finder

Anyone have any idea?

0 Karma

Path Finder

Sound like a good idea. But i get some problem trying this. Im trying to create a txt file named useragent.csv and paste the thing you wrote. Then im doing, Lookup table files in manager->lookups. Then i get this error.

"Error in 'lookup' command: Could not find all of the specified destination fields in the lookup table."

When doin this command index="os" source="/var/log/httpd/access_log" | lookup useragent.csv useragent OUTPUT boolean_include | where isnull(boolean_exclude)

0 Karma

Splunk Employee
Splunk Employee

You could create an eventtype to group useragent values and filter against that eventtype in several searches. Then you could maintain the one eventtype centrally.


You could also achieve this by creating a lookup for useragent
where useragent.csv is:

useragent, boolean_exclude
bot, true
feed, true
spider, true 

and your search is:

... | lookup useragent.csv useragent OUTPUT boolean_include | where isnull(boolean_exclude)
0 Karma