Hello Im looking to do some stats on the traffic to my companys webserver (apache). Im using splunk as a lightforwarder. And monitoring it with the unix app. I want to hide the hits to the server from different kind of bots and fetchers from google and stuff like that. I have done it manualy with useragent!=. But there always seems to be new kund of useragents to block. How do you write to search for speciall words in the useragent (like bot, feed and spider)
I created an eventtype called BOTS that will match bots that I know of (i.e. http_user_agent="crawler" OR ...etc..). When I want to filter out events created by BOTS, I add it to my search query:
something=something NOT eventtype=BOTS etc
This works very well for me. Periodicly, I see a new bot show up & add it to my list.
I may, or may not understand your question...
You are looking at a webserver access log and you want to report stats, but need to filter out bots. You think that useragent string is the way to identify the bots from the people. Unfortunately useragent strings are wild creatures and very hard to process consistently. There is too much manual intervention required.
Could you start by gathering everything that looks for a robots.txt ? Perhaps get a list of IP's or useragents that requested robots.txt and then use those as your filter. It won't get the virus probes and other black hat hits, but should align your stats closer to reality than counting up everything.
Sound like a good idea. But i get some problem trying this. Im trying to create a txt file named useragent.csv and paste the thing you wrote. Then im doing, Lookup table files in manager->lookups. Then i get this error.
"Error in 'lookup' command: Could not find all of the specified destination fields in the lookup table."
When doin this command index="os" source="/var/log/httpd/access_log" | lookup useragent.csv useragent OUTPUT boolean_include | where isnull(boolean_exclude)
You could create an eventtype to group useragent values and filter against that eventtype in several searches. Then you could maintain the one eventtype centrally.
You could also achieve this by creating a lookup for useragent
where useragent.csv is:
useragent, boolean_exclude bot, true feed, true spider, true
and your search is:
... | lookup useragent.csv useragent OUTPUT boolean_include | where isnull(boolean_exclude)