Splunk Search
Highlighted

What's the best search method to remove web crawlers or bots from download logs?

Path Finder

A few years ago, I was given a search string to filter web crawlers/bots from showing up in our download reports. I'm curious as to what other people use to make sure bots are not counted in their downloads...are there better methods?

This is the string I inherited:

eval agentType=if(match(http_user_agent,"(?i).*(bot|crawler|spider).*"),"Bot",if(match(http_user_agent,"^.*Mozilla/.*"),"Browser","Unknown")) | search agentType!="Bot"|search agentType!="Unknown"|

Does anyone know of a more exact or better method to filter out crawlers?

0 Karma
Highlighted

Re: What's the best search method to remove web crawlers or bots from download logs?

Champion

I don't use eval statements to figure this if something is a bot. I have a collection of 74 transforms applied against the useragent field. The regex patterns are highly tuned to match in the least amount of steps. The reason I have so many is that our SEO team uses this data; however, this does not account for any bot doing a good job of impersonating a browser. We also do a cidr match against the cip and assume any address coming from AWS, Google Cloud, Digital Ocean, and Azure address blocks are bots.

Here is a link to a gist I created - https://gist.github.com/httpstergeek/5fd08b9bc750e2d1954de78b063a092a

Hope this helps and if it does dont forget to accept and vote up. Cheers.

View solution in original post

Highlighted

Re: What's the best search method to remove web crawlers or bots from download logs?

Path Finder

Wow, so that's a totally different method 🙂 Is it safe to say there isn't a definitive way to 100% accurately define bots?

0 Karma
Highlighted

Re: What's the best search method to remove web crawlers or bots from download logs?

Champion

You can get fairly close, but definitely not 100%. We also use Google Analytics and our number match up fairly closely. Our SEO team uses Splunk for quick analysis and granularity since GA I think reports hourly.

0 Karma
Highlighted

Re: What's the best search method to remove web crawlers or bots from download logs?

Path Finder

Are you able to share any hints on how you created your set of 74 transforms? I can't find anything anywhere on making sure what I'm using is giving accurate results. When I compare Splunk and GA, the numbers vary greatly and I'm trying to figure it out if it's my eval that's the problem or if GA is misbehaving somehow.

0 Karma
Highlighted

Re: What's the best search method to remove web crawlers or bots from download logs?

Champion

I've converted my post to an answer with a link to my transform as gist.

0 Karma
Highlighted

Re: What's the best search method to remove web crawlers or bots from download logs?

Path Finder

You are AMAZING! Thank you so much! 🙂

0 Karma
Highlighted

Re: What's the best search method to remove web crawlers or bots from download logs?

New Member

Hi!

Could you explain how to correctly implement this configuraton in Splunk, I've copied transforms.conf but nothing has changed
I also want to exclude all bots from my analysis.

Thanks in advance!

0 Karma
Speak Up for Splunk Careers!

We want to better understand the impact Splunk experience and expertise has has on individuals' careers, and help highlight the growing demand for Splunk skills.