Splunk Search

What's the best search method to remove web crawlers or bots from download logs?

Communicator

A few years ago, I was given a search string to filter web crawlers/bots from showing up in our download reports. I'm curious as to what other people use to make sure bots are not counted in their downloads...are there better methods?

This is the string I inherited:

eval agentType=if(match(http_user_agent,"(?i).*(bot|crawler|spider).*"),"Bot",if(match(http_user_agent,"^.*Mozilla/.*"),"Browser","Unknown")) | search agentType!="Bot"|search agentType!="Unknown"|

Does anyone know of a more exact or better method to filter out crawlers?

0 Karma
1 Solution

Champion

I don't use eval statements to figure this if something is a bot. I have a collection of 74 transforms applied against the useragent field. The regex patterns are highly tuned to match in the least amount of steps. The reason I have so many is that our SEO team uses this data; however, this does not account for any bot doing a good job of impersonating a browser. We also do a cidr match against the cip and assume any address coming from AWS, Google Cloud, Digital Ocean, and Azure address blocks are bots.

Here is a link to a gist I created - https://gist.github.com/httpstergeek/5fd08b9bc750e2d1954de78b063a092a

Hope this helps and if it does dont forget to accept and vote up. Cheers.

View solution in original post

New Member

Hi!

Could you explain how to correctly implement this configuraton in Splunk, I've copied transforms.conf but nothing has changed
I also want to exclude all bots from my analysis.

Thanks in advance!

0 Karma

Champion

I don't use eval statements to figure this if something is a bot. I have a collection of 74 transforms applied against the useragent field. The regex patterns are highly tuned to match in the least amount of steps. The reason I have so many is that our SEO team uses this data; however, this does not account for any bot doing a good job of impersonating a browser. We also do a cidr match against the cip and assume any address coming from AWS, Google Cloud, Digital Ocean, and Azure address blocks are bots.

Here is a link to a gist I created - https://gist.github.com/httpstergeek/5fd08b9bc750e2d1954de78b063a092a

Hope this helps and if it does dont forget to accept and vote up. Cheers.

View solution in original post

Communicator

Wow, so that's a totally different method 🙂 Is it safe to say there isn't a definitive way to 100% accurately define bots?

0 Karma

Champion

You can get fairly close, but definitely not 100%. We also use Google Analytics and our number match up fairly closely. Our SEO team uses Splunk for quick analysis and granularity since GA I think reports hourly.

0 Karma

Communicator

Are you able to share any hints on how you created your set of 74 transforms? I can't find anything anywhere on making sure what I'm using is giving accurate results. When I compare Splunk and GA, the numbers vary greatly and I'm trying to figure it out if it's my eval that's the problem or if GA is misbehaving somehow.

0 Karma

Champion

I've converted my post to an answer with a link to my transform as gist.

0 Karma

Communicator

You are AMAZING! Thank you so much! 🙂

0 Karma