Splunk Search

What's the best search method to remove web crawlers or bots from download logs?

mistydennis
Communicator

A few years ago, I was given a search string to filter web crawlers/bots from showing up in our download reports. I'm curious as to what other people use to make sure bots are not counted in their downloads...are there better methods?

This is the string I inherited:

eval agentType=if(match(http_user_agent,"(?i).*(bot|crawler|spider).*"),"Bot",if(match(http_user_agent,"^.*Mozilla/.*"),"Browser","Unknown")) | search agentType!="Bot"|search agentType!="Unknown"|

Does anyone know of a more exact or better method to filter out crawlers?

0 Karma
1 Solution

bmacias84
Champion

I don't use eval statements to figure this if something is a bot. I have a collection of 74 transforms applied against the useragent field. The regex patterns are highly tuned to match in the least amount of steps. The reason I have so many is that our SEO team uses this data; however, this does not account for any bot doing a good job of impersonating a browser. We also do a cidr match against the cip and assume any address coming from AWS, Google Cloud, Digital Ocean, and Azure address blocks are bots.

Here is a link to a gist I created - https://gist.github.com/httpstergeek/5fd08b9bc750e2d1954de78b063a092a

Hope this helps and if it does dont forget to accept and vote up. Cheers.

View solution in original post

rwesolowski
New Member

Hi!

Could you explain how to correctly implement this configuraton in Splunk, I've copied transforms.conf but nothing has changed
I also want to exclude all bots from my analysis.

Thanks in advance!

0 Karma

bmacias84
Champion

I don't use eval statements to figure this if something is a bot. I have a collection of 74 transforms applied against the useragent field. The regex patterns are highly tuned to match in the least amount of steps. The reason I have so many is that our SEO team uses this data; however, this does not account for any bot doing a good job of impersonating a browser. We also do a cidr match against the cip and assume any address coming from AWS, Google Cloud, Digital Ocean, and Azure address blocks are bots.

Here is a link to a gist I created - https://gist.github.com/httpstergeek/5fd08b9bc750e2d1954de78b063a092a

Hope this helps and if it does dont forget to accept and vote up. Cheers.

mistydennis
Communicator

Wow, so that's a totally different method 🙂 Is it safe to say there isn't a definitive way to 100% accurately define bots?

0 Karma

bmacias84
Champion

You can get fairly close, but definitely not 100%. We also use Google Analytics and our number match up fairly closely. Our SEO team uses Splunk for quick analysis and granularity since GA I think reports hourly.

0 Karma

mistydennis
Communicator

Are you able to share any hints on how you created your set of 74 transforms? I can't find anything anywhere on making sure what I'm using is giving accurate results. When I compare Splunk and GA, the numbers vary greatly and I'm trying to figure it out if it's my eval that's the problem or if GA is misbehaving somehow.

0 Karma

bmacias84
Champion

I've converted my post to an answer with a link to my transform as gist.

0 Karma

mistydennis
Communicator

You are AMAZING! Thank you so much! 🙂

0 Karma
Get Updates on the Splunk Community!

Take Your Breath Away with Splunk Risk-Based Alerting (RBA)

WATCH NOW!The Splunk Guide to Risk-Based Alerting is here to empower your SOC like never before. Join Haylee ...

SignalFlow: What? Why? How?

What is SignalFlow? Splunk Observability Cloud’s analytics engine, SignalFlow, opens up a world of in-depth ...

Federated Search for Amazon S3 | Key Use Cases to Streamline Compliance Workflows

Modern business operations are supported by data compliance. As regulations evolve, organizations must ...