Splunk Search

How to write a search to extract URLs crawled by Googlebot

guru1
New Member

Which field should be extracted for this relevant use-case?

index={wxxx} googlebot | fields URIs | stats count by URIs | addcoltotals count 

Is this search correct?

0 Karma
1 Solution

burwell
SplunkTrust
SplunkTrust

You want to look in the User Agent field of your web server access logs. You want to look for Googlebot.

For us we see entries like

"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) - -"

View solution in original post

0 Karma

koshyk
Super Champion

if you are using CIM then the web model has lot of this already done and accelerated if needed for faster analysis.
https://docs.splunk.com/Documentation/CIM/4.9.1/User/Web

0 Karma

burwell
SplunkTrust
SplunkTrust

You want to look in the User Agent field of your web server access logs. You want to look for Googlebot.

For us we see entries like

"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) - -"
0 Karma

guru1
New Member

From the logs obtained which field should be extracted which gives data for crawled urls?
ex log : {mainsite} "66.249.66.131, 184.28.127.88, 165.254.1.201" - - [22/Oct/2017:02:45:03 -0400] "GET /somelink.html?promoid=P3KMQYMW&mv=other HTTP/1.1" 200 158382 10371 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Which query should be used.

0 Karma

burwell
SplunkTrust
SplunkTrust

Hi. If you see how the answer in https://answers.splunk.com/answers/584114/how-to-identify-pages-with-404-page-not-found-stat.html mentions getting at the fields, you would find the field associated with

"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

So search around the time that got that above event and then add

<your initial search>| table *

Scroll around and you will find the one that has "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

If you don't have fields then you will need to add props.conf to identify the fields. There's lots of Splunk documentation on how to do that.

If you have control over what is creating the web logs, I highly recommend using field/value pairs instead of positional fields. It makes life so much easier if your logs have status=200 bytes=10371 etc and Splunk pulls these fields out for you.

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...