All Apps and Add-ons

Any apps for web site crawling?

melonman
Motivator

Hi,

I am looking for a add-on or apps that gathers HTML information of a website.
e.g. I want to collect HTMLS found under www.somesite.com, and do further text analysis based on the information gathered from the site.

Main use case is to find specific description of an product in a partner company site of a company, and see if the description is really up to date or collect. The characters used in sites may contain non-English ones.

I know the crawl command that comes with Splunk installation is not for this use case...

I would appreciate any comment from anyone who has tried this kind of thing...

Thank you,

Tags (3)
0 Karma
1 Solution

jstockamp
Communicator

Just use a splunk scripted input and use "curl" to hit the web page. Once you've got the HTML you can extract anything you want. For HTML tables I've used a couple "sed" commands to convert the whole table to CSV before bringing it into splunk. Take a look at http://www.mylinuxplace.com/tips_and_tricks/convert-html-to-csv/ for a few hints.

View solution in original post

0 Karma

MuS
Legend

python urllib2 is wonderful in doing this stuff, it will take you some time to write the script but in the end you will have exactly that one script that will fit your needs 🙂
you can find a example here: http://ryanmerl.com/2009/02/14/python-web-crawler-in-less-than-50-lines/

sit down, relax, take a deep breath and start scripting...

melonman
Motivator

I think I need to add more info to my question, but please see the comment under jstockamp's answer.

Well for your idea, actually, I was gathering weather information from a weather website that showed daily weather data in a table format in a HTML. Also, there was a website that posted daily nuclear emission in a table format in a PDF. there are many kind of data that would be very useful but not really machine friendly.

If you could make this kind of thing easier, that would be great.

0 Karma

jstockamp
Communicator

Just use a splunk scripted input and use "curl" to hit the web page. Once you've got the HTML you can extract anything you want. For HTML tables I've used a couple "sed" commands to convert the whole table to CSV before bringing it into splunk. Take a look at http://www.mylinuxplace.com/tips_and_tricks/convert-html-to-csv/ for a few hints.

0 Karma

melonman
Motivator

If I could specify exactly the URL in a site, then I could simply use curl to get HTML. But I want a crawler type of thing to get the HTML in a top page (e.g. www.somesite.com/index.html) and follow the link found in the HTML to get HTML of next page (e.g. www.somesite.com/nextdir/nextpage.html) ... traverse the site and gather the HTML.

0 Karma

LukeMurphey
Champion

I haven't tried this yet but I have been considering writing an app to do this sort of thing. I was thinking of writing something that would do things like:

  • Grab information using a jQuery-like selector
  • Convert a table of information into an event (useful because some webpages have tables where the columns or rows contain the field name and the cells contain the values)

Let me know if you had other ideas. If others want this, then I might just start working on this.

0 Karma
Get Updates on the Splunk Community!

New Dates, New City: Save the Date for .conf25!

Wake up, babe! New .conf25 dates AND location just dropped!! That's right, this year, .conf25 is taking place ...

Introduction to Splunk Observability Cloud - Building a Resilient Hybrid Cloud

Introduction to Splunk Observability Cloud - Building a Resilient Hybrid Cloud  In today’s fast-paced digital ...

Observability protocols to know about

Observability protocols define the specifications or formats for collecting, encoding, transporting, and ...