Archive

Any apps for web site crawling?

Motivator

Hi,

I am looking for a add-on or apps that gathers HTML information of a website.
e.g. I want to collect HTMLS found under www.somesite.com, and do further text analysis based on the information gathered from the site.

Main use case is to find specific description of an product in a partner company site of a company, and see if the description is really up to date or collect. The characters used in sites may contain non-English ones.

I know the crawl command that comes with Splunk installation is not for this use case...

I would appreciate any comment from anyone who has tried this kind of thing...

Thank you,

Tags (3)
0 Karma
1 Solution

Communicator

Just use a splunk scripted input and use "curl" to hit the web page. Once you've got the HTML you can extract anything you want. For HTML tables I've used a couple "sed" commands to convert the whole table to CSV before bringing it into splunk. Take a look at http://www.mylinuxplace.com/tips_and_tricks/convert-html-to-csv/ for a few hints.

View solution in original post

0 Karma

SplunkTrust
SplunkTrust

python urllib2 is wonderful in doing this stuff, it will take you some time to write the script but in the end you will have exactly that one script that will fit your needs 🙂
you can find a example here: http://ryanmerl.com/2009/02/14/python-web-crawler-in-less-than-50-lines/

sit down, relax, take a deep breath and start scripting...

Motivator

I think I need to add more info to my question, but please see the comment under jstockamp's answer.

Well for your idea, actually, I was gathering weather information from a weather website that showed daily weather data in a table format in a HTML. Also, there was a website that posted daily nuclear emission in a table format in a PDF. there are many kind of data that would be very useful but not really machine friendly.

If you could make this kind of thing easier, that would be great.

0 Karma

Communicator

Just use a splunk scripted input and use "curl" to hit the web page. Once you've got the HTML you can extract anything you want. For HTML tables I've used a couple "sed" commands to convert the whole table to CSV before bringing it into splunk. Take a look at http://www.mylinuxplace.com/tips_and_tricks/convert-html-to-csv/ for a few hints.

View solution in original post

0 Karma

Motivator

If I could specify exactly the URL in a site, then I could simply use curl to get HTML. But I want a crawler type of thing to get the HTML in a top page (e.g. www.somesite.com/index.html) and follow the link found in the HTML to get HTML of next page (e.g. www.somesite.com/nextdir/nextpage.html) ... traverse the site and gather the HTML.

0 Karma

Champion

I haven't tried this yet but I have been considering writing an app to do this sort of thing. I was thinking of writing something that would do things like:

  • Grab information using a jQuery-like selector
  • Convert a table of information into an event (useful because some webpages have tables where the columns or rows contain the field name and the cells contain the values)

Let me know if you had other ideas. If others want this, then I might just start working on this.

0 Karma
State of Splunk Careers

Access the Splunk Careers Report to see real data that shows how Splunk mastery increases your value and job satisfaction.

Find out what your skills are worth!