Solved: Any apps for web site crawling?

melonman · ‎03-14-2013

Hi,

I am looking for a add-on or apps that gathers HTML information of a website.
e.g. I want to collect HTMLS found under www.somesite.com, and do further text analysis based on the information gathered from the site.

Main use case is to find specific description of an product in a partner company site of a company, and see if the description is really up to date or collect. The characters used in sites may contain non-English ones.

I know the crawl command that comes with Splunk installation is not for this use case...

I would appreciate any comment from anyone who has tried this kind of thing...

Thank you,

jstockamp · ‎03-14-2013

Just use a splunk scripted input and use "curl" to hit the web page. Once you've got the HTML you can extract anything you want. For HTML tables I've used a couple "sed" commands to convert the whole table to CSV before bringing it into splunk. Take a look at http://www.mylinuxplace.com/tips_and_tricks/convert-html-to-csv/ for a few hints.

View solution in original post

MuS · ‎03-15-2013

python urllib2 is wonderful in doing this stuff, it will take you some time to write the script but in the end you will have exactly that one script that will fit your needs 🙂
you can find a example here: http://ryanmerl.com/2009/02/14/python-web-crawler-in-less-than-50-lines/

sit down, relax, take a deep breath and start scripting...

melonman · ‎03-15-2013

I think I need to add more info to my question, but please see the comment under jstockamp's answer.

Well for your idea, actually, I was gathering weather information from a weather website that showed daily weather data in a table format in a HTML. Also, there was a website that posted daily nuclear emission in a table format in a PDF. there are many kind of data that would be very useful but not really machine friendly.

If you could make this kind of thing easier, that would be great.

jstockamp · ‎03-14-2013

Just use a splunk scripted input and use "curl" to hit the web page. Once you've got the HTML you can extract anything you want. For HTML tables I've used a couple "sed" commands to convert the whole table to CSV before bringing it into splunk. Take a look at http://www.mylinuxplace.com/tips_and_tricks/convert-html-to-csv/ for a few hints.

melonman · ‎03-15-2013

If I could specify exactly the URL in a site, then I could simply use curl to get HTML. But I want a crawler type of thing to get the HTML in a top page (e.g. www.somesite.com/index.html) and follow the link found in the HTML to get HTML of next page (e.g. www.somesite.com/nextdir/nextpage.html) ... traverse the site and gather the HTML.

LukeMurphey · ‎03-14-2013

I haven't tried this yet but I have been considering writing an app to do this sort of thing. I was thinking of writing something that would do things like:

Grab information using a jQuery-like selector
Convert a table of information into an event (useful because some webpages have tables where the columns or rows contain the field name and the cells contain the values)

Let me know if you had other ideas. If others want this, then I might just start working on this.

Any apps for web site crawling?

ICYMI - Check out the latest releases of Splunk Edge Processor

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!