Hi,
I am looking for a add-on or apps that gathers HTML information of a website.
e.g. I want to collect HTMLS found under www.somesite.com, and do further text analysis based on the information gathered from the site.
Main use case is to find specific description of an product in a partner company site of a company, and see if the description is really up to date or collect. The characters used in sites may contain non-English ones.
I know the crawl command that comes with Splunk installation is not for this use case...
I would appreciate any comment from anyone who has tried this kind of thing...
Thank you,
Just use a splunk scripted input and use "curl" to hit the web page. Once you've got the HTML you can extract anything you want. For HTML tables I've used a couple "sed" commands to convert the whole table to CSV before bringing it into splunk. Take a look at http://www.mylinuxplace.com/tips_and_tricks/convert-html-to-csv/ for a few hints.
python urllib2 is wonderful in doing this stuff, it will take you some time to write the script but in the end you will have exactly that one script that will fit your needs 🙂
you can find a example here: http://ryanmerl.com/2009/02/14/python-web-crawler-in-less-than-50-lines/
sit down, relax, take a deep breath and start scripting...
I think I need to add more info to my question, but please see the comment under jstockamp's answer.
Well for your idea, actually, I was gathering weather information from a weather website that showed daily weather data in a table format in a HTML. Also, there was a website that posted daily nuclear emission in a table format in a PDF. there are many kind of data that would be very useful but not really machine friendly.
If you could make this kind of thing easier, that would be great.
Just use a splunk scripted input and use "curl" to hit the web page. Once you've got the HTML you can extract anything you want. For HTML tables I've used a couple "sed" commands to convert the whole table to CSV before bringing it into splunk. Take a look at http://www.mylinuxplace.com/tips_and_tricks/convert-html-to-csv/ for a few hints.
If I could specify exactly the URL in a site, then I could simply use curl to get HTML. But I want a crawler type of thing to get the HTML in a top page (e.g. www.somesite.com/index.html) and follow the link found in the HTML to get HTML of next page (e.g. www.somesite.com/nextdir/nextpage.html) ... traverse the site and gather the HTML.
I haven't tried this yet but I have been considering writing an app to do this sort of thing. I was thinking of writing something that would do things like:
Let me know if you had other ideas. If others want this, then I might just start working on this.