I'm struggling to monitor a webpage with Splunk. I just need to get the html from a certain URL (like google,com) and get the contents of the element into Splunk. According to the following link, a script using WGET would be the best solution:
Hi, thanks for your response. Have you actually been able to get this to work? I have been trying to set it up and no matter what I do, it returns a table with timed_out as True. Here are examples of searches after setup:
Can you let me know how you set it up?
it is working fine for a simple HTML page i have created that dont have any CSS
<!DOCTYPE html> <html> <body> hello this is a sample page content </body> </html>
in this case - *in the selector settings of the Website Input you should put **
then to extract only the body of the html i have used the REX command
index=web* sourcetype="web test" | rex field=content "(?msi)<Body>(?<testdata>.*)<\/Body>\s+"
i have used it for a while in a test environment. it is working ok but it takes some time to find the right selector.
please note that the selector is based on the page CSS classes. in the search in your comment the selector is body which seem to be the html part. try finding the css class that define the text you are trying to get from the site.
Got it, I was thinking css/html classes were the same. The page I am using is very bare-bones. There is no CSS at all. You're saying you can only monitor classes that have CSS applied to them?