Splunk Search

Routinely (24 times per day = 1 get per hour) parse section of HTML page (i.e.: specific table) to output.txt. I need syntax/regex to parse specific section, NOT all source code.



I need to parse a specific web page's table (I'm using PowerShell/WMI ($wc.downloadstring) to download source code) and output to output.txt.

If I pull the entire source code, I get duplicate events/data for obvious reasons - which then throws off my numbers of events (based on repeats)

I need to pull the exact section of the page 24 times a day (1 x per hour), and output to file.

What I need:

The regex syntax to search html source code - specific section/table. Should I use a named variable to identify the code for beginning of the table and the end of the table - which means I can output or index all the content within?

Thanks in advance for your help!

Tags (1)
0 Karma

Esteemed Legend

Hi agoktas,
you can take the page using a forwarder installed on the web server and index only the part you're interested.

It's not possible to create a regex without having the page because regex is specific for a source.

Anyway the method I suggest is:

  • take your page source,
  • copy it in regex101.com,
  • extract and test regex,
  • configure your filtering (props.conf and transforms.conf) using the found regex.




DEST_KEY = queue
FORMAT = indexQueue

Remember that you have a multi line regex so in the beginning of your regex you have to put (?ms)


0 Karma


Hi cusello,

Thanks for your reply, but this is a web server that I don't own nor do I have admin access to. Please see my reply to Niketnilay the response above. This should clarify my use case a bit more.

Let me know if we need further clarification.


0 Karma


@agoktas, the regex will be specific to data. So we would need to get the sample web page data. Please mock up the events if not the entire page so that community can help you with the same.

Please elaborate exact section of the page 24 times a day (1 x per hour) with the data as to which html tag it belongs to and what is the pattern.

| makeresults | eval message= "Happy Splunking!!!"
0 Karma


What I'm trying to do is parse a table on a particular web site (i.e.: forum posts) every minute or hour (still deciding) and regex a specific named variable. Then I'm going to run reports on this named variable (i.e.: number of occurrences).

I am already using PowerShell to download source code, and am indexing this output.txt.

$wc.downloadstring("https://website.com/forum123/") >C:\PS_Output\Output.txt

The problem I have is when I overwrite the output.txt on the routine interval, I get a lot of duplicates for this named variable. I need a way to write to this output.txt as if it were a traditional log file - thus not have duplicate events.

Hope this clarifies a bit. 🙂

0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Updates (ESCU) - New Releases

In the last month, the Splunk Threat Research Team (STRT) has had 3 releases of new content via the Enterprise ...

Thought Leaders are Validating Your Hard Work and Training Rigor

As a Splunk enthusiast and member of the Splunk Community, you are one of thousands who recognize the value of ...

.conf23 Registration is Now Open!

Time to toss the .conf-etti 🎉 —  .conf23 registration is open!   Join us in Las Vegas July 17-20 for ...