How can I index the subjects of posts on a forum that is updated constantly without indexing duplicates?

agoktas — Tue, 26 Sep 2017 05:00:50 GMT

Hello!

Here is what I'm trying to do:

Index a particular section of a web page. This particular section is a forum that is updated constantly, and there is only 1 main column that I'm interested in, which is titled "Subject".

How do I accomplish this w/o running into duplicate entries? - which is what I'm getting when I do the following.

Currently I run the following using PowerShell:
$wc.downloadstring("https://website.com/forum123/") >C:\PS_Output\Output.txt

Then I index output.txt and use Splunk to find a Named Variable using Regex to find the occurrences of a particular string (i.e.: 4 consecutive capitol letters).

But each time Output.txt is overwritten (when I run $wc.download string twice - seconds apart), I get a lot of duplicates.

I believe I have 2 problems:
1) Need to instead clean up output.txt and only have relevant events (no need for all the surround garbage html source). Perhaps I need to add some regex to the $wc.downloadstring class?

2) The tricky part is how quickly the webpage's table is flushed out with new posts. If I run this every minute, but all 50 posts flush with 50 new posts within 30 seconds, I loose about half content that I need.

Anyone out there ever tried grabbing content from an external site (not having admin access to the server of course) and keeping historical data?

Thanks!

Re: How can I index the subjects of posts on a forum that is updated constantly without indexing duplicates?

DalJeanis — Tue, 26 Sep 2017 21:05:05 GMT

I'm not sure I understand your use case. For example, I'm not sure what the issue with duplicates is, because you can dedup before, during or after ingestion. For example, you could start by ingesting into a temporary index, then use collect to copy the nondups to a permanent summary index. Alternately, you could append the output to a file, and run a script periodically to clean the file up and copy it over for ingestion.

It sounds like your major issue is that the flow of events through the webpage is faster than you are able to scrape it. I guess I would probably have two separate systems, or three, or four, pulling the data rotating every 15-20-30 seconds, and then worry about cleaning up the dups on the back end.

topic How can I index the subjects of posts on a forum that is updated constantly without indexing duplicates? in Splunk Search

How can I index the subjects of posts on a forum that is updated constantly without indexing duplicates?

Re: How can I index the subjects of posts on a forum that is updated constantly without indexing duplicates?