Splunk Search

How can I index the subjects of posts on a forum that is updated constantly without indexing duplicates?

agoktas
Communicator

Hello!

Here is what I'm trying to do:

Index a particular section of a web page. This particular section is a forum that is updated constantly, and there is only 1 main column that I'm interested in, which is titled "Subject".

How do I accomplish this w/o running into duplicate entries? - which is what I'm getting when I do the following.

Currently I run the following using PowerShell:
$wc.downloadstring("https://website.com/forum123/") >C:\PS_Output\Output.txt

Then I index output.txt and use Splunk to find a Named Variable using Regex to find the occurrences of a particular string (i.e.: 4 consecutive capitol letters).

But each time Output.txt is overwritten (when I run $wc.download string twice - seconds apart), I get a lot of duplicates.

I believe I have 2 problems:
1) Need to instead clean up output.txt and only have relevant events (no need for all the surround garbage html source). Perhaps I need to add some regex to the $wc.downloadstring class?

2) The tricky part is how quickly the webpage's table is flushed out with new posts. If I run this every minute, but all 50 posts flush with 50 new posts within 30 seconds, I loose about half content that I need.

Anyone out there ever tried grabbing content from an external site (not having admin access to the server of course) and keeping historical data?

Thanks!

0 Karma

DalJeanis
Legend

I'm not sure I understand your use case. For example, I'm not sure what the issue with duplicates is, because you can dedup before, during or after ingestion. For example, you could start by ingesting into a temporary index, then use collect to copy the nondups to a permanent summary index. Alternately, you could append the output to a file, and run a script periodically to clean the file up and copy it over for ingestion.

It sounds like your major issue is that the flow of events through the webpage is faster than you are able to scrape it. I guess I would probably have two separate systems, or three, or four, pulling the data rotating every 15-20-30 seconds, and then worry about cleaning up the dups on the back end.

0 Karma
Get Updates on the Splunk Community!

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...