<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How can I index the subjects of posts on a forum that is updated constantly without indexing duplicates? in Splunk Search</title>
    <link>https://community.splunk.com/t5/Splunk-Search/How-can-I-index-the-subjects-of-posts-on-a-forum-that-is-updated/m-p/348421#M103141</link>
    <description>&lt;P&gt;Hello!  &lt;/P&gt;

&lt;P&gt;Here is what I'm trying to do:&lt;BR /&gt;&lt;BR /&gt;
Index a particular section of a web page.  This particular section is a forum that is updated constantly, and there is only 1 main column that I'm interested in, which is titled "Subject".  &lt;/P&gt;

&lt;P&gt;How do I accomplish this w/o running into duplicate entries? - which is what I'm getting when I do the following.  &lt;/P&gt;

&lt;P&gt;Currently I run the following using PowerShell: &lt;BR /&gt;
$wc.downloadstring("&lt;A href="https://website.com/forum123/%22"&gt;https://website.com/forum123/"&lt;/A&gt;) &amp;gt;C:\PS_Output\Output.txt&lt;/P&gt;

&lt;P&gt;Then I index output.txt and use Splunk to find a Named Variable using Regex to find the occurrences of a particular string (i.e.:  4 consecutive capitol letters).&lt;BR /&gt;&lt;BR /&gt;
But each time Output.txt is overwritten (when I run $wc.download string twice - seconds apart), I get a lot of duplicates.  &lt;/P&gt;

&lt;P&gt;I believe I have 2 problems:&lt;BR /&gt;
1) Need to instead clean up output.txt and only have relevant events (no need for all the surround garbage html source).  Perhaps I need to add some regex to the $wc.downloadstring class?&lt;BR /&gt;&lt;BR /&gt;
2) The tricky part is how quickly the webpage's table is flushed out with new posts.  If I run this every minute, but all 50 posts flush with 50 new posts within 30 seconds, I loose about half content that I need.  &lt;/P&gt;

&lt;P&gt;Anyone out there ever tried grabbing content from an external site (not having admin access to the server of course) and keeping historical data?  &lt;/P&gt;

&lt;P&gt;Thanks!&lt;/P&gt;</description>
    <pubDate>Tue, 26 Sep 2017 05:00:50 GMT</pubDate>
    <dc:creator>agoktas</dc:creator>
    <dc:date>2017-09-26T05:00:50Z</dc:date>
    <item>
      <title>How can I index the subjects of posts on a forum that is updated constantly without indexing duplicates?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/How-can-I-index-the-subjects-of-posts-on-a-forum-that-is-updated/m-p/348421#M103141</link>
      <description>&lt;P&gt;Hello!  &lt;/P&gt;

&lt;P&gt;Here is what I'm trying to do:&lt;BR /&gt;&lt;BR /&gt;
Index a particular section of a web page.  This particular section is a forum that is updated constantly, and there is only 1 main column that I'm interested in, which is titled "Subject".  &lt;/P&gt;

&lt;P&gt;How do I accomplish this w/o running into duplicate entries? - which is what I'm getting when I do the following.  &lt;/P&gt;

&lt;P&gt;Currently I run the following using PowerShell: &lt;BR /&gt;
$wc.downloadstring("&lt;A href="https://website.com/forum123/%22"&gt;https://website.com/forum123/"&lt;/A&gt;) &amp;gt;C:\PS_Output\Output.txt&lt;/P&gt;

&lt;P&gt;Then I index output.txt and use Splunk to find a Named Variable using Regex to find the occurrences of a particular string (i.e.:  4 consecutive capitol letters).&lt;BR /&gt;&lt;BR /&gt;
But each time Output.txt is overwritten (when I run $wc.download string twice - seconds apart), I get a lot of duplicates.  &lt;/P&gt;

&lt;P&gt;I believe I have 2 problems:&lt;BR /&gt;
1) Need to instead clean up output.txt and only have relevant events (no need for all the surround garbage html source).  Perhaps I need to add some regex to the $wc.downloadstring class?&lt;BR /&gt;&lt;BR /&gt;
2) The tricky part is how quickly the webpage's table is flushed out with new posts.  If I run this every minute, but all 50 posts flush with 50 new posts within 30 seconds, I loose about half content that I need.  &lt;/P&gt;

&lt;P&gt;Anyone out there ever tried grabbing content from an external site (not having admin access to the server of course) and keeping historical data?  &lt;/P&gt;

&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Tue, 26 Sep 2017 05:00:50 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/How-can-I-index-the-subjects-of-posts-on-a-forum-that-is-updated/m-p/348421#M103141</guid>
      <dc:creator>agoktas</dc:creator>
      <dc:date>2017-09-26T05:00:50Z</dc:date>
    </item>
    <item>
      <title>Re: How can I index the subjects of posts on a forum that is updated constantly without indexing duplicates?</title>
      <link>https://community.splunk.com/t5/Splunk-Search/How-can-I-index-the-subjects-of-posts-on-a-forum-that-is-updated/m-p/348422#M103142</link>
      <description>&lt;P&gt;I'm not sure I understand your use case.  For example, I'm not sure what the issue with duplicates is, because you can &lt;CODE&gt;dedup&lt;/CODE&gt; before, during or after ingestion.  For example, you could start by ingesting into a temporary index, then use &lt;CODE&gt;collect&lt;/CODE&gt; to copy the nondups to a permanent summary index.  Alternately, you could append the output to a file, and run a script periodically to clean the file up and copy it over for ingestion.&lt;/P&gt;

&lt;P&gt;It sounds like your major issue is that the flow of events through the webpage is faster than you are able to scrape it.  I guess I would probably have two separate systems, or three, or four, pulling the data rotating every 15-20-30 seconds, and then worry about cleaning up the dups on the back end.&lt;/P&gt;</description>
      <pubDate>Tue, 26 Sep 2017 21:05:05 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Splunk-Search/How-can-I-index-the-subjects-of-posts-on-a-forum-that-is-updated/m-p/348422#M103142</guid>
      <dc:creator>DalJeanis</dc:creator>
      <dc:date>2017-09-26T21:05:05Z</dc:date>
    </item>
  </channel>
</rss>

