<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Feeding data from script into splunk while avoiding data duplicates in Getting Data In</title>
    <link>https://community.splunk.com/t5/Getting-Data-In/Feeding-data-from-script-into-splunk-while-avoiding-data/m-p/130849#M26903</link>
    <description>&lt;P&gt;You would be better to implement &lt;A href="http://docs.splunk.com/Documentation/Splunk/6.0.1/AdvancedDev/ModInputsIntro" target="_blank"&gt;a modular input&lt;/A&gt; that keeps track of its position in the data source and , using the Splunk REST API , persists this positional data (your timestamp) back to inputs.conf so that upon Splunk restarts it remembers where it left off and won't index duplicate data.&lt;/P&gt;

&lt;P&gt;Have a look at my &lt;A href="http://apps.splunk.com/app/1546/" target="_blank"&gt;REST API Modular input&lt;/A&gt; as an example of how to do this.&lt;/P&gt;

&lt;P&gt;It comes with an example Twitter handler that keeps track of the tweet stream "since_id" so that it doesn't index duplicate data.This "since_id" is persisted back to Splunk via the REST API. So this is the same paradigm you are trying to achieve with a positional timestamp.&lt;/P&gt;</description>
    <pubDate>Mon, 28 Sep 2020 15:46:48 GMT</pubDate>
    <dc:creator>Damien_Dallimor</dc:creator>
    <dc:date>2020-09-28T15:46:48Z</dc:date>
    <item>
      <title>Feeding data from script into splunk while avoiding data duplicates</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Feeding-data-from-script-into-splunk-while-avoiding-data/m-p/130848#M26902</link>
      <description>&lt;P&gt;Hello all,&lt;/P&gt;

&lt;P&gt;upfront: first time Splunk user here, be patient with me &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;

&lt;P&gt;I've a scenario I would like to describe and which I require some comments on in regards to how this can be archived with Splunk.&lt;/P&gt;

&lt;P&gt;Scenario:&lt;BR /&gt;
- I have a PERL script which is generating data from a target API (stdout or file) on a daily basis - the script requires to be executed with a parameter and only retrieves data with a timestamp that is newer than the value of the supplied parameter&lt;BR /&gt;
- now I want to forward this CSV formated output into Splunk&lt;BR /&gt;
- tha data integrity shall be handled by relying on the information which is stored in Splunk (the highest timestamp value stored)&lt;/P&gt;

&lt;P&gt;Generally I'm not sure how to assure that splunk does not create duplicates for this data.&lt;/P&gt;

&lt;P&gt;Current approach/idea: &lt;BR /&gt;
1) a daily routine in Splunk is triggered (I assume that would be the job of the forwarder)&lt;BR /&gt;
2) this input routine checks for the highest timestamp value currently stored in the Splunk index, passes this information towards the PERL script and executes it&lt;BR /&gt;
3) Splunk takes the output from the PERL script (stdout or file) and feeds it into the index &lt;/P&gt;

&lt;P&gt;Does the approach sound reasonable? I'm uncertain how to archive the logic described in 2) - I was thinking about firing the module up as script:// but I'm uncertain how to pass the timestamp value stored in the index. As an alternative I was thinking about just dumping the whole information from the API each time and afterwards somehow filter for data which already was indexed. What can I do to implement a logic for validating for duplicated data?&lt;/P&gt;

&lt;P&gt;Any recommendation or pointing in the right direction would be appreciated.&lt;/P&gt;

&lt;P&gt;Cheers! &lt;/P&gt;</description>
      <pubDate>Thu, 30 Jan 2014 10:49:12 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Feeding-data-from-script-into-splunk-while-avoiding-data/m-p/130848#M26902</guid>
      <dc:creator>skrskr</dc:creator>
      <dc:date>2014-01-30T10:49:12Z</dc:date>
    </item>
    <item>
      <title>Re: Feeding data from script into splunk while avoiding data duplicates</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/Feeding-data-from-script-into-splunk-while-avoiding-data/m-p/130849#M26903</link>
      <description>&lt;P&gt;You would be better to implement &lt;A href="http://docs.splunk.com/Documentation/Splunk/6.0.1/AdvancedDev/ModInputsIntro" target="_blank"&gt;a modular input&lt;/A&gt; that keeps track of its position in the data source and , using the Splunk REST API , persists this positional data (your timestamp) back to inputs.conf so that upon Splunk restarts it remembers where it left off and won't index duplicate data.&lt;/P&gt;

&lt;P&gt;Have a look at my &lt;A href="http://apps.splunk.com/app/1546/" target="_blank"&gt;REST API Modular input&lt;/A&gt; as an example of how to do this.&lt;/P&gt;

&lt;P&gt;It comes with an example Twitter handler that keeps track of the tweet stream "since_id" so that it doesn't index duplicate data.This "since_id" is persisted back to Splunk via the REST API. So this is the same paradigm you are trying to achieve with a positional timestamp.&lt;/P&gt;</description>
      <pubDate>Mon, 28 Sep 2020 15:46:48 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/Feeding-data-from-script-into-splunk-while-avoiding-data/m-p/130849#M26903</guid>
      <dc:creator>Damien_Dallimor</dc:creator>
      <dc:date>2020-09-28T15:46:48Z</dc:date>
    </item>
  </channel>
</rss>

