<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Backfill automated bash script timeout: Is there a best practice on how much data can be backfilled per thread/search? in Knowledge Management</title>
    <link>https://community.splunk.com/t5/Knowledge-Management/Backfill-automated-bash-script-timeout-Is-there-a-best-practice/m-p/203516#M1803</link>
    <description>&lt;P&gt;From where does the &lt;CODE&gt;fill_summary_index.py&lt;/CODE&gt; python script come from?&lt;/P&gt;</description>
    <pubDate>Sun, 12 Jun 2016 01:00:28 GMT</pubDate>
    <dc:creator>ddrillic</dc:creator>
    <dc:date>2016-06-12T01:00:28Z</dc:date>
    <item>
      <title>Backfill automated bash script timeout: Is there a best practice on how much data can be backfilled per thread/search?</title>
      <link>https://community.splunk.com/t5/Knowledge-Management/Backfill-automated-bash-script-timeout-Is-there-a-best-practice/m-p/203515#M1802</link>
      <description>&lt;P&gt;I have created a bash script to assist with automation of backfilling missing data and to avoid overloading the server. However, at times when I increase threads and the time space of the search, some backfills are skipped due to an error. From the settings below (within script) is there a best practice on how much data can be backfilled per thread/search? &lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;#!/bin/bash
#This script is intended to backfill splunk data that is already ingested into splunk but searches that failed due to license issues (or other phenomenon).

#Timestamp used for logs
_now=$(date +"%Y-%b-%d_%Hh_%Mm_%Ss")

############################
#Required Information Needed
############################
#Splunk path
splunk_dir=/opt/splunk/bin
#Log Path
log_dir=/opt/scripts/logs
#Splunk Username (not linux username) to run backfill script under
username=Powers64
#Name of Application the search resides in
app="SmartyPants"
#This needs to be type again manually down below. If search has - in it's name the script will not run.
search_name="'Summary - SmartyPants - 5 minutes'"
#Search Earliest EPOCH time
et=1463247600
#Search Latest EPOCH time
lt=1463605200
#How often does the search run? [In Seconds]
seconds=300
#Max Backfill Queries in every Search
maxq=10
#When this option is set to true, the script does not run saved searches for a scheduled timespan if data already exists in the summary index for that timespan.
dedup="true"
#Specifies that the summary indexes are not on the search head but are on the indexes instead. To be used in conjunction with -dedup
nolocal="true"
#Maximum number of concurrent searches to run
concurrent=2
####
#For more info on managing backfill visit &lt;A href="http://docs.splunk.com/Documentation/Splunk/latest/Knowledge/Managesummaryindexgapsandoverlaps" target="test_blank"&gt;http://docs.splunk.com/Documentation/Splunk/latest/Knowledge/Managesummaryindexgapsandoverlaps&lt;/A&gt;
####
############################
#End of required Information
############################

echo "Please enter username's password (Note: Password is invisiable, just press [Enter] after typed): "
read -s password

cd $splunk_dir
#Total queries run within 1 backfill search
queries=$(($seconds*$maxq))
#Calculates the last run if not dividable by queries ran
remaintime=$((($lt-$et)%$queries))

#Runs a recurring backfill search based on parameters above
for ((current=$et; current&amp;lt;$lt; current=current+$queries))
do

#Calculates remaining seconds to run. Identifies when to run last backfill search
lastrun=$(($lt-$current))

        if [ $lastrun != $remaintime ]
            then
                qrun=$(($current+$queries))
                completed=$(((($current-$et)*100)/($lt-$et)))
                echo "Running backfill from" $current "to" $qrun
                ./splunk cmd python fill_summary_index.py -app $app -name 'Summary - SmartyPants - 5 minutes' -et $current -lt $qrun -dedup $dedup -nolocal $nolocal -showprogress true -j $concurrent -auth $username:$password 2&amp;gt;&amp;amp;1 | tee $log_dir/$_now.output
                echo $_now "-" $app $search_name $current $qrun &amp;gt;&amp;gt; $log_dir/backfill_history.log
                echo $completed"% Complete - Surpressing script for 15 seconds to avoid overloading server"
                sleep 15
            else
                echo "Running last backfill from" $current "to" $lt
                ./splunk cmd python fill_summary_index.py -app $app -name 'Summary - SmartyPants - 5 minutes' -et $current -lt $lt -dedup $dedup -nolocal $nolocal -showprogress true -j $concurrent -auth $username:$password 2&amp;gt;&amp;amp;1 | tee $log_dir/$_now.output
                echo $_now "-" $app $search_name $current $lt &amp;gt;&amp;gt; $log_dir/backfill_history.log
                echo "100% Complete - Backfill completed! Yippy"
        fi

done
&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Fri, 10 Jun 2016 18:26:27 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Knowledge-Management/Backfill-automated-bash-script-timeout-Is-there-a-best-practice/m-p/203515#M1802</guid>
      <dc:creator>Powers64</dc:creator>
      <dc:date>2016-06-10T18:26:27Z</dc:date>
    </item>
    <item>
      <title>Re: Backfill automated bash script timeout: Is there a best practice on how much data can be backfilled per thread/search?</title>
      <link>https://community.splunk.com/t5/Knowledge-Management/Backfill-automated-bash-script-timeout-Is-there-a-best-practice/m-p/203516#M1803</link>
      <description>&lt;P&gt;From where does the &lt;CODE&gt;fill_summary_index.py&lt;/CODE&gt; python script come from?&lt;/P&gt;</description>
      <pubDate>Sun, 12 Jun 2016 01:00:28 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Knowledge-Management/Backfill-automated-bash-script-timeout-Is-there-a-best-practice/m-p/203516#M1803</guid>
      <dc:creator>ddrillic</dc:creator>
      <dc:date>2016-06-12T01:00:28Z</dc:date>
    </item>
    <item>
      <title>Re: Backfill automated bash script timeout: Is there a best practice on how much data can be backfilled per thread/search?</title>
      <link>https://community.splunk.com/t5/Knowledge-Management/Backfill-automated-bash-script-timeout-Is-there-a-best-practice/m-p/203517#M1804</link>
      <description>&lt;P&gt;First, script looks very good. It gives a lot of options to pick....I do a lot of backfilling myself  based on search names/app etc&lt;/P&gt;

&lt;P&gt;i have modified the fill_summary_index.py to suit the best for a Search head clustering environment and pick a time when the schedules are minimum. The option j cannot exceed the number of cores for the search head (I can put a 1000 in there but if my machine has only 16 cores, 16 searches is all it can run at any give time (concurrently))&lt;/P&gt;

&lt;P&gt;I typically rely heavily on dedup -true as that's not going to harm the performance (It simply does not execute if the job has already run). &lt;/P&gt;

&lt;P&gt;That being said, there's no backfilling best practice perse . I however pick a list f searches that have a lot in common (Example: schedule time ranges) If i have 10 searches that run in 15 min difference, i will pick a -et,-lt to cover the search window for all the 10 and use -dedup true true to ignore the ones that already ran.&lt;/P&gt;

&lt;P&gt;./splunk cmd python fill_summary_index.py -app search -name  "All the crazy summaries" -dedup true -showprogress true -j 16 (That's all my search head can handle, 16 searches concurrently) -owner admin -auth admin:admin.&lt;/P&gt;

&lt;P&gt;Since the script you wrote covers everything...only way to overcome performance is , run few summary backfills from a different SHC member (If you have search head clustering) or even pooling.....Reason i had to edit fill_summary_index.py is i do not store any summary data on Search heads and i forward everything rom SHC to Indexers.&lt;/P&gt;

&lt;P&gt;Hope this helps!&lt;/P&gt;

&lt;P&gt;Thanks,&lt;BR /&gt;
Raghav&lt;/P&gt;</description>
      <pubDate>Tue, 29 Sep 2020 09:56:31 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Knowledge-Management/Backfill-automated-bash-script-timeout-Is-there-a-best-practice/m-p/203517#M1804</guid>
      <dc:creator>Raghav2384</dc:creator>
      <dc:date>2020-09-29T09:56:31Z</dc:date>
    </item>
    <item>
      <title>Re: Backfill automated bash script timeout: Is there a best practice on how much data can be backfilled per thread/search?</title>
      <link>https://community.splunk.com/t5/Knowledge-Management/Backfill-automated-bash-script-timeout-Is-there-a-best-practice/m-p/203518#M1805</link>
      <description>&lt;P&gt;Raghav2384, thanks for the reply. I noticed that when I try to backfill a search jobs that runs every 5 minutes with over 100k events per search it will error out if I use a wide back fill time window. On the other hand when I run a backfill on a search jobs that runs every hour with 9k events per search with a very large window of backfill it has no issue.&lt;/P&gt;

&lt;P&gt;I figured there is a limitation on how many events per search job can be backfilled.&lt;/P&gt;

&lt;P&gt;As for your change on the fill_summary_index.py, there is -nolocal argument that "Specifies that the summary indexes are not on the search head but are on the indexes instead. To be used in conjunction with -dedup"&lt;/P&gt;</description>
      <pubDate>Tue, 29 Sep 2020 09:56:39 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Knowledge-Management/Backfill-automated-bash-script-timeout-Is-there-a-best-practice/m-p/203518#M1805</guid>
      <dc:creator>Powers64</dc:creator>
      <dc:date>2020-09-29T09:56:39Z</dc:date>
    </item>
    <item>
      <title>Re: Backfill automated bash script timeout: Is there a best practice on how much data can be backfilled per thread/search?</title>
      <link>https://community.splunk.com/t5/Knowledge-Management/Backfill-automated-bash-script-timeout-Is-there-a-best-practice/m-p/203519#M1806</link>
      <description>&lt;P&gt;It is a splunk script to backfill data generated from running search jobs. &lt;BR /&gt;
&lt;A href="http://docs.splunk.com/Documentation/Splunk/6.4.1/Knowledge/Managesummaryindexgapsandoverlaps"&gt;http://docs.splunk.com/Documentation/Splunk/6.4.1/Knowledge/Managesummaryindexgapsandoverlaps&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 13 Jun 2016 12:53:58 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Knowledge-Management/Backfill-automated-bash-script-timeout-Is-there-a-best-practice/m-p/203519#M1806</guid>
      <dc:creator>Powers64</dc:creator>
      <dc:date>2016-06-13T12:53:58Z</dc:date>
    </item>
    <item>
      <title>Re: Backfill automated bash script timeout: Is there a best practice on how much data can be backfilled per thread/search?</title>
      <link>https://community.splunk.com/t5/Knowledge-Management/Backfill-automated-bash-script-timeout-Is-there-a-best-practice/m-p/203520#M1807</link>
      <description>&lt;P&gt;This looks pretty good so if you are still looking for performance/safety improvements, I suggest that you convert from using SI to using accelerated data-models + tstats.&lt;/P&gt;</description>
      <pubDate>Mon, 13 Jun 2016 20:10:56 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Knowledge-Management/Backfill-automated-bash-script-timeout-Is-there-a-best-practice/m-p/203520#M1807</guid>
      <dc:creator>woodcock</dc:creator>
      <dc:date>2016-06-13T20:10:56Z</dc:date>
    </item>
  </channel>
</rss>

