Ensuring that no duplicate events will be indexed

dondky — Wed, 06 Mar 2013 21:34:19 GMT

Hi guys, I'm stumped on task I've been working on for the last few weeks. We are extracting about 1.5 million lines of events from Nessus vulnerability info into a file and then having splunk index via a monitor.

From there we are going to do some logic within our query script to make sure we only pull the latest events from nessus since the last query and have splunk index those new events.

Something we have been discussing internally is what occurs if something blows up on the script and we end up indexing data twice. How do we guarantee that in the event that something goes wrong with our query script we wouldn't re-index data twice?

There have been thoughts of tagging a MD5 sum to the events prior to indexing, then using a call to the Splunk API to match what is on the index vs what we have locally prior to creating a new file to index. Is this even possible and how have others in the same boat approached?

Thanks

Re: Ensuring that no duplicate events will be indexed

bmacias84 — Mon, 28 Sep 2020 13:28:59 GMT

Splunk MD5 on the file, but does not check events or line in the file for duplicates. If it’s added to your file splunk assumes you wanted it. You will have to build a more Robust script. I’ve had similar issues with data. My approach was to have the script query and create new few if it does exist. If the file does exist have script query and store each event/line in a list or array. Then read existing each event/line from file into a hash or dictionary, if exist remove or pop element. Final array should only contain new events and just append output to existing file. Below is some basic logic.

Logic:

-Query data and store in array

-Does output file exist (yes/no)

NO: create file. And output data. Exit

Yes: Read file and store in Hash.

Iterate through data in array. Does element exist in Hash (yes/no)

Yes: Remove element from array.

No: Next element.

Append data to existing output file.

Exit

This one was one of my first python programs so its not the prettist code.
UPDATE SAMPLE:



import sys

# ############################

# User defined variables

# ############################

isdebug = 0 

loc = ''

locations = ['PERF','DEV'] #Option 'DEV' 'PROD' 'PERF'

username = ""  #F5 user

password = ""  #f% user password

#datadrop = 'E:/Data/perfdata/'+loc+'/f5/'

# additional graphobjs and discriptions can be found in readme file.  Use graph_name column for graphobjs list

graphobjs = [ 

            'CPU',

            'memory',

            'throughput',

            'detailthroughput1',

            'detailthroughput2',

            'detailactcons1',

            'detailnewcons4',

            'httprequests',

            'SSLTPSGraph',

            'detailnewcons3',

            'detailactcons3'

            ] 

# ############################

# Custome function

# ############################

# ############################

# Debug

# ############################

def debug (strvalue, bolvalue):

    if bolvalue:

        print(strvalue)

        print

# ############################

# Pull f5 performance csv stats

# ############################

def f5csvstats(objname,fileloc,location):

    # ############################

    # Additional modules

    # ############################

    import get_interface as F5, os, re, time, binascii

    from time import localtime, strftime

    # ############################

    # Date and time stamps

    # ############################

    localdate = strftime("%Y-%m%d", localtime())

    timestamp = strftime("%Y-%m-%d\t%H:%M\t",localtime())

    # ############################

    # File names

    # ############################ 

    f5perfdata = fileloc + 'f5' +objname + '_' + localdate +'.tsv'

    f5error =  fileloc + 'error'

    outputdef = fileloc + 'readme.txt'

    # ############################

    # Creating F5 object/interface

    # ############################

    interface = F5.f5_interface(location, username, password)

    if not interface:

        debug(interface, isdebug)

        errfile = open(f5error, 'w')

        errfile.write('statheader and statvalue did not match, output not written to perfstats file.\n')

        errofile.close()

        sys.exit(51)

    # ############################

    # Pulling binary data for get_performance_graph_csv_statistics api object

    # ############################

csv = interface.System.Statistics.get_performance_graph_csv_statistics(objects = [{'object_name': objname, 'start_time': 0, 'end_time': 0, 'interval': 0, 'maximum_rows': 0}])

    stat = binascii.a2b_base64(csv[0].statistic_data)

    statline = []

    statline = stat.split('\n')

    header = 'date\ttime\t'

    for head in statline[0].split(',')[1:]:

        header += head.strip('\"') + '\t'

    # ############################

    # Creating tsv file, sorting, and adding unique data

    # ############################

d = {}

    ftout = ''

    if(os.path.exists(f5perfdata)):

        statfile = open(f5perfdata, 'r')

        timestamp = float(statfile.readlines()[-1].split('\t')[-1])

        statfile.close()

        statline.pop()

        for x in statline[1:-1]:

            element =  x.split(',')

            if len(element) > 1 and float(element[0]) > timestamp and not '             nan' in element:

                ftout += strftime("\n%Y-%m-%d\t%H:%M", localtime(float(element[0]))) + '\t'

                for ele in element[1:]:

                    ftout += ele+ '\t'

                ftout += element[0].strip('\n')

        statfile = open(f5perfdata, 'a')

        statfile.write(ftout)

        statfile.close()

    else:

        ftout = header + 'epoch'

        for x in statline[1:]:

            element =  x.split(',')

            if len(element) > 1 and not 'timestamp' in element[0]:

                if  strftime("%Y%m%d",localtime()) == strftime("%Y%m%d", localtime(float(element[0]))):

                    ftout += strftime("\n%Y-%m-%d\t%H:%M", localtime(float(element[0]))) + '\t'

                    for ele in element[1:]:

                        ftout += ele+ '\t'

                    ftout += element[0] 

        statfile = open(f5perfdata, 'w')

        statfile.write(ftout)

        statfile.close()

    if not (os.path.exists(outputdef)):

        readtxt = "Below are additain objects that can be used to generate data.\n\ngraph_name                              graph_title                             graph_description\n----------                              -----------                             -----------------\nmemory                                  Memory Used                             Memory Used\nactivecons                              Active Connections                      Active Connections\nnewcons                                 New Connections                         New Connections\nthroughput                              Throughput                              Throughput\nhttprequests                            HTTP Requests                           HTTP Requests\nramcache                                RAM Cache Utilization                   RAM Cache Utilization\ndetailactcons1                          Active Connections                      Active Connections\ndetailactcons2                          Active PVA Connections                  Active PVA Connections\ndetailactcons3                          Active SSL Connections                  Active SSL Connections\ndetailnewcons1                          Total New Connections                   Total New Connections\ndetailnewcons2                          New PVA Connections                     New PVA Connections\ndetailnewcons3                          New ClientSSL Profile Connections       New ClientSSL Profile Connections\ndetailnewcons4                          New Accepts/Connects                    New Accepts/Connects\ndetailthroughput1                       Client-side Throughput                  Client-side Throughput\ndetailthroughput2                       Server-side Throughput                  Server-side Throughput\ndetailthroughput3                       HTTP Compression Rate                   HTTP Compression Rate\nSSLTPSGraph                             SSL Transactions/Sec                    SSL Transactions/Sec\nGTMGraph                                GTM Performance                         GTM Requests and Resolutions\nGTMrequests                             GTM Requests                            GTM Requests\nGTMresolutions                          GTM Resolutions                         GTM Resolutions\nGTMpersisted                            GTM Resolutions Persisted               GTM Resolutions Persisted\nGTMret2dns                              GTM Resolutions Returned to DNS         GTM Resolutions Returned to DNS\ndetailcpu0                              CPU Utilization                         CPU Usage\ndetailcpu1                              CPU Utilization                         CPU Usage\nCPU                                     CPU Utilization                         CPU Usage\ndetailtmm0                              TMM Utilization                         TMM Usage\nTMM                                     TMM Utilization                         TMM CPU Utilization"

        readme = open(outputdef, 'w')

        readme.write(readtxt)

        readme.close()

# ############################

# Main Body

# ############################

for loc in locations:

    for x in graphobjs:

        debug(x, isdebug)

        f5csvstats(x,'E:/Data/perfdata/'+loc+'/f5/', loc)

sys.exit(0)

Hope this helps or gets you started. If this does help please Vote up and/or accept it.

Re: Ensuring that no duplicate events will be indexed

dondky — Mon, 11 Mar 2013 14:36:08 GMT

Thanks for the reply, would you happen to have some sample code you can provide on the process? The goal would not to re-engineer the wheel and work with one others have done.

Re: Ensuring that no duplicate events will be indexed

bmacias84 — Mon, 11 Mar 2013 15:05:42 GMT

@dondky, I don't know how helpfull it will be since it data specific and uses a custom imported module,but sure. This is for an F5 Loadbalancer pulling csv stats.

Re: Ensuring that no duplicate events will be indexed

bmacias84 — Mon, 11 Mar 2013 18:32:07 GMT

That script there only looks for data with new time stamps that previously entered in data file. I chose to use the time stamp over the hash method becuase my datafile was getting to large to store in memory and became very slow. The has and dictionary portion are really easy to do.

topic Re: Ensuring that no duplicate events will be indexed in Getting Data In

Ensuring that no duplicate events will be indexed

Re: Ensuring that no duplicate events will be indexed

Re: Ensuring that no duplicate events will be indexed

Re: Ensuring that no duplicate events will be indexed

Re: Ensuring that no duplicate events will be indexed