Getting Data In

Ensuring that no duplicate events will be indexed

dondky
Path Finder

Hi guys, I'm stumped on task I've been working on for the last few weeks. We are extracting about 1.5 million lines of events from Nessus vulnerability info into a file and then having splunk index via a monitor.

From there we are going to do some logic within our query script to make sure we only pull the latest events from nessus since the last query and have splunk index those new events.

Something we have been discussing internally is what occurs if something blows up on the script and we end up indexing data twice. How do we guarantee that in the event that something goes wrong with our query script we wouldn't re-index data twice?

There have been thoughts of tagging a MD5 sum to the events prior to indexing, then using a call to the Splunk API to match what is on the index vs what we have locally prior to creating a new file to index. Is this even possible and how have others in the same boat approached?

Thanks

Tags (3)
0 Karma

bmacias84
Champion

Splunk MD5 on the file, but does not check events or line in the file for duplicates. If it’s added to your file splunk assumes you wanted it. You will have to build a more Robust script. I’ve had similar issues with data. My approach was to have the script query and create new few if it does exist. If the file does exist have script query and store each event/line in a list or array. Then read existing each event/line from file into a hash or dictionary, if exist remove or pop element. Final array should only contain new events and just append output to existing file. Below is some basic logic.

Logic:


  • -Query data and store in array

  • -Does output file exist (yes/no)


    • NO: create file. And output data. Exit

    • Yes: Read file and store in Hash.


  • Iterate through data in array. Does element exist in Hash (yes/no)


    • Yes: Remove element from array.

    • No: Next element.


    # Note: at this point only new data is in array. Assuming data in array is already sorted.
  • Append data to existing output file.

  • Exit

This one was one of my first python programs so its not the prettist code.
UPDATE SAMPLE:


import sys
# ############################
# User defined variables
# ############################
isdebug = 0
loc = ''
locations = ['PERF','DEV'] #Option 'DEV' 'PROD' 'PERF'
username = "" #F5 user
password = "" #f% user password
#datadrop = 'E:/Data/perfdata/'+loc+'/f5/'
# additional graphobjs and discriptions can be found in readme file. Use graph_name column for graphobjs list
graphobjs = [
'CPU',
'memory',
'throughput',
'detailthroughput1',
'detailthroughput2',
'detailactcons1',
'detailnewcons4',
'httprequests',
'SSLTPSGraph',
'detailnewcons3',
'detailactcons3'
]
# ############################
# Custome function
# ############################
# ############################
# Debug
# ############################
def debug (strvalue, bolvalue):
if bolvalue:
print(strvalue)
print
# ############################
# Pull f5 performance csv stats
# ############################
def f5csvstats(objname,fileloc,location):
# ############################
# Additional modules
# ############################
import get_interface as F5, os, re, time, binascii
from time import localtime, strftime
# ############################
# Date and time stamps
# ############################
localdate = strftime("%Y-%m%d", localtime())
timestamp = strftime("%Y-%m-%d\t%H:%M\t",localtime())
# ############################
# File names
# ############################
f5perfdata = fileloc + 'f5' +objname + '_' + localdate +'.tsv'
f5error = fileloc + 'error'
outputdef = fileloc + 'readme.txt'
# ############################
# Creating F5 object/interface
# ############################
interface = F5.f5_interface(location, username, password)
if not interface:
debug(interface, isdebug)
errfile = open(f5error, 'w')
errfile.write('statheader and statvalue did not match, output not written to perfstats file.\n')
errofile.close()
sys.exit(51)
# ############################
# Pulling binary data for get_performance_graph_csv_statistics api object
# ############################
csv = interface.System.Statistics.get_performance_graph_csv_statistics(objects = [{'object_name': objname, 'start_time': 0, 'end_time': 0, 'interval': 0, 'maximum_rows': 0}])
stat = binascii.a2b_base64(csv[0].statistic_data)
statline = []
statline = stat.split('\n')
header = 'date\ttime\t'
for head in statline[0].split(',')[1:]:
header += head.strip('\"') + '\t'
# ############################
# Creating tsv file, sorting, and adding unique data
# ############################
d = {}
ftout = ''
if(os.path.exists(f5perfdata)):
statfile = open(f5perfdata, 'r')
timestamp = float(statfile.readlines()[-1].split('\t')[-1])
statfile.close()
statline.pop()
for x in statline[1:-1]:
element = x.split(',')
if len(element) > 1 and float(element[0]) > timestamp and not ' nan' in element:
ftout += strftime("\n%Y-%m-%d\t%H:%M", localtime(float(element[0]))) + '\t'
for ele in element[1:]:
ftout += ele+ '\t'
ftout += element[0].strip('\n')
statfile = open(f5perfdata, 'a')
statfile.write(ftout)
statfile.close()
else:
ftout = header + 'epoch'
for x in statline[1:]:
element = x.split(',')
if len(element) > 1 and not 'timestamp' in element[0]:
if strftime("%Y%m%d",localtime()) == strftime("%Y%m%d", localtime(float(element[0]))):
ftout += strftime("\n%Y-%m-%d\t%H:%M", localtime(float(element[0]))) + '\t'
for ele in element[1:]:
ftout += ele+ '\t'
ftout += element[0]
statfile = open(f5perfdata, 'w')
statfile.write(ftout)
statfile.close()
if not (os.path.exists(outputdef)):
readtxt = "Below are additain objects that can be used to generate data.\n\ngraph_name graph_title graph_description\n---------- ----------- -----------------\nmemory Memory Used Memory Used\nactivecons Active Connections Active Connections\nnewcons New Connections New Connections\nthroughput Throughput Throughput\nhttprequests HTTP Requests HTTP Requests\nramcache RAM Cache Utilization RAM Cache Utilization\ndetailactcons1 Active Connections Active Connections\ndetailactcons2 Active PVA Connections Active PVA Connections\ndetailactcons3 Active SSL Connections Active SSL Connections\ndetailnewcons1 Total New Connections Total New Connections\ndetailnewcons2 New PVA Connections New PVA Connections\ndetailnewcons3 New ClientSSL Profile Connections New ClientSSL Profile Connections\ndetailnewcons4 New Accepts/Connects New Accepts/Connects\ndetailthroughput1 Client-side Throughput Client-side Throughput\ndetailthroughput2 Server-side Throughput Server-side Throughput\ndetailthroughput3 HTTP Compression Rate HTTP Compression Rate\nSSLTPSGraph SSL Transactions/Sec SSL Transactions/Sec\nGTMGraph GTM Performance GTM Requests and Resolutions\nGTMrequests GTM Requests GTM Requests\nGTMresolutions GTM Resolutions GTM Resolutions\nGTMpersisted GTM Resolutions Persisted GTM Resolutions Persisted\nGTMret2dns GTM Resolutions Returned to DNS GTM Resolutions Returned to DNS\ndetailcpu0 CPU Utilization CPU Usage\ndetailcpu1 CPU Utilization CPU Usage\nCPU CPU Utilization CPU Usage\ndetailtmm0 TMM Utilization TMM Usage\nTMM TMM Utilization TMM CPU Utilization"
readme = open(outputdef, 'w')
readme.write(readtxt)
readme.close()
# ############################
# Main Body
# ############################
for loc in locations:
for x in graphobjs:
debug(x, isdebug)
f5csvstats(x,'E:/Data/perfdata/'+loc+'/f5/', loc)
sys.exit(0)

Hope this helps or gets you started. If this does help please Vote up and/or accept it.

bmacias84
Champion

That script there only looks for data with new time stamps that previously entered in data file. I chose to use the time stamp over the hash method becuase my datafile was getting to large to store in memory and became very slow. The has and dictionary portion are really easy to do.

0 Karma

bmacias84
Champion

@dondky, I don't know how helpfull it will be since it data specific and uses a custom imported module,but sure. This is for an F5 Loadbalancer pulling csv stats.

0 Karma

dondky
Path Finder

Thanks for the reply, would you happen to have some sample code you can provide on the process? The goal would not to re-engineer the wheel and work with one others have done.

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...