Solved: Reimporting Event Data / Fixing Data Holes

jroedel · ‎11-25-2019

Because of network problems between my HFs and my indexing tier I have some "holes" in my data. With holes I mean missing data. These Holes need to be fixed. My Idea for this goes as follows:

reindex all logs (and rotated logs) within the timerange, but to a new index
search for index=newindex OR index=originalindex | eventstats count by raw | where count=1 | eval count=null
exporting these events in a file
reimporting these events into the original index

Generally this seems to work, but there is still one problem: My data comes from my different sources with many different source types. How can I export data with source and sourcetype and keep those fields when reimporting? I am also open to better solutions to my general problem.

Thanks in advance.

jroedel · ‎11-26-2019

I found a solution myself in the meantime. In particular for step three and four:

3) export as json file
4.1) let the json-file run through my python-script (see below) ./script.py > /tmp/missingdata.txt
4.2) one-shot the outputs of this to the index ./splunk add oneshot /tmp/missingdata.txt -index foo -sourcetype logimport

my python3 script:

#!/usr/bin/python3
import json

fp=open('./export.json', 'r')

line = fp.readline()
while line:
  parsedline=json.loads(line)
  print(parsedline["result"]["_raw"])
  print("HOST = "+parsedline["result"]["host"])
  print("SOURCE = "+parsedline["result"]["source"])
  print("SOURCETYPE = "+parsedline["result"]["sourcetype"])
  print("###")
  line=fp.readline()

fp.close()

props.conf:

[logimport]
LINE_BREAKER=(###\n)
TRANSFORMS = importsource, importsourcetype, importhost, importraw

transforms.conf:

[importhost]
REGEX =\nHOST = (.*)
FORMAT= host::$1
DEST_KEY = MetaData:Host
WRITE_META = true

[importsource]
REGEX=\nSOURCE = (.*)
FORMAT= source::$1
DEST_KEY = MetaData:Source

[importsourcetype]
REGEX=\nSOURCETYPE = (.*)
FORMAT= sourcetype::$1
DEST_KEY = MetaData:Sourcetype

[importraw]
REGEX=^(.*)\nHOST
DEST_KEY = _raw
FORMAT = $1

View solution in original post

jroedel · ‎11-26-2019

I found a solution myself in the meantime. In particular for step three and four:

3) export as json file
4.1) let the json-file run through my python-script (see below) ./script.py > /tmp/missingdata.txt
4.2) one-shot the outputs of this to the index ./splunk add oneshot /tmp/missingdata.txt -index foo -sourcetype logimport

my python3 script:

#!/usr/bin/python3
import json

fp=open('./export.json', 'r')

line = fp.readline()
while line:
  parsedline=json.loads(line)
  print(parsedline["result"]["_raw"])
  print("HOST = "+parsedline["result"]["host"])
  print("SOURCE = "+parsedline["result"]["source"])
  print("SOURCETYPE = "+parsedline["result"]["sourcetype"])
  print("###")
  line=fp.readline()

fp.close()

props.conf:

[logimport]
LINE_BREAKER=(###\n)
TRANSFORMS = importsource, importsourcetype, importhost, importraw

transforms.conf:

[importhost]
REGEX =\nHOST = (.*)
FORMAT= host::$1
DEST_KEY = MetaData:Host
WRITE_META = true

[importsource]
REGEX=\nSOURCE = (.*)
FORMAT= source::$1
DEST_KEY = MetaData:Source

[importsourcetype]
REGEX=\nSOURCETYPE = (.*)
FORMAT= sourcetype::$1
DEST_KEY = MetaData:Sourcetype

[importraw]
REGEX=^(.*)\nHOST
DEST_KEY = _raw
FORMAT = $1

woodcock · ‎11-26-2019

Way to go sharing your code. Come back and click Accept on your answer to close the question and let other people know there is a good answer.

woodcock · ‎11-25-2019

Over the time range in question, run a search like this:

index="foo" AND sourcetype="bar" | stats count BY source

Then in the shell on your HF, do this:

for FILE in /path/to/files/in/question/*
do
   wc -l ${FILE}
done

Then cross-reference these 2 lists. Whichever ones are wrong, do this in Splunk:

index="foo" AND sourcetype="bar" source="bad" | delete

Then in the shell on your HF, for each bad file, do this:

/opt/splunk/bin/splunk add oneshot /path/to/files/in/question/bad.csv -index foo -sourcetype bar -auth admin:changeme

woodcock · ‎11-25-2019

Keep in mind that if the gaps are very large, doing this will erode your overall retention of data because delete does not really delete anything, it just hides it.

jroedel · ‎11-25-2019

With this solution I would have to repeat the procedure manually for every sourcetype. Furthermore the HF only just forwards most of data, which are coming from UFs on which I have no shell-access.

Thus I am sorry, this solution does not work for me.

somesoni2 · ‎11-25-2019

How many events we're talking about here? Do remember that both step 1 and step 2 will be counted against license.

jroedel · ‎11-25-2019

There is a mistake in the SPL-query. It should be eventstats count by _raw.

Reimporting Event Data / Fixing Data Holes

Fall Into Learning with New Splunk Education Courses

Super Optimize your Splunk Stats Searches: Unlocking the Power of tstats, TERM, and ...

How Splunk Observability Cloud Prevented a Major Payment Crisis in Minutes

Are you a member of the Splunk Community?

Reimporting Event Data / Fixing Data Holes

Fall Into Learning with New Splunk Education Courses

Super Optimize your Splunk Stats Searches: Unlocking the Power of tstats, TERM, and ...

How Splunk Observability Cloud Prevented a Major Payment Crisis in Minutes