Getting Data In

Reimporting Event Data / Fixing Data Holes

Explorer

Because of network problems between my HFs and my indexing tier I have some "holes" in my data. With holes I mean missing data. These Holes need to be fixed. My Idea for this goes as follows:

  1. reindex all logs (and rotated logs) within the timerange, but to a new index
  2. search for index=newindex OR index=originalindex | eventstats count by raw | where count=1 | eval count=null
  3. exporting these events in a file
  4. reimporting these events into the original index

Generally this seems to work, but there is still one problem: My data comes from my different sources with many different source types. How can I export data with source and sourcetype and keep those fields when reimporting? I am also open to better solutions to my general problem.

Thanks in advance.

0 Karma
1 Solution

Explorer

I found a solution myself in the meantime. In particular for step three and four:

3) export as json file
4.1) let the json-file run through my python-script (see below) ./script.py > /tmp/missingdata.txt
4.2) one-shot the outputs of this to the index ./splunk add oneshot /tmp/missingdata.txt -index foo -sourcetype logimport


my python3 script:

#!/usr/bin/python3
import json

fp=open('./export.json', 'r')

line = fp.readline()
while line:
  parsedline=json.loads(line)
  print(parsedline["result"]["_raw"])
  print("HOST = "+parsedline["result"]["host"])
  print("SOURCE = "+parsedline["result"]["source"])
  print("SOURCETYPE = "+parsedline["result"]["sourcetype"])
  print("###")
  line=fp.readline()

fp.close()

props.conf:

[logimport]
LINE_BREAKER=(###\n)
TRANSFORMS = importsource, importsourcetype, importhost, importraw

transforms.conf:

[importhost]
REGEX =\nHOST = (.*)
FORMAT= host::$1
DEST_KEY = MetaData:Host
WRITE_META = true

[importsource]
REGEX=\nSOURCE = (.*)
FORMAT= source::$1
DEST_KEY = MetaData:Source

[importsourcetype]
REGEX=\nSOURCETYPE = (.*)
FORMAT= sourcetype::$1
DEST_KEY = MetaData:Sourcetype

[importraw]
REGEX=^(.*)\nHOST
DEST_KEY = _raw
FORMAT = $1

View solution in original post

Explorer

I found a solution myself in the meantime. In particular for step three and four:

3) export as json file
4.1) let the json-file run through my python-script (see below) ./script.py > /tmp/missingdata.txt
4.2) one-shot the outputs of this to the index ./splunk add oneshot /tmp/missingdata.txt -index foo -sourcetype logimport


my python3 script:

#!/usr/bin/python3
import json

fp=open('./export.json', 'r')

line = fp.readline()
while line:
  parsedline=json.loads(line)
  print(parsedline["result"]["_raw"])
  print("HOST = "+parsedline["result"]["host"])
  print("SOURCE = "+parsedline["result"]["source"])
  print("SOURCETYPE = "+parsedline["result"]["sourcetype"])
  print("###")
  line=fp.readline()

fp.close()

props.conf:

[logimport]
LINE_BREAKER=(###\n)
TRANSFORMS = importsource, importsourcetype, importhost, importraw

transforms.conf:

[importhost]
REGEX =\nHOST = (.*)
FORMAT= host::$1
DEST_KEY = MetaData:Host
WRITE_META = true

[importsource]
REGEX=\nSOURCE = (.*)
FORMAT= source::$1
DEST_KEY = MetaData:Source

[importsourcetype]
REGEX=\nSOURCETYPE = (.*)
FORMAT= sourcetype::$1
DEST_KEY = MetaData:Sourcetype

[importraw]
REGEX=^(.*)\nHOST
DEST_KEY = _raw
FORMAT = $1

View solution in original post

Esteemed Legend

Way to go sharing your code. Come back and click Accept on your answer to close the question and let other people know there is a good answer.

0 Karma

Esteemed Legend

Over the time range in question, run a search like this:

index="foo" AND sourcetype="bar" | stats count BY source

Then in the shell on your HF, do this:

for FILE in /path/to/files/in/question/*
do
   wc -l ${FILE}
done

Then cross-reference these 2 lists. Whichever ones are wrong, do this in Splunk:

index="foo" AND sourcetype="bar" source="bad" | delete

Then in the shell on your HF, for each bad file, do this:

/opt/splunk/bin/splunk add oneshot /path/to/files/in/question/bad.csv -index foo -sourcetype bar -auth admin:changeme
0 Karma

Esteemed Legend

Keep in mind that if the gaps are very large, doing this will erode your overall retention of data because delete does not really delete anything, it just hides it.

0 Karma

Explorer

With this solution I would have to repeat the procedure manually for every sourcetype. Furthermore the HF only just forwards most of data, which are coming from UFs on which I have no shell-access.

Thus I am sorry, this solution does not work for me.

0 Karma

SplunkTrust
SplunkTrust

How many events we're talking about here? Do remember that both step 1 and step 2 will be counted against license.

0 Karma

Explorer

There is a mistake in the SPL-query. It should be eventstats count by _raw.

0 Karma