Getting Data In

Reimporting Event Data / Fixing Data Holes

jroedel
Path Finder

Because of network problems between my HFs and my indexing tier I have some "holes" in my data. With holes I mean missing data. These Holes need to be fixed. My Idea for this goes as follows:

  1. reindex all logs (and rotated logs) within the timerange, but to a new index
  2. search for index=newindex OR index=originalindex | eventstats count by raw | where count=1 | eval count=null
  3. exporting these events in a file
  4. reimporting these events into the original index

Generally this seems to work, but there is still one problem: My data comes from my different sources with many different source types. How can I export data with source and sourcetype and keep those fields when reimporting? I am also open to better solutions to my general problem.

Thanks in advance.

0 Karma
1 Solution

jroedel
Path Finder

I found a solution myself in the meantime. In particular for step three and four:

3) export as json file
4.1) let the json-file run through my python-script (see below) ./script.py > /tmp/missingdata.txt
4.2) one-shot the outputs of this to the index ./splunk add oneshot /tmp/missingdata.txt -index foo -sourcetype logimport


my python3 script:

#!/usr/bin/python3
import json

fp=open('./export.json', 'r')

line = fp.readline()
while line:
  parsedline=json.loads(line)
  print(parsedline["result"]["_raw"])
  print("HOST = "+parsedline["result"]["host"])
  print("SOURCE = "+parsedline["result"]["source"])
  print("SOURCETYPE = "+parsedline["result"]["sourcetype"])
  print("###")
  line=fp.readline()

fp.close()

props.conf:

[logimport]
LINE_BREAKER=(###\n)
TRANSFORMS = importsource, importsourcetype, importhost, importraw

transforms.conf:

[importhost]
REGEX =\nHOST = (.*)
FORMAT= host::$1
DEST_KEY = MetaData:Host
WRITE_META = true

[importsource]
REGEX=\nSOURCE = (.*)
FORMAT= source::$1
DEST_KEY = MetaData:Source

[importsourcetype]
REGEX=\nSOURCETYPE = (.*)
FORMAT= sourcetype::$1
DEST_KEY = MetaData:Sourcetype

[importraw]
REGEX=^(.*)\nHOST
DEST_KEY = _raw
FORMAT = $1

View solution in original post

jroedel
Path Finder

I found a solution myself in the meantime. In particular for step three and four:

3) export as json file
4.1) let the json-file run through my python-script (see below) ./script.py > /tmp/missingdata.txt
4.2) one-shot the outputs of this to the index ./splunk add oneshot /tmp/missingdata.txt -index foo -sourcetype logimport


my python3 script:

#!/usr/bin/python3
import json

fp=open('./export.json', 'r')

line = fp.readline()
while line:
  parsedline=json.loads(line)
  print(parsedline["result"]["_raw"])
  print("HOST = "+parsedline["result"]["host"])
  print("SOURCE = "+parsedline["result"]["source"])
  print("SOURCETYPE = "+parsedline["result"]["sourcetype"])
  print("###")
  line=fp.readline()

fp.close()

props.conf:

[logimport]
LINE_BREAKER=(###\n)
TRANSFORMS = importsource, importsourcetype, importhost, importraw

transforms.conf:

[importhost]
REGEX =\nHOST = (.*)
FORMAT= host::$1
DEST_KEY = MetaData:Host
WRITE_META = true

[importsource]
REGEX=\nSOURCE = (.*)
FORMAT= source::$1
DEST_KEY = MetaData:Source

[importsourcetype]
REGEX=\nSOURCETYPE = (.*)
FORMAT= sourcetype::$1
DEST_KEY = MetaData:Sourcetype

[importraw]
REGEX=^(.*)\nHOST
DEST_KEY = _raw
FORMAT = $1

woodcock
Esteemed Legend

Way to go sharing your code. Come back and click Accept on your answer to close the question and let other people know there is a good answer.

0 Karma

woodcock
Esteemed Legend

Over the time range in question, run a search like this:

index="foo" AND sourcetype="bar" | stats count BY source

Then in the shell on your HF, do this:

for FILE in /path/to/files/in/question/*
do
   wc -l ${FILE}
done

Then cross-reference these 2 lists. Whichever ones are wrong, do this in Splunk:

index="foo" AND sourcetype="bar" source="bad" | delete

Then in the shell on your HF, for each bad file, do this:

/opt/splunk/bin/splunk add oneshot /path/to/files/in/question/bad.csv -index foo -sourcetype bar -auth admin:changeme
0 Karma

woodcock
Esteemed Legend

Keep in mind that if the gaps are very large, doing this will erode your overall retention of data because delete does not really delete anything, it just hides it.

0 Karma

jroedel
Path Finder

With this solution I would have to repeat the procedure manually for every sourcetype. Furthermore the HF only just forwards most of data, which are coming from UFs on which I have no shell-access.

Thus I am sorry, this solution does not work for me.

0 Karma

somesoni2
Revered Legend

How many events we're talking about here? Do remember that both step 1 and step 2 will be counted against license.

0 Karma

jroedel
Path Finder

There is a mistake in the SPL-query. It should be eventstats count by _raw.

0 Karma
Get Updates on the Splunk Community!

Technical Workshop Series: Splunk Data Management and SPL2 | Register here!

Hey, Splunk Community! Ready to take your data management skills to the next level? Join us for a 3-part ...

Spotting Financial Fraud in the Haystack: A Guide to Behavioral Analytics with Splunk

In today's digital financial ecosystem, security teams face an unprecedented challenge. The sheer volume of ...

Solve Problems Faster with New, Smarter AI and Integrations in Splunk Observability

Solve Problems Faster with New, Smarter AI and Integrations in Splunk Observability As businesses scale ...