How does splunk extract search time fields in "int...

goelt2000 · ‎07-03-2021

which props.conf setting does splunk use to extract interesting fields from _raw field.

I am trying to use collect command to get _raw data from one index into another. However, it does not extract interesting fields. If I give sourcetype=splunkd. It extracts interesting fields. I understand using a different sourcetype other than stash will take license usage. So, I should be able to create a custom field extraction for the stash source file paths without taking any license.

I did a ./splunk btool props list splunkd and this is what it shows.

[splunkd]
ADD_EXTRA_TIME_FIELDS = True
ANNOTATE_PUNCT = True
AUTO_KV_JSON = true
BREAK_ONLY_BEFORE = 
BREAK_ONLY_BEFORE_DATE = True
CHARSET = UTF-8
DATETIME_CONFIG = /etc/datetime.xml
DEPTH_LIMIT = 1000
DETERMINE_TIMESTAMP_DATE_WITH_SYSTEM_TIME = false
EXTRACT-fields = (?i)^(?:[^ ]* ){2}(?:[+\-]\d+ )?(?P<log_level>[^ ]*)\s+(?P<component>[^ ]+) - (?P<event_message>.+)
HEADER_MODE = 
LB_CHUNK_BREAKER_TRUNCATE = 2000000
LEARN_MODEL = true
LEARN_SOURCETYPE = true
LINE_BREAKER_LOOKBEHIND = 100
MATCH_LIMIT = 100000
MAX_DAYS_AGO = 2000
MAX_DAYS_HENCE = 2
MAX_DIFF_SECS_AGO = 3600
MAX_DIFF_SECS_HENCE = 604800
MAX_EVENTS = 256
MAX_TIMESTAMP_LOOKAHEAD = 40
MUST_BREAK_AFTER = 
MUST_NOT_BREAK_AFTER = 
MUST_NOT_BREAK_BEFORE = 
SEGMENTATION = indexing
SEGMENTATION-all = full
SEGMENTATION-inner = inner
SEGMENTATION-outer = outer
SEGMENTATION-raw = none
SEGMENTATION-standard = standard
SHOULD_LINEMERGE = false
TIME_FORMAT = %m-%d-%Y %H:%M:%S.%l %z
TRANSFORMS = 
TRUNCATE = 20000
detect_trailing_nulls = false
maxDist = 100
priority = 
sourcetype = 
termFrequencyWeightedDist = false

for default stanza, it shows :

[default]
ADD_EXTRA_TIME_FIELDS = True
ANNOTATE_PUNCT = True
AUTO_KV_JSON = true
BREAK_ONLY_BEFORE = 
BREAK_ONLY_BEFORE_DATE = True
CHARSET = UTF-8
DATETIME_CONFIG = /etc/datetime.xml
DEPTH_LIMIT = 1000
DETERMINE_TIMESTAMP_DATE_WITH_SYSTEM_TIME = false
HEADER_MODE = 
LB_CHUNK_BREAKER_TRUNCATE = 2000000
LEARN_MODEL = true
LEARN_SOURCETYPE = true
LINE_BREAKER_LOOKBEHIND = 100
MATCH_LIMIT = 100000
MAX_DAYS_AGO = 2000
MAX_DAYS_HENCE = 2
MAX_DIFF_SECS_AGO = 3600
MAX_DIFF_SECS_HENCE = 604800
MAX_EVENTS = 256
MAX_TIMESTAMP_LOOKAHEAD = 128
MUST_BREAK_AFTER = 
MUST_NOT_BREAK_AFTER = 
MUST_NOT_BREAK_BEFORE = 
SEGMENTATION = indexing
SEGMENTATION-all = full
SEGMENTATION-inner = inner
SEGMENTATION-outer = outer
SEGMENTATION-raw = none
SEGMENTATION-standard = standard
SHOULD_LINEMERGE = True
TRANSFORMS = 
TRUNCATE = 10000
detect_trailing_nulls = false
maxDist = 100
priority = 
sourcetype = 
termFrequencyWeightedDist = false

I verified the data and it is not in json format. So, AUTO_KV_JSON would not apply to it.

The only thing I could find in transforms and props.conf which separate fields based upon "=" is

[ad-kv]
CAN_OPTIMIZE = True
CLEAN_KEYS = True
DEFAULT_VALUE = 
DEPTH_LIMIT = 1000
DEST_KEY = 
FORMAT = 
KEEP_EMPTY_VALS = False
LOOKAHEAD = 4096
MATCH_LIMIT = 100000
MV_ADD = true
REGEX = (?<_KEY_1>[\w-]+)=(?<_VAL_1>[^\r\n]*)
SOURCE_KEY = _raw
WRITE_META = False

which is being called by

[ActiveDirectory]
SHOULD_LINEMERGE = false
LINE_BREAKER = ([\r\n]+---splunk-admon-end-of-event---\r\n[\r\n]*)
EXTRACT-GUID = (?i)(?!=\w)(?:objectguid|guid)\s*=\s*(?<guid_lookup>[\w\-]+)
EXTRACT-SID = objectSid\s*=\s*(?<sid_lookup>\S+)
REPORT-MESSAGE = ad-kv
# some schema AD events may be very long
MAX_EVENTS = 10000
TRUNCATE = 100000

richgalloway · ‎07-05-2021

That's a lot of work to create a backup of an index. Splunk has a document describing how to back up indexed data. See https://docs.splunk.com/Documentation/Splunk/8.2.1/Indexer/Backupindexeddata

Another way to protect your data is via replication done by an indexer cluster. See https://docs.splunk.com/Documentation/Splunk/8.2.1/Indexer/Aboutclusters

---
If this reply helps you, Karma would be appreciated.

richgalloway · ‎07-04-2021

What regex did you use in your rex command? I would use the expression in the EXTRACT-fields attribute from props.conf then add more rex commands to extract more fields.

Stepping back, what problem are you trying to solve by copying data between indexes?

---
If this reply helps you, Karma would be appreciated.

goelt2000 · ‎07-04-2021

it was a simple regex. regex was not the issue. as the same regex worked with other sourcetypes, but not stash. These commands worked for me. I am still figuring it out how to retain the original host, source, sourcetype

| extract auto=t

I wanted to merge data from one index into another for a use case. My understanding is collect command does the work. It is also documented here.

https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Collect

Copying events to a different index
You can use the collect command to copy search results to another index. Construct a search that returns the data you want to copy, and pipe the results to the collect command. For example:

index=foo | ... | collect index=bar

This search writes the results into the bar index. The sourcetype is changed to stash.

You can specify a sourcetype with the collect command. However, specifying a sourcetype counts against your license, as if you indexed the data again.

We can probably keep the original sourcetypes and host, and source values too. But license usage will become an issue since the amount of data is in TBs. I think I saw a thread about how you can append source, sourcetype, host to _raw.

I am still looking for it.

| eval _raw=_raw.orig_host=$host..orig_source=$source

Once it is done, I can use the destination index like

index=destinationindex|eval host=orig_host ...|extract auto=t

and I will have a backup index data without consuming more license usage.

It also says: data is stored under:

The file that is written to the var/spool/splunk path ends in .stash_hec instead of .stash.

while the saved results from normal searches are stored under

var/run/splunk/dispatch.

So, splunk should not replicate the artifacts from spool/splunk to other search head cluster members. I can test it out though. So, that should rule out the results getting replicated across search peers and having duplicate events?

If I do schedule a search with collect command, for this use case, should it be run in fast mode, or verbose mode? or it doesn't matter, most likely scheduled searches always run in fast mode? and where will the results from scheduled search for a collect command get stored, under dispatch or spool?

What fields does collect command collect from source index?

thanks

goelt2000 · ‎07-04-2021

I gave a try to use rex on sourcetype=stash. It is not working. Even a basic regex is not working. Seems like I will have to change the sourcetype in order to get the interesting fields?

@splunk @richgalloway - would you have any idea? - thanks

How does splunk extract search time fields in "interesting fields"?

field extraction

Developer Spotlight with Brett Adams

Index This | What can you do to make 55,555 equal 500?

Say goodbye to manually analyzing phishing and malware threats with Splunk Attack ...