I've heard that using Splunk's default source type detection is flexible, but can be hard on performance. What is the best way to define source types that keeps performance speedy?
To start with, it helps to know how Splunk software parses data. See Configure event processing in the Getting Data In manual for a good background.
From a performance perspective, event line breaking and event timestamps are two areas where you can define source types that streamline how Splunk software parses data.
The Splunk Getting Data In manual does a great job going into detail in the event line breaking topic. Within that topic, you'll learn about these specific attributes that consistently correlate with performance:
SHOULD_LINEMERGE
combines several lines of data into a single multi-line event. Defaults to true
.LINE_BREAKER
determines how the raw text stream is broken into initial events, before line merging takes place. Default is the regex for new lines, ([\r\n]+)
.This means that as data comes in, Splunk software organizes the raw text into events based on new lines while also trying to keep multiple lines of the same event together as one larger event. You can streamline this to be more efficient if you know the data's patterns. For example, a typical syslog source only produces single line events (in contrast to stack traces, which could span multiple lines), so I can set:
SHOULD_LINEMERGE = false
This reduces the capacity Splunk uses for merging the line, while adopting the default LINE_BREAKER
.
The Splunk Getting Data In manual provides detail on event timestamps in the topic How timestamp assignment works.
Some key attributes of the timestamp assignment that affect performance are:
MAX_TIMESTAMP_LOOKAHEAD
- Specifies how many characters after the TIME_PREFIX
pattern Splunk software should look for a timestamp. Defaults to 128 characters.TIME_FORMAT
- Specifies a strptime format string to extract the date. No default set.TIME_PREFIX
- If set, Splunk software scans the event text for a match for this regex in event text before attempting to extract a timestamp. No default set.This means that as data comes in, the Splunk software indexing process evaluates the first 128 characters for anything that could resemble potential time patterns on every event. That's a lot of processing! To really appreciate it, explore the wide range of variability the automatic detection covers within the topic Configure timestamp recognition in the Getting Data In manual. If you already know the timestamp will occur, for example, at the start of the event, and will be similar in pattern to 2019-01-01 19:08:01
, then you can use a configuration like this:
MAX_TIMESTAMP_LOOKAHEAD = 32
TIME_FORMAT = %F %T
TIME_PREFIX = ^
Now Splunk will only look 32 characters after the start of the event ( ^
) for the timestamp specifically of the pattern %Y-%m-%d %H:%M:%S
. That's a lot more specific, and a lot more performant!
If you are truly a pro with source type definition, then you'll find you never use the punct
field. In such a scenario, you may find performance benefits by turning off the generation of this field. The regex processing for each event that produces the punct
has been shown to have performance implications in high volume environments.
ANNOTATE_PUNCT = false
If you find you need to use the punct
after you've removed it, then explore adding some of the eval command's replace function to your search:
| eval punct = replace( replace( _raw , "\w" , "" ) , "\s" , "_" )
Remember that this is resource intensive and may make for a slow search while it processes.
As you go forth strengthening your source types, remember to consider best practices for Source type naming conventions in the Splunk Supported Add-ons manual.
Also, depending on the data, other settings could produce more dramatic results. For example, the CHARSET
attribute would be critical for performance for non ASCII UTF-8 encoded machine data, such as the data widely available outside English speaking countries.