Getting Data In

What are the best practices for defining source types?

sloshburch
Splunk Employee
Splunk Employee

I've heard that using Splunk's default source type detection is flexible, but can be hard on performance. What is the best way to define source types that keeps performance speedy?

0 Karma
1 Solution

sloshburch
Splunk Employee
Splunk Employee

The Splunk Product Best Practices team provided this response. Read more about How Crowdsourcing is Shaping the Future of Splunk Best Practices.

Overview

To start with, it helps to know how Splunk software parses data. See Configure event processing in the Getting Data In manual for a good background.

From a performance perspective, event line breaking and event timestamps are two areas where you can define source types that streamline how Splunk software parses data.

Event line breaking

The Splunk Getting Data In manual does a great job going into detail in the event line breaking topic. Within that topic, you'll learn about these specific attributes that consistently correlate with performance:

  • SHOULD_LINEMERGE combines several lines of data into a single multi-line event. Defaults to true.
  • LINE_BREAKER determines how the raw text stream is broken into initial events, before line merging takes place. Default is the regex for new lines, ([\r\n]+).

This means that as data comes in, Splunk software organizes the raw text into events based on new lines while also trying to keep multiple lines of the same event together as one larger event. You can streamline this to be more efficient if you know the data's patterns. For example, a typical syslog source only produces single line events (in contrast to stack traces, which could span multiple lines), so I can set:

SHOULD_LINEMERGE = false

This reduces the capacity Splunk uses for merging the line, while adopting the default LINE_BREAKER.

Event timestamps

The Splunk Getting Data In manual provides detail on event timestamps in the topic How timestamp assignment works.

Some key attributes of the timestamp assignment that affect performance are:

  • MAX_TIMESTAMP_LOOKAHEAD - Specifies how many characters after the TIME_PREFIX pattern Splunk software should look for a timestamp. Defaults to 128 characters.
  • TIME_FORMAT - Specifies a strptime format string to extract the date. No default set.
  • TIME_PREFIX - If set, Splunk software scans the event text for a match for this regex in event text before attempting to extract a timestamp. No default set.

This means that as data comes in, the Splunk software indexing process evaluates the first 128 characters for anything that could resemble potential time patterns on every event. That's a lot of processing! To really appreciate it, explore the wide range of variability the automatic detection covers within the topic Configure timestamp recognition in the Getting Data In manual. If you already know the timestamp will occur, for example, at the start of the event, and will be similar in pattern to 2019-01-01 19:08:01, then you can use a configuration like this:

MAX_TIMESTAMP_LOOKAHEAD = 32
TIME_FORMAT = %F %T
TIME_PREFIX = ^

Now Splunk will only look 32 characters after the start of the event ( ^) for the timestamp specifically of the pattern %Y-%m-%d %H:%M:%S. That's a lot more specific, and a lot more performant!

Punctuation

If you are truly a pro with source type definition, then you'll find you never use the punct field. In such a scenario, you may find performance benefits by turning off the generation of this field. The regex processing for each event that produces the punct has been shown to have performance implications in high volume environments.

ANNOTATE_PUNCT = false

If you find you need to use the punct after you've removed it, then explore adding some of the eval command's replace function to your search:

| eval punct = replace( replace( _raw , "\w" , "" ) , "\s" , "_" )

Remember that this is resource intensive and may make for a slow search while it processes.

Don't Forget!

As you go forth strengthening your source types, remember to consider best practices for Source type naming conventions in the Splunk Supported Add-ons manual.
Also, depending on the data, other settings could produce more dramatic results. For example, the CHARSET attribute would be critical for performance for non ASCII UTF-8 encoded machine data, such as the data widely available outside English speaking countries.

Sources

View solution in original post

Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...