Getting Data In

What are the best practices for defining source types?

sloshburch
Splunk Employee
Splunk Employee

I've heard that using Splunk's default source type detection is flexible, but can be hard on performance. What is the best way to define source types that keeps performance speedy?

0 Karma
1 Solution

sloshburch
Splunk Employee
Splunk Employee

The Splunk Product Best Practices team provided this response. Read more about How Crowdsourcing is Shaping the Future of Splunk Best Practices.

Overview

To start with, it helps to know how Splunk software parses data. See Configure event processing in the Getting Data In manual for a good background.

From a performance perspective, event line breaking and event timestamps are two areas where you can define source types that streamline how Splunk software parses data.

Event line breaking

The Splunk Getting Data In manual does a great job going into detail in the event line breaking topic. Within that topic, you'll learn about these specific attributes that consistently correlate with performance:

  • SHOULD_LINEMERGE combines several lines of data into a single multi-line event. Defaults to true.
  • LINE_BREAKER determines how the raw text stream is broken into initial events, before line merging takes place. Default is the regex for new lines, ([\r\n]+).

This means that as data comes in, Splunk software organizes the raw text into events based on new lines while also trying to keep multiple lines of the same event together as one larger event. You can streamline this to be more efficient if you know the data's patterns. For example, a typical syslog source only produces single line events (in contrast to stack traces, which could span multiple lines), so I can set:

SHOULD_LINEMERGE = false

This reduces the capacity Splunk uses for merging the line, while adopting the default LINE_BREAKER.

Event timestamps

The Splunk Getting Data In manual provides detail on event timestamps in the topic How timestamp assignment works.

Some key attributes of the timestamp assignment that affect performance are:

  • MAX_TIMESTAMP_LOOKAHEAD - Specifies how many characters after the TIME_PREFIX pattern Splunk software should look for a timestamp. Defaults to 128 characters.
  • TIME_FORMAT - Specifies a strptime format string to extract the date. No default set.
  • TIME_PREFIX - If set, Splunk software scans the event text for a match for this regex in event text before attempting to extract a timestamp. No default set.

This means that as data comes in, the Splunk software indexing process evaluates the first 128 characters for anything that could resemble potential time patterns on every event. That's a lot of processing! To really appreciate it, explore the wide range of variability the automatic detection covers within the topic Configure timestamp recognition in the Getting Data In manual. If you already know the timestamp will occur, for example, at the start of the event, and will be similar in pattern to 2019-01-01 19:08:01, then you can use a configuration like this:

MAX_TIMESTAMP_LOOKAHEAD = 32
TIME_FORMAT = %F %T
TIME_PREFIX = ^

Now Splunk will only look 32 characters after the start of the event ( ^) for the timestamp specifically of the pattern %Y-%m-%d %H:%M:%S. That's a lot more specific, and a lot more performant!

Punctuation

If you are truly a pro with source type definition, then you'll find you never use the punct field. In such a scenario, you may find performance benefits by turning off the generation of this field. The regex processing for each event that produces the punct has been shown to have performance implications in high volume environments.

ANNOTATE_PUNCT = false

If you find you need to use the punct after you've removed it, then explore adding some of the eval command's replace function to your search:

| eval punct = replace( replace( _raw , "\w" , "" ) , "\s" , "_" )

Remember that this is resource intensive and may make for a slow search while it processes.

Don't Forget!

As you go forth strengthening your source types, remember to consider best practices for Source type naming conventions in the Splunk Supported Add-ons manual.
Also, depending on the data, other settings could produce more dramatic results. For example, the CHARSET attribute would be critical for performance for non ASCII UTF-8 encoded machine data, such as the data widely available outside English speaking countries.

Sources

View solution in original post

woodcock
Esteemed Legend
0 Karma

skoelpin
SplunkTrust
SplunkTrust

Sourcetype naming won't really give you better performance, but rather make it easier to maintain when scaling your environment. It will also cut back on the tech debt if you do it right the first time. To be clear, you should define a new sourcetype when you encounter a new log format which is not already in Splunk.

You should use the least amount of sourcetypes as possible as it will get more difficult to scale when you have to manage a ton of sourcetypes. So if you're onboarding logs and trying to determine a good sourcetype name, use the punct field to identify existing logs that match the format of your log sample to see if it's already onboarded. Reason to do this is because when you add a new sourcetype, you need to write base configs for each sourcetype to ensure proper line breaking and timestamping.

Also, do not use environment variables in your sourcetypes. A better approach would be to use the same sourcetype and tag events that belong to an environment. You should also follow the CIM and use : in place of underscores.

sloshburch
Splunk Employee
Splunk Employee

Yes! I love it! Great point about checking for existing sourcetypes with that punct. Do you have a search you use for that investigating you'd be willing to share?

I've seen success putting the env name as part of the index naming convention as well as tag (like you mentioned). But avoiding the sourcetype will mean less redundant sourcetype definitions FTW.

0 Karma

skoelpin
SplunkTrust
SplunkTrust

Feel free to upvote if this was helpful

I have a dashboard I call "Sourcetype Ledger" which accepts text input and compares against what's in the system. Here's a view under the hood

| tstats summariesonly=t count where index=* sourcetype!=*_* punct="*$compare|n$*" by sourcetype punct 
| stats dc(punct) as unique_patterns sum(count) as popularity by sourcetype
| sort - popularity

sloshburch
Splunk Employee
Splunk Employee

HOT DANG! Very cool dude! I'd love to strengthen a prescription around getting data in that embraces this very artwork! I'll point back here and give you a shout out on that post when I get around to it. You rock!

0 Karma

skoelpin
SplunkTrust
SplunkTrust

Sure thing. Feel free to add me on LinkedIn too if you want more details on building the dashboard to drive that comparison

0 Karma

itsmevic
Communicator

Hi Skoelpin, I'd like to add you to LinkedIn too if you do not mind.

skoelpin
SplunkTrust
SplunkTrust

Yeah sure. You can find me at www.linkedin.com/in/skoelpin

0 Karma

sloshburch
Splunk Employee
Splunk Employee

The Splunk Product Best Practices team provided this response. Read more about How Crowdsourcing is Shaping the Future of Splunk Best Practices.

Overview

To start with, it helps to know how Splunk software parses data. See Configure event processing in the Getting Data In manual for a good background.

From a performance perspective, event line breaking and event timestamps are two areas where you can define source types that streamline how Splunk software parses data.

Event line breaking

The Splunk Getting Data In manual does a great job going into detail in the event line breaking topic. Within that topic, you'll learn about these specific attributes that consistently correlate with performance:

  • SHOULD_LINEMERGE combines several lines of data into a single multi-line event. Defaults to true.
  • LINE_BREAKER determines how the raw text stream is broken into initial events, before line merging takes place. Default is the regex for new lines, ([\r\n]+).

This means that as data comes in, Splunk software organizes the raw text into events based on new lines while also trying to keep multiple lines of the same event together as one larger event. You can streamline this to be more efficient if you know the data's patterns. For example, a typical syslog source only produces single line events (in contrast to stack traces, which could span multiple lines), so I can set:

SHOULD_LINEMERGE = false

This reduces the capacity Splunk uses for merging the line, while adopting the default LINE_BREAKER.

Event timestamps

The Splunk Getting Data In manual provides detail on event timestamps in the topic How timestamp assignment works.

Some key attributes of the timestamp assignment that affect performance are:

  • MAX_TIMESTAMP_LOOKAHEAD - Specifies how many characters after the TIME_PREFIX pattern Splunk software should look for a timestamp. Defaults to 128 characters.
  • TIME_FORMAT - Specifies a strptime format string to extract the date. No default set.
  • TIME_PREFIX - If set, Splunk software scans the event text for a match for this regex in event text before attempting to extract a timestamp. No default set.

This means that as data comes in, the Splunk software indexing process evaluates the first 128 characters for anything that could resemble potential time patterns on every event. That's a lot of processing! To really appreciate it, explore the wide range of variability the automatic detection covers within the topic Configure timestamp recognition in the Getting Data In manual. If you already know the timestamp will occur, for example, at the start of the event, and will be similar in pattern to 2019-01-01 19:08:01, then you can use a configuration like this:

MAX_TIMESTAMP_LOOKAHEAD = 32
TIME_FORMAT = %F %T
TIME_PREFIX = ^

Now Splunk will only look 32 characters after the start of the event ( ^) for the timestamp specifically of the pattern %Y-%m-%d %H:%M:%S. That's a lot more specific, and a lot more performant!

Punctuation

If you are truly a pro with source type definition, then you'll find you never use the punct field. In such a scenario, you may find performance benefits by turning off the generation of this field. The regex processing for each event that produces the punct has been shown to have performance implications in high volume environments.

ANNOTATE_PUNCT = false

If you find you need to use the punct after you've removed it, then explore adding some of the eval command's replace function to your search:

| eval punct = replace( replace( _raw , "\w" , "" ) , "\s" , "_" )

Remember that this is resource intensive and may make for a slow search while it processes.

Don't Forget!

As you go forth strengthening your source types, remember to consider best practices for Source type naming conventions in the Splunk Supported Add-ons manual.
Also, depending on the data, other settings could produce more dramatic results. For example, the CHARSET attribute would be critical for performance for non ASCII UTF-8 encoded machine data, such as the data widely available outside English speaking countries.

Sources

jeffland
SplunkTrust
SplunkTrust

Unfortunately, the links at the end of this post point to "depublished" conf talks, as Splunk for some reason does not keep old talks searchable from the .conf pages. There is an idea to fix this (https://ideas.splunk.com/ideas/PORTALSID-I-141) - feel free to upvote if you'd like, and @sloshburch maybe you can you link to the pdf directly instead?

0 Karma

jkat54
SplunkTrust
SplunkTrust

This simply is not true and should be dispelled:

"If you are truly a pro with source type definition, then you'll find you never use the punct field. In such a scenario, you may find performance benefits by turning off the generation of this field. The regex processing for each event that produces the punct has been shown to have performance implications in high volume environments.

ANNOTATE_PUNCT = false"

I've heard people tout as much as 4% performance impact here... but at what cost?

If someone thinks the punct field is useless, then why is it enabled by default in the product?

Someone obviously thought it served a purpose at some point.

I use it occasionally to find outliers in my data. Those outliers could be due to me not being an "pro" or perhaps the data is onboarded fine but I still want to use the punct field from time to time.

The truth of the matter is, punct consumes disk because it is an indexed extraction. It's also consuming CPU at index time. How much disk does it use? That depends on how big your events are to begin with. Typically the punct field has varying lengths so to find out how much punct is costing you in disk space you can evaluate that.

| tstats count where index=* OR index=_* by punct index 
| eval bytes=len(punct)*count 
| stats sum(eval(bytes/1024/1024/1024)) as GB_used count by index

Above doesnt take into account disk compression, search factor and replication factor, below does:

| tstats count where index=* OR index=_* by punct index 
| eval bytes=len(punct)*count 
| eval replicationFactor=2
| eval searchFactor=3
| stats sum(eval(bytes/1024/1024/1024)) as GB count by index replicationFactor searchFactor
| eval Estimated_GB_used = (0.15*replicationFactor*GB) + (0.35*searchFactor*GB)

How much CPU does it use?

This I dont know, but i imagine it costs as much as any other index time extractions ... a cpu lock for each event while regex is computed. Modern CPU architectures are capable of millions if not billions of locks per second, so hopefully "one more indexed extraction" wont break the camels back.

But to say its best practice to disable this field has grated my nerves since I first learned it was being taught.

How about it's best practice not to ship software with settings that are against best practices? Once you think of it from this standpoint, you begin to wonder... what are the values of this punct field? and why would i want it to be in my tsidx files? Should you just extract it at search time should you need it?

sloshburch
Splunk Employee
Splunk Employee

Heads up that len returns the returns the character length of a string, which is not always the same as the bytes. It's certainly a fine proxy but I wanted to toss that disclaimer out there since I know we all eventually overlook that when comparing license_usage.log against len(_raw) and being confused by the difference.

OK, back to your regularly scheduled splunking!

Source: Text functions

0 Karma

jkat54
SplunkTrust
SplunkTrust

You’re right about len, so then we should be able to stipulate when it’s 8 bytes per char versus more or less and further discuss. At least add a note that this assumes ascii etc

0 Karma

woodcock
Esteemed Legend

s/bytes/bits/

0 Karma

sloshburch
Splunk Employee
Splunk Employee

Yea, but my gut is telling me this is starting to get a bit messy. When I take a step back I realize I'm more interested in the CPU part than the disk usage part anyway - cause the rest of the sourcetype tuning primarily drive CPU-bound improvements. So I'm incline to pivot back to research on the CPU usage and trust that the disk usage is a subtle cost of any indexed field.

That's a lot of gut and feeling. Keep me honest - am I over simplifying the disk vs cpu pivot here?

0 Karma

sloshburch
Splunk Employee
Splunk Employee

@jkat54 - I could hug you - that was such a great response!

Before I go into praising more on the rich content you added, I do want to add nuance to your point about why it's a default setting.

Specifically, I think it's fair to conclude that, like other default settings, such could be configured in a way that is safest for automatic detection. Let me try saying that differently: we know that some of the other default settings are very forgiving (time detection, event breaks) but at a cost of CPU. When we have confidence about those aspects of the data, we can tune the sourcetype thereby producing efficiencies.
To that end, I think the example you gave is perfect! Consider when trying to find outliers. If I was new to Splunk, I could send my data in all as one sourcetype, and then use punct to find the different patterns that I want to mature into their own sourcetype.

Now more praise! I totally agree about the outlier use case. That is definitely a great use of the punct field. But, like you said, at what cost? Thank you for adding the specific searches that can be used to measure that. I'll take your contribution a step further and do some research on my end to get some data on the CPU impact...because I KNOW there's been charts of that shown at .confs.

Anyway, clearly I have some homework based on what you shared. As I play more with the searches you shared and adjust the post to reflect your analysis, expect to see some karma head your way!

Thanks again!

0 Karma

tmoser
Splunk Employee
Splunk Employee

Perhaps you meant this presentation?

https://conf.splunk.com/files/2016/slides/observations-and-recommendations-on-splunk-performance.pdf

I am pretty sure there was a presentation that said that manually defining 6-8 configuration options under sourcetype stanza could save up to 60% of CPU resources during indexing. 

0 Karma
Get Updates on the Splunk Community!

New Dates, New City: Save the Date for .conf25!

Wake up, babe! New .conf25 dates AND location just dropped!! That's right, this year, .conf25 is taking place ...

Introduction to Splunk Observability Cloud - Building a Resilient Hybrid Cloud

Introduction to Splunk Observability Cloud - Building a Resilient Hybrid Cloud  In today’s fast-paced digital ...

Observability protocols to know about

Observability protocols define the specifications or formats for collecting, encoding, transporting, and ...