To start with, it helps to know how Splunk software parses data. See Configure event processing in the Getting Data In manual for a good background.
From a performance perspective, event line breaking and event timestamps are two areas where you can define source types that streamline how Splunk software parses data.
The Splunk Getting Data In manual does a great job going into detail in the event line breaking topic. Within that topic, you'll learn about these specific attributes that consistently correlate with performance:
SHOULD_LINEMERGEcombines several lines of data into a single multi-line event. Defaults to
LINE_BREAKERdetermines how the raw text stream is broken into initial events, before line merging takes place. Default is the regex for new lines,
This means that as data comes in, Splunk software organizes the raw text into events based on new lines while also trying to keep multiple lines of the same event together as one larger event. You can streamline this to be more efficient if you know the data's patterns. For example, a typical syslog source only produces single line events (in contrast to stack traces, which could span multiple lines), so I can set:
SHOULD_LINEMERGE = false
This reduces the capacity Splunk uses for merging the line, while adopting the default
The Splunk Getting Data In manual provides detail on event timestamps in the topic How timestamp assignment works.
Some key attributes of the timestamp assignment that affect performance are:
MAX_TIMESTAMP_LOOKAHEAD- Specifies how many characters after the
TIME_PREFIXpattern Splunk software should look for a timestamp. Defaults to 128 characters.
TIME_FORMAT- Specifies a strptime format string to extract the date. No default set.
TIME_PREFIX- If set, Splunk software scans the event text for a match for this regex in event text before attempting to extract a timestamp. No default set.
This means that as data comes in, the Splunk software indexing process evaluates the first 128 characters for anything that could resemble potential time patterns on every event. That's a lot of processing! To really appreciate it, explore the wide range of variability the automatic detection covers within the topic Configure timestamp recognition in the Getting Data In manual. If you already know the timestamp will occur, for example, at the start of the event, and will be similar in pattern to
2019-01-01 19:08:01, then you can use a configuration like this:
MAX_TIMESTAMP_LOOKAHEAD = 32 TIME_FORMAT = %F %T TIME_PREFIX = ^
Now Splunk will only look 32 characters after the start of the event (
^) for the timestamp specifically of the pattern
%Y-%m-%d %H:%M:%S. That's a lot more specific, and a lot more performant!
If you are truly a pro with source type definition, then you'll find you never use the
punct field. In such a scenario, you may find performance benefits by turning off the generation of this field. The regex processing for each event that produces the
punct has been shown to have performance implications in high volume environments.
ANNOTATE_PUNCT = false
| eval punct = replace( replace( _raw , "\w" , "" ) , "\s" , "_" )
Remember that this is resource intensive and may make for a slow search while it processes.
As you go forth strengthening your source types, remember to consider best practices for Source type naming conventions in the Splunk Supported Add-ons manual.
Also, depending on the data, other settings could produce more dramatic results. For example, the
CHARSET attribute would be critical for performance for non ASCII UTF-8 encoded machine data, such as the data widely available outside English speaking countries.
This simply is not true and should be dispelled:
"If you are truly a pro with source type definition, then you'll find you never use the punct field. In such a scenario, you may find performance benefits by turning off the generation of this field. The regex processing for each event that produces the punct has been shown to have performance implications in high volume environments.
ANNOTATE_PUNCT = false"
I've heard people tout as much as 4% performance impact here... but at what cost?
If someone thinks the punct field is useless, then why is it enabled by default in the product?
Someone obviously thought it served a purpose at some point.
I use it occasionally to find outliers in my data. Those outliers could be due to me not being an "pro" or perhaps the data is onboarded fine but I still want to use the punct field from time to time.
The truth of the matter is, punct consumes disk because it is an indexed extraction. It's also consuming CPU at index time. How much disk does it use? That depends on how big your events are to begin with. Typically the punct field has varying lengths so to find out how much punct is costing you in disk space you can evaluate that.
| tstats count where index=* OR index=_* by punct index | eval bytes=len(punct)*count | stats sum(eval(bytes/1024/1024/1024)) as GB_used count by index
Above doesnt take into account disk compression, search factor and replication factor, below does:
| tstats count where index=* OR index=_* by punct index | eval bytes=len(punct)*count | eval replicationFactor=2 | eval searchFactor=3 | stats sum(eval(bytes/1024/1024/1024)) as GB count by index replicationFactor searchFactor | eval Estimated_GB_used = (0.15*replicationFactor*GB) + (0.35*searchFactor*GB)
How much CPU does it use?
This I dont know, but i imagine it costs as much as any other index time extractions ... a cpu lock for each event while regex is computed. Modern CPU architectures are capable of millions if not billions of locks per second, so hopefully "one more indexed extraction" wont break the camels back.
But to say its best practice to disable this field has grated my nerves since I first learned it was being taught.
How about it's best practice not to ship software with settings that are against best practices? Once you think of it from this standpoint, you begin to wonder... what are the values of this punct field? and why would i want it to be in my tsidx files? Should you just extract it at search time should you need it?
@jkat54 - I could hug you - that was such a great response!
Before I go into praising more on the rich content you added, I do want to add nuance to your point about why it's a default setting.
Specifically, I think it's fair to conclude that, like other default settings, such could be configured in a way that is safest for automatic detection. Let me try saying that differently: we know that some of the other default settings are very forgiving (time detection, event breaks) but at a cost of CPU. When we have confidence about those aspects of the data, we can tune the sourcetype thereby producing efficiencies.
To that end, I think the example you gave is perfect! Consider when trying to find outliers. If I was new to Splunk, I could send my data in all as one sourcetype, and then use punct to find the different patterns that I want to mature into their own sourcetype.
Now more praise! I totally agree about the outlier use case. That is definitely a great use of the punct field. But, like you said, at what cost? Thank you for adding the specific searches that can be used to measure that. I'll take your contribution a step further and do some research on my end to get some data on the CPU impact...because I KNOW there's been charts of that shown at .confs.
Anyway, clearly I have some homework based on what you shared. As I play more with the searches you shared and adjust the post to reflect your analysis, expect to see some karma head your way!
Heads up that
len returns the returns the character length of a string, which is not always the same as the bytes. It's certainly a fine proxy but I wanted to toss that disclaimer out there since I know we all eventually overlook that when comparing licenseusage.log against `len(raw)` and being confused by the difference.
OK, back to your regularly scheduled splunking!
Source: Text functions
You’re right about len, so then we should be able to stipulate when it’s 8 bytes per char versus more or less and further discuss. At least add a note that this assumes ascii etc
Yea, but my gut is telling me this is starting to get a bit messy. When I take a step back I realize I'm more interested in the CPU part than the disk usage part anyway - cause the rest of the sourcetype tuning primarily drive CPU-bound improvements. So I'm incline to pivot back to research on the CPU usage and trust that the disk usage is a subtle cost of any indexed field.
That's a lot of gut and feeling. Keep me honest - am I over simplifying the disk vs cpu pivot here?
Sourcetype naming won't really give you better performance, but rather make it easier to maintain when scaling your environment. It will also cut back on the tech debt if you do it right the first time. To be clear, you should define a new sourcetype when you encounter a new log format which is not already in Splunk.
You should use the least amount of sourcetypes as possible as it will get more difficult to scale when you have to manage a ton of sourcetypes. So if you're onboarding logs and trying to determine a good sourcetype name, use the
punct field to identify existing logs that match the format of your log sample to see if it's already onboarded. Reason to do this is because when you add a new sourcetype, you need to write base configs for each sourcetype to ensure proper line breaking and timestamping.
Also, do not use environment variables in your sourcetypes. A better approach would be to use the same sourcetype and tag events that belong to an environment. You should also follow the CIM and use
: in place of underscores.
Yes! I love it! Great point about checking for existing sourcetypes with that
punct. Do you have a search you use for that investigating you'd be willing to share?
I've seen success putting the env name as part of the index naming convention as well as tag (like you mentioned). But avoiding the sourcetype will mean less redundant sourcetype definitions FTW.