Even if you are new to Splunk, you should be somewhat familiar with source types. The idea of source types is one of the first thing you learn about in your Splunk journey. That being said, the concept and its applications might be somewhat hard to fully grasp. With this blog post I’ll try to answer some common questions about source types.
Technically a source type is just a default indexed field. Normally the source type field is determined at data ingest, but it can be rewritten later in the pipeline as well. That the field is default means that every log event in Splunk has this field, and that the field is indexed means that the field is stored in the index (as opposed to being extracted at search time). The picture below shows examples of source types, as shown in Splunk web (note the other default indexed fields as well, listed on the left).
Run a search in Splunk to see source types (the actual field name being “sourcetype”). They are pre-selected as an interesting field in the GUI.
Conceptually, the source type is supposed to classify the data structure of a group of log events, or in other words, specify the log format. So, for example, you might have a group of sources (e.g. log files) that are sending events to Splunk in JSON format. Then, the common source type for all these events should be JSON. Said in a different way, the source type tells you what the events are. The other default fields “source” and “host” tells you where the events are.
Rules for parsing and field extractions are defined per source type. This means that you can use source types to decide how different log streams should be treated. Spending some time setting well-defined logical source types gives a better end user experience in Splunk, and better performance utilization for the Splunk platform as a whole.
All source types are defined in the configuration file props.conf. The file can be edited directly, or you can add and edit source types through the Splunk web UI. What works best is dependent on your environment and situation. Each source type has its own stanza in props.conf, defining the source type name, parsing rules and field extractions.
A source type that specifies efficient parsing rules and user-friendly field extractions is a good start. If the field extractions are also following the naming standards as defined in the Common Information Model (CIM), used for data normalization and data models, that makes a quite well-defined source type. Make sure to look into the Splunk Great Eight for super-efficient parsing rules. Note that defining source types requires some manual work, but the payoff makes it worth it.
There is not a single common consensus to how source types should be named, but having a consistent naming convention within your environment is beneficial. Commonly you’ll see source type names using a colon separated hierarchy system, where you “build” the name from “general” to “specific”. See the screenshot above for examples, like "aws:cloudwatchlogs:vpcflow". One benefit from this system, besides being easy to read, is that you can add a wildcard after any colon to search that subset of source types, for example "aws:cloudwatchlogs:*" or "aws:*".
When to split a source type? That’s a good question. To add to the JSON example mentioned earlier, you might have two different systems logging JSON events. Even though the log format is the same (JSON), the fields used by the systems might be totally different. This means that rules for field extraction and normalization need to be customised for each system, and thus, each system needs its own source type. You could define the source type names as something like “json:this” and “json:that”, each with its own field extraction rules.
If you have different systems, logging essentially the same type of events, but with minor variances, it might not be smart to split all these log streams into different source types. Say that you want to change a field extraction rule for all these log streams, you would have to update all the different source types, one-by-one, every time. In this case a common source type might be a better option. You’ll have to decide what is best case by case.
Users can specify a source type field in a query to easily find and filter events. For example, by searching for source type “WinEventLog”, they’ll find events from Windows Event Log. Since the source type field is an indexed field, it gives good search efficiency when specified. Also, note that indexed fields allow for the usage of tstats, which can be used to create super-efficient searches (see example below).
tstats is a highly optimized way to search your data.
You can use tstats on indexed fields, but also on other “terms” that exists in your data (see my other blogpost if you are interested). Understanding exactly how this works takes a bit of practice and learning, but combine this knowledge with a set of well-defined source types, and you can create all sorts of Splunk tstats magic.
Also, since source type is a special default field, not just any indexed field, you can use the metadata command as well. This command lets you search on bucket metadata for source type information (not even touching the logs on disk). It’s uses are limited, but when it works, it grants huge efficiency benefits. One common use of metadata is to efficiently catch stopped data streams, e.g. see when logs of a certain log format (source type) suddenly stop being ingested to Splunk. This could indicate a failure in the data pipeline somewhere, that needs to be investigated.
The metadata command can be used to find source types that haven’t sent data to Splunk in more than a day.
Another special use of the source type field, is to analyse license usage. In Splunk there is a license usage log, which also has some prebuilt views in the Monitoring Console. This log can be split by source type, meaning that you can use this field to identify which “classes” of sources are using the most of your license. The better and more consistent you’ve defined your source types, the better values you can get from this license analysis.
Martin Hettervik, Senior Consultant and Team Leader at Accelerate Oslo, Splunk MVP
LinkedIn: https://www.linkedin.com/in/martinhettervik/
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.