Community Blog
Get the latest updates on the Splunk Community, including member experiences, product education, events, and more!

Good Sourcetype Naming

mmccul
SplunkTrust
SplunkTrust

When it comes to getting data in, one of the earliest decisions made is what to use as a sourcetype. Often, this choice gets less thought, less consideration than any other detail in log onboarding. That's a mistake. Picking the right sourcetype often drives the right behavior throughout the log onboarding process.

Part of the problem with picking a good sourcetype is people often don't have a good understanding of what a sourcetype is. The best comparison I've seen is a sourcetype is the "data type definition" of the data. For those who don't live in the world of XML, that means the list of all the fields that are required or optional, and what kind of data can live in each field.

Two bad habits are immediately eliminated by this concept. First, the habit of bringing in different CSV files with different columns with the same sourcetype. It's a lot easier to specify the fields if they have different sourcetypes. Second is use of the _json sourcetype. Stay away from it. Don't use it, ever. It doesn't do what you want.

You may be thinking, why not use _json? Well, there's two problems. The first problem is that _json indexes all fields. Contrary to popular myth, indexed fields are not inherently faster, and may actually be slower if you index every field. The second problem is that it makes adding additional props or transforms a lot harder. You now have to link to the source rather than the sourcetype in order to target your transformations or index time adjustments.

Rather than only tell you the wrong way, what's a good way to do sourcetypes? A good example of sourcetypes done well is zeek. Each unique sourcetype in the family of zeek sourcetypes has a unique set of fields that are expected in the data. Example sourcetypes are "zeek:conn:json", "zeek:notice:json", "zeek:dns:json" or "zeek:quic:json". One can look at any of the various sourcetypes and know what fields will be present.

The second thing to note is that the naming structure is hierarchical. The tradition is to use a single colon to denote the levels from least specific to most specific. In the above examples, The software product is listed first, then the specific component of the product. The last section is a type notation to distinguish the format. While "json" is the most commonly seen format specifier using this notation, you may have other formats as well, such as "csv" or even "kv" (key/value pairs).

At this point, you're probably thinking that this issue seems too complex and nitpicky. There's a number of reasons why good sourcetype naming matters. First, a good sourcetype name makes it easy to know what kind of data is in the sourcetype, which improves the ease of referencing the sourceetype explicitly in searches. Second, by using a hierarchical structure for sourcetype naming, when one needs to reference multiple sourcetypes, it's easy to do a `sourcetype=zeek:*` or similar query, even if those sourcetypes span multiple indexes.

What about more common sourcetypes, like access_combined? That sourcetype only emphasizes the point. The access_combined sourcetype is very strict and defines exactly what information is stored in the log and how it is stored. Try sending data that isn't perfectly fitting the access_combined definition and all your field extractions break. Some incoming JSON data requires special handling, such as extra large JSON objects sent as a single event.

Good sourcetype naming doesn't have to be hard, but it is one of those choices that is hard to change later on. Taking the time to define your sourcetypes well also makes it much easier to understand how to organize your data in indexes and how to arrange your field extractions. Because in the end, what is useful is how you are searching your data.

Get Updates on the Splunk Community!

Splunk Observability Cloud's AI Assistant in Action Series: Auditing Compliance and ...

This is the third post in the Splunk Observability Cloud’s AI Assistant in Action series that digs into how to ...

Splunk Community Badges!

  Hey everyone! Ready to earn some serious bragging rights in the community? Along with our existing badges ...

What You Read The Most: Splunk Lantern’s Most Popular Articles!

Splunk Lantern is a Splunk customer success center that provides advice from Splunk experts on valuable data ...