Getting Data In

What is the best analogy for explaining 'sourcetypes'?

Ultra Champion

When I talk to folks who are new to Splunk, I often struggle to explain the concept of a sourcetype to them. Other basic fields, like host, source and _time, are more easily understood because they exist outside of Splunk.

Analogies tend to be a great way to convey new concepts. So I'm curious what analogies for sourcetype have worked for you?

0 Karma
1 Solution

Ultra Champion

The Splunk Product Best Practices team provided this reponse. Read more about How Crowdsourcing is Shaping the Future of Splunk Best Practices.

I find Humans to be a great analogy. Here's how I explain it:

Splunk headquarters is in downtown San Francisco, California, adjacent to the Embarcadero, along the city's shoreline where thousands of people pass by every day: pedestrians, tourists, runners, families, workers on break, and so on.

All the people on the Embarcadero have their own names and addresses. Some are named John, some have black hair, some even come from the same address, like a family or coworkers from the same company. This is similar to machine data! While the data may not have a name, it likely has a host name, or identifier for the unique asset that created the data. Likewise, the data may not rest its head, but it did originate from an address, or source location.

When you compare and contrast people on the Embarcadero, it makes more sense to compare them by attributes they have in common, such as bikers, runners, or tourists, rather than dealing with each individual's name or address. By organizing the people this way, you can compare and contrast their common attributes effectively.

In Splunk software, you do the same thing with sourcetype. Consider Apache web logs. By referring to all Apache web logs with the same source type name, we can calculate average web request time without having to list list every Apache host or source path. It is best to create source types for data that has a similar structure. For example, bicyclists and unicyclists are similar, but they are structured differently. Likewise, Apache web logs and ISS web logs are both web logs, but they are structurally different and have different values worth comparing, so they should each have their own source type.

To learn more about source types, check out "Why source types matter" within the Getting Data In manual. For those ready to define their own custom source types, discover naming conventions within the "Source types for add-ons" of the Splunk Add-ons manual.

View solution in original post

Explorer

i like to keep things simple so everyone understands what im intending to explain

Consider splunk as a database

Index = database name
sourcetype = tables
_time = when these events where recorded in splunk
host, source , sourcetype, _time are the key identifiers to create various views on the database.

0 Karma

Ultra Champion

That's solid for folks who have an understanding of databases. Has it worked for non-technical folks?

I still have it in my bones to avoid referring to indexes as databases. Back in the day, everyone was so concerned indexes couldn't scale because they were used to database limitations and not familiar with map reduce. So that could be part of my hesitation about introducing the concept of a database here.

Nonetheless, the sourcetype as a table is compelling because it aligns with the idea that the table has fixed fields (columns) and everything in the table (or of that sourcetype) would have those same columns or attributes.

Very cool idea and definitely effective!

0 Karma

Contributor

I think of indexes like bookshelves and sources like content. In that analogy, sourcetypes are like types of media. You could read an article in a newspaper or magazine, but that article could also be excepted in a book. A lot of classic books were once serialized in magazines. You can even listen to a book or magazine article on a service like audible.

Different audio media like cassettes and CDs can also contain the same content (source), and the medium determines how you interact with that content. If the media content is on a CD, you can easily skip back and forth on tracks, but not on a cassette.

All of these media (sourcetypes) determine how you get your content (source). You have to use different methods and have certain capabilities to handle each type of medium. Print books are no good to those without sight, and audio is useless to those without hearing. So it's important to choose the right medium (sourcetype) in order to get the content in a way that's useful.

Maybe in this analogy it's the human that's the indexer? I guess that's where it breaks down.

0 Karma

Ultra Champion

Ha ha. I love the humility that you end it on.

It's an interesting concept. I honestly have had to read it a few times to follow. I'm curious if it's too involved OR am I just getting to close to end of day on a Friday.

Anyway, thanks for the contribution and the concept!

0 Karma

Splunk Employee
Splunk Employee

As the child of a librarian, this analogy makes great sense to me! The librarian, who fills out the catalog entry for a given item, is the indexer. This analogy also accounts for cross-references. The patron is the search head.

0 Karma

SplunkTrust
SplunkTrust

Your explanation sounds very much like the description of objects in object oriented programming. The sourcetype allows you to group data in similar object types. This makes it easier to write configurations such as field extractions, tagging, etc on each sourcetype. Also, when you add new data, if it already fits one of the molds that you already have, you don't need to rewrite the configs all over again.

0 Karma

Ultra Champion

Great point and I think this is a great nuance of how to cater an explanation towards the right audience. While a very tech-agnostic suggestion might work for business users, an object oriented one will resonate with techies! Kuddos!

0 Karma

SplunkTrust
SplunkTrust

I liken it to a map. It defines the steps needed to figure out what kind of creature is behind the fourth comma. That number behind the seventh colon, that's obviously packed size. This approach has a bonus in that let it lets me sing the map song from Dora.

Ultra Champion

Ha ha. That sounds a solid way to explain what the sourcetype does. I imagine it might still get questions around why sourcetypes are used rather than just the sources. Which then opens up landing the concept perfectly.

0 Karma

Ultra Champion

It's just a label used to categorize data that has similar structure and content.

Maybe you can compare it to the concept of file extensions.

Splunk Employee
Splunk Employee

I prefer this description, as it applies the K.I.S.S principle. The reason we're all discussing sourcetype is to facilitate understanding of the term as Splunk uses it because sourcetype settings are the cornerstone of Splunk App functionality. It'd be tough or impossible to provide pre-made, pre-vetted, domain-specific knowledge (App dashboards, searches, and other content) by tagging data using Splunk's other metadata fields. If a customer is creating all of their own content in Splunk, they don't need sourcetype. But to provide that content in a normalized, redistributable format you need a consistent label\flag\tag\marker to filter the data. Good conversation!

0 Karma

Ultra Champion

My only hesitation is that file extensions would also paint a picture of binary code and complexity. As a concept it works well but I bet there's a nuance we can add to dissuade anyone from drawing further conclusions about the complexity.

0 Karma

Splunk Employee
Splunk Employee

Oh, heya Burch. I was referring to this post by "FrankVl". >> It's just a label used to categorize data that has similar structure and content.

0 Karma

Ultra Champion

Ah. Ok, cool.

0 Karma

Ultra Champion

Ooh. File extensions. I like that!

0 Karma

SplunkTrust
SplunkTrust

It's just a label used to categorize data that has similar structure and content. <-- 100%

I would go on to add...

The sourcetypes are defined (and possibly manipulated) at index time and are found in tsidx files, which means using sourcetypes in your searches can return results quicker.

0 Karma

Ultra Champion

Oh, I like "It's just a label used to categorize data that has similar structure and content". I find that I love my audience if I share the rest since they know not about tsidx and the concept of index time. Although, I think maybe you were just sharing that with us and not necessarily something you specifically tell n00bs.

Thanks again for the great content!

0 Karma

SplunkTrust
SplunkTrust

Yes, know your audience!

0 Karma

SplunkTrust
SplunkTrust

Used to be able to say, tsidx is fastest splunk Storage... but I would certainly include something about sourcetypes being typically “configured” in inputs.conf.

0 Karma

Ultra Champion

The Splunk Product Best Practices team provided this reponse. Read more about How Crowdsourcing is Shaping the Future of Splunk Best Practices.

I find Humans to be a great analogy. Here's how I explain it:

Splunk headquarters is in downtown San Francisco, California, adjacent to the Embarcadero, along the city's shoreline where thousands of people pass by every day: pedestrians, tourists, runners, families, workers on break, and so on.

All the people on the Embarcadero have their own names and addresses. Some are named John, some have black hair, some even come from the same address, like a family or coworkers from the same company. This is similar to machine data! While the data may not have a name, it likely has a host name, or identifier for the unique asset that created the data. Likewise, the data may not rest its head, but it did originate from an address, or source location.

When you compare and contrast people on the Embarcadero, it makes more sense to compare them by attributes they have in common, such as bikers, runners, or tourists, rather than dealing with each individual's name or address. By organizing the people this way, you can compare and contrast their common attributes effectively.

In Splunk software, you do the same thing with sourcetype. Consider Apache web logs. By referring to all Apache web logs with the same source type name, we can calculate average web request time without having to list list every Apache host or source path. It is best to create source types for data that has a similar structure. For example, bicyclists and unicyclists are similar, but they are structured differently. Likewise, Apache web logs and ISS web logs are both web logs, but they are structurally different and have different values worth comparing, so they should each have their own source type.

To learn more about source types, check out "Why source types matter" within the Getting Data In manual. For those ready to define their own custom source types, discover naming conventions within the "Source types for add-ons" of the Splunk Add-ons manual.

View solution in original post