Getting Data In
Highlighted

Why shouldn't I use the "_json" or "syslog" sourcetypes?

Explorer

I'm coming to understand that "json" and "syslog" aren't sourcetypes, but formats.

Why are they provided as sourcetypes out of the box with splunk if they are not meant to be used?

How is it recommended that i properly sourcetype new log data?

Tags (2)
Highlighted

Re: Why shouldn't I use the "_json" or "syslog" sourcetypes?

Esteemed Legend

Well, JSON is a data format, not a sourcetype, just like syslog is a data transfer protocol (and a data format), not a sourcetype. When you use those predefined sourcetypes, you get a few things "for free" (e.g. KV_MODE = json for JSON and line-breaking and timestamp settings for syslog ) but those are easy enough to copy to your own props.conf for a proper sourcetype. The general approach for sourcetype naming is vendor:product:type:technology/format where the slash ( / ) is not literal and you can see that formats like json have a place, but not the entire value. What will happen if you head down this terrible road is that you will have 50 different actual sourcetypes but all jumbled together as sourcetype=_json and it will be a disaster to untangle and fix.

A sourcetype is whatever you say it is. Some of the docs and training talk about "creating a sourcetype" but there is no such process. Simply using any string with sourcetype=AnyString or putting any [sourcetypeStringHere] in props.conf just works. It is really simple: every different type and source of data should have it's own distinct sourcetype value. These can share an index value, but should NEVER, EVER share a sourcetype value, unless it is EXACTLY THE SAME type of thing. If you like some automatic freebie from sourcetype=_json or sourcetype=syslog or whatever, then go look at what is in $SPLUNK_HOME/etc/system/default/{fields,props,transforms}.conf and copy it for your new sourcetype. These "generic" sourcetypes were never intended to be used in production; they are for training and PoC only, even though Splunk never actually says this anywhere. Having said that, though, most of the pretrained sourcetypes are legit, chief among those are access_combined and access_combined_wcookie.

Highlighted

Re: Why shouldn't I use the "_json" or "syslog" sourcetypes?

New Member

Is the Add-Ons format the same? According to this link: https://docs.splunk.com/Documentation/AddOns/released/Overview/Sourcetypes

Source type names use the following format: vendor:product:technology:format

0 Karma
Highlighted

Re: Why shouldn't I use the "_json" or "syslog" sourcetypes?

Esteemed Legend

I updated my answer.

0 Karma
Highlighted

Re: Why shouldn't I use the "_json" or "syslog" sourcetypes?

Splunk Employee
Splunk Employee

Yes, the getting-data-in or GDI tasks are terminology-heavy and a bit convoluted until you've done it a couple times.

A sourcetype is a flag/marker/key/tag/reference used to identify a data source so that Splunk Enterprise can do $things to it. The common things you'll do to a sourcetype are define timestamp extraction, line breaking, set a timezone, and so on. A sourcetype can be named anything (sourcetype=orangered6000,) but if you're making many of them like Splunk (the company) it helps to define a standard such as "vendor:product:technology:format".

Splunk Enterprise has a few sourcetypes defined in the default conf files (props.conf and transforms.conf) that are leftovers. The "json" and "syslog" sourcetypes were left in for various reasons, but it's always assumed that they're too simplistic for most production data, and therefore of limited use. That said, I have home router data sourcetyped as 'syslog' and that was sufficient due to the lack of complexity in those log events. YMMV.

Finally, the GDI manual has a number of topics on sourcetyping

Most people will setup a single-instance Splunk Enterprise install and start ingesting the data to see what Splunk Enterprise will do to the source events out-of-the-box.
Once you can see the events in Splunk Enterprise:
1. Review the Answers post "What are the best practices for defining source types?"
2. Fix what's broken. You have the original source, and you can see the results in Splunk Enterprise. Try to reconcile those.
3. If you can, tightly define the rules for your sourcetype (details in #1 link.) This is the tweaking that happens to minimize the time Splunk Enterprise will spend on processing and guessing details about your sourcetype. Why let it guess where your event breaking is when you can define it instead?
4. You should also look for oddball or unique events from the source. There might be events that are drastically different from the others but are part of the same source. Try punct and rare like: | rare limit=20 punct. If you find any unique events, make sure your custom sourcetype can handle them properly. If not, you might need another sourcetype for those.

And good luck!

Highlighted

Re: Why shouldn't I use the "_json" or "syslog" sourcetypes?

Super Champion

Just to add my 5 cents. Sourcetype is the MOST important type of metadata if you want to support your stakeholders as most queries come on "type of data", rather than "indexes or other index fields". This means I would
1. Spent more effort in defining sourcetype more precisely (and not dump everything as syslog)
2. Align sourcetypes in hierarchical names using colon as much as possible (eg myorg:some_application:some_sub_application:syslog)
3. Write props/transforms/eventtypes/tags for each of the sourcetype (which might be copy-paste), but will yield correct fields for Common Information modelling. (Of course you can easily let splunk know it is json format etc within KV_MODE = json etc, but do it under your sourcetype)
4. Ensure your sourcetype is well named to be it to be identified and aggregated (eg. you can type all sourcetypes for your application by myorg:some_appplication*)

All in all, my advise and experience shows it is worth to spent time/effort in building quality data rather than quantity. (quantity will come automatically once stakeholders like the data) !!

Highlighted

Re: Why shouldn't I use the "_json" or "syslog" sourcetypes?

Splunk Employee
Splunk Employee

They are sourcetype leftover examples of simpler times.

They can work on very generic datasets, but if you intend to ingest more specific logs, it's better to create new specific sourcetypes. (or use the ones provided by specific enterprise standard in their apps, and likely compatible with the Common Information Model).

The reason is that, if you all many of your events have the same sourcetype, then the day you want to setup a custom filters/field parsing/index time or search time processing, it will apply to a large set of data. While if you apply it to a specific sourcetype, you can control better the scope.