Getting Data In

When to create an additional sourcetype vs new indexed fields when events are in JSON format?


We have an application that reads events from Kafka using Kafka Consumers and persist them into database (mysql/oracle). Each event has a table name field that tells consumer into which table to persist the event. The event itself can be deserialized into a JSON string.

I am working on a process to ingest these events into Splunk via HEC.

Because the events are in JSON format, I understand that when the events are ingested into Splunk, Splunk tags the sourceType as _json.

Because we have millions/billions of these events, sourceType field, which is considered as a default field in Splunk goes underutilized when the data is indexed.

Reading through the documentation -

I understand we have at least 2 ways to deal with these scenarios.

1 - Create new sourceTypes, where name of the sourceType is the name of the table and the underlying definition is same as sourceType _json. Looks easy but we probably may have 100's of such sourceTypes

2 - Create an additional indexed field, lets say "tablename". Looks easy but we probably end-up needing additional space for the index because this field is extracted during the index time.

Any suggestion on what is the better approach?

New Member

You can use JSON to configure multiple Sources in either of the following ways:
Create a single JSON file with the configuration information for all the Sources (sources.json).
Create individual JSON files, one for each Source, and then combine them in a single folder. You then configure the Source folder instead of the individual Sources.
photo editor

0 Karma


Be aware that in 6.4 there are two different HEC endpoints you can write to.

The /services/collector endpoint does not pass events through the event processing pipeline. This means index-time processing of sourcetypes won't work here. So you actually don't want to use _json as the sourcetype, because the _json sourcetype extracts json events at index time. You'll notice in the _json definition that INDEXED_EXTRACTIONS = json and KV_MODE = none. What that does is tell Splunk to create your json fields at index time, and skip auto-extracting the fields at search time. Otherwise, you'd end up with two entries for each field (Splunk would show the index-time and the search-time field).

The new /services/collector/raw endpoint, however, will pass data through the event processing pipeline. So you can post json data as _json and use index-time field extractions, transforms, and so on. Hopefully this difference makes sense.

As far as whether to use one sourcetype or multiple, or a new field, are you putting something in the sourcetype field just because you feel like you need to utilize it? If so, you may want to wait on using that field until you've used Splunk for awhile to see how best to use it for your data. You'll find differing opinions, but I think the sourcetype field should be used to describe the format of your data. Remember you can also use the source field to include information that might better describe where the data originated from (ie, the tablename). And you can create props/extractions that apply to sources as well.

0 Karma


@jeremiah - Thanks for details on the rest end-points. I want to use the source type because it is one of the default fields in splunk and reading through some of the best practices, it was advised to use this field in the queries for better performance. If I have this sourceType as my table name, this would be a better filter for me. But which of the above two options is a better choice?

0 Karma


Also, you mentioned - "Remember you can also use the source field to include information that might better describe where the data originated from (ie, the tablename)." How can we include table name into the source type?

0 Karma


There are two ways you can set the source field, depending on which hec endpoint you use. If you choose the /services/collector endpoint, then you can set the field when you send the event:

    "time": 1426279439, // epoch time
    "host": "localhost",
    "source": "datasource",
    "sourcetype": "txt",
    "index": "main",
    "event": { "Hello world!" }

If you'd rather use the raw endpoint, then you can use an index-time field extract to rewrite the value of source from your data. Something like what you see in the link below, but with source instead:

0 Karma
Get Updates on the Splunk Community!

The Splunk Success Framework: Your Guide to Successful Splunk Implementations

Splunk Lantern is a customer success center that provides advice from Splunk experts on valuable data ...

Splunk Training for All: Meet Aspiring Cybersecurity Analyst, Marc Alicea

Splunk Education believes in the value of training and certification in today’s rapidly-changing data-driven ...

Investigate Security and Threat Detection with VirusTotal and Splunk Integration

As security threats and their complexities surge, security analysts deal with increased challenges and ...