Getting Data In

What field name should I use to contain the value for overriding sourcetype in a TCP input?

Graham_Hanningt
Builder

Background

Some background to this question:

  • I'm working on a platform that does not have a Splunk Universal Forwarder. I don't mean to be coy about which platform I'm working on, but I'd prefer to leave it at that.
  • To get data into Splunk, I've developed a small Java application that massages JSON lines-format data (from a log extraction tool on that platform) into the similar JSON lines format required by the Splunk HTTP Event Collector (EC). The Java app sends the data from the remote platform to Splunk EC on my PC (for small-scale testing, I'm running Splunk Enterprise 6.4 with a free license on my Windows 7 PC). This works fine.
  • EC was an easy first choice for me because I'm familiar with HTTP-based tools. For example, I'm comfortable using cURL and writing Ajax requests in JavaScript.
  • I'm now looking at using a TCP input to Splunk instead of EC. I understand that many experienced Splunk users will have started with TCP first.
  • The EC event protocol separates the contents of a "packet" into event metadata and data. The metadata can include, among other items, the event time (timestamp) and sourcetype.
  • For TCP inputs, there is, to my knowledge, no such formalized protocol for separating metadata and data.
  • I want the timestamp of the events ingested via TCP to match the value from the original log data, not the time it is ingested into Splunk. With EC, I achieved this by setting the "time" key in the event metadata. For TCP, I believe I'll have to configure timestamp recognition in props.conf as described in Splunk docs.
  • Why I'm asking this question: I'm sending a wide variety of sourcetypes to Splunk via EC, using the "sourcetype" key in the event metadata. For TCP, I believe I'll have to overrride source types on a per-event basis as described in Splunk docs. I do not want to use a different TCP port for each sourcetype.

So, I plan to create a stanza in transforms.conf that gets a field value from the JSON-format data received via TCP, and uses it to set the sourcetype, like this:

[set_source_type_my_log_type]
REGEX = \"somefieldname\"\:\"(?[^\"]+)\"
FORMAT = sourcetype::$1
DEST_KEY = MetaData:Sourcetype

(I've tested this regular expression using the rex command, but not yet in the context of overriding sourcetype; I don't yet know for sure whether I'll have to escape the double quotes, as done here.)

where the JSON received via TCP contains a field like this:

"somefieldname": "xyz_123"

where "xyz_123" is the sourcetype I want the event to have.

The question

All of the above (thanks for reading this far) boils down to one simple question: what field name should I use in place of somefieldname (as per the example above)?

Thoughts on possible answers

It occurs to me that I probably shouldn't use the default field name sourcetype.

"Anything you like, except for one of the default fields (so, not sourcetype)" might be a valid answer, but I'd prefer a more specific answer: an actual field name that other users in the same situation might also choose to use, as an informal convention (rather than the formalized EC protocol).

Er, event_sourcetype ?

It also occurs to me that, given that I can supply both the sourcetype and the _time (expressed as Unix time) as fields in the JSON data, is there some better, more direct way than using regexes to configure the timestamp recognition and override the sourcetype? Specifying a regex to extract a JSON key value seems a bit like... inserting a key into a car door and turning the key, when you've got a remote unlock button on the same keychain. (Someone is going to lecture me on the data pipeline, and parsing versus search-time field extraction, and I probably deserve it.)

I'm not thrilled by having to pass through, as a field that will appear in the _raw field of each event, a value that will also be represented in the sourcetype field. That strikes me as inelegant. My EC-ingested events don't have such a field, and I'm hoping for my EC-ingested and TCP-ingested events to cohabit in the same indexes, so I'd prefer them to be as similar as possible. I'd appreciate advice on that, too.

0 Karma
1 Solution

woodcock
Esteemed Legend

I would first try sourcetype; you may find that this does exactly what you would like it to do. If not, it will create a field called something like something_sourctype which will be whatever the current Splunk code (i.e. developers) think should be named. Please do report back here which way it works and what the renamed/new field name is (if it does override the sourcetype field, it should rename the original field as something like orig_sourcetype).

View solution in original post

0 Karma

Graham_Hanningt
Builder

What I'd really like is to use the same JSON for TCP input as I use for the HTTP Event Collector. That is, to specify time and sourcetype as metadata keys, rather than having to write stanzas to configure timestamp recognition and override the source type per-event.

0 Karma

woodcock
Esteemed Legend

I would first try sourcetype; you may find that this does exactly what you would like it to do. If not, it will create a field called something like something_sourctype which will be whatever the current Splunk code (i.e. developers) think should be named. Please do report back here which way it works and what the renamed/new field name is (if it does override the sourcetype field, it should rename the original field as something like orig_sourcetype).

0 Karma

Graham_Hanningt
Builder

Thanks, I'll try that first and report back.

0 Karma

Graham_Hanningt
Builder

Well, that was interesting. I defined a TCP input in inputs.conf:

[tcp://:6666]
index = test
sourcetype = xyz

with a corresponding stanza in props.conf:

[source::tcp:6666]
INDEXED_EXTRACTIONS = JSON

As an initial test, I used a Windows PowerShell script to send a few events in JSON format, and confirmed in Splunk Web that the following search:

sourcetype=xyz

displayed the events, with the field names and values extracted from the JSON. So far so good.

Then I added:

"sourcetype":"xyz_123"

to the JSON, and sent that.

The event appears in the Splunk Web Events tab with two values for the sourcetype field: xyz and xyz_123. There's no new or renamed field: just the one sourcetype field with two values.

That event appears if I use the search cited above, but if I change the search to:

sourcetype=xyz_123

I get no results.

I'm now about to try overriding the sourcetype as described in Splunk docs. Just for fun, I might try doing that using a sourcetype field, and see what happens: I wonder whether Splunk will "collapse" the two (now identical) values into one, or show them as separate values. Probably safer, though (since I don't understand the underlying code), to use a different field name.

0 Karma

Graham_Hanningt
Builder

I'm now overriding the sourcetype. Here's my working transforms.conf stanza:

[set_sourcetype_xyz]
REGEX = \x22event_sourcetype\x22:\x22([^\x22]+)\x22
FORMAT = sourcetype::$1
DEST_KEY = MetaData:Sourcetype

(\x22 is an escaped double quote)

Using sourcetype instead of event_sourcetype as a field name in the JSON input data also works, but you end up with an ingested event with a sourcetype field that has two identical values (for example, xyz_123 and xyz_123).

I'm torn, and would appreciate advice on this. On the one hand, I'd prefer not to coin my own field name; on the other, I'm not comfortable with sourcetype having two values. That just looks weird to me, and I don't have enough experience with Splunk to know whether this will bite me in the a...

0 Karma

woodcock
Esteemed Legend

There is also sourcetype renaming that you might exploit. When you rename a sourcetype, the original value is moved to _sourcetype

0 Karma

Graham_Hanningt
Builder

Yeah, I read the Splunk docs topic on that ("Rename source types at search time") before asking this question. Problem is, that functionality is too limited to be useful in this situatoin: it only offers a one-to-one renaming, from the original sourcetype value to a different literal string value. Thanks for the suggestion, though.

Thanks also for prodding me to try sourcetype and see what happens. I think that your answer, combined with this trail of comments, will prove useful to users with the same question, so I'm going to accept it.

I'm considering asking a new question, spawned by the testing I've done here, to ask about (re)using the EC protocol for TCP inputs.

0 Karma

Graham_Hanningt
Builder

I realize that I could save myself a heap of trouble here by using a single sourcetype value for all of the different types of log records - all of which have different record structures - that are extracted by the platform-specific log extraction tool I referred to in my original question. And I could coin some new field with the unique values that would have been in sourcetype.

But I think that would be a "cop out"; an un-Splunk-y thing to do; in neither the spirit nor the letter of the Splexicon definition of source type.

0 Karma
Get Updates on the Splunk Community!

Splunk Enterprise Security 8.0.2 Availability: On cloud and On-premise!

A few months ago, we released Splunk Enterprise Security 8.0 for our cloud customers. Today, we are excited to ...

Logs to Metrics

Logs and Metrics Logs are generally unstructured text or structured events emitted by applications and written ...

Developer Spotlight with Paul Stout

Welcome to our very first developer spotlight release series where we'll feature some awesome Splunk ...