Some background to this question:
"time"
key in the event metadata. For TCP, I believe I'll have to configure timestamp recognition in props.conf
as described in Splunk docs."sourcetype"
key in the event metadata. For TCP, I believe I'll have to overrride source types on a per-event basis as described in Splunk docs. I do not want to use a different TCP port for each sourcetype.So, I plan to create a stanza in transforms.conf
that gets a field value from the JSON-format data received via TCP, and uses it to set the sourcetype, like this:
[set_source_type_my_log_type]
REGEX = \"somefieldname\"\:\"(?[^\"]+)\"
FORMAT = sourcetype::$1
DEST_KEY = MetaData:Sourcetype
(I've tested this regular expression using the rex
command, but not yet in the context of overriding sourcetype; I don't yet know for sure whether I'll have to escape the double quotes, as done here.)
where the JSON received via TCP contains a field like this:
"somefieldname": "xyz_123"
where "xyz_123"
is the sourcetype I want the event to have.
All of the above (thanks for reading this far) boils down to one simple question: what field name should I use in place of somefieldname
(as per the example above)?
It occurs to me that I probably shouldn't use the default field name sourcetype
.
"Anything you like, except for one of the default fields (so, not sourcetype
)" might be a valid answer, but I'd prefer a more specific answer: an actual field name that other users in the same situation might also choose to use, as an informal convention (rather than the formalized EC protocol).
Er, event_sourcetype
?
It also occurs to me that, given that I can supply both the sourcetype
and the _time
(expressed as Unix time) as fields in the JSON data, is there some better, more direct way than using regexes to configure the timestamp recognition and override the sourcetype? Specifying a regex to extract a JSON key value seems a bit like... inserting a key into a car door and turning the key, when you've got a remote unlock button on the same keychain. (Someone is going to lecture me on the data pipeline, and parsing versus search-time field extraction, and I probably deserve it.)
I'm not thrilled by having to pass through, as a field that will appear in the _raw
field of each event, a value that will also be represented in the sourcetype
field. That strikes me as inelegant. My EC-ingested events don't have such a field, and I'm hoping for my EC-ingested and TCP-ingested events to cohabit in the same indexes, so I'd prefer them to be as similar as possible. I'd appreciate advice on that, too.
I would first try sourcetype
; you may find that this does exactly what you would like it to do. If not, it will create a field called something like something_sourctype
which will be whatever the current Splunk code (i.e. developers) think should be named. Please do report back here which way it works and what the renamed/new field name is (if it does override the sourcetype
field, it should rename the original field as something like orig_sourcetype
).
What I'd really like is to use the same JSON for TCP input as I use for the HTTP Event Collector. That is, to specify time
and sourcetype
as metadata keys, rather than having to write stanzas to configure timestamp recognition and override the source type per-event.
I would first try sourcetype
; you may find that this does exactly what you would like it to do. If not, it will create a field called something like something_sourctype
which will be whatever the current Splunk code (i.e. developers) think should be named. Please do report back here which way it works and what the renamed/new field name is (if it does override the sourcetype
field, it should rename the original field as something like orig_sourcetype
).
Thanks, I'll try that first and report back.
Well, that was interesting. I defined a TCP input in inputs.conf
:
[tcp://:6666]
index = test
sourcetype = xyz
with a corresponding stanza in props.conf
:
[source::tcp:6666]
INDEXED_EXTRACTIONS = JSON
As an initial test, I used a Windows PowerShell script to send a few events in JSON format, and confirmed in Splunk Web that the following search:
sourcetype=xyz
displayed the events, with the field names and values extracted from the JSON. So far so good.
Then I added:
"sourcetype":"xyz_123"
to the JSON, and sent that.
The event appears in the Splunk Web Events tab with two values for the sourcetype
field: xyz
and xyz_123
. There's no new or renamed field: just the one sourcetype
field with two values.
That event appears if I use the search cited above, but if I change the search to:
sourcetype=xyz_123
I get no results.
I'm now about to try overriding the sourcetype as described in Splunk docs. Just for fun, I might try doing that using a sourcetype
field, and see what happens: I wonder whether Splunk will "collapse" the two (now identical) values into one, or show them as separate values. Probably safer, though (since I don't understand the underlying code), to use a different field name.
I'm now overriding the sourcetype. Here's my working transforms.conf
stanza:
[set_sourcetype_xyz]
REGEX = \x22event_sourcetype\x22:\x22([^\x22]+)\x22
FORMAT = sourcetype::$1
DEST_KEY = MetaData:Sourcetype
(\x22
is an escaped double quote)
Using sourcetype
instead of event_sourcetype
as a field name in the JSON input data also works, but you end up with an ingested event with a sourcetype
field that has two identical values (for example, xyz_123
and xyz_123
).
I'm torn, and would appreciate advice on this. On the one hand, I'd prefer not to coin my own field name; on the other, I'm not comfortable with sourcetype
having two values. That just looks weird to me, and I don't have enough experience with Splunk to know whether this will bite me in the a...
There is also sourcetype renaming
that you might exploit. When you rename a sourcetype, the original value is moved to _sourcetype
Yeah, I read the Splunk docs topic on that ("Rename source types at search time") before asking this question. Problem is, that functionality is too limited to be useful in this situatoin: it only offers a one-to-one renaming, from the original sourcetype value to a different literal string value. Thanks for the suggestion, though.
Thanks also for prodding me to try sourcetype
and see what happens. I think that your answer, combined with this trail of comments, will prove useful to users with the same question, so I'm going to accept it.
I'm considering asking a new question, spawned by the testing I've done here, to ask about (re)using the EC protocol for TCP inputs.
I realize that I could save myself a heap of trouble here by using a single sourcetype
value for all of the different types of log records - all of which have different record structures - that are extracted by the platform-specific log extraction tool I referred to in my original question. And I could coin some new field with the unique values that would have been in sourcetype
.
But I think that would be a "cop out"; an un-Splunk-y thing to do; in neither the spirit nor the letter of the Splexicon definition of source type.