I know how to use Splunk 7.3.0 to overrride source type per event using a backreference. For example, given this snippet of incoming JSON Lines:
"code":"red"
I can do this in transforms.conf
:
REGEX = \"code\":\"([^\"]+)\"
FORMAT = sourcetype::$1
DEST_KEY = MetaData:Sourcetype
Code "red" in the incoming JSON Lines event data sets the event source type to "red".
But suppose I don't want to use the value of code
as the sourcetype
? Suppose I want to map each code
value to a completely different sourcetype
value? Perhaps each incoming code
value uniquely identifies a different source type, but the actual code
value is not Splunk-y enough to be a sourcetype
value? Although, I don't want to get into sourcetype
naming conventions here.
The only way I have thought of doing this so far is to create a stanza for each code
value. For example, in transforms.conf
(these code
and sourcetype
values are fictitious):
[set_sourcetype_test_red]
REGEX = \"code\":\"red\"
FORMAT = sourcetype::scarlet
DEST_KEY = MetaData:Sourcetype
[set_sourcetype_test_green]
REGEX = \"code\":\"green\"
FORMAT = sourcetype::emerald
DEST_KEY = MetaData:Sourcetype
[set_sourcetype_test_blue]
REGEX = \"code\":\"blue\"
FORMAT = sourcetype::aqua
DEST_KEY = MetaData:Sourcetype
and in props.conf
:
TRANSFORMS-changesourcetype = set_sourcetype_test_red, set_sourcetype_test_green, set_sourcetype_test_blue
Codes "red", "green", and "blue" become source types "scarlet", "emerald", and "aqua".
I don't like this multi-stanza technique. I currently have only half a dozen or so source types in this context, but I might end up with many more.
Can anyone suggest a more concise, more performant technique; say, a single stanza with a single regex? I can't see how to do it.
For the purposes of this question:
code
values are all arriving at the same Splunk input (for example, TCP port)code
values are (although, a fallback transform that uses a backreference for unexpected code
values would be useful)I notice that the Splunk docs contain the PCRE2 license, but the transforms.conf
docs don't appear to mention any PCRE2-specific functionality, and anyway, I'm not even sure whether PCRE2-level substitution features would be of help here.
I've just submitted the following feedback on the Splunk 7.3.0 docs page for transforms.conf
:
I've seen that Splunk docs cite the PCRE2 license, so I'd hoped that regex replacement in transforms.conf would support PCRE2 replacements. Apparently not :-(, hence this feedback.
The following settings:
[set_sourcetype_test_pcre2]
REGEX = \"code\":\"(?<red>red)|(?<green>green)|(?<blue>blue)|(?<other>[^\"]+)\"
FORMAT = sourcetype::${red:+scarlet:}${green:+emerald:}${blue:+aqua:}
with input JSON Lines snippet such as:
"code":"red"
results in a sourcetype value of, literally:
${red:+scarlet:}${green:+emerald:}${blue:+aqua:}
That is, regex processing in Splunk appears not to recognize the PCRE2 replacement syntax.
Or perhaps I'm doing something wrong.
Here's what I want to happen: if the code property value is "red", then set sourcetype to "scarlet"; if code "green", set sourcetype "emerald"; if code "blue", set sourcetype "aqua".
For more details, see my related question in Splunk Answers, "Performantly overriding sourcetype per event with new replacement string, not backreference?".
By "doing something wrong", I mean, for example: if the named capture group "red" is unset, then I want the replacement value to be an empty string, hence the lack of a string after the second colon; however, I'm unsure whether PCRE2 allows this; whether I need to specify "something" as the replacement string.
A Splunk docs contact has responded to my feedback (thank you!), and confirmed that, as of Splunk 8.0.0, Splunk doesn't support functions specific to PCRE2, such as these substitution functions.
You could use INGEST_EVAL
with a case
statement to facilitate this.
Yes!
This works:
[set_sourcetype]
INGEST_EVAL = sourcetype:=case(match(_raw, "\"code\":\"red\""), "scarlet", match(_raw, "\"code\":\"green\""), "emerald", match(_raw, "\"code\":\"blue\""), "aqua", true(), "other")
Thank you for your answer. My apologies for this belated comment.
I don't like the repetition of match(_raw, ... )
in my case
function, though.
Here's a variation that extracts the code
value into sourcetype
in one transform, and then refers to that "temporary" sourcetype
in the INGEST_EVAL
in a second transform:
[get_sourcetype_from_code]
REGEX = \"code\":\"([^\"]+)\"
FORMAT = sourcetype::$1
DEST_KEY = MetaData:Sourcetype
[set_sourcetype]
INGEST_EVAL = sourcetype:=case(sourcetype=="red", "scarlet", sourcetype=="green", "emerald", sourcetype=="blue", "aqua", true(), sourcetype)
(Requires props.conf
to refer to the two transforms in sequence. For example: TRANSFORMS-changesourcetype = get_sourcetype_from_code,set_sourcetype
.)
VERY nicely done! I like it.
Incidental observation: the example set_sourcetype
stanza in my previous comment (deliberately) doesn't specify a REGEX
setting. splunkd
reports this omission as an error:
ERROR regexExtractionProcessor - REGEX field must be specified tranform_name=set_sourcetype
My opinion: this error is a bug. In practice, a REGEX
is not required for this stanza.
Nit: Splunk, please correct the typo tranform_name
(sic; note the missing "s") in the error text.
Perhaps I'm trying too hard to be Splunk-y by attempting to map each of these incoming code
values to a different sourcetype
value. I could simply forget about overriding the source type per event, set a fixed sourcetype
, and, in my searches, where I currently refer only to sourcetype
, refer instead to both sourcetype
and code
. (I didn't mention this in the question, but I typically use a transform to remove code
after using it to override sourcetype
.) I typically place such search snippets in macros, anyway, to isolate my dashboard Simple XML from such issues.
Not overriding the source type would mean that, if the data is ingested by uploading from a file on my computer, the search that Splunk Web offers for the newly uploaded data will actually find results!
not sure if your comment is an answer ...
can you elaborate on the problem you are trying to solve? what is it that you would like to achieve?
Hi adonio,
My question includes an answer, but, as I wrote, I don't like the technique it uses. My first comment after the question describes a workaround, rather than an answer: abandoning the idea of a granular sourcetype
field, and instead relying on a combination of a fixed, generic sourcetype
field in combination with a separate code
field.
can you elaborate on the problem you are trying to solve?
I want to use a value in incoming JSON Lines data to set sourcetype
per event. The value in the incoming data and the sourcetype
are completely different.
what is it that you would like to achieve?
A more performant solution than the one I have now. Suppose I have 20 source types. Using my current technique, that means 20 separate stanzas in transforms.conf
. I'm hoping for something more elegant and concise; and I'm hoping that this also means "more performant" (faster; less index-time processing for the transform).
I was hoping that PCRE2 replacement syntax might work; see my recent related comment on this question.