Solved: Performantly overriding sourcetype per event with ...

Graham_Hanningt · ‎11-08-2019

I know how to use Splunk 7.3.0 to overrride source type per event using a backreference. For example, given this snippet of incoming JSON Lines:

"code":"red"

I can do this in transforms.conf:

REGEX = \"code\":\"([^\"]+)\"
FORMAT = sourcetype::$1
DEST_KEY = MetaData:Sourcetype

Code "red" in the incoming JSON Lines event data sets the event source type to "red".

But suppose I don't want to use the value of code as the sourcetype? Suppose I want to map each code value to a completely different sourcetype value? Perhaps each incoming code value uniquely identifies a different source type, but the actual code value is not Splunk-y enough to be a sourcetype value? Although, I don't want to get into sourcetype naming conventions here.

The only way I have thought of doing this so far is to create a stanza for each code value. For example, in transforms.conf (these code and sourcetype values are fictitious):

[set_sourcetype_test_red]
REGEX = \"code\":\"red\"
FORMAT = sourcetype::scarlet
DEST_KEY = MetaData:Sourcetype
[set_sourcetype_test_green]
REGEX = \"code\":\"green\"
FORMAT = sourcetype::emerald
DEST_KEY = MetaData:Sourcetype
[set_sourcetype_test_blue]
REGEX = \"code\":\"blue\"
FORMAT = sourcetype::aqua
DEST_KEY = MetaData:Sourcetype

and in props.conf:

TRANSFORMS-changesourcetype = set_sourcetype_test_red, set_sourcetype_test_green, set_sourcetype_test_blue

Codes "red", "green", and "blue" become source types "scarlet", "emerald", and "aqua".

I don't like this multi-stanza technique. I currently have only half a dozen or so source types in this context, but I might end up with many more.

Can anyone suggest a more concise, more performant technique; say, a single stanza with a single regex? I can't see how to do it.

For the purposes of this question:

The different code values are all arriving at the same Splunk input (for example, TCP port)
I know what all the code values are (although, a fallback transform that uses a backreference for unexpected code values would be useful)

I notice that the Splunk docs contain the PCRE2 license, but the transforms.conf docs don't appear to mention any PCRE2-specific functionality, and anyway, I'm not even sure whether PCRE2-level substitution features would be of help here.

woodcock · ‎11-09-2019

You could use INGEST_EVAL with a case statement to facilitate this.

View solution in original post

Graham_Hanningt · ‎11-14-2019

I've just submitted the following feedback on the Splunk 7.3.0 docs page for transforms.conf:

I've seen that Splunk docs cite the PCRE2 license, so I'd hoped that regex replacement in transforms.conf would support PCRE2 replacements. Apparently not :-(, hence this feedback.

The following settings:

[set_sourcetype_test_pcre2]
REGEX = \"code\":\"(?<red>red)|(?<green>green)|(?<blue>blue)|(?<other>[^\"]+)\"
FORMAT = sourcetype::${red:+scarlet:}${green:+emerald:}${blue:+aqua:}

with input JSON Lines snippet such as:

"code":"red"

results in a sourcetype value of, literally:

${red:+scarlet:}${green:+emerald:}${blue:+aqua:}

That is, regex processing in Splunk appears not to recognize the PCRE2 replacement syntax.

Or perhaps I'm doing something wrong.

Here's what I want to happen: if the code property value is "red", then set sourcetype to "scarlet"; if code "green", set sourcetype "emerald"; if code "blue", set sourcetype "aqua".

For more details, see my related question in Splunk Answers, "Performantly overriding sourcetype per event with new replacement string, not backreference?".

By "doing something wrong", I mean, for example: if the named capture group "red" is unset, then I want the replacement value to be an empty string, hence the lack of a string after the second colon; however, I'm unsure whether PCRE2 allows this; whether I need to specify "something" as the replacement string.

Graham_Hanningt · ‎11-20-2019

A Splunk docs contact has responded to my feedback (thank you!), and confirmed that, as of Splunk 8.0.0, Splunk doesn't support functions specific to PCRE2, such as these substitution functions.

woodcock · ‎11-09-2019

You could use INGEST_EVAL with a case statement to facilitate this.

Graham_Hanningt · ‎11-14-2019

Yes!

This works:

[set_sourcetype]
INGEST_EVAL = sourcetype:=case(match(_raw, "\"code\":\"red\""), "scarlet", match(_raw, "\"code\":\"green\""), "emerald", match(_raw, "\"code\":\"blue\""), "aqua", true(), "other")

Thank you for your answer. My apologies for this belated comment.

I don't like the repetition of match(_raw, ... ) in my case function, though.

Here's a variation that extracts the code value into sourcetype in one transform, and then refers to that "temporary" sourcetype in the INGEST_EVAL in a second transform:

[get_sourcetype_from_code]
REGEX = \"code\":\"([^\"]+)\"
FORMAT = sourcetype::$1
DEST_KEY = MetaData:Sourcetype
[set_sourcetype]
INGEST_EVAL = sourcetype:=case(sourcetype=="red", "scarlet", sourcetype=="green", "emerald", sourcetype=="blue", "aqua", true(), sourcetype)

(Requires props.conf to refer to the two transforms in sequence. For example: TRANSFORMS-changesourcetype = get_sourcetype_from_code,set_sourcetype.)

woodcock · ‎11-15-2019

VERY nicely done! I like it.

Graham_Hanningt · ‎11-14-2019

Incidental observation: the example set_sourcetype stanza in my previous comment (deliberately) doesn't specify a REGEX setting. splunkd reports this omission as an error:

ERROR regexExtractionProcessor - REGEX field must be specified tranform_name=set_sourcetype

My opinion: this error is a bug. In practice, a REGEX is not required for this stanza.

Nit: Splunk, please correct the typo tranform_name (sic; note the missing "s") in the error text.

Graham_Hanningt · ‎11-08-2019

Perhaps I'm trying too hard to be Splunk-y by attempting to map each of these incoming code values to a different sourcetypevalue. I could simply forget about overriding the source type per event, set a fixed sourcetype, and, in my searches, where I currently refer only to sourcetype, refer instead to both sourcetype and code. (I didn't mention this in the question, but I typically use a transform to remove code after using it to override sourcetype.) I typically place such search snippets in macros, anyway, to isolate my dashboard Simple XML from such issues.

Not overriding the source type would mean that, if the data is ingested by uploading from a file on my computer, the search that Splunk Web offers for the newly uploaded data will actually find results!

adonio · ‎11-08-2019

not sure if your comment is an answer ...
can you elaborate on the problem you are trying to solve? what is it that you would like to achieve?

Graham_Hanningt · ‎11-14-2019

Hi adonio,

My question includes an answer, but, as I wrote, I don't like the technique it uses. My first comment after the question describes a workaround, rather than an answer: abandoning the idea of a granular sourcetype field, and instead relying on a combination of a fixed, generic sourcetype field in combination with a separate code field.

can you elaborate on the problem you are trying to solve?

I want to use a value in incoming JSON Lines data to set sourcetype per event. The value in the incoming data and the sourcetype are completely different.

what is it that you would like to achieve?

A more performant solution than the one I have now. Suppose I have 20 source types. Using my current technique, that means 20 separate stanzas in transforms.conf. I'm hoping for something more elegant and concise; and I'm hoping that this also means "more performant" (faster; less index-time processing for the transform).

I was hoping that PCRE2 replacement syntax might work; see my recent related comment on this question.

Performantly overriding sourcetype per event with new replacement string, not backreference?

Stay Connected: Your Guide to May Tech Talks, Office Hours, and Webinars!

They're back! Join the SplunkTrust and MVP at .conf24

Enterprise Security Content Update (ESCU) | New Releases