Re: event categorization with rex: best practices?

mitag · ‎09-16-2020

Expanding a bit on my question from last year, "categorize or classify dissimilar field values at search time?":

How does one simplify, streamline, maintain event classification on verbose and non-uniform logs? E.g. if someone decided to dump 50 different application logs into one and where each application produces different events that need classification?

Something like "linux_messages_syslog" which aggregates a number of different application and service logs into one? In my case, it's a content management and orchestration platform with a variety of different modules each producing a variety of different event types.

I assume the primary mechanism is to use (1) rex field extraction then (2) case-match groups like this:

| rex field=event_message "<regex_01>"
| rex field=event_message "<regex_03>"
| rex field=event_message "<regex_05>"

| eval event_category = case (
        match( event_message, "<regex_01>"), category_01.": ".extracted_field_01,
        match( event_message, "<regex_02>"), category_02,
        match( event_message, "<regex_03>"), category_03.": ".extracted_field_03,
        match( event_message, "<regex_04>"), category_04,
        match( event_message, "<regex_05>"), category_05.": ".extracted_field_05,
        true(), "<uncategorized_yet>"
        )
| stats count dc(extracted_field_01) dc(extracted_field_03) by event_category, log_level
| sort -count

Also assuming that fields can't be extracted via "match" clauses and thus I have to use the same regex statements twice - first in field extraction and then in case-match groups where I classify the events?)

Here is a small sample of the actual classification SPL:

index=main sourcetype="custom_application"

| rex field=component "\.(?P<component_shortname>\w+)$"
| rex field=event_message "^<(?P<event_msg>.*)>$"
| rex field=event_message "^<(?P<action>Created|Creating) (?P<object>capacity|workflow execution|step execution) sample(\.+| in (?P<sample_creation_time>\d+) ms\.)>$"

| rex field=event_message "^<GENERATED TEST OUTPUT:  (?P<filetype>(?P<asset_type>\w+) asset|AdPod file|closed captioning / subtitle assets) (?P<filename>.*) ingested successfully>$"

| rename COMMENT AS "the above is just a sample - there are about 20-30 more rex statements"

| eval event_category = case (

        match( event_message, "^<(?P<action>Created|Creating) (?P<object>capacity|workflow execution|step execution) sample(\.+| in (?P<sample_creation_time>\d+) ms\.)>$"), action." ".object." sample",

        match( event_message, "^<GENERATED TEST OUTPUT:  (?P<filetype>(?P<asset_type>\w+) asset|AdPod file|closed captioning / subtitle assets) (?P<filename>.*) ingested successfully>$"), "GENERATED TEST OUTPUT:  ".filetype." <filename> ingested successfully",


        match( event_message, "^<GENERATED TEST OUTPUT: .*>"), event_msg,


        true(), "<other>"
        )

| eval event_category_long = component_shortname." (".log_level."): ".event_category

(In reality it's already a few pages long, and I am far from done.)

Event samples:

2020-09-16 00:04:29,253 INFO [com.custom_app.plugin.workflow.execution.processor.TestStepExecutionProcessor] (partnerPackageWorkflow-log result [ex 34588028]) - <GENERATED TEST OUTPUT:  deliveries complete for partner some_partner>

2020-09-16 00:03:20,462 INFO [com.custom_app.plugin.workflow.execution.processor.TestStepExecutionProcessor] (packageDelivery-log result [ex 34588139]) - <GENERATED TEST OUTPUT:  package_name_anonymized delivered successfully>

2020-09-16 00:03:41,183 TRACE [com.custom_app.workflow.services.ReportService] (pool-8-thread-68) - <Created step execution sample in 57 ms.>

2020-09-16 00:03:41,126 TRACE [com.custom_app.workflow.services.ReportService] (pool-8-thread-68) - <Creating step execution sample...>

2020-09-15 23:58:24,896 INFO [com.custom_app.plugin.workflow.execution.processor.ThrottledSubflowStepExecutionProcessor] (partnerPackageWorkflow-deliver package items [ex 34588027]) - <Executing as the **THROTTLED** subflow step processor.>

Context:

Application produces logs with a large number of of dissimilar events where log_level (INFO, DEBUG, etc.) is often meaningless and where INFO events may need to be reclassified as ERROR type events and alerted on accordingly.

Key parts of the above code:

extract fields where needed e.g. via a number of "rex" statements
categorize using case( match( field, regex_01), event_category_01, ...))

ITWhisperer · ‎09-16-2020

Perhaps you can use offset_field and then match by the values in the field

| rex field=event_message offset_field=catindex "^<(?P<event_msg>.*)>$"
| rex field=event_message offset_field=catindex "^<(?P<action>Created|Creating) (?P<object>capacity|workflow execution|step execution) sample(\.+| in (?P<sample_creation_time>\d+) ms\.)>$"
| rex field=event_message offset_field=catindex "^<GENERATED TEST OUTPUT:  (?P<filetype>(?P<asset_type>\w+) asset|AdPod file|closed captioning / subtitle assets) (?P<filename>.*) ingested successfully>$"

From your sample data you would get

catindex

event_msg=1-68

event_msg=1-70

event_msg=1-39&action=1-7&object=9-22&sample_creation_time=34-35

event_msg=1-33&action=1-8&object=10-23

event_msg=1-54

Using unique field names in your different rex you can categorise by the presence of these fields and/or their offsets

mitag · ‎09-16-2020

Thank you, I'll play with this and see if it gives additional insights - yet the defining factors for the categories are as follows, in this (descending) order:

"component" (application module, java class, etc.) generating the events
keywords and/or specific sentences in the events
presence of certain fields

Field offsets (their positioning) is unlikely to play a significant role.

Other than that, am I following Splunk's best practices in terms of readability and performance of my SPL? I.e. there isn't a way to reduce the use of redundant regex processing (once in field extraction, then again in case-match statements)?

Thanks!

ITWhisperer · ‎09-16-2020

Hi @mitag

The point about offset_field is that it gives you additional information from the original set of rex feild extracts. In order to remove the additional rex matches in the case statement, the case statement can be built around which fields were extracted. For example, your first alert category requires fields action, object and sample_creation_time. Rather than repeating the rex in the match, you could check that all 3 of these fields are not null, which would indicate that the corresponding rex found them - with offset_field the names of these fields also appear in catindex so your case condition for this alert could be like(catindex, "%action%object%sample_creation_time%"). This might be quicker/easier than checking for null on multiple fields. Also, the index ranges may be useful for determining which type of match occurred, e.g. action=1-7 means Created and action=1-8 means Creating were extracted although simply looking at the action field makes more sense in this instance but without knowing the full range of your alert categorisation, it is hard to know whether the indexes will prove useful or not.

In either case, you do not need to rerun the rex in the match statements, you already have all the information you need.

mitag · ‎09-17-2020

The same set of fields (with similar offsets) can be found in multiple types of events, and thus in multiple categories. I.e. checking for certain fields not being null and determining the category based on that will lead to miscategorization.

ITWhisperer · ‎09-17-2020

If you use different field names in your initial rex you can identify which fields came from which rex e.g. <cat1_class>, <cat2_class>. You can rename the fields later if necessary e.g. rename *_class as class

mitag · ‎09-17-2020

...which will make the SPL even more complex and less readable. I'll have to work with redundant regex statements where needed until someone suggest a more elegant and efficient approach - but this is not it, regretfully. Appreciate your sharing your knowledge.

ITWhisperer · ‎09-17-2020

You have a complex requirement - lots of different "event types" in the same log. What I have suggested does remove the redundant rexes (one of your (main?) asks). You could add an extra field extraction to each rex purely to identify the rex e.g. if "ingested successfully" only appear in events of one type the extract that to a unique field name or to the same field for all categories e.g. (?<cat>ingested successfully) then your case statement can be based on the value of this field. Ideally, these field extractions would be done at the indexing phase which would hide them from your SPL. However, without a complete picture of all your different categorisations, it will be difficult to define an "elegant" or optimised solution. This is something you will have to determine yourself.

mitag · ‎09-18-2020

Perhaps I could have been clearer in my question / request (I am still unsure if "ask" is the right term - feels weird to me) - yet it's one of those cases where asking the right question is a pursuit of its own.

Maybe this:

Context:

Say a single sourcetype aggregates a number of different, disparate events from different software modules of a given application making it impossible to write just a few regular expressions extracting fields or categorizing events. Think 50+ such expressions. And no, it's not an option to ask the developers of the software to separate the logs so we could have multiple sourcetypes, e.g. one for each software module.
Say things change too fast (code changes and revisions in turn changing the log events) meaning the event classification is not set in stone but has to be updated on an ongoing - monthly if not weekly - basis in order to keep up and not go stale and irrelevant. What this means is indexing-time field extraction has no chance of staying relevant - it gets stale too quickly. If a specific event format or wording changes - or a new field as added - whatever is in props.conf breaks the moment it happens. This in turn means:
1. we have to extract fields at search time, always at search time
2. we have to keep that field extraction backward compatible with prior code revisions, possibly by adding more rex and case-match statements to accommodate revisions

Hence The Question:

Given the context above, what are the best practices to design, maintain and document a classification and field extraction SPL? The SPL must be readable and (relatively) easily adaptable to any changes in the event format.

My understanding - very limited due to my very limited Splunk knowledge - is:

One must maintain redundant (to a degree) regex statements (one for field extraction, one for classification in case-match statements) in order to keep the SPL readable and therefore maintainable.
There is no way to branch (via "if-elif-else" or "case" logic) regex execution in field extraction to make the SPL more efficient. All the "rex" regex statements will execute which will slow down the SPL with a lot of "rex" statements significantly.
Given the sheer number of event types and associated regex statements, field-based event classification is not an option as the same fields can be present in multiple event types. In addition, field-based event classification introduces a greater risk of classification errors given it's based less on the defining format and structure of an event (e.g. keywords, sentences present) and more - on factors that may be present in multiple event types (field names).

I.e. (1) the general SPL structure in my example is as efficient and performant as it gets given the context and (2) field-based (rather than regex-based) event classification makes the SPL less readable and maintainable at the expense of performance.

If any of this is off - I am open to suggestions.

ITWhisperer · ‎09-18-2020

As you say, doing lots of rex can be expensive, so halving this by using the results to classify the event types would surely help. You could add extra matches to your expressions to help classify your events. For example

| rex field=event_message "^<(?P<action>Created|Creating) (?P<object>capacity|workflow execution|step execution) sample(\.+| in (?P<sample_creation_time>\d+) ms\.)>$"
| rex field=event_message "^<(?P<action>Creating) (?P<object>capacity|workflow execution|step execution) aborted\.)>$"

you would not be able to distinguish these event types purely by the presence of the action and object fields. However

| rex field=event_message "^<(?P<action>Created|Creating) (?P<object>capacity|workflow execution|step execution) sample(\.+| in (?P<sample_creation_time>\d+) ms\.)>$"
| rex field=event_message "^<(?P<action>Creating) (?P<object>capacity|workflow execution|step execution) (?P<abort>aborted)\.)>$"

might give you an extra field which you may not need for anything other than classifying the events.

In terms of readability, when evaluating the category, you could repeat the rex but comment it out and actually use alternative methods

| eval event_category = case (

        /* match( event_message, "^<(?P<action>Created|Creating) (?P<object>capacity|workflow execution|step execution) sample(\.+| in (?P<sample_creation_time>\d+) ms\.)>$") */
        if(isnotnull(action) AND isnotnull(object) AND isnull(abort)), action." ".object." sample",

You say that you can't ask the developers to use different logs, but can you ask them to include unique patterns in the messages to assist in the classification of the events?

Given that you have lots of different event types to categorise, can you put together a presentation for your developers or management to explain why the fast turn around of changing logs is making your job more difficult and if the developers could consider what you need to do with the log messages they are creating new messages, this would save you time and help you keep pace with their changes. Afterall, your dashboards are presumably important (otherwise, why would you be doing it), and the quicker you can get them updated and maintain their accuracy, the more useful they remain.

event categorization with rex: best practices?

rex

Continuing Innovation & New Integrations Unlock Full Stack Observability For Your ...

Monitoring Amazon Elastic Kubernetes Service (EKS)

Cloud Platform & Enterprise: Classic Dashboard Export Feature Deprecation