Expanding a bit on my question from last year, "categorize or classify dissimilar field values at search time?":
How does one simplify, streamline, maintain event classification on verbose and non-uniform logs? E.g. if someone decided to dump 50 different application logs into one and where each application produces different events that need classification?
Something like "linux_messages_syslog" which aggregates a number of different application and service logs into one? In my case, it's a content management and orchestration platform with a variety of different modules each producing a variety of different event types.
I assume the primary mechanism is to use (1) rex field extraction then (2) case-match groups like this:
| rex field=event_message "<regex_01>"
| rex field=event_message "<regex_03>"
| rex field=event_message "<regex_05>"
| eval event_category = case (
match( event_message, "<regex_01>"), category_01.": ".extracted_field_01,
match( event_message, "<regex_02>"), category_02,
match( event_message, "<regex_03>"), category_03.": ".extracted_field_03,
match( event_message, "<regex_04>"), category_04,
match( event_message, "<regex_05>"), category_05.": ".extracted_field_05,
true(), "<uncategorized_yet>"
)
| stats count dc(extracted_field_01) dc(extracted_field_03) by event_category, log_level
| sort -count
Also assuming that fields can't be extracted via "match" clauses and thus I have to use the same regex statements twice - first in field extraction and then in case-match groups where I classify the events?)
Here is a small sample of the actual classification SPL:
index=main sourcetype="custom_application"
| rex field=component "\.(?P<component_shortname>\w+)$"
| rex field=event_message "^<(?P<event_msg>.*)>$"
| rex field=event_message "^<(?P<action>Created|Creating) (?P<object>capacity|workflow execution|step execution) sample(\.+| in (?P<sample_creation_time>\d+) ms\.)>$"
| rex field=event_message "^<GENERATED TEST OUTPUT: (?P<filetype>(?P<asset_type>\w+) asset|AdPod file|closed captioning / subtitle assets) (?P<filename>.*) ingested successfully>$"
| rename COMMENT AS "the above is just a sample - there are about 20-30 more rex statements"
| eval event_category = case (
match( event_message, "^<(?P<action>Created|Creating) (?P<object>capacity|workflow execution|step execution) sample(\.+| in (?P<sample_creation_time>\d+) ms\.)>$"), action." ".object." sample",
match( event_message, "^<GENERATED TEST OUTPUT: (?P<filetype>(?P<asset_type>\w+) asset|AdPod file|closed captioning / subtitle assets) (?P<filename>.*) ingested successfully>$"), "GENERATED TEST OUTPUT: ".filetype." <filename> ingested successfully",
match( event_message, "^<GENERATED TEST OUTPUT: .*>"), event_msg,
true(), "<other>"
)
| eval event_category_long = component_shortname." (".log_level."): ".event_category
(In reality it's already a few pages long, and I am far from done.)
Event samples:
2020-09-16 00:04:29,253 INFO [com.custom_app.plugin.workflow.execution.processor.TestStepExecutionProcessor] (partnerPackageWorkflow-log result [ex 34588028]) - <GENERATED TEST OUTPUT: deliveries complete for partner some_partner>
2020-09-16 00:03:20,462 INFO [com.custom_app.plugin.workflow.execution.processor.TestStepExecutionProcessor] (packageDelivery-log result [ex 34588139]) - <GENERATED TEST OUTPUT: package_name_anonymized delivered successfully>
2020-09-16 00:03:41,183 TRACE [com.custom_app.workflow.services.ReportService] (pool-8-thread-68) - <Created step execution sample in 57 ms.>
2020-09-16 00:03:41,126 TRACE [com.custom_app.workflow.services.ReportService] (pool-8-thread-68) - <Creating step execution sample...>
2020-09-15 23:58:24,896 INFO [com.custom_app.plugin.workflow.execution.processor.ThrottledSubflowStepExecutionProcessor] (partnerPackageWorkflow-deliver package items [ex 34588027]) - <Executing as the **THROTTLED** subflow step processor.>
Context:
Key parts of the above code:
Perhaps you can use offset_field and then match by the values in the field
| rex field=event_message offset_field=catindex "^<(?P<event_msg>.*)>$"
| rex field=event_message offset_field=catindex "^<(?P<action>Created|Creating) (?P<object>capacity|workflow execution|step execution) sample(\.+| in (?P<sample_creation_time>\d+) ms\.)>$"
| rex field=event_message offset_field=catindex "^<GENERATED TEST OUTPUT: (?P<filetype>(?P<asset_type>\w+) asset|AdPod file|closed captioning / subtitle assets) (?P<filename>.*) ingested successfully>$"
From your sample data you would get
catindex
event_msg=1-68 |
event_msg=1-70 |
event_msg=1-39&action=1-7&object=9-22&sample_creation_time=34-35 |
event_msg=1-33&action=1-8&object=10-23 |
event_msg=1-54 |
Using unique field names in your different rex you can categorise by the presence of these fields and/or their offsets
Thank you, I'll play with this and see if it gives additional insights - yet the defining factors for the categories are as follows, in this (descending) order:
Field offsets (their positioning) is unlikely to play a significant role.
Other than that, am I following Splunk's best practices in terms of readability and performance of my SPL? I.e. there isn't a way to reduce the use of redundant regex processing (once in field extraction, then again in case-match statements)?
Thanks!
Hi @mitag
The point about offset_field is that it gives you additional information from the original set of rex feild extracts. In order to remove the additional rex matches in the case statement, the case statement can be built around which fields were extracted. For example, your first alert category requires fields action, object and sample_creation_time. Rather than repeating the rex in the match, you could check that all 3 of these fields are not null, which would indicate that the corresponding rex found them - with offset_field the names of these fields also appear in catindex so your case condition for this alert could be like(catindex, "%action%object%sample_creation_time%"). This might be quicker/easier than checking for null on multiple fields. Also, the index ranges may be useful for determining which type of match occurred, e.g. action=1-7 means Created and action=1-8 means Creating were extracted although simply looking at the action field makes more sense in this instance but without knowing the full range of your alert categorisation, it is hard to know whether the indexes will prove useful or not.
In either case, you do not need to rerun the rex in the match statements, you already have all the information you need.
The same set of fields (with similar offsets) can be found in multiple types of events, and thus in multiple categories. I.e. checking for certain fields not being null and determining the category based on that will lead to miscategorization.
If you use different field names in your initial rex you can identify which fields came from which rex e.g. <cat1_class>, <cat2_class>. You can rename the fields later if necessary e.g. rename *_class as class
...which will make the SPL even more complex and less readable. I'll have to work with redundant regex statements where needed until someone suggest a more elegant and efficient approach - but this is not it, regretfully. Appreciate your sharing your knowledge.
You have a complex requirement - lots of different "event types" in the same log. What I have suggested does remove the redundant rexes (one of your (main?) asks). You could add an extra field extraction to each rex purely to identify the rex e.g. if "ingested successfully" only appear in events of one type the extract that to a unique field name or to the same field for all categories e.g. (?<cat>ingested successfully) then your case statement can be based on the value of this field. Ideally, these field extractions would be done at the indexing phase which would hide them from your SPL. However, without a complete picture of all your different categorisations, it will be difficult to define an "elegant" or optimised solution. This is something you will have to determine yourself.
Perhaps I could have been clearer in my question / request (I am still unsure if "ask" is the right term - feels weird to me) - yet it's one of those cases where asking the right question is a pursuit of its own.
Maybe this:
Context:
Hence The Question:
Given the context above, what are the best practices to design, maintain and document a classification and field extraction SPL? The SPL must be readable and (relatively) easily adaptable to any changes in the event format.
My understanding - very limited due to my very limited Splunk knowledge - is:
I.e. (1) the general SPL structure in my example is as efficient and performant as it gets given the context and (2) field-based (rather than regex-based) event classification makes the SPL less readable and maintainable at the expense of performance.
If any of this is off - I am open to suggestions.
As you say, doing lots of rex can be expensive, so halving this by using the results to classify the event types would surely help. You could add extra matches to your expressions to help classify your events. For example
| rex field=event_message "^<(?P<action>Created|Creating) (?P<object>capacity|workflow execution|step execution) sample(\.+| in (?P<sample_creation_time>\d+) ms\.)>$"
| rex field=event_message "^<(?P<action>Creating) (?P<object>capacity|workflow execution|step execution) aborted\.)>$"
you would not be able to distinguish these event types purely by the presence of the action and object fields. However
| rex field=event_message "^<(?P<action>Created|Creating) (?P<object>capacity|workflow execution|step execution) sample(\.+| in (?P<sample_creation_time>\d+) ms\.)>$"
| rex field=event_message "^<(?P<action>Creating) (?P<object>capacity|workflow execution|step execution) (?P<abort>aborted)\.)>$"
might give you an extra field which you may not need for anything other than classifying the events.
In terms of readability, when evaluating the category, you could repeat the rex but comment it out and actually use alternative methods
| eval event_category = case (
/* match( event_message, "^<(?P<action>Created|Creating) (?P<object>capacity|workflow execution|step execution) sample(\.+| in (?P<sample_creation_time>\d+) ms\.)>$") */
if(isnotnull(action) AND isnotnull(object) AND isnull(abort)), action." ".object." sample",
You say that you can't ask the developers to use different logs, but can you ask them to include unique patterns in the messages to assist in the classification of the events?
Given that you have lots of different event types to categorise, can you put together a presentation for your developers or management to explain why the fast turn around of changing logs is making your job more difficult and if the developers could consider what you need to do with the log messages they are creating new messages, this would save you time and help you keep pace with their changes. Afterall, your dashboards are presumably important (otherwise, why would you be doing it), and the quicker you can get them updated and maintain their accuracy, the more useful they remain.