Community Blog
Get the latest updates on the Splunk Community, including member experiences, product education, events, and more!

Message Parsing in SOCK

sbylica
Splunk Employee
Splunk Employee

Introduction

This blog post is part of an ongoing series on SOCK enablement.

In this blog post, I will write about parsing messages to extract valuable information, and then process it consistently across entries. SOCK (Splunk OpenTelemetry Collector for Kubernetes) can be used to process many different kinds of data, one of the most common ones being logs extracted from log files. We use operators to extract and parse information - operators being the most basic units of log processing.

As an example, by default, the filelog receiver in the SOCK pipeline uses various operators to extract information from the incoming logs and log file path. This information includes, but is not limited to:

  • namespace, pod, uid, and container name from the log file’s path
  • time, log level, log-tag and log message from the actual log body

In later stages of the pipeline, this information is used to enrich the attributes of the log. For example:

  • com.splunk.sourcetype field is set from the container name
  • com.splunk.source field is set from the log file’s path
  • So, if the full path of the container’s log file is: /var/log/pods/kube-system_etcd/etcd/0.log, then com.splunk.source value will be set to this value - we understand the path of the file as its source

There might be scenarios where you would like to set a different source other than the default one (i.e. log’s file path) or there is a need to extract some extra attributes from the log message.

This article explains how to do it.

Operators

The OpenTelemetry Collector comes with a set of operators. From README:

 

 

An operator is the most basic unit of log processing. Each operator fulfills a single responsibility, such as reading lines from a file, or parsing JSON from a field. Operators are then chained together in a pipeline to achieve a desired result.

For instance, a user may read lines from a file using the file_input operator. From there, the results of this operation may be sent to a regex_parser operator that creates fields based on a regex pattern. And then finally, these results may be sent to a file_output operator that writes each line to a file on disk.

 

 

 

Under the hood, SOCK uses a pipeline of several operators to extract the information from the log.

We will look at an example of logs produced by containerd - it is one of the runtimes commonly used to run containers (a different runtime could be one like docker). Let’s look at a snippet of an operator from SOCK used to extract data from containerd runtime logs:

 

 

-  type: regex_parser
   id: parser-containerd
   regex: '^(?P<time>[^ ]+) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$'
   timestamp:
      parse_from: attributes.time
      layout_type: gotime
      layout: '2006-01-02T15:04:05.999999999Z07:00'

 

The actual log being read from the file and going into the operator might look like this:

 

2023-12-27T12:14:05.227298808+00:00 stderr F Hello World

 

The above operator does a simple thing. It extracts the following data from log messages based on the regular expression

  1. time: it is set to “2023-12-27T12:14:05.227298808+00:00”
  2. stream: “stderr” matches with one of the two possible stream types (stdout or stderr)
  3. logtag: is set to “F” 
  4. log: “Hello World” - this is an actual log message 

Our regex operator extracts these values and inserts them into the event body.

 

Structure of the log message in the operator pipeline

Before we continue, we should learn about the log message format inside the pipeline. Knowing this will help us to apply our own custom operators later.

Suppose, we have a slightly different message from containerd:

 

2023-12-27T12:14:05.227298808+00:00 stderr F Hello World source=xyz

 

The entry for the above log will look like this in the operator pipeline:

 

{
  "timestamp": "2024-05-27T12:21:03.769505512Z",
  "body": "2024-05-27T12:14:05.227298808+00:00 stderr F Hello World source=xyz",
  "attributes": {
    "log": "Hello World source=xyz",
    "log.iostream": "stderr",
    "logtag": "F",
    "time": "2024-05-27T12:14:05.227298808+00:00",
  },
  "resource": {
    "com.splunk.source": "/var/log/pods/(path to my file)",
    "com.splunk.sourcetype": "kube:container:(container name)",
    "k8s.container.name": "(container name)",
    "k8s.container.restart_count": "0",
    "k8s.namespace.name": "default",
    "k8s.pod.name": "(pod name)",
    "k8s.pod.uid": "(pod uid)",
  },
  "severity": 0,
  "scope_name": ""
}

 

See how for every info extracted from the log message, there is a corresponding match in the above entry in the attributes field. Regex_parser inserts values into the attributes field by default but this behavior can be changed with the parse_to option.

As we can also see there is a log.iostream key in our message, even though we expected stream instead. This is because there is another operator later on in the pipeline that changes it, it looks like this:

 

 - from: attributes.stream
    to: attributes["log.iostream"]
    type: move

 

This operator is used for simple move operations, as we can see it moves the stream field into log.iostream.

How do you use custom operators?

As an example, let’s consider the same log we saw earlier i.e.

 

2023-12-27T12:14:05.227298808+00:00 stderr F Hello World source=xyz

 

What if we want to extract the source from the above message and set it into com.splunk.source resource? Doing that would allow us to assign custom source values based on a log message instead of the path to the file - which is a default behavior.

For such a use case, we may create the following operators:

 

-  type: regex_parser
   id: my-custom-parser
   regex: '^.*source=(?P<source>[^ ]*).*$'
-  type: copy
   from: attributes["source"]
   to: resource["com.splunk.source"]

 

If we then use them, the entry for our message will look like this:

 

{
  "timestamp": "2024-05-27T12:21:03.769505512Z",
  "body": "2024-05-27T12:14:05.227298808+00:00 stderr F Hello World source=xyz",
  "attributes": {
    "log": "Hello World source=xyz",
    "log.iostream": "stderr",
    "logtag": "F",
    "time": "2024-05-27T12:14:05.227298808+00:00"
    "source": "xyz",
 },
 "resource": {
   "com.splunk.source": "xyz",
   "com.splunk.sourcetype": "kube:container:(container name)",
   "k8s.container.name": "(container name)",
   "k8s.container.restart_count": "0",
   "k8s.namespace.name": "default",
   "k8s.pod.name": "(pod name)",
   "k8s.pod.uid": "(pod uid)",
 },
 "severity": 0,
 "scope_name": ""
}

 

Notice the attribute source, which is parsed by the regex_parser that we just created. This value is then copied into a resource[“com.splunk.source”] by the copy operator. 

Using custom operators with values.yaml

So, we learned how to create custom operators. But where do we specify them in my_values.yaml to actually use them? Enter extraOperators!

For the example discussed above, we will now update our configuration file with the following settings:

 

logsCollection:
  containers:
    extraOperators:
      - type: regex_parser
        id: my-custom-parser
        regex: '^.*source=(?P<source>[^ ]*).*$'
      - type: copy
        from: attributes["source"]
        to: resource["com.splunk.source"]

 

Now restart the helm deployment and you’re good to go!

 

helm upgrade --install my-splunk-otel-collector --values my_values.yaml splunk-otel-collector-chart/splunk-otel-collector

 

Some operators that you might find useful

  • add - can be used to insert either a static value or an expression
  • remove - removes a field, useful for cleaning up unnecessary data after other operations 
  • move - moves (or renames) a field 
  • json-parser - can be useful when you want to parse data saved in a JSON format
  • recombine - combines multi-line logs into one, a topic that we covered extensively in previous blog posts

And a lot more can be found here!

And some troubleshooting tips

So what if I’m not sure what my log entry looks like, I can’t possibly experiment with operators without that knowledge right?

Correct! Before experimenting with operators, you should know the structure of your log entry, or else you might end up with faulty data or lots of annoying guesswork. And how would I know the structure of my log entry? You can use stdout operator:

 

logsCollection:
  containers:
    extraOperators:
      - type: stdout
        id: my-custom-stdout

 

Use the above config and restart the helm deployment. Now do kubectl get logs pod_name command and you’ll notice a bunch of logs containing JSON entries.

That’s how your entry looks, and how you can debug your operators.

Conclusion

In this article, we’ve explored some ways of using operators to extract information from the logs. This very powerful feature can be used to parse logs in a more complex way not provided by a basic configuration. 

On the other hand, it is important not to overcomplicate things - if you can extract data using built-in functions then do so. SOCK provides many ways to extract data for many commonly used data formats, and using them is much simpler.

Get Updates on the Splunk Community!

October Community Champions: A Shoutout to Our Contributors!

As October comes to a close, we want to take a moment to celebrate the people who make the Splunk Community ...

Community Content Calendar, November Edition

Welcome to the November edition of our Community Spotlight! Each month, we dive into the Splunk Community to ...

Stay Connected: Your Guide to November Tech Talks, Office Hours, and Webinars!

What are Community Office Hours? Community Office Hours is an interactive 60-minute Zoom series where ...