Splunk Search

How do you use SED on a heavy forwarder to drop log data after the specified character count?

Explorer

We are going to be pushing our logs through a heavy forwarder, so we have the ability to truncate a certain part of our logs to a reasonable size. Right now this log event can have up to 500k characters, but we want to limit this one piece without truncating other parts of the log event data. I was told we could use SEDCMD with regex to strip off the rest of the specific event.

Is it possible to use a character count with SEDCMD so we can just keep a specific amount of characters?

0 Karma

SplunkTrust
SplunkTrust

Splunk event processing provides an attribute called TRUNCATE (see this http://docs.splunk.com/Documentation/Splunk/7.2.0/Admin/Propsconf#Line_breaking) which defines, at sourcetype/host/source level, what the max lenght of an event can be. I believe you could use that to truncate your long events to keep only certain number of bytes from start of the event.

SEDCMD can be used but it'll be very expensive in terms of resources.

Ultra Champion

I converted this to an answer because @somesoni2's guidance would solve this.

While SED may be able to do this, the approach may not be successful. The reason is that the SED feature may not be as performant as other options, the introduction of the HF will cause larger payloads sent to indexer (and larger data to decrypt), every hop in the data flow is another point of failure (HF), etc...

The TRUNCATE def would do you well. Alternatively, you could likely even use a transforms to change the _raw for this truncate, or even go with the SED, but all on the indexer rather than introducing a HF earlier.

When it comes to the props.conf, remember that you can do it by sourcetype OR source OR host. But I don't think the scope props stanza definitions would get in your way here.

Of course, feel free to provide more specifics and we can help craft a specific solution for ya.

0 Karma

Splunk Employee
Splunk Employee

@FIS1

As you shared with me the longer log entries you want to truncate contain XML documents that contain elements with unlimited cardinality. We discussed a "smarter" way to truncate those events by limiting the number of unlimited repeating elements to a set of five, removing 6 through N, and leaving the remaining message in tact.

SEDCMD-remove-6-N-Transactions = s/(^.?)((?:.?<\/ma1:Transaction>){5})(.<\/ma1:Transaction>)(?=<\/ma1:TransactionInfo>)(.*])/\1\2\4/g

The first capture group grabs the header of the log entry and start of the XML message through the tag.

The second capture grabs the first five occurrences of the elements.

Third capture group grabs the remaining elements

The fourth capture group grabs the remainder of the XML message and the trailing characters of the log entry.

Then, I have the SEDCMD simply keep the first, second and fourth capture groups before indexing in Splunk.

After testing you confirmed it work as desired and are now reliably achieving an average message size between 5-6K.

Here are the props.conf I used:

[fis:log]
DATETIME_CONFIG =
MAX_EVENTS = 10000
TRUNCATE = 0
TIME_PREFIX = ^
TIME_FORMAT = %m/%d/%Y %k:%M:%S,%3N
SHOULD_LINEMERGE = false
LINE_BREAKER = ([\r\n])(?=\d{4}-\d{2}-\d{2}\s)
SEDCMD-remove-6-N-Transactions = s/(^.
?)((?:.?<\/ma1:Transaction>){5})(.<\/ma1:Transaction>)(?=<\/ma1:TransactionInfo>)(.])/\1\2\4/g

Where TRUNCATE = 0 instructs Splunk to capture ALL original characters of the inbound event and MAX_EVENTS = 10000 limits that capture to 10000 lines.

I've done lots of creative things with SEDCMD and TRANSFORMS in my time with Splunk. Hope this helps someone else employ SEDCMD for similar purposes?

Regards.

0 Karma

Explorer

Yes we are currently using truncate now but we want to only keep a certain character limit of this log line and let the rest of the log lines be as long as they would like. From what i understand truncate will truncate everything in a source down to that value and it doesn't look for a specific line in that source and just truncate that.

0 Karma

Splunk Employee
Splunk Employee

If this is the only reason why you are switching to a HF?
Does "pushing through a HF" mean "using a HF as an intermediary forwarder?
What is the goal: (A) Save network bandwidth or (B) minimize license consumption?
How many of these "special cases" are there?

An alternative to forcing the switch to a HF would be to consider INVALID_CAUSE / UNARCHIVE_CMD in props.conf to invoke a custom script that identifies and truncates the relevant events before forwarding, while copying everything else verbatim. This works on a UF and will allow you to continue to use a UF, which has a much smaller network footprint than the HF.
Yes, it involves creation of a script, but that is rather simple. Unless you have lots of unique events that need to be subjected to this treatment.

SplunkTrust
SplunkTrust

The SEDCMD is not designed to do this kind of tasks, but may be used here. Re-iterating the fact that depending upon your regex and amount of data that you index (it'll be evaluated against all events of that sourcetype/source/host), it could be resource intensive. There could also be limitation on how much data your regex can process. Having said that, you'd want to test something like this (your regex should be able to identify the events you want this to be applied upon, say based on some uniq key in the event

props.conf on your Heavy forwarder

[yoursourcetypeHere]
---other line breaking and timestamp recognition stuffs---
#assuming you want to truncate events in which has say X number of character before your "UniquePattern" and Y number of characters afterwards.
# replace X and Y with actual numbers. Also replace UniquePattern with yours
SEDCMD-keepspecificparts = s/^(.{X})(UniquePattern)(.{Y})(.+)/\1\2\3/
0 Karma

Path Finder

As you can see, this type of ingestion time modification of data is tough in Splunk. This SEDCMD approach could work, but getting your configuration right will be a major pain. For what it's worth, not sure if you're amenable to a 3rd party solution, but this is something Cribl (https://cribl.io/) makes easy that's very hard to get right in props/transforms.

0 Karma