Solved: byte count of a text field

wowbaggerHU

Dear Members,

We are developing a Splunk add-on, and we need to find the octet (byte) count of a data field in an INGEST_EVAL statement in tranforms.conf.

Currently we are using the length(_raw) statement, but it is problematic because if the data to be transferred contains UTF-8 characters then the number of characters returned by length() will not match the actual number of bytes used to store the message.

Is there a function that returns the byte-count in a single step, or do we have to employ some kind of magic to get this information?

Thanks in advance!

livehybrid

Hi @wowbaggerHU

Could you use something like this?

# transforms.conf 
[YourTransform]
INGEST_EVAL = fieldByteLen=len(replace(_raw, "[^\x00-\x7F]", "XX"))

🌟 Did this answer help you? If so, please consider:

Adding karma to show it was useful
Marking it as the solution if it resolved your issue
Commenting if you need any clarification

Your feedback encourages the volunteers in this community to continue contributing

View solution in original post

livehybrid

Hi @wowbaggerHU

Could you use something like this?

# transforms.conf 
[YourTransform]
INGEST_EVAL = fieldByteLen=len(replace(_raw, "[^\x00-\x7F]", "XX"))

🌟 Did this answer help you? If so, please consider:

Adding karma to show it was useful
Marking it as the solution if it resolved your issue
Commenting if you need any clarification

Your feedback encourages the volunteers in this community to continue contributing

PickleRick

Are you sure about this?

| windbag
| eval bytes=len(replace(sample, "[^\x00-\x7F]", "XX"))
| eval len=len(sample)
| table sample len bytes _raw

If I check - for example, Tamil one, I get len=58 (which seems wrong; I remember a thread about eastern scripts so I picked this one deliberately 😉 ), bytes =108 but if I copy-paste the example into a text file on my disk.

Additional question to OP is what is the goal you want to achieve? Because if it has something to do with measuring the actual input, it's already too late because transforms kick in after the initial UTF normalization.

wowbaggerHU

We are doing some magic in transforms.cond, and wan to send out log data with a forged syslog header and framing. For the framing we need the byte count and and not the character count.

Your example illustrates the problem very well.
I guess most of the characters were 2 byte ones, perhaps the spaces were single byte, so that's the reason why it's not the exact duplicate of the length.

However one needs to handle 3 and 4 byte UTF-8 characters too.

PickleRick

~~Then again - at transforms stage you will no longer have the raw contents of the original data stream.~~

For example, on Windows the default CHARSET is AUTO which means that Splunk will try to guess based on whatever you throw at it. So it will happily convert CP-1250 or UTF-16 to UTF-8 and your "corrected" count will still not be the same as the original data size.

~~On Linux it's UTF-8 by default but it can be overriden.~~

OK. Scratch all that. You actually want to _send_ data. You want to use octet-count framing? That's kinda unusual. I know it's RFC-compliant and all but hardly anyone uses it in the wild. The only advantage over normal trailer-based framing is if you have multiline events.

wowbaggerHU

Well, we work with our own syslog-ng based solution on the receiving side, and we have the know-how to make use of all of its features.

The thing is, we don't want to venture into the legally gray area surrounding the reverse engineering of S2S, and SC4S already has some pre-made open source configs in place to achieve the goal of sending logs (with specially crafted headers and framing) from a Splunk HWF to syslog-ng. We are just taking it a few steps further.

We just noticed this problem, and aside ours, their approach has this flaw too, so I notified them of this problem.

PickleRick

Yeah. But if your events are small enough, you can send them over UDP and have natural event breaking. If you don't have multiline events you can break on line endings. If you have long events you could try doing some magic with original line endings (substitute them with something?) and decode them on receiving side.

wowbaggerHU

Thanks for the suggestion, we are looking at this possibility.
Unfortunately there are 4, 3 and 2 byte long UTF-8 characters, so we will likely have to do separate passes for all of those.

I hope there'll be a better approach because this likely has a less-than-ideal performance impact.
But apart from that this vay may prove to be viable.

PickleRick

Unfortunately, that seems to be the only viable solution. After the "entry point" to the pipeline where Splunk decodes the data stream into UTF-8, it works on code points so any text operations work on whole characters, regardless of their byte width. So you have to explicitly replace known ranges of codepoints with two-, three and four-char sequences.

byte count of a text field

add-on

other

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Stronger Security with Federated Search for S3, GCP SQL & Australian Threat ...

Accelerating Observability as Code with the Splunk AI Assistant

Join the Conversation