Dear Members,
We are developing a Splunk add-on, and we need to find the octet (byte) count of a data field in an INGEST_EVAL statement in tranforms.conf.
Currently we are using the length(_raw) statement, but it is problematic because if the data to be transferred contains UTF-8 characters then the number of characters returned by length() will not match the actual number of bytes used to store the message.
Is there a function that returns the byte-count in a single step, or do we have to employ some kind of magic to get this information?
Thanks in advance!
Hi @wowbaggerHU
Could you use something like this?
# transforms.conf
[YourTransform]
INGEST_EVAL = fieldByteLen=len(replace(_raw, "[^\x00-\x7F]", "XX"))🌟 Did this answer help you? If so, please consider:
Your feedback encourages the volunteers in this community to continue contributing
Hi @wowbaggerHU
Could you use something like this?
# transforms.conf
[YourTransform]
INGEST_EVAL = fieldByteLen=len(replace(_raw, "[^\x00-\x7F]", "XX"))🌟 Did this answer help you? If so, please consider:
Your feedback encourages the volunteers in this community to continue contributing
Are you sure about this?
| windbag
| eval bytes=len(replace(sample, "[^\x00-\x7F]", "XX"))
| eval len=len(sample)
| table sample len bytes _raw
If I check - for example, Tamil one, I get len=58 (which seems wrong; I remember a thread about eastern scripts so I picked this one deliberately 😉 ), bytes =108 but if I copy-paste the example into a text file on my disk.
Additional question to OP is what is the goal you want to achieve? Because if it has something to do with measuring the actual input, it's already too late because transforms kick in after the initial UTF normalization.
We are doing some magic in transforms.cond, and wan to send out log data with a forged syslog header and framing. For the framing we need the byte count and and not the character count.
Your example illustrates the problem very well.
I guess most of the characters were 2 byte ones, perhaps the spaces were single byte, so that's the reason why it's not the exact duplicate of the length.
However one needs to handle 3 and 4 byte UTF-8 characters too.
Then again - at transforms stage you will no longer have the raw contents of the original data stream.
For example, on Windows the default CHARSET is AUTO which means that Splunk will try to guess based on whatever you throw at it. So it will happily convert CP-1250 or UTF-16 to UTF-8 and your "corrected" count will still not be the same as the original data size.
On Linux it's UTF-8 by default but it can be overriden.
OK. Scratch all that. You actually want to _send_ data. You want to use octet-count framing? That's kinda unusual. I know it's RFC-compliant and all but hardly anyone uses it in the wild. The only advantage over normal trailer-based framing is if you have multiline events.
Well, we work with our own syslog-ng based solution on the receiving side, and we have the know-how to make use of all of its features.
The thing is, we don't want to venture into the legally gray area surrounding the reverse engineering of S2S, and SC4S already has some pre-made open source configs in place to achieve the goal of sending logs (with specially crafted headers and framing) from a Splunk HWF to syslog-ng. We are just taking it a few steps further.
We just noticed this problem, and aside ours, their approach has this flaw too, so I notified them of this problem.
Yeah. But if your events are small enough, you can send them over UDP and have natural event breaking. If you don't have multiline events you can break on line endings. If you have long events you could try doing some magic with original line endings (substitute them with something?) and decode them on receiving side.
Thanks for the suggestion, we are looking at this possibility.
Unfortunately there are 4, 3 and 2 byte long UTF-8 characters, so we will likely have to do separate passes for all of those.
I hope there'll be a better approach because this likely has a less-than-ideal performance impact.
But apart from that this vay may prove to be viable.
Unfortunately, that seems to be the only viable solution. After the "entry point" to the pipeline where Splunk decodes the data stream into UTF-8, it works on code points so any text operations work on whole characters, regardless of their byte width. So you have to explicitly replace known ranges of codepoints with two-, three and four-char sequences.