Deployment Architecture

How to transform data prior to indexing

Conradj
Path Finder

Hi,

I am using Splunk to collect perfmon data from my servers as well. however, the data i am indexing currently is very raw and I believe its consuming much more space in the index that it really should. I am compensating for this currently by reducing the collecting frequency, this is less than ideal though as i then lose resolution over time, I'd rather have a slightly less accurate value, more often.

What I really want to do is transform this data as part of the collection process so that it consumes less space in the indexes eg

Currently:
% Processor Time = 25.1536939345647248
Memory MBytes free = 10182
% disk Space free = 49.753237234302468
NIC RX bytes/in = 690.58168126768078
NIC TX bytes/in = 949.90804335349833

What I would like to do is transform all of these values so that I get say, 6 significant figures only.

transformed to consume less space:
% Processor Time = 25.1536
Memory MBytes free = 10182.0
% disk Space free = 49.7532
NIC RX bytes/in = 690.581
NIC TX bytes/in = 949.908

I use deployment server, so if this transformation could be done as part of the collection, even better.

Has anyone done anything like this?

0 Karma
1 Solution

dwaddle
SplunkTrust
SplunkTrust

The traditional approach to transforming just-prior-to-indexing is to use SEDCMD. With the right regular expression, this should work. However, it may not be entirely pretty and/or mathematically correct.

(props.conf)

[mysourcetype]
SEDCMD-foo = s/=(\s+)([0-9.]{7})(\d+)/=\1\2/

Note I've not tested this regular expression, it may not work properly at all ...

View solution in original post

cvajs
Contributor

ok, well, you dont need math to achieve this space saving endeavor. using scientifc notation simply modifies the units via # notation. and, you dont save byte space simply by converting to scientific notation, you only save byte space if you roundoff/truncate. you can SEDCMD the input data and normalize it say to E+3 or E+6 (whatever, etc) then round/truncate. then perhaps do a custom field extraction and name it with units attached, eg "kB" or "MB" or whatever. you can roundoff/truncate/normalize with SEDCMD, etc.

0 Karma

Conradj
Path Finder

ok,

Thanks you to all involved on this question. I think the answer is clear.

dwaddle hit the nail on the head. Splunk doesn't really see a number, it just sees a piece of information that takes up a a number of characters. Regex is perfect for finding patterns in strings, but it doesn't do math for you.

This is fundamentally a math problem. There doesn't seem to be a way to perform math on incoming data, one day there might be, but for this particular problem Splunk isn't actually the right tool for the job anyway.

I will stop collecting this perfrmon data in Splunk and collect it in Nagios. This way I can use more of our Splunk license on ingesting application logs that will give us the best value. We can still correlate application events against high CPU, high network IO etc we just won't be able to do it within the same tool!

Cheers!

C.

0 Karma

dwaddle
SplunkTrust
SplunkTrust

The traditional approach to transforming just-prior-to-indexing is to use SEDCMD. With the right regular expression, this should work. However, it may not be entirely pretty and/or mathematically correct.

(props.conf)

[mysourcetype]
SEDCMD-foo = s/=(\s+)([0-9.]{7})(\d+)/=\1\2/

Note I've not tested this regular expression, it may not work properly at all ...

Conradj
Path Finder

Its just an example, but I could use a really big number to demonstrate aswell e.g. my NIC reports that the current Tx rate is 3949787710 bytes/s (10 char). Would you rather see that 10 digit number eating away your index or would you rather see it putting 3.950E9 Bytes/s (7 char) instead?.

The later consumes 30% less space in your index with only a minimal loss of precision over the original value.

0 Karma

cvajs
Contributor

what is the significance of 1.23457x10^-6 ?? that is a rather small number. if 10E-6 with hundreths at that unit scale is the resolution you need then regex out 8 digits, so 0.00000123

0 Karma

dwaddle
SplunkTrust
SplunkTrust

Yeah, you'll need something way more powerful than this SEDCMD-based approach. Part of your problem is at this point in the indexing process, Splunk really doesn't know what is a "number" and what isn't -- it only sees strings of characters.

0 Karma

Conradj
Path Finder

I wondered if a regex might work, (and it does for the cases I have shown above, apart from memory)

But I think I need it to be a bit smarter and reformat a number into a scientific notation.

Otherwise an input of say 0.000001234567 will become 0.00000 if I use regex, so it really needs to be shown as 1.23457x10^-6

0 Karma
Get Updates on the Splunk Community!

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...

Splunk Custom Visualizations App End of Life

The Splunk Custom Visualizations apps End of Life for SimpleXML will reach end of support on Dec 21, 2024, ...