Solved: Re: How to transform data prior to indexing

Conradj · ‎04-01-2012

Hi,

I am using Splunk to collect perfmon data from my servers as well. however, the data i am indexing currently is very raw and I believe its consuming much more space in the index that it really should. I am compensating for this currently by reducing the collecting frequency, this is less than ideal though as i then lose resolution over time, I'd rather have a slightly less accurate value, more often.

What I really want to do is transform this data as part of the collection process so that it consumes less space in the indexes eg

Currently:
% Processor Time = 25.1536939345647248
Memory MBytes free = 10182
% disk Space free = 49.753237234302468
NIC RX bytes/in = 690.58168126768078
NIC TX bytes/in = 949.90804335349833

What I would like to do is transform all of these values so that I get say, 6 significant figures only.

transformed to consume less space:
% Processor Time = 25.1536
Memory MBytes free = 10182.0
% disk Space free = 49.7532
NIC RX bytes/in = 690.581
NIC TX bytes/in = 949.908

I use deployment server, so if this transformation could be done as part of the collection, even better.

Has anyone done anything like this?

dwaddle · ‎04-01-2012

The traditional approach to transforming just-prior-to-indexing is to use SEDCMD. With the right regular expression, this should work. However, it may not be entirely pretty and/or mathematically correct.

(props.conf)

[mysourcetype]
SEDCMD-foo = s/=(\s+)([0-9.]{7})(\d+)/=\1\2/

Note I've not tested this regular expression, it may not work properly at all ...

View solution in original post

cvajs · ‎04-04-2012

ok, well, you dont need math to achieve this space saving endeavor. using scientifc notation simply modifies the units via # notation. and, you dont save byte space simply by converting to scientific notation, you only save byte space if you roundoff/truncate. you can SEDCMD the input data and normalize it say to E+3 or E+6 (whatever, etc) then round/truncate. then perhaps do a custom field extraction and name it with units attached, eg "kB" or "MB" or whatever. you can roundoff/truncate/normalize with SEDCMD, etc.

Conradj · ‎04-03-2012

ok,

Thanks you to all involved on this question. I think the answer is clear.

dwaddle hit the nail on the head. Splunk doesn't really see a number, it just sees a piece of information that takes up a a number of characters. Regex is perfect for finding patterns in strings, but it doesn't do math for you.

This is fundamentally a math problem. There doesn't seem to be a way to perform math on incoming data, one day there might be, but for this particular problem Splunk isn't actually the right tool for the job anyway.

I will stop collecting this perfrmon data in Splunk and collect it in Nagios. This way I can use more of our Splunk license on ingesting application logs that will give us the best value. We can still correlate application events against high CPU, high network IO etc we just won't be able to do it within the same tool!

Cheers!

C.

dwaddle · ‎04-01-2012

The traditional approach to transforming just-prior-to-indexing is to use SEDCMD. With the right regular expression, this should work. However, it may not be entirely pretty and/or mathematically correct.

(props.conf)

[mysourcetype]
SEDCMD-foo = s/=(\s+)([0-9.]{7})(\d+)/=\1\2/

Note I've not tested this regular expression, it may not work properly at all ...

Conradj · ‎04-03-2012

Its just an example, but I could use a really big number to demonstrate aswell e.g. my NIC reports that the current Tx rate is 3949787710 bytes/s (10 char). Would you rather see that 10 digit number eating away your index or would you rather see it putting 3.950E9 Bytes/s (7 char) instead?.

The later consumes 30% less space in your index with only a minimal loss of precision over the original value.

cvajs · ‎04-03-2012

what is the significance of 1.23457x10^-6 ?? that is a rather small number. if 10E-6 with hundreths at that unit scale is the resolution you need then regex out 8 digits, so 0.00000123

dwaddle · ‎04-02-2012

Yeah, you'll need something way more powerful than this SEDCMD-based approach. Part of your problem is at this point in the indexing process, Splunk really doesn't know what is a "number" and what isn't -- it only sees strings of characters.

Conradj · ‎04-01-2012

I wondered if a regex might work, (and it does for the cases I have shown above, apart from memory)

But I think I need it to be a bit smarter and reformat a number into a scientific notation.

Otherwise an input of say 0.000001234567 will become 0.00000 if I use regex, so it really needs to be shown as 1.23457x10^-6

How to transform data prior to indexing

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life