Hi,
I am using Splunk to collect perfmon data from my servers as well. however, the data i am indexing currently is very raw and I believe its consuming much more space in the index that it really should. I am compensating for this currently by reducing the collecting frequency, this is less than ideal though as i then lose resolution over time, I'd rather have a slightly less accurate value, more often.
What I really want to do is transform this data as part of the collection process so that it consumes less space in the indexes eg
Currently:
% Processor Time = 25.1536939345647248
Memory MBytes free = 10182
% disk Space free = 49.753237234302468
NIC RX bytes/in = 690.58168126768078
NIC TX bytes/in = 949.90804335349833
What I would like to do is transform all of these values so that I get say, 6 significant figures only.
transformed to consume less space:
% Processor Time = 25.1536
Memory MBytes free = 10182.0
% disk Space free = 49.7532
NIC RX bytes/in = 690.581
NIC TX bytes/in = 949.908
I use deployment server, so if this transformation could be done as part of the collection, even better.
Has anyone done anything like this?
The traditional approach to transforming just-prior-to-indexing is to use SEDCMD
. With the right regular expression, this should work. However, it may not be entirely pretty and/or mathematically correct.
(props.conf)
[mysourcetype]
SEDCMD-foo = s/=(\s+)([0-9.]{7})(\d+)/=\1\2/
Note I've not tested this regular expression, it may not work properly at all ...
ok, well, you dont need math to achieve this space saving endeavor. using scientifc notation simply modifies the units via # notation. and, you dont save byte space simply by converting to scientific notation, you only save byte space if you roundoff/truncate. you can SEDCMD the input data and normalize it say to E+3 or E+6 (whatever, etc) then round/truncate. then perhaps do a custom field extraction and name it with units attached, eg "kB" or "MB" or whatever. you can roundoff/truncate/normalize with SEDCMD, etc.
ok,
Thanks you to all involved on this question. I think the answer is clear.
dwaddle hit the nail on the head. Splunk doesn't really see a number, it just sees a piece of information that takes up a a number of characters. Regex is perfect for finding patterns in strings, but it doesn't do math for you.
This is fundamentally a math problem. There doesn't seem to be a way to perform math on incoming data, one day there might be, but for this particular problem Splunk isn't actually the right tool for the job anyway.
I will stop collecting this perfrmon data in Splunk and collect it in Nagios. This way I can use more of our Splunk license on ingesting application logs that will give us the best value. We can still correlate application events against high CPU, high network IO etc we just won't be able to do it within the same tool!
Cheers!
C.
The traditional approach to transforming just-prior-to-indexing is to use SEDCMD
. With the right regular expression, this should work. However, it may not be entirely pretty and/or mathematically correct.
(props.conf)
[mysourcetype]
SEDCMD-foo = s/=(\s+)([0-9.]{7})(\d+)/=\1\2/
Note I've not tested this regular expression, it may not work properly at all ...
Its just an example, but I could use a really big number to demonstrate aswell e.g. my NIC reports that the current Tx rate is 3949787710 bytes/s (10 char). Would you rather see that 10 digit number eating away your index or would you rather see it putting 3.950E9 Bytes/s (7 char) instead?.
The later consumes 30% less space in your index with only a minimal loss of precision over the original value.
what is the significance of 1.23457x10^-6 ?? that is a rather small number. if 10E-6 with hundreths at that unit scale is the resolution you need then regex out 8 digits, so 0.00000123
Yeah, you'll need something way more powerful than this SEDCMD-based approach. Part of your problem is at this point in the indexing process, Splunk really doesn't know what is a "number" and what isn't -- it only sees strings of characters.
I wondered if a regex might work, (and it does for the cases I have shown above, apart from memory)
But I think I need it to be a bit smarter and reformat a number into a scientific notation.
Otherwise an input of say 0.000001234567 will become 0.00000 if I use regex, so it really needs to be shown as 1.23457x10^-6