Splunk Cloud Platform

How to find the size of fields in a particular index?

hmasciave
Observer

We are looking to see the size of all the fields in a particular index. We have come up with this search to see the size of a particular field but we would like to see the size of all the fields in the index in order to understand where the bulk of the data is sitting.

index=index_name
| eval raw_len=(len(_raw)/1024/1024/1024)
| stats sum(raw_len) as GB by field_name
| sort -GB

Labels (1)
Tags (1)
0 Karma

hmasciave
Observer

Thank you both for the feedback. I am not sure what the value in the search represents. Is that the amount of events, size in GB, etc.?

The search below works. Given the numerous fields I just want to find the amount of data in these various fields so we can see where most of our data is sitting. Seeing the amount of data in an index and sourcetype are great but we need to dig deeper within the data to see where most of the data is logging.

index=index_name
| eval raw_len=(len(_raw)/1024/1024/1024)
| stats sum(raw_len) as GB by field_name
| sort -GB

Tags (1)
0 Karma

PickleRick
SplunkTrust
SplunkTrust

Oh, and BTW, neither your search nor @richgalloway 's gives size properly. It returns cumulative length but measured in characters, not in bytes! Splunk supports unicode (I'm not sure however how it's internally encoded) so "a" is the same length as "ą" but it most probably be of a different size in memory/on disk.

You have to remember that raw data is one thing and index disk usage is a completely different cup of tea. Raw data is stored in compressed form so disk usage depends on the entropy of your data and just length distribution will not correspond 1:1 to the compressed data size (for example relatively short but highly variable fields like timestamps would in many compression schemes - and I don't remember at the moment what algorithm is the raw data compressed with; gzip? - compress much worse than even relatively long strings of repeatable sequences). That's one thing.

Another thing is that around 70 percent of the actual data use is not the raw data itself but the indexes and their size depends on many factors including cardinality of fields, not just their lengths.

So it's not that straightforward.

0 Karma

hmasciave
Observer

@PickleRickThank you for that. Sounds like I won't be able to get this kind of data then. Appreciate the response.

0 Karma

PickleRick
SplunkTrust
SplunkTrust

You might get _some_ data but as I said - it won't be 1:1 transferable to disk usage. It's not that "field A accounts for 30% of your event's size so if don't send this field in our logs the disk usage will drop by 30%". It's more complicated than that.

Oh, and I forgot that there's of course some space "unaccounted for" if you're summing up the fields because the event contains whitespaces and other delimiters.

0 Karma

richgalloway
SplunkTrust
SplunkTrust

Is this what you're looking for?

index=foo
| foreach * 
    ```Get the length of every field```
    [ eval <<FIELD>>_len = len('<<FIELD>>') ] 
```Add up the lengths of the fields```
| stats sum(*_len) as *_len 
| rename *_len as *
| table *
| transpose column_name="field" 
| rename "row 1" as "value" 
| sort - value
---
If this reply helps you, Karma would be appreciated.

PickleRick
SplunkTrust
SplunkTrust

While this solution is technically correct, it doesn't answer the question.

It shows cumulative lengths of _all_ fields (maybe except the "hidden" ones) associated with the events. Including the default ones like host or sourcetype, including indexed fields which are stored completely differently, tags, aliased fields, calculated fields, automatic lookups and so on.

BTW, you can't calculate your size usage based on search-time extractions because fields extracted that way may overlap.

0 Karma

richgalloway
SplunkTrust
SplunkTrust

I tried incorporating walklex into the query, but it didn't give the results I was looking for.

Your point about this query including more than it should is valid, but I'd rather the example provide more information the OP can ignore or filter out.

---
If this reply helps you, Karma would be appreciated.
0 Karma

PickleRick
SplunkTrust
SplunkTrust

I rather meant my comment as a warning for the OP that (s)he's trying to do his/her analysis "backwards".

0 Karma
Get Updates on the Splunk Community!

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...