How to find the size of fields in a particular ind...

hmasciave · ‎12-02-2022

We are looking to see the size of all the fields in a particular index. We have come up with this search to see the size of a particular field but we would like to see the size of all the fields in the index in order to understand where the bulk of the data is sitting.

index=index_name
| eval raw_len=(len(_raw)/1024/1024/1024)
| stats sum(raw_len) as GB by field_name
| sort -GB

hmasciave · ‎12-05-2022

Thank you both for the feedback. I am not sure what the value in the search represents. Is that the amount of events, size in GB, etc.?

The search below works. Given the numerous fields I just want to find the amount of data in these various fields so we can see where most of our data is sitting. Seeing the amount of data in an index and sourcetype are great but we need to dig deeper within the data to see where most of the data is logging.

index=index_name
| eval raw_len=(len(_raw)/1024/1024/1024)
| stats sum(raw_len) as GB by field_name
| sort -GB

PickleRick · ‎12-05-2022

Oh, and BTW, neither your search nor @richgalloway 's gives size properly. It returns cumulative length but measured in characters, not in bytes! Splunk supports unicode (I'm not sure however how it's internally encoded) so "a" is the same length as "ą" but it most probably be of a different size in memory/on disk.

You have to remember that raw data is one thing and index disk usage is a completely different cup of tea. Raw data is stored in compressed form so disk usage depends on the entropy of your data and just length distribution will not correspond 1:1 to the compressed data size (for example relatively short but highly variable fields like timestamps would in many compression schemes - and I don't remember at the moment what algorithm is the raw data compressed with; gzip? - compress much worse than even relatively long strings of repeatable sequences). That's one thing.

Another thing is that around 70 percent of the actual data use is not the raw data itself but the indexes and their size depends on many factors including cardinality of fields, not just their lengths.

So it's not that straightforward.

hmasciave · ‎12-05-2022

@PickleRickThank you for that. Sounds like I won't be able to get this kind of data then. Appreciate the response.

PickleRick · ‎12-05-2022

You might get _some_ data but as I said - it won't be 1:1 transferable to disk usage. It's not that "field A accounts for 30% of your event's size so if don't send this field in our logs the disk usage will drop by 30%". It's more complicated than that.

Oh, and I forgot that there's of course some space "unaccounted for" if you're summing up the fields because the event contains whitespaces and other delimiters.

richgalloway · ‎12-03-2022

Is this what you're looking for?

index=foo
| foreach * 
    ```Get the length of every field```
    [ eval <<FIELD>>_len = len('<<FIELD>>') ] 
```Add up the lengths of the fields```
| stats sum(*_len) as *_len 
| rename *_len as *
| table *
| transpose column_name="field" 
| rename "row 1" as "value" 
| sort - value

---
If this reply helps you, Karma would be appreciated.

PickleRick · ‎12-03-2022

While this solution is technically correct, it doesn't answer the question.

It shows cumulative lengths of _all_ fields (maybe except the "hidden" ones) associated with the events. Including the default ones like host or sourcetype, including indexed fields which are stored completely differently, tags, aliased fields, calculated fields, automatic lookups and so on.

BTW, you can't calculate your size usage based on search-time extractions because fields extracted that way may overlap.

richgalloway · ‎12-03-2022

I tried incorporating walklex into the query, but it didn't give the results I was looking for.

Your point about this query including more than it should is valid, but I'd rather the example provide more information the OP can ignore or filter out.

---
If this reply helps you, Karma would be appreciated.

PickleRick · ‎12-05-2022

I rather meant my comment as a warning for the OP that (s)he's trying to do his/her analysis "backwards".

How to find the size of fields in a particular index?

Splunk Investigate

Routing logs with Splunk OTel Collector for Kubernetes

Welcome to the Splunk Community!

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM