Re: Why is bro index so big?

puce_flume · ‎01-06-2015

We have bro TA installed and putting all the bro logs into a dedicated index. We are logging ~5GB per day. The index size on disk is about 2.5 times larger than the raw data size according to Fire Brigade app. This is supported by inspection of index folder size. For comparison we are logging a similar daily volume of Windows security events and the on-disk size is <30% of the raw data size.

Why does our bro data not compress but inflate?

joshua_hart1 · ‎12-30-2015

The answer most likely has to do with the fact that the Bro Add-On uses the INDEXED_EXTRACTONS setting to index the events as events with headers. What this does is index each of those fields and thus the index is HUGE.

I recently discovered that Bro can log in JSON format and am working to port the Add-On over to use KV_MODE = JSON instead of INDEXED_EXTRACTIONS. The difficulty is that I'll have to use a different index to store the new format in and potentially use a different sourcetype bro_json_http versus bro_http since the field extractions are based on props.conf settings.

Hope this helps.

balmeida · ‎01-19-2015

I actually noticed exactly the same thing, in my case it's more like 3x

|dbinspect index=bro |eval rawMB=(rawSize / 1024 / 1024 ) | stats values(index), sum(rawMB) AS rawTotal, sum(sizeOnDiskMB) AS diskTotalinMB

values(index) rawTotal diskTotalinMB
bro 36052.553794 104741.765628

balmeida · ‎01-23-2015

I just created a test index called bro_test and loaded a 147MB bro_conn file with 1055473 lines to it and I see the same results in splunk. Fresh data only, nothing else.

rawsize=145MB
diskSize=257MB

[user@host bro1]$ wc -l conn.11:00:00-12:00:00.log
1055473 conn.11:00:00-12:00:00.log
[user@host bro1]$ du -hs conn.11:00:00-12:00:00.log
147M conn.11:00:00-12:00:00.log

|dbinspect index=bro_test |eval rawMB=(rawSize / 1024 / 1024 ) | stats values(index), sum(rawMB) AS rawTotal, sum(sizeOnDiskMB) AS diskTotalinMB, values(eventCount)

values(index) rawTotal diskTotalinMB values(eventCount)
bro_test 145.171837 257.875000 1055473

Sample of log file:
1421668703.224169 Ci2Oik4vP9Wc8n6XDk 10.10.101.238 54350 23.63.99.88 80 tcp http 30.904114 678 171648 SF T 1 ShADadfF 76 4642 131 178560 (empty) - US so-eth3
1421668703.074892 CBIaMfbZYpIOO77x1 10.10.101.238 54347 23.61.254.251 80 tcp http 31.053642 340 84757 SF T 0 ShADadFf 43 2588 69 88353 (empty) - US so-eth3
1421668693.147515 Cx3H5H1v12kDTYuJYa 10.10.101.238 54340 23.61.254.58 80 tcp http 40.981791 1017 308532 SF T 0 ShADadFf 135 8049 231 320552 (empty) - US so-eth3
1421668703.002721 CZqQ5646gAQBpjK3Ma 10.10.101.238 54346 23.61.254.200 80 tcp http 31.126685 338 106877 SF T 0 ShADadFf 47 2794 82 111149 (empty) - US so-eth3
1421668693.313653 C7ZQp24SOGM1hoW2Yl 10.10.101.238 54343 23.61.254.16 443 tcp ssl 40.815798 1606 18566 SF T 0 hSADadFfR 27 2998 22 19718 (empty) - US so-eth3

alacercogitatus · ‎01-06-2015

I'm suspecting that this isn't actually the case. Your raw data is almost definately smaller. However - you may be subject to datamodels, search accelerations, etc that will increase the size of the index on disk due to tsidx files used in Accelerations.

Do you have any searches accelerated?

puce_flume · ‎01-23-2015

You are right. the raw data are compressed. I think about 4:1. I tried to check this across all the compressed raw data files but it seems like splunk uses a gzip format that has file size int overflow issues (known gzip bug) so compression ratio came out negative for a lot of files.

I still don't know why the indexes are so large. I am pretty sure it is not accelerated searches though. I will keep looking but suspect it may be part of TA-bro that I don't want to break.

mreynov_splunk · ‎03-11-2016

what else do you have installed on the system? What other apps, addons?

Why is bro index so big?

New in Observability - Improvements to Custom Metrics SLOs, Log Observer Connect & ...

Improve Data Pipelines Using Splunk Data Management

3-2-1 Go! How Fast Can You Debug Microservices with Observability Cloud?