All Apps and Add-ons

Why is bro index so big?

puce_flume
Engager

We have bro TA installed and putting all the bro logs into a dedicated index. We are logging ~5GB per day. The index size on disk is about 2.5 times larger than the raw data size according to Fire Brigade app. This is supported by inspection of index folder size. For comparison we are logging a similar daily volume of Windows security events and the on-disk size is <30% of the raw data size.

Why does our bro data not compress but inflate?

joshua_hart1
Path Finder

The answer most likely has to do with the fact that the Bro Add-On uses the INDEXED_EXTRACTONS setting to index the events as events with headers. What this does is index each of those fields and thus the index is HUGE.

I recently discovered that Bro can log in JSON format and am working to port the Add-On over to use KV_MODE = JSON instead of INDEXED_EXTRACTIONS. The difficulty is that I'll have to use a different index to store the new format in and potentially use a different sourcetype bro_json_http versus bro_http since the field extractions are based on props.conf settings.

Hope this helps.

balmeida
Explorer

I actually noticed exactly the same thing, in my case it's more like 3x

|dbinspect index=bro |eval rawMB=(rawSize / 1024 / 1024 ) | stats values(index), sum(rawMB) AS rawTotal, sum(sizeOnDiskMB) AS diskTotalinMB

values(index) rawTotal diskTotalinMB
bro 36052.553794 104741.765628

0 Karma

balmeida
Explorer

I just created a test index called bro_test and loaded a 147MB bro_conn file with 1055473 lines to it and I see the same results in splunk. Fresh data only, nothing else.

rawsize=145MB
diskSize=257MB

[user@host bro1]$ wc -l conn.11:00:00-12:00:00.log
1055473 conn.11:00:00-12:00:00.log
[user@host bro1]$ du -hs conn.11:00:00-12:00:00.log
147M conn.11:00:00-12:00:00.log

|dbinspect index=bro_test |eval rawMB=(rawSize / 1024 / 1024 ) | stats values(index), sum(rawMB) AS rawTotal, sum(sizeOnDiskMB) AS diskTotalinMB, values(eventCount)

values(index) rawTotal diskTotalinMB values(eventCount)
bro_test 145.171837 257.875000 1055473

Sample of log file:
1421668703.224169 Ci2Oik4vP9Wc8n6XDk 10.10.101.238 54350 23.63.99.88 80 tcp http 30.904114 678 171648 SF T 1 ShADadfF 76 4642 131 178560 (empty) - US so-eth3
1421668703.074892 CBIaMfbZYpIOO77x1 10.10.101.238 54347 23.61.254.251 80 tcp http 31.053642 340 84757 SF T 0 ShADadFf 43 2588 69 88353 (empty) - US so-eth3
1421668693.147515 Cx3H5H1v12kDTYuJYa 10.10.101.238 54340 23.61.254.58 80 tcp http 40.981791 1017 308532 SF T 0 ShADadFf 135 8049 231 320552 (empty) - US so-eth3
1421668703.002721 CZqQ5646gAQBpjK3Ma 10.10.101.238 54346 23.61.254.200 80 tcp http 31.126685 338 106877 SF T 0 ShADadFf 47 2794 82 111149 (empty) - US so-eth3
1421668693.313653 C7ZQp24SOGM1hoW2Yl 10.10.101.238 54343 23.61.254.16 443 tcp ssl 40.815798 1606 18566 SF T 0 hSADadFfR 27 2998 22 19718 (empty) - US so-eth3

0 Karma

alacercogitatus
SplunkTrust
SplunkTrust

I'm suspecting that this isn't actually the case. Your raw data is almost definately smaller. However - you may be subject to datamodels, search accelerations, etc that will increase the size of the index on disk due to tsidx files used in Accelerations.

Do you have any searches accelerated?

0 Karma

puce_flume
Engager

You are right. the raw data are compressed. I think about 4:1. I tried to check this across all the compressed raw data files but it seems like splunk uses a gzip format that has file size int overflow issues (known gzip bug) so compression ratio came out negative for a lot of files.

I still don't know why the indexes are so large. I am pretty sure it is not accelerated searches though. I will keep looking but suspect it may be part of TA-bro that I don't want to break.

0 Karma

mreynov_splunk
Splunk Employee
Splunk Employee

what else do you have installed on the system? What other apps, addons?

0 Karma
Get Updates on the Splunk Community!

.conf24 | Day 0

Hello Splunk Community! My name is Chris, and I'm based in Canberra, Australia's capital, and I travelled for ...

Enhance Security Visibility with Splunk Enterprise Security 7.1 through Threat ...

(view in My Videos)Struggling with alert fatigue, lack of context, and prioritization around security ...

Troubleshooting the OpenTelemetry Collector

  In this tech talk, you’ll learn how to troubleshoot the OpenTelemetry collector - from checking the ...