Solved: Proactively monitor for bucket corruption

jamesoconnell · ‎08-02-2017

I just repaired corrupt buckets for a partner index on one of our enterprise indexers.
The issue only became apparent after the customer saw the warnings on their reports.

My question is: are there easy proactive warnings the administrators can receive highlighting index bucket corruption -- rather than leaving it up to our customers to find the problems.

bheemireddi · ‎08-02-2017

If you are using "monitoring console" that would be a good starting point. It has the visibility into monitoring Indexer clustering activities. Below link might get you started, these are all the dashboards/searches, so may be you can setup the alerts on them. Also on the cluster master settings->indexer clustering might give you some insights too.
https://docs.splunk.com/Documentation/Splunk/6.6.2/Indexer/Viewindexerclusteringstatus

View solution in original post

isoutamo · ‎04-24-2018

Hi

we can found corrupted buckets from multisite cluster by next search / alert:

index=_internal component=CMMaster state=Discard incoming_bucket_size=* earliest=-30d@d 
| dedup bid 
| table _time,bid,peer_name,existing_bucket_size,incoming_bucket_size
| sort bid,_time

This shows bucket id + source peer.

r. Ismo

isoutamo · ‎10-01-2020

Even this is old case, I would like to add which the one can do with current versions.

Just run this:

| dbinspect index=* OR index=_* corruptonly=true 
| search state!=hot

Select enough long time period to found all corrupted buckets.

r. Ismo

sloshburch · ‎08-11-2017

A peer of mine shared this search. Does it jive with your environment? I wanna see if we can add these things into the MC as well so I'm curious to hear how you make out.

index=_internal sourcetype=splunkd component=ProcessTracker (BucketBuilder OR JournalSlice) (NOT "rawdata was truncated")
|eval message=replace(message, "^\(child.*?\)\s+", "")
|bin _time span=1m
|stats c by _time, host, splunk_server, message
|fields - c
|rename splunk_server as Indexer, host as Host, message as Issue

jamesoconnell · ‎08-11-2017

Thank you Mr. Burch. I tried running this but didn't get any results.

This could either mean that we don't have any bucket issues, or your search isn't worth the paper it is written on -- not sure which.

I'm not sure where the truth lies yet, but I am guessing we must have some bucket issues somewhere given the amount of data we pump each day.

More testing required I think.

thank you!

isoutamo · ‎04-26-2018

Us neither could see any issues with previous search, but there are still couple of corrupted buckets (e.g. journal.gz was only couple of bytes).

sloshburch · ‎08-02-2017

Would you provide more detail on how you identified the buckets were corrupted? That might add color into an existing way to be notified.

jamesoconnell · ‎08-02-2017

There was an exclamation symbol / warning on the Dashboard with some cryptic message saying there was an error related to the indexer in question: "[indexer_] Streamed search execute failed because: JournalSliceDirectory: Cannot seek to rawdata offset 0 ..."
This type of error scares the crap out of users and they freak-out to the admin...

bheemireddi · ‎08-02-2017

If you are using "monitoring console" that would be a good starting point. It has the visibility into monitoring Indexer clustering activities. Below link might get you started, these are all the dashboards/searches, so may be you can setup the alerts on them. Also on the cluster master settings->indexer clustering might give you some insights too.
https://docs.splunk.com/Documentation/Splunk/6.6.2/Indexer/Viewindexerclusteringstatus

Proactively monitor for bucket corruption

Enterprise Security Content Update (ESCU) | New Releases

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

Index This | What are the 12 Days of Splunk-mas?