We fail again and again these days when we have major spikes in ingestion, primarily with HEC. What would be a good and efficient way to detect major up/down spikes in data ingestion.
What you are meaning with "We fail again and again"?
What kind of environment you have? Distributed, separate HEC nodes with LB?
Basically you could create e.g. dashboard where you are looking status information from _internal & _introspection logs. You could also create alerts based on your normal and abnormal behaviour after that.
r. Ismo