I've worked with Splunk in financial-services environments for a long time and built observability tooling on top of it, so here's how I'd approach this - Splunk-first, but architected so you never regret it. Get your ingestion architecture right first. The mistake I see most is apps writing straight to HEC with hardcoded endpoints. It works on day one, but you've welded your application code to a specific pipeline. Put an OpenTelemetry Collector in front instead — apps emit structured telemetry, the Collector handles routing, enrichment, and redaction, and HEC becomes one exporter behind it. This is squarely where Splunk itself is heading; Splunk is one of the biggest backers of OTel and it's the recommended path into Splunk Observability Cloud. You get clean ingest into Splunk and the flexibility to add or change destinations later through config, not code. Run the Collector agent + gateway. Lightweight agent near each service for local pickup, a central gateway tier for sampling, PII scrubbing, and routing to your indexers. Drop Kafka in front of the gateway if your volume is spiky. Scrub PII and secrets at the gateway before they hit an index — non-negotiable on a platform handling resumes and payment data. Structure your logs for SPL. Structured JSON, consistent fields, and follow OTel semantic conventions for naming. The thing that pays off most: propagate a trace_id/request_id across every service boundary. Being able to transaction or stats by trace_id across services is the difference between reconstructing a failing request in two minutes versus two hours — and it makes your dashboards and RBA far cleaner. Dashboards and alerting in Splunk. Dashboard Studio for the visuals, and lean into Risk-Based Alerting rather than firing on raw error counts — alert on symptoms (error rate, p95/p99 latency, saturation) so the team doesn't learn to ignore pages. RBA aggregating weaker signals into high-confidence notables is exactly the pattern for this. On the broader question of Splunk vs. open-source — Splunk gives you the strongest ad-hoc search and the lowest operational burden on Cloud, which is usually the right call for a single product where you'd rather not run an Elasticsearch cluster yourself. If ingest cost ever becomes your hard constraint at scale, the OTel layer above means you can route a portion of low-value, high-volume data elsewhere without touching application code. That's the beauty of decoupling early — you keep Splunk for what it's best at and stay flexible on the rest. If you share your rough daily volume, whether you're on K8s or VMs, and whether this is ops-only or also security/compliance, I can get a lot more specific. Happy to share a sample OTel Collector -> HEC config too.
... View more