Re: Splunk Data Lake

ARC1

Can you clarify Splunk Data Lake support around schema (schema-on-read vs enforced), available APIs for ingest/query, and how the LDP handles large-scale time-series data? Also interested in integration options and any FedRAMP considerations.

tscroggins

Hi @ARC1,

If you're not a federal employee or contractor, you'll need to contact Splunk or a Splunk partner for more information re: FedRAMP.

If you are a federal employee or contractor, you can work with your agency's FedRAMP approver to request access to FedRAMP security packages from the public FedRAMP marketplace:

Splunk Cloud Platform for FedRAMP Moderate

Splunk Cloud Platform for FedRAMP High

For on-premises deployments, you can review FIPS and Common Criteria compliance at https://help.splunk.com/en/splunk-enterprise/administer/manage-users-and-security/10.0/establish-and....

Wander

Splunk doesn’t behave like a traditional data lake, so it helps to frame the answers in Splunk terms.

Splunk is fundamentally schema-on-read. Data lands as raw or semi-structured events, and structure is applied at search time using field extractions, knowledge objects, data models, and CIM. If you need enforcement, that happens upstream or during ingest through disciplined sourcetypes, DSP pipelines, validation, and standardized TAs. Splunk itself does not enforce schemas at rest.

From a schema perspective, native Splunk indexes can store essentially any event format. Structure is implicit and defined by extractions, not storage. When using Federated Search for Amazon S3, schemas are defined externally in AWS Glue Data Catalog tables (for formats like JSON, CSV, or Parquet). In that case, the schema lives in Glue, not in Splunk.

Splunk does not automatically normalize time-series data across sources. Normalization is achieved by mapping fields to CIM (or OCSF when working with Security Lake–aligned data), either at ingest or at search time. Data models, macros, and calculated fields are what provide a consistent analytical view across different producers.

For ingest, you use the standard Splunk mechanisms: forwarders, syslog, HEC, DSP, and app-based collectors. Queries are executed with SPL via the Search REST API or SDKs. These APIs are designed for analytics and automation, not bulk lake-style exports. Practical limits are driven by search concurrency, quotas, job lifetimes, and result pagination rather than a single hard rate limit.

External tools do not directly query Splunk-managed storage. They query Splunk through its APIs. If data physically resides in S3, external analytics tools query S3 directly using Glue schemas, independent of Splunk. Splunk acts as an analytics and correlation layer, not a general-purpose lake endpoint.

Splunk can federate to external object storage. Federated Search for Amazon S3 allows SPL queries over data stored in S3 without ingesting it into Splunk indexes. This is commonly used for long-term retention, compliance data, or very large volumes where ingesting everything isn’t cost-effective.

At large scales (for example, 10 TB/day), Splunk requires deliberate architecture and licensing. The typical pattern is to ingest a curated, high-value subset for fast detection and correlation, while leaving the bulk of the data in S3 and using federated search for longer lookbacks. Retention is managed through index policies, and cost optimization is achieved through federation and archiving rather than treating Splunk as the primary data lake. I can assure you though, once built out right, Splunk has no issues soaring about 10TB per day.

Integration with external data lakes and analytics platforms usually follows two models: federate when you need access without ingest, and curate-and-ingest when you need speed, correlation, and operational context. Most mature environments use both.

Access control uses standard Splunk RBAC. Roles are explicitly granted access to native indexes and federated indexes. The same permission model applies across on-platform and federated data.

From a compliance standpoint, Splunk Cloud Platform is FedRAMP High authorized. DoD and federal customers should validate the exact authorization boundary and region, and remember that external S3 buckets fall under their own authorization and controls.

Auditability is handled through Splunk’s built-in _audit index and the Audit Trail app. This covers user activity, searches, and configuration and knowledge-object changes. In Splunk terms, “schema changes” are changes to extractions, data models, or configurations, all of which are auditable.

ARC1

What data schemas are supported in the Splunk Data Lake, and is the schema enforced, semi-structured, or schema-on-read? How are time-series fields normalized across different data sources? What APIs are available for data ingestion and extraction (REST, streaming, batch), and are there any API rate limits or throughput constraints? Can external tools query the data lake directly via API, and can Splunk federate queries to external object storage such as S3? How does Splunk LDP handle large-scale time-series data (e.g., 10 TB/day), including retention and tiering options, and how does it integrate with external data lakes or analytics platforms? What role-based access controls are supported for data lake access, is the Splunk Data Lake deployment FedRAMP authorized (and at what level), and how is audit logging handled for data access and schema changes?

PickleRick

1. What is the goal of those questions? You haven't just woken up one day and decided that you want to know "everything about Splunk" without actually learning Splunk.

2. Have you actually tried searching any of those things on your own? Where? What are your results? Do you have any doubts or don't understand something you'd already found?

Also - I'm merging your two separate threads about the same subject.

ARC1

Thank you for the question.

Below is the consolidated list of User Experience Monitoring (UXM) requirements that drive our evaluation, questions, and deliverables. These requirements are derived from mission needs and are intended to support requirements-based validation, not general product familiarization.

The UXM solution must be capable of:

Measuring user experience “at the glass” with quantitative, actionable metrics
Detecting and diagnosing UX and performance degradations in near real time
Correlating IT operator changes to service improvements or degradations regardless of change source
Monitoring endpoint, network, application, and cloud service performance across the AFIN
Supporting time-series analysis, anomaly detection, alerting, and capacity planning
Providing centralized, standardized, and interoperable UX metrics
Supporting decision-making, SLA validation, and vendor accountability
Monitoring either 100% of the environment or statistically significant samples
Utilizing built-in vendor metrics while maintaining transparency in metric definitions
Supporting ETL, data ingestion, analytics, and data lake scalability
Enabling wide data sharing without additional licensing or access costs
Enforcing privacy protections (no user profiling)
Supporting configurable data retention (short-term detail and long-term trends)
Providing collector-level APIs for near–real-time access
Meeting DoD security, RMF, logging, authentication, least-privilege, and compliance requirements
Supporting patching, vulnerability management, intrusion detection, and auditability

These requirements are formally captured and traceable through UXM deliverables such as the RTVM, Tool Set Evaluation Report, MVP Architecture, RMF Package, and Operations & Maintenance artifacts.

Accordingly, all clarification questions are intended to determine capability alignment, architectural fit, security posture, scalability, and operational suitability of proposed solutions against these defined requirements.

Please let us know if additional clarification is needed on any specific requirement area.

Respectfully,
Abrahameen

PickleRick

You're mixing requirements for different products and services (Splunk Enterprise/Cloud, o11y, RUM, ITSI, you name it...). This is not a topic for community. This is something you need to engage your local Splunk Partner for.

ARC1

Thank you for the feedback on our UXM requirements. To dive deeper into topics like architecture, scale, and enterprise deployment, could you please point me to the appropriate account team or solutions architect we should engage with?We want to ensure our discussions are with the right experts who can provide guidance on supported designs and operational considerations.

Wander

This is a really solid set of requirements, and honestly, you’re asking the right questions. You’re well past “does the product have feature X” and deep into “does this actually work at scale, under real constraints, without breaking policy or budgets.”

At a high level, yes, the kind of UXM you’re describing is achievable. But a lot of what you’re asking about lives in the gray space between product capability, architecture, deployment patterns, and licensing. Things like true “at the glass” experience, meaningful change correlation across domains, privacy-safe data collection, and enterprise-wide data sharing aren’t things the community can answer with a simple yes or no.

That’s also where you’re probably starting to hit the ceiling of community help. These are the kinds of questions that really need to be walked through with your Splunk account team and solutions architects who can talk specifics around supported designs, scale limits, and how this plays out in real environments.

If you want to go deeper or compare notes (or flat out running into a wall with who to contact), feel free to message me directly. I’m working with a other similar groups in the tackling very similar UXM problems, and I’m always happy to nerd out on architectures, tradeoffs, and what actually works in practice.

ARC1

Thank you for the feedback on our UXM requirements. To dive deeper into topics like architecture, scale, and enterprise deployment, could you please point me to the appropriate account team or solutions architect we should engage with? We want to ensure our discussions are with the right experts who can provide guidance on supported designs and operational considerations.

ARC1

I'm not able to sent message to you directly "You have reached the limit for number of private messages that you can send for now. Please try again later." Yes, we’re focused on the federal side.

livehybrid

Hi @ARC1

It seems like your questions would be better answered by Splunk's internal teams rather than the community due to the specific nature of the request.

Do you already have a contact at Splunk that you can speak to? If not you might want to reach out via https://www.splunk.com/en_us/about-splunk/contact-us.html or speak to one of your local Splunk Partners (https://www.splunk.com/en_us/partners.html)

🌟 Did this answer help you? If so, please consider:

Adding karma to show it was useful
Marking it as the solution if it resolved your issue
Commenting if you need any clarification

Your feedback encourages the volunteers in this community to continue contributing

Splunk Data Lake support

deployer

deployment client

deployment server

distributed search

indexer

indexer clustering

search head clustering

Unlock Database Monitoring with Splunk Observability Cloud

Purpose in Action: How Splunk Is Helping Power an Inclusive Future for All

[Upcoming Webinar] Demo Day: Transforming IT Operations with Splunk

Join the Conversation

Splunk Data Lake support