Solved: Splunk Data Model Data, To combine or to segregate

yh · ‎03-18-2024

Hello,

I have been working on Splunk for a few months now, and we are using Splunk mainly for Cyber Security monitoring. I am wondering with regards to data model (CIM) should I create separate data model for different zones, or should I combine all in a single data model.

For example, I have 3 different independent network zones, DMZ, Zone A and Zone B. Each of those zones will have multiple indexes linked to it.

Shall I actually use the default data model in CIM, eg datamodel=Authentication with all the indexes in DMZ, ZoneA and ZoneB, or should I make copies of datamodel?

Scenario 1:
If I use a common data model, I will use where source=xxx for example to try to split things out for my queries and dashboarding.

Scenario 2:
If I use a separate data model, I will have, datamodel= DMZ_Authentication, datamodel=ZoneA_Authentication and perhaps use append when I need to see the overall picture?

Still confused on which is the best approach.

PickleRick · ‎03-19-2024

OK. There are additional things to consider here.

1. Datamodel is not the same as datamodel accelerated summary. If you just search from a non-accelerated datamodel, the search is "underneath" translated by Splunk to a normal search according to the definition of the dataset you're searching from. So all role-based restrictions apply.

2. As far as I remember (but you'd have to double-check it), even if you search from accelerated summaries, the index-based restrictions should still be in force because the accelerated summaries are stored along with normal event buckets in the index directory and are tied to the indexes themselves.

3. And because of that exactly the same goes for retention periods. You can't have an accelerated summary retention period longer than the events retention period since the accelerated summaries would get rolled to frozen witht the bucket the events come from.

So there's more to it than meets the eye.

View solution in original post

yh · ‎03-20-2024

Thanks @gcusello @PickleRick for all the replies and tips and hints.

It has been very helpful. In the end, I went with 1 data model with segregation done using source filtering.
Still fiddling with adding fields into data model but I am sure it will be a nice addition like to have extra info like indexes into the data model fields or during indexing.

I wish I can mark both as solutions, but since I can only accept one, I will select Rick's as the reply gave the Eureka moment in which a single model doesn't impact the security roles (index selection) and subsequently made me switch to the single data model. But all in all, all the replies have made me learnt a lot. A big thank you.

gcusello · ‎03-18-2024

Hi @yh ,

as @PickleRick said, it's difficult to hint you without a defined Use Case.

For this reason the best approach is to define your requirements, and then design your data structure to answer to these requirements.

One additional hint:

having more indexes and more Data Models depends on two factors:

access grants,
data retention.

In other words, if all your users can access all data and data must be retained for the same time, there's no reason to have different indexes or Data Models, also because you have to manage them and use in searches, so I'd try to avoid to duplicate DMs if not mandatory for the requirements.

Ciao.

Giuseppe

yh · ‎03-19-2024

Thanks for the hints.

In terms of data retention all the sections will have similar policy.

However, access grants can be an issue. In my use case, the dashboards will be monitored by section personnel and also by the SOC. Therefore in terms of access, SOC will be able to see DMZ, ZoneA and ZoneB while the respective members of each section should only be able to see their zones (need-to-know basis policy)

At the moment I am using different indexes so I can perform some transforms specific to each zones, as the syslog log sending formats are different due to the different log aggregator used by each zones. By using the different indexes in the heavy forwarder, I am able to perform some SED for particular log sources, and host & source override on the HF. I remember that I can limit access based on indexes, but I guess this is not possible with data models but will this be a concern?

If I put them all in a data model, is it still possible to restrict access? For example, if the user can only manipulate views from dashboard and not be able to run searches themselves, that will still be OK.

Pros and Cons in my mind:
Separate data model:
- Pro's: I can easily segregate the tstats query
- Cons: Might be difficult to get an overview stats need to use appends and maintain each additional new zone. Each new data model will need to run periodically and increase the number of scheduled accelerations?

Integrated data model:
- Cons: might be harder to filter, eg between ZoneA, ZoneB and DMZ. Seems like I can filter only based on the few parameters in the model, eg source, host
- Pros: Easier to maintain, as just need to add new indexes into the data model whitelist. Limit the number of Scheduled runs.
- And as mentioned the point on data access? Will it be still possible to restrict?

I am still quite new to Splunk so some of my thoughts might be wrong. Open to any advice, still in a conundrum.

PickleRick · ‎03-18-2024

Unless you have a very specific use case you don't want to touch the CIM datamodels. They are the "common wisdom" and many existing searches and apps rely on the CIM being properly implemented and data being CIM-compliant.

Question is what would be the goal of modifying the datamodels?

yh · ‎03-18-2024

The data will still be CIM compliant though. I am simply replicating the data model, so I have two different sets of data model (all the settings are similar but the whitelisted indexes in each data model is different)

By cloning the original Data Model from the CIM app, I have a
DMZ Network data model = Only Index for DMZ
Zone A Network data model = Only Index for Zone A

At that time, my goal was to provide ease for the users to display the dashboard but simply swapping the data model in use. Cause DMZ and Zone A is highly unique between one another

So would the best practice be just to put all in one common data model, eg
Default network data model = all indexes and then try to separate out the zones by using filters like "where" in the search queries.

PickleRick · ‎03-19-2024

OK. There are additional things to consider here.

1. Datamodel is not the same as datamodel accelerated summary. If you just search from a non-accelerated datamodel, the search is "underneath" translated by Splunk to a normal search according to the definition of the dataset you're searching from. So all role-based restrictions apply.

2. As far as I remember (but you'd have to double-check it), even if you search from accelerated summaries, the index-based restrictions should still be in force because the accelerated summaries are stored along with normal event buckets in the index directory and are tied to the indexes themselves.

3. And because of that exactly the same goes for retention periods. You can't have an accelerated summary retention period longer than the events retention period since the accelerated summaries would get rolled to frozen witht the bucket the events come from.

So there's more to it than meets the eye.

yh · ‎03-19-2024

Hey @PickleRick

2. You are absolutely right. I just tried with different users on the same accelerated model, same query but different roles, and the restricted users has much less results.

So, can I say the way forward seems to be one common data model then?
Is there any recommended or easy way to perform filtering between Zones in a summary search for example?
Is using Where source=ZoneA* alright then?

PickleRick · ‎03-19-2024

Well, see into CIM definition and check which fields might be relevant to your use case.

"zone" is a relatively vague term and can have different meanings depending on context.

For example, the Network Traffic has three different "zone fields"

src_zone, dest_zone and dvc_zone

Of course filtering by source field is OK but it might not contain the thing you need.

yh · ‎03-19-2024

Yes, certain source it's a bit hard for me to override the source name, I will try to see what can be done.
I was looking at source as it's one of the few fields that seems to be common across multiple models, eg network, authentication, change etc

PickleRick · ‎03-19-2024

There are some fields which are always present - source, sourcetype, host, _raw, _time (along with some internal Splunk's fields). But they each have their own meaning and you should be aware of the consequences if you want to fiddle with them.

In your case you could most probably add a field matching the appropriate CIM field (for example the dvc_zone). It could be a calculated field (evaluated by some static lookup listing your devices and associating them with zones) or (and that's one of the cases where indexed fields are useful) an indexed field, possibly added at the initial forwarder level.

yh · ‎03-19-2024

Thanks!

I did not know about indexed field, that would be something interesting. Is there a way to add on another field that is always present for all models? For example in addition to. source, sourcetype, host, _raw, _time, is it possible to add like source_zone or something that works for all models? I saw that the source, sourcetype, host, etc are inherited but unsure from where is the inheritance from.

PickleRick · ‎03-20-2024

OK. Because I think you might be misunderstanding something.

CIM is just a definition of fields which should be either present directly in your events or defined as calculated fields or automatic lookups.

So the way to go would be not to fiddle with the definition of the datamodel to fit the data but rather the other way around - modify the data to fit the datamodel).

There is already a good candidate for the "location" field I showed already - the dvc_zone field - you can either fill it in search time or during index-time. Or even set it "statically" on the input level by using the _meta option.

gcusello · ‎03-19-2024

Hi @yh ,

you can customize your Data Model adding some fields (e.g. I usually add also the index) following you requisites, but don't duplicate them!

Ciao.

Giuseppe

yh · ‎03-19-2024

hi @gcusello I think that would be useful.

I try to add the index field in the data model but seems not able to. I don't see that field in the auto-extracted option. I can see fields like host, sourcetype being inherited from BaseEvent in the JSON.

I am wondering, shall I modify the JSON then? Not sure if that is the right way. Can't see to figure out how to add the index using the data model editor.

Thanks again

gcusello · ‎03-20-2024

Hi @yh ,

manually add it and you'll find it.

Remember that to see the index field, in the | tstats searches, you have to use the prefix (e.g. Authentication.index).

Ciao.

Giuseppe

Splunk Data Model Data, To combine or to segregate

data

indexer

syslog

whitelist

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Quantify Your Splunk Investment Impact: Introducing Savings Metrics to Value Insights

Event Series: Telemetry Pipeline Management

Kick the Tires Before You Commit: A Hands-On Tour of the Splunk Observability Cloud ...

Join the Conversation

Splunk Data Model Data, To combine or to segregate

data

indexer

syslog

whitelist

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Quantify Your Splunk Investment Impact: Introducing Savings Metrics to Value Insights

Event Series: Telemetry Pipeline Management

Kick the Tires Before You Commit: A Hands-On Tour of the Splunk Observability Cloud ...