Getting Data In

Splunk Data Model Data, To combine or to segregate

yh
Explorer

Hello,

I have been working on Splunk for a few months now, and we are using Splunk mainly for Cyber Security monitoring. I am wondering with regards to data model (CIM) should I create separate data model for different zones, or should I combine all in a single data model.

For example, I have 3 different independent network zones, DMZ, Zone A and Zone B. Each of those zones will have multiple indexes linked to it.

Shall I actually use the default data model in CIM, eg datamodel=Authentication with all the indexes in DMZ, ZoneA and ZoneB, or should I make copies of datamodel?

Scenario 1:
If I use a common data model, I will use where source=xxx for example to try to split things out for my queries and dashboarding.

Scenario 2:
If I use a separate data model, I will have, datamodel= DMZ_Authentication, datamodel=ZoneA_Authentication and perhaps use append when I need to see the overall picture?

Still confused on which is the best approach.





Labels (4)
0 Karma
1 Solution

PickleRick
SplunkTrust
SplunkTrust

OK. There are additional things to consider here.

1. Datamodel is not the same as datamodel accelerated summary. If you just search from a non-accelerated datamodel, the search is "underneath" translated by Splunk to a normal search according to the definition of the dataset you're searching from. So all role-based restrictions apply.

2. As far as I remember (but you'd have to double-check it), even if you search from accelerated summaries, the index-based restrictions should still be in force because the accelerated summaries are stored along with normal event buckets in the index directory and are tied to the indexes themselves.

3. And because of that exactly the same goes for retention periods. You can't have an accelerated summary retention period longer than the events retention period since the accelerated summaries would get rolled to frozen witht the bucket the events come from.

So there's more to it than meets the eye.

View solution in original post

yh
Explorer

Thanks @gcusello @PickleRick  for all the replies and tips and hints.

It has been very helpful. In the end, I went with 1 data model with segregation done using source filtering. 
Still fiddling with adding fields into data model but I am sure it will be a nice addition like to have extra info like indexes into the data model fields or during indexing.

I wish I can mark both as solutions, but since I can only accept one, I will select Rick's as the reply gave the Eureka moment in which a single model doesn't impact the security roles (index selection) and subsequently made me switch to the single data model. But all in all, all the replies have made me learnt a lot. A big thank you.

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @yh ,

as @PickleRick said, it's difficult to hint you without a defined Use Case.

For this reason the best approach is to define your requirements, and then design your data structure to answer to these requirements.

One additional hint:

having more indexes and more Data Models depends on two factors:

  • access grants,
  • data retention.

In other words, if all your users can access all data and data must be retained for the same time, there's no reason to have different indexes or Data Models, also because you have to manage them and use in searches, so I'd try to avoid to duplicate DMs if not mandatory for the requirements.

Ciao.

Giuseppe

yh
Explorer

Thanks for the hints.

In terms of data retention all the sections will have similar policy.

However, access grants can be an issue. In my use case, the dashboards will be monitored by section personnel and also by the SOC. Therefore in terms of access, SOC will be able to see DMZ, ZoneA and ZoneB while the respective members of each section should only be able to see their zones (need-to-know basis policy)

At the moment I am using different indexes so I can perform some transforms specific to each zones, as the syslog log sending formats are different due to the different log aggregator used by each zones. By using the different indexes in the heavy forwarder, I am able to perform some SED for particular log sources, and host & source override on the HF. I remember that I can limit access based on indexes, but I guess this is not possible with data models but will this be a concern?

If I put them all in a data model, is it still possible to restrict access? For example, if the user can only manipulate views from dashboard and not be able to run searches themselves, that will still be OK.

Pros and Cons in my mind:
Separate data model:
- Pro's: I can easily segregate the tstats query
- Cons: Might be difficult to get an overview stats need to use appends and maintain each additional new zone. Each new data model will need to run periodically and increase the number of scheduled accelerations?

Integrated data model:
- Cons: might be harder to filter, eg between ZoneA, ZoneB and DMZ. Seems like I can filter only based on the few parameters in the model, eg source, host
- Pros: Easier to maintain, as just need to add new indexes into the data model whitelist. Limit the number of Scheduled runs.
- And as mentioned the point on data access? Will it be still possible to restrict?

I am still quite new to Splunk so some of my thoughts might be wrong. Open to any advice, still in a conundrum. 

0 Karma

PickleRick
SplunkTrust
SplunkTrust

Unless you have a very specific use case you don't want to touch the CIM datamodels. They are the "common wisdom" and many existing searches and apps rely on the CIM being properly implemented and data being CIM-compliant.

Question is what would be the goal of modifying the datamodels?

0 Karma

yh
Explorer

The data will still be CIM compliant though. I am simply replicating the data model, so I have two different sets of data model (all the settings are similar but the whitelisted indexes in each data model is different)

By cloning the original Data Model from the CIM app, I have a
DMZ Network data model = Only Index for DMZ 
Zone A Network data model = Only Index for Zone A

At that time, my goal was to provide ease for the users to display the dashboard but simply swapping the data model in use. Cause DMZ and Zone A is highly unique between one another

So would the best practice be just to put all in one common data model, eg
Default network data model = all indexes and then try to separate out the zones by using filters like "where" in the search queries.

0 Karma

PickleRick
SplunkTrust
SplunkTrust

OK. There are additional things to consider here.

1. Datamodel is not the same as datamodel accelerated summary. If you just search from a non-accelerated datamodel, the search is "underneath" translated by Splunk to a normal search according to the definition of the dataset you're searching from. So all role-based restrictions apply.

2. As far as I remember (but you'd have to double-check it), even if you search from accelerated summaries, the index-based restrictions should still be in force because the accelerated summaries are stored along with normal event buckets in the index directory and are tied to the indexes themselves.

3. And because of that exactly the same goes for retention periods. You can't have an accelerated summary retention period longer than the events retention period since the accelerated summaries would get rolled to frozen witht the bucket the events come from.

So there's more to it than meets the eye.

yh
Explorer

Hey @PickleRick 

2. You are absolutely right. I just tried with different users on the same accelerated model, same query but different roles, and the restricted users has much less results.

So, can I say the way forward seems to be one common data model then?
Is there any recommended or easy way to perform filtering between Zones in a summary search for example? 
Is using Where source=ZoneA* alright then?

0 Karma

PickleRick
SplunkTrust
SplunkTrust

Well, see into CIM definition and check which fields might be relevant to your use case.

"zone" is a relatively vague term and can have different meanings depending on context.

For example, the Network Traffic has three different "zone fields"

src_zone, dest_zone and dvc_zone

Of course filtering by source field is OK but it might not contain the thing you need.

0 Karma

yh
Explorer

Yes, certain source it's a bit hard for me to override the source name, I will try to see what can be done.
I was looking at source as it's one of the few fields that seems to be common across multiple models, eg network, authentication, change etc

0 Karma

PickleRick
SplunkTrust
SplunkTrust

There are some fields which are always present - source, sourcetype, host, _raw, _time (along with some internal Splunk's fields). But they each have their own meaning and you should be aware of the consequences if you want to fiddle with them.

In your case you could most probably add a field matching the appropriate CIM field (for example the dvc_zone). It could be a calculated field (evaluated by some static lookup listing your devices and associating them with zones) or (and that's one of the cases where indexed fields are useful) an indexed field, possibly added at the initial forwarder level.

0 Karma

yh
Explorer

Thanks! 

I did not know about indexed field, that would be something interesting. Is there a way to add on another field that is always present for all models? For example in addition to. source, sourcetype, host, _raw, _time, is it possible to add like source_zone or something that works for all models? I saw that the source, sourcetype, host, etc are inherited but unsure from where is the inheritance from.

0 Karma

PickleRick
SplunkTrust
SplunkTrust

OK. Because I think you might be misunderstanding something.

CIM is just a definition of fields which should be either present directly in your events or defined as calculated fields or automatic lookups.

So the way to go would be not to fiddle with the definition of the datamodel to fit the data but rather the other way around - modify the data to fit the datamodel).

There is already a good candidate for the "location" field I showed already - the dvc_zone field - you can either fill it in search time or during index-time. Or even set it "statically" on the input level by using the _meta option.

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @yh ,

you can customize your Data Model adding some fields (e.g. I usually add also the index) following you requisites, but don't duplicate them!

Ciao.

Giuseppe

0 Karma

yh
Explorer

hi @gcusello  I think that would be useful.

I try to add the index field in the data model but seems not able to. I don't see that field in the auto-extracted option. I can see fields like host, sourcetype being inherited from BaseEvent in the JSON.

I am wondering, shall I modify the JSON then? Not sure if that is the right way. Can't see to figure out how to add the index using the data model editor.

Thanks again

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @yh ,

manually add it and you'll find it.

Remember that to see the index field, in the | tstats searches, you have to use the prefix (e.g. Authentication.index).

Ciao.

Giuseppe

0 Karma
Get Updates on the Splunk Community!

Join Us for Splunk University and Get Your Bootcamp Game On!

If you know, you know! Splunk University is the vibe this summer so register today for bootcamps galore ...

.conf24 | Learning Tracks for Security, Observability, Platform, and Developers!

.conf24 is taking place at The Venetian in Las Vegas from June 11 - 14. Continue reading to learn about the ...

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...