Hi,
Let's say I have sample data below all being ingested to index="characters". How do I create two separate sub-indexes "superheroes" and "villains" such that for events where archetype="superhero", the index "superheroes" will contain only events with id=superman, batman and archetype="villain" will only contain event with id="joker"? The reasoning is I want to set permissions on the sub-indexes so only specific users can see the index (e.g. only people with role "good guys" can see superhero data).
I have tried summary indexing with the following query, scheduled the search, and enabled summary indexing but it doesn't capture the original fields in the data.
index=characters
| fields id, strengths, archetype
| where archetype="superhero"
| eventstats count as total_superheroes
| table id, strengths, archetype
Sample Json Data:
[
{
"id": "superman",
"strengths": "super strength, flight, and heat vision",
"archetype": "superhero"
},
{
"id": "batman",
"strengths": "exceptional martial arts skills, detective abilities, and psychic abilities",
"archetype": "superhero"
},
{
"id": "joker",
"strengths": "cunning and unpredictable personality",
"archetype": "villain"
}
]
Yes. Restricting access is one of the valid points for creating separate indexes.
Your data though seems a bit strange - I didn't notice that before.
You have a json array with separate structures within that array which you want as separate events. That makes it a more complicated task. I'd probably try to use an external tool to read/receive the source "events", then parse the json, split the array into separate entities and push each of them separately to its proper index (either by writing to separate files for pickup by UF or pushing to HEC endpoint).
You can put your summary indexes in different apps and only allow certain roles access to the different apps, or you could restrict access to the indexes by role.
For populating the summary index, how are you doing this? What do you mean by "original fields"?
Yes, I've thought about creating different apps but I wanted to avoid this since the only difference between the apps will be one line in the filter (e.g. archetype="superhero"). Ideally I just want to create separate indexes based on a search filter and be able to restrict access to these filters based on roles without the need to create different apps.
I tried populating the summary index using the query above.
By Original Fields not populating, I mean once you run the query and check the index that was created (e.g. index=superheros), the fields present in the search don't include id, strengths, and archetype but only default fields like date_hour, index, timeendpos etc).
1. There is no such thing as "subindex". Indexes are separate entities and do not form any kind of hierarchy.
2. Unless you have a Very Good Reason (tm) there's not much sense in splitting data into multiple indexes - you use search-time filters to return just a subset of your events when needed
3. Summary indexing is usually used for - as the name says - storing pre-aggregated summaries of your data so you can later usse those aggregates to speed up your searches. Using collect to simply copy events from one index to another _usually_ doesn't make much sense (see also 2.)
So, what's the use case?
1. The use case is ideally I just want to create separate indexes based on a search filter and be able to restrict access to these filters based on roles without the need to create different apps. The data I'm ingesting is aggregate so only admins can see it. From there, I want to create two separate dashboards (not Splunk apps if possible) showing data for superheros or villains. I need only users with the role "good guys" to access the superhero dashboard and be unable to access the villain dashboard and vice versa. The solution I've thought about is by creating indexes which only differ by one line (e.g. archetype=superhero/villain) and then restricting access to these indexes based on user roles.
2. I can't restrict access when using search-time filters though.
3. Okay I understand summary indexing is not the best approach. Do you have a better solution
Yes. Restricting access is one of the valid points for creating separate indexes.
Your data though seems a bit strange - I didn't notice that before.
You have a json array with separate structures within that array which you want as separate events. That makes it a more complicated task. I'd probably try to use an external tool to read/receive the source "events", then parse the json, split the array into separate entities and push each of them separately to its proper index (either by writing to separate files for pickup by UF or pushing to HEC endpoint).
Hey PickleRick,
Yeah I was thinking this. The data is coming in through modular input so if I adjust the script then I be able to parse them into their respective indexes. But if I'm doing so then I may as well create separate applications altogether for each one which is what I'm trying to avoid with this exercise.
Regarding the data, yes this is a much simpler example of more complicated data I'm working with. Essentially each event is JSON data with values that are either string or [array]. archetype is [array] and can be both superhero and villain so this event should appear in both indexes (but I've simplified it for this example).
So is there no possible way to utilise and bypass summary indexing rules by any chance to meet my desired use case? Because I'm still trying to summarise my data by separating superhero and villains to speed up searches. Seems like a lot of work to simply want to create separate indexes based on search.
Thanks,
Of course you _can_ do search & collect. It's just not something that's typically done since you'd have to first ingest the data "normally" and then split it using a search into another two indexes (since you don't want group A to see index B and vice versa). And if you wanted to use original sourcetype (or any other sourcetype than stash or stash_hec), you'd get double your license usage. If there is not much data, that might be acceptable but typically it's a waste of perfectly good license 😉 And a waste of resources to search, split and collect. And additional lag on ingest. So that's why you don't typically do it this way.
And I don't get why you would want to do separate apps?
Anyway, now you're saying that you want to speed up searches whereas before you said that it's due to access restrictions. And there is definitely something to work on with your data format if you indeed have a mix of various formats within one json structure which might be an array or might not be an array... That seems to be calling for some sanitization process on ingest.
Hey PickleRick,
I see, I was not aware that having different sourcetype than stash would double licence usage thank you for making me aware of that. I see so the only solutions available to restrict search access based on filters is to create separate apps or do data processing prior to event ingestion.
I didn't want to do separate apps because of congestion, especially since they will only differ from one line in the search filter. Please correct me if I'm wrong but I thought this would increase costs. Wasn't aware that having different sourcetypes other than stash would also incur costs (thanks).
The speeding up search was in reference to summary indexing, not a concern. I was wondering why summary indexing wouldn't work since filtering the search for only superheros/villains will speed up the search, which is what summary indexing is meant to help with. The main purpose was always for access restrictions.
Thanks,
I still don't understand what you mean by those apps. App in Splunk terminology is just a collection of files. The same set of settings can usually be equally well provided by a single app as well as multiple ones (with the possible difference of access to config elements provided by different apps if you differentiate permissions on a role/app basis).
Performance improvement when doing summary indexing happens not just because you use the collect command but because summary indexing assumes just that - doing some _summary_ on your data before indexing it. So - for example - you calculate some aggregated sums for every 15 minutes and store that value using collect command in an index so that later you don't have to summarize your raw data each time but just use that already calculated sum. That's what summary indexing is. Simply copying events from index to index is not summary indexing.
Not totally clear what the eventstats is doing here. It would help if you could illustrate the desired results from mock data. Do you mean to produce two tables like these?
1. superhero
archetype | id | strengths |
superhero | superman | super strength, flight, and heat vision |
superhero | batman | exceptional martial arts skills, detective abilities, and psychic abilities |
2. villan
archetype | id | strengths |
villain | joker | cunning and unpredictable personality |
To do these, you can use
index=characters
| spath path={}
| mvexpand {}
| spath input={}
| fields id, strengths, archetype
| where archetype="superhero"
| stats values(*) as * by id
for superhero; for villan, use
index=characters
```
| spath path={}
| mvexpand {}
| spath input={}
| fields id, strengths, archetype
| where archetype="villan"
| stats values(*) as * by id
Here is an emulation for you to play with and compare with real data
| makeresults
| eval _raw="[
{
\"id\": \"superman\",
\"strengths\": \"super strength, flight, and heat vision\",
\"archetype\": \"superhero\"
},
{
\"id\": \"batman\",
\"strengths\": \"exceptional martial arts skills, detective abilities, and psychic abilities\",
\"archetype\": \"superhero\"
},
{
\"id\": \"joker\",
\"strengths\": \"cunning and unpredictable personality\",
\"archetype\": \"villain\"
}
]"
| spath
``` the above emulates
index=characters
```
Yes, thank you. On top of creating those two separate tables. I want to then store the table data in separate indexes and maintaining all fields headers and values. From there I want to restrict access to the indexes. Ideally, I want to avoid creating separate apps