I've been wanting to build some integrity checking and other functionality based on knowing the fields in a sourcetype for a while now.
At my company we've built a data dictionary of indexes and sourcetype of interest to the SOC. They can search the dictionary to help them remember the important data sources. I'd like to augment/use this info in a couple of new ways:
1) give them a field list for all of these sourcetypes so they could search for which sourcetypes have a relevant field (like src_ip)
2) I'd like to note the fields that appear in 100% of records for a sourcetype and then every day find out if is missing any of those fields. This would quickly clue me into data issues related to the event sent, parsing, or knowledge objects.
I know how to get a list of fields for 1 sourcetype and store that info. And I know how to compare a sourcetype to a past set of fields to a current set.
My challenge now is how do I get the list of fields for the 100 sourcetypes of interest
so far my best idea is to create 100 jobs to handle each sourcetype. Something like
```1-get the sourcetypes of interest and pull back data for them```
[| inputlookup dataDictionary.csv where imf_critical=true | eval yesterday=relative_time(now(),"-1d@d") | where evalTS>yesterday
| dedup sourcetype | sort sourcetype | head 5 | tail 1 | table sourcetype]
earliest=-2d@d latest=-1@d
```2-get samples for all indexes in which the sourcetype appears```
| dedup 10 index sourcetype
| fieldsummary
```3-determine field coverage so we can pick the hallmark fields```
| eventstats max(count) as maxCount
| eval pctCov=round(count/maxCount,2)*100
| table field pctCov
```4-add back in the sourcetype name```
| append
[| inputlookup dataDictionary.csv where imf_critical=true | eval yesterday=relative_time(now(),"-1d@d") | where evalTS>yesterday
| dedup sourcetype | sort sourcetype | head 5 | tail 1 | table sourcetype]
| eventstats first(sourcetype) as sourcetype
| eval evalTS=now()
| table sourcetype evalTS field pctCov
```5-collect the fields to a summary index daily```
| collect index=soc_summary marker="sumType=dataInfo, sumSubtype=stFields"
If I ran 100 jobs like this, the number after head would increment to give me the next sourcetype.
But I feel like there has to be a better way to do fieldsummary on a lot of sourcetypes.
Any ideas?
I got it. And it definitely starts with PickleRick's answer
<base search>
|bin span=3h _time
|dedup 10 index sourcetype _time host ```get a good cross section of the sourcetype in all of it's use cases```
| stats count count(*) as * by sourcetype
|untable sourcetype field fieldCount
|eval stCount=if(field=="count", fieldCount,null())
|eventstats max(stCount) as stCount by sourcetype
|eval pctCov=fieldCount/stCount
|search pctCov>=.5
|table sourcetype field pctCov
I was about to start writing a post trying to figure out what you seem to have figured out. I think I get the gist of your original search, though it seems like you have to repeat it per index? If that's the case, it wouldn't work, as you've got quite a few indexes to search over I'd imagine. I guess my question is, what is <base search> that you have referenced up there?
Fair question.
When I want to focus on a particular part of my SPL, I often start with
"<base search> | "
so that I keep the focus on what I am interested in.
In this case "<base search>" is how I restrict the results to the sourcetypes of interest to my SOC. I actually pull that from a list of critical data sources which is maintained by the SOC
For my purposes, I am only trying to understand fields by sourcetype
I am using index in the dedup command
|bin span=3h _time
|dedup 10 index sourcetype _time host
This allows me to consider 10 records from every index in every 3 hour period. I do that because I worry that different workloads (which get ingested into different indexes), may decide to change the format of what they send. This way I only consider common fields in a sourcetype across all indexes.
If I didn't account for different indexes, it might be possible that I evaluate different indexes each time and therefore would , may conclude that there is a change when the common fields actually did not change
If I wanted the field listing per index, I'd just modify the "stats" and "eventstats" to include index in the "by" clause.
@MonkeyK
All the possible fields for all the selected sourcetypes? With an assumption that the sourcetypes all have the same fields every time so you can create a list of "supposed_to_be_there_fields" and then reference that list every time, to find when a field is missing. Is that right?
That's a few questions rolled into one. They probably all won't be answered here. Solve for the first part and then create another question for the second part (referencing the first part, for those who come along later).
* Create a list. Use what you have (then determine how often it's going to be updated). An alternative could be something like
| fieldsummary
I suspect you already knew this command. Here's the link to the docs for those finding this later: https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Fieldsummary
* Export as csv, and use as a lookup. Comparing the lookup list against what's available to find what's missing. In the Splunk community, we often kick this link around (credit: Duane Waddle): https://www.duanewaddle.com/proving-a-negative/
As a one-off, this might be okay. As an ongoing solution against 100's of sourcetypes, it sounds a little fragile. YMMV.
Best of luck! Maybe this helped some.
I'm reading this question on my tablet while walking my dog so the circumstances are not very good for analysing your search 😉
But let me ask you a different question - especially that you're talking about SOC team which is usually interested with a relatively small set of "types of data" (not sourcetypes!). Instead of making them look through a dictionary of fields for multiple sourcetypes ahd indexes why not use datamodels to normalize data and let the analysts query the data in a unified way regardless of where the specific events come from?
CIM is a great starting point for that. You can also create your own datamodels if needed.
The possible additional upside is that datamodels can be accelerated (but it introduces some problems with managing access, I know).
I don't really want to debate the value of the CIM. I think it's great, but not enough me or my SOC on it's own.
CIM is great for quickly finding data. And actually you reminded me that I should include the CIM and it's fields in my data model. Those records are just as pivotable as the other sources.
That said, access to underlying logs is still critical
OK. You do have a valid point. I just wanted to point out the existence of CIM because some people go to great lengths to reinvent the wheel while not noticing thigs already done by others.
I must say that I don't understand a thing or two about your search.
Firstly, I wouldn't search throughout all my data. How much are you ingesting? That must be a performance hit on your environment. I'd stick to sampling. And I mean heavy sampling.
Secondly, I don't understand those "head 5 | tail 1" - why would you want a fifth result?
Thirdly, instead of doing fieldsummary, I think I'd simply do something like
| stats count count(*) AS *
to find percent coverage. (With this you can easily add a BY sourcetype clause).
right, probably could have better explained.
| inputlookup dataDictionary.csv
is my lookup of all indexes and sourcetypes in the environment. Of the 1400 we have there are about 180 that the SOC is specifically interested in and of that there 70 sourcetypes that the SOC considers critical to their work.
I used the first clause as a subsearch to get one sourcetype to act on. In this case the 5th sourcetype when critical sourcetypes are sorted ascending
I do this because I need to pick a sourcetype to do fieldsummary on. I don't know how to do fieldsummary on more than one sourcetype and have the result tie back to the sourcetype of interest, hence the request for help.
I limit the amount of data to evaluate with dedup
| dedup 10 index sourcetype
says to get 10 records of the sourcetype for each index that uses that sourcetype.
-------------------------------------------
I like your stats count(*) as count by sourcetype technique . That would certainly allow me to summarize by sourcetype without needing to do them one at a time. Feels like it should be workable, but I don't know how to use the results
(would have done the next bit in a table, but the markdown is failing me in the most frustrating way)
With it I can get a table with a row for each sourcetype where I'd wind up with fields for all sourcetypes where some would be the same as the total, some less than the total, and some zero (because they are part of a different sourcetype.
Given I don't know the field names, I don't know how I'd determine pctCoverage by sourcetype
So how would I go from that to fields with 100% coverage by sourcetype?
traffic_logs: src_ip, dest_ip
dns: src_ip,request_type
dhcp: src_ip,dest_ip,mac,lease
or even fields by coverage?
sourcetype, field, pctCov
traffic, src_ip, 100
traffic, dest_ip, 100
etc