Is there an efficient way to do field summary on a...

MonkeyK · ‎12-02-2022

I've been wanting to build some integrity checking and other functionality based on knowing the fields in a sourcetype for a while now.

At my company we've built a data dictionary of indexes and sourcetype of interest to the SOC. They can search the dictionary to help them remember the important data sources. I'd like to augment/use this info in a couple of new ways:

1) give them a field list for all of these sourcetypes so they could search for which sourcetypes have a relevant field (like src_ip)

2) I'd like to note the fields that appear in 100% of records for a sourcetype and then every day find out if is missing any of those fields. This would quickly clue me into data issues related to the event sent, parsing, or knowledge objects.

I know how to get a list of fields for 1 sourcetype and store that info. And I know how to compare a sourcetype to a past set of fields to a current set.

My challenge now is how do I get the list of fields for the 100 sourcetypes of interest

so far my best idea is to create 100 jobs to handle each sourcetype. Something like

If I ran 100 jobs like this, the number after head would increment to give me the next sourcetype.

But I feel like there has to be a better way to do fieldsummary on a lot of sourcetypes.

Any ideas?

MonkeyK · ‎12-11-2022

I got it. And it definitely starts with PickleRick's answer

<base search>
|bin span=3h _time

|dedup 10 index sourcetype _time host ```get a good cross section of the sourcetype in all of it's use cases```
| stats count count(*) as * by sourcetype
|untable sourcetype field fieldCount

|eval stCount=if(field=="count", fieldCount,null())
|eventstats max(stCount) as stCount by sourcetype

|eval pctCov=fieldCount/stCount

|search pctCov>=.5

|table sourcetype field pctCov

manderson7 · ‎03-24-2023

I was about to start writing a post trying to figure out what you seem to have figured out. I think I get the gist of your original search, though it seems like you have to repeat it per index? If that's the case, it wouldn't work, as you've got quite a few indexes to search over I'd imagine. I guess my question is, what is <base search> that you have referenced up there?

MonkeyK · ‎03-24-2023

Fair question.

When I want to focus on a particular part of my SPL, I often start with
"<base search> | "
so that I keep the focus on what I am interested in.

In this case "<base search>" is how I restrict the results to the sourcetypes of interest to my SOC. I actually pull that from a list of critical data sources which is maintained by the SOC

For my purposes, I am only trying to understand fields by sourcetype
I am using index in the dedup command
|bin span=3h _time
|dedup 10 index sourcetype _time host
This allows me to consider 10 records from every index in every 3 hour period. I do that because I worry that different workloads (which get ingested into different indexes), may decide to change the format of what they send. This way I only consider common fields in a sourcetype across all indexes.
If I didn't account for different indexes, it might be possible that I evaluate different indexes each time and therefore would , may conclude that there is a change when the common fields actually did not change

If I wanted the field listing per index, I'd just modify the "stats" and "eventstats" to include index in the "by" clause.

efavreau · ‎12-05-2022

@MonkeyK
All the possible fields for all the selected sourcetypes? With an assumption that the sourcetypes all have the same fields every time so you can create a list of "supposed_to_be_there_fields" and then reference that list every time, to find when a field is missing. Is that right?
That's a few questions rolled into one. They probably all won't be answered here. Solve for the first part and then create another question for the second part (referencing the first part, for those who come along later).

* Create a list. Use what you have (then determine how often it's going to be updated). An alternative could be something like

| fieldsummary

I suspect you already knew this command. Here's the link to the docs for those finding this later: https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Fieldsummary

* Export as csv, and use as a lookup. Comparing the lookup list against what's available to find what's missing. In the Splunk community, we often kick this link around (credit: Duane Waddle): https://www.duanewaddle.com/proving-a-negative/

As a one-off, this might be okay. As an ongoing solution against 100's of sourcetypes, it sounds a little fragile. YMMV.

Best of luck! Maybe this helped some.

###

If this reply helps you, an upvote would be appreciated.

PickleRick · ‎12-03-2022

I'm reading this question on my tablet while walking my dog so the circumstances are not very good for analysing your search 😉

But let me ask you a different question - especially that you're talking about SOC team which is usually interested with a relatively small set of "types of data" (not sourcetypes!). Instead of making them look through a dictionary of fields for multiple sourcetypes ahd indexes why not use datamodels to normalize data and let the analysts query the data in a unified way regardless of where the specific events come from?

CIM is a great starting point for that. You can also create your own datamodels if needed.

The possible additional upside is that datamodels can be accelerated (but it introduces some problems with managing access, I know).

MonkeyK · ‎12-05-2022

I don't really want to debate the value of the CIM. I think it's great, but not enough me or my SOC on it's own.
CIM is great for quickly finding data. And actually you reminded me that I should include the CIM and it's fields in my data model. Those records are just as pivotable as the other sources.

That said, access to underlying logs is still critical

for the integrity checking portion that I mentioned. An altered log format or an updated TA can often result in data that doesn't get included in the data model or may not be included in a search of the logs assuming that data would exist.
- For example if src_ip didn't get parsed out of a network traffic source type (Firwall logs, flow logs), the higher level datamodel search would simply not find that record and the SOC analyst might be none the wiser.
- The integrity checking is what will clue me into a problem with the source
Also CIM data models often do not contain enough information to fully understand an alert. I find that they are great for alerting, but we often have to drill down to the sources.
- For example the Authentication datamodel does not account for why authentication failed. A locked account or expired password are going to result in lots of authentication failures.
CIM does not automatically account for all newly onboarded sources. An automated data dictionary does.

In my experience robust IR will still require that we work with the logs

PickleRick · ‎12-05-2022

OK. You do have a valid point. I just wanted to point out the existence of CIM because some people go to great lengths to reinvent the wheel while not noticing thigs already done by others.

I must say that I don't understand a thing or two about your search.

Firstly, I wouldn't search throughout all my data. How much are you ingesting? That must be a performance hit on your environment. I'd stick to sampling. And I mean heavy sampling.

Secondly, I don't understand those "head 5 | tail 1" - why would you want a fifth result?

Thirdly, instead of doing fieldsummary, I think I'd simply do something like

| stats count count(*) AS *

to find percent coverage. (With this you can easily add a BY sourcetype clause).

MonkeyK · ‎12-05-2022

right, probably could have better explained.

| inputlookup dataDictionary.csv

is my lookup of all indexes and sourcetypes in the environment. Of the 1400 we have there are about 180 that the SOC is specifically interested in and of that there 70 sourcetypes that the SOC considers critical to their work.

I used the first clause as a subsearch to get one sourcetype to act on. In this case the 5th sourcetype when critical sourcetypes are sorted ascending

I do this because I need to pick a sourcetype to do fieldsummary on. I don't know how to do fieldsummary on more than one sourcetype and have the result tie back to the sourcetype of interest, hence the request for help.

I limit the amount of data to evaluate with dedup

| dedup 10 index sourcetype
says to get 10 records of the sourcetype for each index that uses that sourcetype.

-------------------------------------------

I like your stats count(*) as count by sourcetype technique . That would certainly allow me to summarize by sourcetype without needing to do them one at a time. Feels like it should be workable, but I don't know how to use the results

(would have done the next bit in a table, but the markdown is failing me in the most frustrating way)

With it I can get a table with a row for each sourcetype where I'd wind up with fields for all sourcetypes where some would be the same as the total, some less than the total, and some zero (because they are part of a different sourcetype.

Given I don't know the field names, I don't know how I'd determine pctCoverage by sourcetype

So how would I go from that to fields with 100% coverage by sourcetype?

traffic_logs: src_ip, dest_ip

dns: src_ip,request_type

dhcp: src_ip,dest_ip,mac,lease

or even fields by coverage?

sourcetype, field, pctCov

traffic, src_ip, 100

traffic, dest_ip, 100

etc

Is there an efficient way to do field summary on a large set of source types?

using Splunk Enterprise

.conf24 | Registration Open!

ICYMI - Check out the latest releases of Splunk Edge Processor

Introducing the 2024 SplunkTrust!