Solved: Search Strategies for Complex Data Sets

cmwhite · ‎04-18-2013

Hello All,

I've searched Answers here and I have not really found an answer to my problem, my apologies if I missed one or two.

As the title states, I'm trying to find generic search strategies that will allow me to collect the necessary events given a particular scenario or goal. Here is an image that will hopefully help to describe the data I'm working with.

http://www.freeimagehosting.net/dl8bb

There will be nearly 40 different data sets all to the same source type. The data is not inserted in real time but in bulk, there is a time stamp in a "start" field that is the same for all events in one insert, denoted by the red dashes. One insert has upwards of 200,000 events. I have control over most of the source data before it reaches Splunk.

Strategies I have working so far:

1) Single data set most recent insert (single green dash in image):

sourcetype="dataset" dataset_name="dataset 1" | streamstats dc(start) as distinct_times | head (distinct_times == 1) | ...

2) Single data set all inserts (light green oval on single row in image):

sourcetype="dataset" dataset_name="dataset 1" | ...

3) All data sets, all inserts (light blue square in image):

sourcetype="dataset" | ...

Strategies I don't have down:

4) All data sets, only the most recent inserts each (purple ovals in image).

I've considered a strategy based on the distinct count on data set names as compared to pre-calculated total number of data set names:

sourcetype="dataset" | streamstats dc(dataset_name) as distinct_names | head (distinct_names == total_dataset_names) | ...

There is at least one problem with this strategy which is indicated by Data Set 5 in the image which has higher insert frequency, this would include all of those inserts while waiting to reach the total dataset name count.

I've also tried to map sourcetype="dataset" dataset_name="$dataset_name$" | streamstats dc(start) as distinct_times | head (distinct_times == 1) | ... across all dataset names, but due to the size of the inserts I was reaching subsearch max problems.

5) All dataset month to month comparison.

Live search strategies are preferred but I'm also open to strategies that utilize saved searches that can be done dynamically, i.e. I have a new data set name I would need the reports, alerts, views to be able to adjust without intervention.

I'd love to hear what you all think could solve 4) the most, also if there is a better approach to the others that I should be making I'm all ears. I realize this was a long post and I appreciate it if you've been able to make it through.

Best Regards,
Chris

cmwhite · ‎05-30-2013

Sorry for the delay but it took a while to think through the solution. The more obvious imperative approach to the solution using iteration and subsearches were rather futile since my approach was trying to subsearch the wrong end of the problem.

Instead of running a large complex search iteratively across each of my large data sets I opted for a large boolean term that defined the (data set, most recent) pairings relevant for what I'm calling 'Global Current'.

In other words I run a summarized saved search that collects all "data set" names and "start" fields every minute which only incurs a cost when new inserts are indexed. By sorting this table by "start" I am able to dedup the table getting the most recent pairings.

Using load job of this saved search in a subsearch to define the global current data set terms as the base search solves the problem.

If someone is trying to solve a similar problem here is a more concrete example of the problem and my solution.

Example Events:

name="Data Set B" start="2013-05-28 02:00:00" ... log data ...
name="Data Set B" start="2013-04-28 02:00:00" ... log data ...
name="Data Set D" start="2013-04-21 10:00:00" ... log data ...
name="Data Set C" start="2013-04-10 07:00:00" ... log data ...
name="Data Set A" start="2013-04-01 17:00:00" ... log data ...
name="Data Set C" start="2013-03-05 07:00:00" ... log data ...
name="Data Set A" start="2013-03-01 17:00:00" ... log data ...
name="Data Set A" start="2013-02-01 17:00:00" ... log data ...
name="Data Set D" start="2013-01-21 10:00:00" ... log data ...

Summarized saved search just includes name/start pairs sorted by start:

name            start
"Data Set B"    "2013-05-28 02:00:00"
"Data Set B"    "2013-04-28 02:00:00"
"Data Set D"    "2013-04-21 10:00:00"
"Data Set C"    "2013-04-10 07:00:00"
"Data Set A"    "2013-04-01 17:00:00"
"Data Set C"    "2013-03-05 07:00:00"
"Data Set A"    "2013-03-01 17:00:00"
"Data Set A"    "2013-02-01 17:00:00"
"Data Set D"    "2013-01-21 10:00:00"

dedup name provides most recent name/start pairs:

name            start
"Data Set B"    "2013-05-28 02:00:00"
"Data Set D"    "2013-04-21 10:00:00"
"Data Set C"    "2013-04-10 07:00:00"
"Data Set A"    "2013-04-01 17:00:00"

loadjob in subsearch converts the table to:

( ("Data Set B" AND "2013-05-28 02:00:00") OR ("Data Set D" AND "2013-04-21 10:00:00") OR ("Data Set C" AND "2013-04-10 07:00:00") OR ("Data Set A" AND "2013-04-01 17:00:00") )

This results in matching only events for the "Current Global" without incurring large subsearch/map/join etc issues of subsequent searches.

I was concerned about the size of the term string once I hit 40+ data sets (80 key=value pairs in OR'd AND term) not to mention the other filtering terms needed for the search. The concern was search string character limit and browser uri string character limit. After reading the documentation there shouldn't be a limit on the search string character limit and the browser URI limit can be avoided by using a macro.

The final search looks something like this (pseudo code):

index=data termA termB termC [loadjob bla:bla:bla | sort - start | dedup name] | ...

After macro:

index=data termA termB termC `current_global` | ...

A note on efficiency, though it has worked and scaled well so far I've made sure to run simpler terms before the enormous OR'd AND logic and I've had excellent results.

View solution in original post

cmwhite · ‎05-30-2013

Sorry for the delay but it took a while to think through the solution. The more obvious imperative approach to the solution using iteration and subsearches were rather futile since my approach was trying to subsearch the wrong end of the problem.

Instead of running a large complex search iteratively across each of my large data sets I opted for a large boolean term that defined the (data set, most recent) pairings relevant for what I'm calling 'Global Current'.

In other words I run a summarized saved search that collects all "data set" names and "start" fields every minute which only incurs a cost when new inserts are indexed. By sorting this table by "start" I am able to dedup the table getting the most recent pairings.

Using load job of this saved search in a subsearch to define the global current data set terms as the base search solves the problem.

If someone is trying to solve a similar problem here is a more concrete example of the problem and my solution.

Example Events:

name="Data Set B" start="2013-05-28 02:00:00" ... log data ...
name="Data Set B" start="2013-04-28 02:00:00" ... log data ...
name="Data Set D" start="2013-04-21 10:00:00" ... log data ...
name="Data Set C" start="2013-04-10 07:00:00" ... log data ...
name="Data Set A" start="2013-04-01 17:00:00" ... log data ...
name="Data Set C" start="2013-03-05 07:00:00" ... log data ...
name="Data Set A" start="2013-03-01 17:00:00" ... log data ...
name="Data Set A" start="2013-02-01 17:00:00" ... log data ...
name="Data Set D" start="2013-01-21 10:00:00" ... log data ...

Summarized saved search just includes name/start pairs sorted by start:

name            start
"Data Set B"    "2013-05-28 02:00:00"
"Data Set B"    "2013-04-28 02:00:00"
"Data Set D"    "2013-04-21 10:00:00"
"Data Set C"    "2013-04-10 07:00:00"
"Data Set A"    "2013-04-01 17:00:00"
"Data Set C"    "2013-03-05 07:00:00"
"Data Set A"    "2013-03-01 17:00:00"
"Data Set A"    "2013-02-01 17:00:00"
"Data Set D"    "2013-01-21 10:00:00"

dedup name provides most recent name/start pairs:

name            start
"Data Set B"    "2013-05-28 02:00:00"
"Data Set D"    "2013-04-21 10:00:00"
"Data Set C"    "2013-04-10 07:00:00"
"Data Set A"    "2013-04-01 17:00:00"

loadjob in subsearch converts the table to:

( ("Data Set B" AND "2013-05-28 02:00:00") OR ("Data Set D" AND "2013-04-21 10:00:00") OR ("Data Set C" AND "2013-04-10 07:00:00") OR ("Data Set A" AND "2013-04-01 17:00:00") )

This results in matching only events for the "Current Global" without incurring large subsearch/map/join etc issues of subsequent searches.

I was concerned about the size of the term string once I hit 40+ data sets (80 key=value pairs in OR'd AND term) not to mention the other filtering terms needed for the search. The concern was search string character limit and browser uri string character limit. After reading the documentation there shouldn't be a limit on the search string character limit and the browser URI limit can be avoided by using a macro.

The final search looks something like this (pseudo code):

index=data termA termB termC [loadjob bla:bla:bla | sort - start | dedup name] | ...

After macro:

index=data termA termB termC `current_global` | ...

A note on efficiency, though it has worked and scaled well so far I've made sure to run simpler terms before the enormous OR'd AND logic and I've had excellent results.

alacercogitatus · ‎04-19-2013

For #4, I would place a csv in your lookups folder, with your dataset names in it.

datasets.csv

dataset_name dataset1 dataset2 dataset3

Then you can use the map command based on these dataset names

|append [ |inputlookup datasets.csv | map [ search sourcetype=dataset dataset_name="$dataset_name$" | head 1]]

This returns the raw event for the first event in each dataset. You may need to set maxsearches=100 or a number bigger than the number of datasets. It may not be most efficient, but I'm not sure how else to approach that one. It shouldn't be too large of an execution, because the map search will stop once it hits the head of the data set.

cmwhite · ‎05-30-2013

This wouldn't quite work for the problem. First a lookup would require manual management of a csv, but one of the requirements was a dynamic solution. Second append, map and subsearches all suffer from low limit ceilings that I was readily hitting when trying to iterate over the right side of the problem. After much thought I ended up with a solution. See posted answer if you're curious.

Search Strategies for Complex Data Sets

Splunk App Dev Community Updates – What’s New and What’s Next

The Latest Cisco Integrations With Splunk Platform!

Enterprise Security Content Update (ESCU) | New Releases

Are you a member of the Splunk Community?

Search Strategies for Complex Data Sets

Splunk App Dev Community Updates – What’s New and What’s Next

The Latest Cisco Integrations With Splunk Platform!

Enterprise Security Content Update (ESCU) | New Releases