Splunk Search

How can I escape the 50K subsearch limit while linking commands (joining datasets) together?

woodcock
Esteemed Legend

Does anybody have any creative ways to join search outputs together and avoid subsearch limits?

1 Solution

woodcock
Esteemed Legend

I started out with a goal of appending 5 CSV files with 1M events each; the non-numbered *.csv's events all have TestField=0, the *1.csv's files all are 1, and so on.
Don't read anything into the filenames or fieldnames; this was simply what was handy to me.

I used this search every time to see what ended up in the final file:

|inputlookup 1MeventsFile5.csv | stats count by TestField

The correct/desired result of both the search and the contents of the file should be this:

TestField    count
0    1000000
1    1000000
2    1000000
3    1000000
4    1000000

I started with the totally obvious:

|inputlookup 1MeventsFile.csv | outputlookup 1MeventsFile5.csv
| append [|inputlookup 1MeventsFile1.csv | outputlookup append=true 1MeventsFile5.csv]
| append [|inputlookup 1MeventsFile2.csv | outputlookup append=true 1MeventsFile5.csv]
| append [|inputlookup 1MeventsFile3.csv | outputlookup append=true 1MeventsFile5.csv]
| append [|inputlookup 1MeventsFile4.csv | outputlookup append=true 1MeventsFile5.csv]
| stats count by TestField

This resulted in this totally predictable output on the "Statistics" tab:

TestField   count
0   1000000
1   50000
2   50000
3   50000
4   50000

But somewhat surprisingly, the file contained this:

TestField   count
0   1000000

I surmised that the problem is that everything runs and finishes at roughly the same time and so the first search got the file handle to write first so the others could not write to the busy file. I needed someway to serialize the actions on each file and eventually tried this:

|inputlookup 1MeventsFile.csv | outputlookup 1MeventsFile5.csv
| appendpipe [|inputlookup 1MeventsFile1.csv | outputlookup append=true 1MeventsFile5.csv]
| appendpipe [|inputlookup 1MeventsFile2.csv | outputlookup append=true 1MeventsFile5.csv]
| appendpipe [|inputlookup 1MeventsFile3.csv | outputlookup append=true 1MeventsFile5.csv]
| appendpipe [|inputlookup 1MeventsFile4.csv | outputlookup append=true 1MeventsFile5.csv]
| stats count by TestField

Disappointingly, although the "Statistics" tab always had the correct results, occasionally a few events would be dropped from some of the files which is why I finally settled on this:

|inputlookup max=0 1MeventsFile.csv
| appendpipe [|inputlookup max=0 1MeventsFile1.csv]
| appendpipe [|inputlookup max=0 1MeventsFile2.csv]
| appendpipe [|inputlookup max=0 1MeventsFile3.csv]
| appendpipe [|inputlookup max=0 1MeventsFile4.csv]
| outputlookup max=0 1MeventsFile5.csv
| stats count by TestField

Although I freely admit that this is an abusive hack of the appendpipe command, this approach always yields the correct results both in the "Statistics" tab and in the file and was completely exempt from subsearch limits!

View solution in original post

woodcock
Esteemed Legend

I started out with a goal of appending 5 CSV files with 1M events each; the non-numbered *.csv's events all have TestField=0, the *1.csv's files all are 1, and so on.
Don't read anything into the filenames or fieldnames; this was simply what was handy to me.

I used this search every time to see what ended up in the final file:

|inputlookup 1MeventsFile5.csv | stats count by TestField

The correct/desired result of both the search and the contents of the file should be this:

TestField    count
0    1000000
1    1000000
2    1000000
3    1000000
4    1000000

I started with the totally obvious:

|inputlookup 1MeventsFile.csv | outputlookup 1MeventsFile5.csv
| append [|inputlookup 1MeventsFile1.csv | outputlookup append=true 1MeventsFile5.csv]
| append [|inputlookup 1MeventsFile2.csv | outputlookup append=true 1MeventsFile5.csv]
| append [|inputlookup 1MeventsFile3.csv | outputlookup append=true 1MeventsFile5.csv]
| append [|inputlookup 1MeventsFile4.csv | outputlookup append=true 1MeventsFile5.csv]
| stats count by TestField

This resulted in this totally predictable output on the "Statistics" tab:

TestField   count
0   1000000
1   50000
2   50000
3   50000
4   50000

But somewhat surprisingly, the file contained this:

TestField   count
0   1000000

I surmised that the problem is that everything runs and finishes at roughly the same time and so the first search got the file handle to write first so the others could not write to the busy file. I needed someway to serialize the actions on each file and eventually tried this:

|inputlookup 1MeventsFile.csv | outputlookup 1MeventsFile5.csv
| appendpipe [|inputlookup 1MeventsFile1.csv | outputlookup append=true 1MeventsFile5.csv]
| appendpipe [|inputlookup 1MeventsFile2.csv | outputlookup append=true 1MeventsFile5.csv]
| appendpipe [|inputlookup 1MeventsFile3.csv | outputlookup append=true 1MeventsFile5.csv]
| appendpipe [|inputlookup 1MeventsFile4.csv | outputlookup append=true 1MeventsFile5.csv]
| stats count by TestField

Disappointingly, although the "Statistics" tab always had the correct results, occasionally a few events would be dropped from some of the files which is why I finally settled on this:

|inputlookup max=0 1MeventsFile.csv
| appendpipe [|inputlookup max=0 1MeventsFile1.csv]
| appendpipe [|inputlookup max=0 1MeventsFile2.csv]
| appendpipe [|inputlookup max=0 1MeventsFile3.csv]
| appendpipe [|inputlookup max=0 1MeventsFile4.csv]
| outputlookup max=0 1MeventsFile5.csv
| stats count by TestField

Although I freely admit that this is an abusive hack of the appendpipe command, this approach always yields the correct results both in the "Statistics" tab and in the file and was completely exempt from subsearch limits!

sideview
SplunkTrust
SplunkTrust

Sort of an odd answer, because the question is more general, about search results in general, and the answer is just about merging lots of lookups/csv's together, which behave quite differently. Is it more appropriate for you to amend the question, making it to be more narrowly about working with multiple large csv files? Or for me to give you a general answer on how to skirt subsearch limits when trying to effectively merge/join/mash up different and somewhat large results from normal searches?

woodcock
Esteemed Legend

This approach could be applied to export "regular" search's output by using outputlookup to send the ("too many") events a lookup file. Then it (and other results) could be appended/merged with this method so that there is no limit applied.

sideview
SplunkTrust
SplunkTrust

Sure, but in the general case it's not a very good solution. Generally you want to reorganize the searches where possible, into a disjunction and then group/tweak and normalize as necessary with eval/streamstats/eventstats/stats/chart/xyseries/untable/omgHeadExplodes.Putting out somewhat hacky searches ( I don't think you disagree here) with appendcols as a general answer is less than ideal.

woodcock
Esteemed Legend

Agreed; there is nobody more vocal/active that I am on Answers working with people to avoid subsearches and reduce dataset sizes. Even so, there are circumstances where merging large results is absolutely necessary and nothing else will work. I had to create solution for my own purposes to solve just such an intractable problem. I figured a "fake" question was a decent way to share the option with others.

Get Updates on the Splunk Community!

Developer Spotlight with Paul Stout

Welcome to our very first developer spotlight release series where we'll feature some awesome Splunk ...

State of Splunk Careers 2024: Maximizing Career Outcomes and the Continued Value of ...

For the past four years, Splunk has partnered with Enterprise Strategy Group to conduct a survey that gauges ...

Data-Driven Success: Splunk & Financial Services

Splunk streamlines the process of extracting insights from large volumes of data. In this fast-paced world, ...