Splunk Search

How can I escape the 50K subsearch limit while linking commands (joining datasets) together?

woodcock
Esteemed Legend

Does anybody have any creative ways to join search outputs together and avoid subsearch limits?

1 Solution

woodcock
Esteemed Legend

I started out with a goal of appending 5 CSV files with 1M events each; the non-numbered *.csv's events all have TestField=0, the *1.csv's files all are 1, and so on.
Don't read anything into the filenames or fieldnames; this was simply what was handy to me.

I used this search every time to see what ended up in the final file:

|inputlookup 1MeventsFile5.csv | stats count by TestField

The correct/desired result of both the search and the contents of the file should be this:

TestField    count
0    1000000
1    1000000
2    1000000
3    1000000
4    1000000

I started with the totally obvious:

|inputlookup 1MeventsFile.csv | outputlookup 1MeventsFile5.csv
| append [|inputlookup 1MeventsFile1.csv | outputlookup append=true 1MeventsFile5.csv]
| append [|inputlookup 1MeventsFile2.csv | outputlookup append=true 1MeventsFile5.csv]
| append [|inputlookup 1MeventsFile3.csv | outputlookup append=true 1MeventsFile5.csv]
| append [|inputlookup 1MeventsFile4.csv | outputlookup append=true 1MeventsFile5.csv]
| stats count by TestField

This resulted in this totally predictable output on the "Statistics" tab:

TestField   count
0   1000000
1   50000
2   50000
3   50000
4   50000

But somewhat surprisingly, the file contained this:

TestField   count
0   1000000

I surmised that the problem is that everything runs and finishes at roughly the same time and so the first search got the file handle to write first so the others could not write to the busy file. I needed someway to serialize the actions on each file and eventually tried this:

|inputlookup 1MeventsFile.csv | outputlookup 1MeventsFile5.csv
| appendpipe [|inputlookup 1MeventsFile1.csv | outputlookup append=true 1MeventsFile5.csv]
| appendpipe [|inputlookup 1MeventsFile2.csv | outputlookup append=true 1MeventsFile5.csv]
| appendpipe [|inputlookup 1MeventsFile3.csv | outputlookup append=true 1MeventsFile5.csv]
| appendpipe [|inputlookup 1MeventsFile4.csv | outputlookup append=true 1MeventsFile5.csv]
| stats count by TestField

Disappointingly, although the "Statistics" tab always had the correct results, occasionally a few events would be dropped from some of the files which is why I finally settled on this:

|inputlookup max=0 1MeventsFile.csv
| appendpipe [|inputlookup max=0 1MeventsFile1.csv]
| appendpipe [|inputlookup max=0 1MeventsFile2.csv]
| appendpipe [|inputlookup max=0 1MeventsFile3.csv]
| appendpipe [|inputlookup max=0 1MeventsFile4.csv]
| outputlookup max=0 1MeventsFile5.csv
| stats count by TestField

Although I freely admit that this is an abusive hack of the appendpipe command, this approach always yields the correct results both in the "Statistics" tab and in the file and was completely exempt from subsearch limits!

View solution in original post

woodcock
Esteemed Legend

I started out with a goal of appending 5 CSV files with 1M events each; the non-numbered *.csv's events all have TestField=0, the *1.csv's files all are 1, and so on.
Don't read anything into the filenames or fieldnames; this was simply what was handy to me.

I used this search every time to see what ended up in the final file:

|inputlookup 1MeventsFile5.csv | stats count by TestField

The correct/desired result of both the search and the contents of the file should be this:

TestField    count
0    1000000
1    1000000
2    1000000
3    1000000
4    1000000

I started with the totally obvious:

|inputlookup 1MeventsFile.csv | outputlookup 1MeventsFile5.csv
| append [|inputlookup 1MeventsFile1.csv | outputlookup append=true 1MeventsFile5.csv]
| append [|inputlookup 1MeventsFile2.csv | outputlookup append=true 1MeventsFile5.csv]
| append [|inputlookup 1MeventsFile3.csv | outputlookup append=true 1MeventsFile5.csv]
| append [|inputlookup 1MeventsFile4.csv | outputlookup append=true 1MeventsFile5.csv]
| stats count by TestField

This resulted in this totally predictable output on the "Statistics" tab:

TestField   count
0   1000000
1   50000
2   50000
3   50000
4   50000

But somewhat surprisingly, the file contained this:

TestField   count
0   1000000

I surmised that the problem is that everything runs and finishes at roughly the same time and so the first search got the file handle to write first so the others could not write to the busy file. I needed someway to serialize the actions on each file and eventually tried this:

|inputlookup 1MeventsFile.csv | outputlookup 1MeventsFile5.csv
| appendpipe [|inputlookup 1MeventsFile1.csv | outputlookup append=true 1MeventsFile5.csv]
| appendpipe [|inputlookup 1MeventsFile2.csv | outputlookup append=true 1MeventsFile5.csv]
| appendpipe [|inputlookup 1MeventsFile3.csv | outputlookup append=true 1MeventsFile5.csv]
| appendpipe [|inputlookup 1MeventsFile4.csv | outputlookup append=true 1MeventsFile5.csv]
| stats count by TestField

Disappointingly, although the "Statistics" tab always had the correct results, occasionally a few events would be dropped from some of the files which is why I finally settled on this:

|inputlookup max=0 1MeventsFile.csv
| appendpipe [|inputlookup max=0 1MeventsFile1.csv]
| appendpipe [|inputlookup max=0 1MeventsFile2.csv]
| appendpipe [|inputlookup max=0 1MeventsFile3.csv]
| appendpipe [|inputlookup max=0 1MeventsFile4.csv]
| outputlookup max=0 1MeventsFile5.csv
| stats count by TestField

Although I freely admit that this is an abusive hack of the appendpipe command, this approach always yields the correct results both in the "Statistics" tab and in the file and was completely exempt from subsearch limits!

sideview
SplunkTrust
SplunkTrust

Sort of an odd answer, because the question is more general, about search results in general, and the answer is just about merging lots of lookups/csv's together, which behave quite differently. Is it more appropriate for you to amend the question, making it to be more narrowly about working with multiple large csv files? Or for me to give you a general answer on how to skirt subsearch limits when trying to effectively merge/join/mash up different and somewhat large results from normal searches?

woodcock
Esteemed Legend

This approach could be applied to export "regular" search's output by using outputlookup to send the ("too many") events a lookup file. Then it (and other results) could be appended/merged with this method so that there is no limit applied.

sideview
SplunkTrust
SplunkTrust

Sure, but in the general case it's not a very good solution. Generally you want to reorganize the searches where possible, into a disjunction and then group/tweak and normalize as necessary with eval/streamstats/eventstats/stats/chart/xyseries/untable/omgHeadExplodes.Putting out somewhat hacky searches ( I don't think you disagree here) with appendcols as a general answer is less than ideal.

woodcock
Esteemed Legend

Agreed; there is nobody more vocal/active that I am on Answers working with people to avoid subsearches and reduce dataset sizes. Even so, there are circumstances where merging large results is absolutely necessary and nothing else will work. I had to create solution for my own purposes to solve just such an intractable problem. I figured a "fake" question was a decent way to share the option with others.

Get Updates on the Splunk Community!

New Dates, New City: Save the Date for .conf25!

Wake up, babe! New .conf25 dates AND location just dropped!! That's right, this year, .conf25 is taking place ...

Introduction to Splunk Observability Cloud - Building a Resilient Hybrid Cloud

Introduction to Splunk Observability Cloud - Building a Resilient Hybrid Cloud  In today’s fast-paced digital ...

Observability protocols to know about

Observability protocols define the specifications or formats for collecting, encoding, transporting, and ...