Solved: What causes unioned data sets to be truncated?

jsinnott_ · ‎11-08-2017

Hi Splunk Experts--

I'm confused about the union command and am hoping you can
help. Specifically, I'm struggling to understand what causes the
"things that get unioned" to be truncated-- in my case to 50,000
records.

Here's an example of what confuses me:

Imagine three sets of data-- I've put them in three separate indexes
called union_1, union_2 and union 3. The data sets are very similar:
each has 60,000 records, each consisting of a timestamp, a color and a
hash. Each data set has exactly one event per second and each covers
the same 60,000 seconds (from 2017-01-01 00:00:01 to 2017-01-01
16:40:00). The color is random and the hash is unique across all
180,000 events (60,000 * three data sets).

Here's union_1:

time                       color   hash
-------------------------  ------  --------------------------------
2017-01-01 00:00:01 -0800  blue    08decd051408e648b941b5dbb9b1578c
2017-01-01 00:00:02 -0800  yellow  39d98f7f9a98920ee08631c9e6a4e867
2017-01-01 00:00:03 -0800  green   2b34449aae3a941c64dd76d33a6cfc04
...
2017-01-01 16:39:58 -0800  blue    b2cc43ab839bf57711a00f8f7a622e97
2017-01-01 16:39:59 -0800  blue    e26f577b10d0fa172c122deca813d38f
2017-01-01 16:40:00 -0800  blue    c9b0b55e7513963f7b04cf3c424686f2

...and union_2:

time                       color   hash
-------------------------  ------  --------------------------------
2017-01-01 00:00:01 -0800  violet  c8e68d6c154fc0ca88220a299dba7c55
2017-01-01 00:00:02 -0800  blue    3e18602a1d137ea4bf9157e67c4386ed
2017-01-01 00:00:03 -0800  violet  ecdf61cd34cda950bd782e3a6ba51fd6
...
2017-01-01 16:39:58 -0800  violet  5c00f68da1aa343ec0944fbcd42775fc
2017-01-01 16:39:59 -0800  green   2c3a626ff26a05f9895dc1c9ae1d074e
2017-01-01 16:40:00 -0800  red     9b796de25b072d8a48d3e9a7a716c4e9

...and union_3:

time                       color   hash
-------------------------  ------  --------------------------------
2017-01-01 00:00:01 -0800  orange  772468eb812735bfa984b91477afe967
2017-01-01 00:00:02 -0800  violet  6d9ebc2ce8b1c79d42793d624daeb402
2017-01-01 00:00:03 -0800  red     a31d8811b95b4597f943f268f4068fb0
...
2017-01-01 16:39:58 -0800  yellow  17b43d58e4920f1d2044552acdad5507
2017-01-01 16:39:59 -0800  violet  12425e908448371c38a1f0fe12aedf73
2017-01-01 16:40:00 -0800  indigo  ea1fb54c5c2b5fd2161856ea6937226e

You get the idea... 🙂

Now let's run some SPL:

| union maxout=10000000
  [ search index=union_1 ]
  [ search index=union_2 ]
  [ search index=union_3 ]
| stats count by index

This produces what I'd expect-- 60,000 records per "thing that got
unioned":

index    count
-------  -----
union_1  60000
union_2  60000
union_3  60000

But let's make things a bit more complicated:

| union maxout=10000000
  [ search index=union_1 | head 60000 ]
  [ search index=union_2 ]
  [ search index=union_3 ]
| stats count by index

Wait, what? Adding a head command to the first search causes the
second and third to be truncated to 50000?

index    count
-------  -----
union_1  60000
union_2  50000
union_3  50000

How about this one?

| union maxout=10000000
  [ search index=union_1 ]
  [ search index=union_2 | head 60000 ]
  [ search index=union_3 ]
| stats count by index

Hmmm... same result:

index    count
-------  -----
union_1  60000
union_2  50000
union_3  50000

What if we move the head command to the final search?

| union maxout=10000000
  [ search index=union_1 ]
  [ search index=union_2 ]
  [ search index=union_3 | head 60000 ]
| stats count by index

Wow... now only the final search gets truncated:

index    count
-------  -----
union_1  60000
union_2  60000
union_3  50000

Notes that may or may not be relevant:

Many commands have a similar effect (i.e. cause the same
truncations) as head-- in particular dedup and sort seem to cause
the same problems.
I suspect that these commands (and presumably many others) cause
the subsearch to no longer qualify as a "streaming subsearch"--
(although honestly I can't imagine why head would do this) and
that this fact makes union behave much more like append.
I believe (but am not sure) that the 50000 truncation limit is due
to maxresultrows in limits.conf-- that value (for me is currently
50000)

For context, here's what I want to do:

In general, get a better understanding of how union works and how
its different than append.
Specifically, union a set of three searches that each produce substantially more
than 50000 records and not experience truncation.

Anybody willing to help me out with this? Would totally appreciate the
benefit of your wisdom 🙂

Thanks!

mattness · ‎11-09-2017

Hi jsinnott_

At this time, union behaves alternately like multisearch (for distributable streaming subsearches) or append (for subsearches that are not distributable streaming). This is not adequately explained in the doc topic for the union command at present and I'll see what I can do to fix that.

(For more information about the types of streaming search commands, see Command types in the Splunk Enterprise Search Manual.)

Let's take your first search:

| union maxout=10000000
  [ search index=union_1 ]
  [ search index=union_2 ]
  [ search index=union_3 ]
| stats count by index

In this case, all of the searches are distributable streaming, so they area all unioned with multisearch. This is why you see 60k in each.

Your second search uses the head command for one of the subsearches. Because head is centralized streaming rather than distributable streaming, it causes the subsearches that follow it to use the append command. "Under the hood," the search is converted to:

| search index=union_1
| head 60000
| append 
 [ search index=union_2 ]
| append
 [ search index=union_3 ]
| stats count by index

When union is used in conjunction with a search that is not distributable streaming, the default for the maxout argument applies: 50k events. This is mentioned in the doc topic for the union command.

Your third search also ends up being an append search, because the second subsearch is not distributable streaming due to the head command. Here's how it looks "under the hood":

| search index=union_1
| append 
 [ search index=union_2 | head 60000 ]
| append
 [ search index=union_3 ]
| stats count by index

Again, the maxoutargument default applies here, limiting the results of the appended searches to 50k events.

In your last example, the first two subsearches are distributable streaming, so they are unioned with multisearch. But the final subsearch has the head command, so it gets unioned with append at the end.

| multisearch 
 [ search index=union_1 ]
 [ search index=union_2 ]| 
| append
 [ search index=union_3 | head 60000 ]
| stats count by index

The maxout argument applies to that last subsearch because it is not distributable streaming due to the head command. So it returns 50k events rather than 60k events.

Note that multisearch has to be the first command. If your union search unpacks in a way that puts append first, you won't get multisearch to follow it.

Kindest regards,
Matt (Splunk Docs Team)

View solution in original post

mattness · ‎11-09-2017

Hi jsinnott_

At this time, union behaves alternately like multisearch (for distributable streaming subsearches) or append (for subsearches that are not distributable streaming). This is not adequately explained in the doc topic for the union command at present and I'll see what I can do to fix that.

(For more information about the types of streaming search commands, see Command types in the Splunk Enterprise Search Manual.)

Let's take your first search:

| union maxout=10000000
  [ search index=union_1 ]
  [ search index=union_2 ]
  [ search index=union_3 ]
| stats count by index

In this case, all of the searches are distributable streaming, so they area all unioned with multisearch. This is why you see 60k in each.

Your second search uses the head command for one of the subsearches. Because head is centralized streaming rather than distributable streaming, it causes the subsearches that follow it to use the append command. "Under the hood," the search is converted to:

| search index=union_1
| head 60000
| append 
 [ search index=union_2 ]
| append
 [ search index=union_3 ]
| stats count by index

When union is used in conjunction with a search that is not distributable streaming, the default for the maxout argument applies: 50k events. This is mentioned in the doc topic for the union command.

Your third search also ends up being an append search, because the second subsearch is not distributable streaming due to the head command. Here's how it looks "under the hood":

| search index=union_1
| append 
 [ search index=union_2 | head 60000 ]
| append
 [ search index=union_3 ]
| stats count by index

Again, the maxoutargument default applies here, limiting the results of the appended searches to 50k events.

In your last example, the first two subsearches are distributable streaming, so they are unioned with multisearch. But the final subsearch has the head command, so it gets unioned with append at the end.

| multisearch 
 [ search index=union_1 ]
 [ search index=union_2 ]| 
| append
 [ search index=union_3 | head 60000 ]
| stats count by index

The maxout argument applies to that last subsearch because it is not distributable streaming due to the head command. So it returns 50k events rather than 60k events.

Note that multisearch has to be the first command. If your union search unpacks in a way that puts append first, you won't get multisearch to follow it.

Kindest regards,
Matt (Splunk Docs Team)

jsinnott_ · ‎11-10-2017

Hi Matt--

Thanks so much for taking time to write this clear and detailed explanation. It's exactly what I needed-- you're my new best friend!

..j

MuS · ‎11-08-2017

Hi jsinnott_,

since union is just another sub search you will hit many limits with it, some are mentioned here http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Union#Optional_arguments

In most cases you can just use stats to do the same and will not hit any limits. Read some examples here https://answers.splunk.com/answers/129424/how-to-compare-fields-over-multiple-sourcetypes-without-jo... or in the March 2016 Virtual .conf session here http://wiki.splunk.com/Virtual_.conf

Why union is truncating events from a second search after using more commands sounds weird and might be worth opening a bug report.

Hope this helps ...

cheers, MuS

jsinnott_ · ‎11-10-2017

Hello and thanks for this. I really appreciate you taking the time to answer.

..j

What causes unioned data sets to be truncated?

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!