Splunk Search

Join outer fails to return valid result even when the value exists in both searches

leonheart78
Explorer

Below is the search which I'm trying:

index=p_data sourcetype="p_sourcetype" | xmlkv | where EventId!="" | table source,EventId | join type=outer max=0 source [search index=p_data | xmlkv | where ChannelId!="" | table source,ChannelId] | table source,EventId,ChannelId | where EventId = "PG0019283"

The result return with the source and the EventId, but ChannelId is empty, which it should contain some value.

When I conduct the searches separately, they both contain the identical source value for joining.

Pls assist.
Thank you.

Tags (2)
0 Karma
1 Solution

acharlieh
Influencer

What if we did one search like so:

index=p_data (EventId OR ChannelId) | xmlkv | where (sourcetype="p_sourcetype" AND EventId!="") OR ChannelId!="" | stats values(EventId) as EventId values(ChannelId) as ChannelId by source

As xmlkv is your first step in both queries, I'm guessing that the text EventId or ChannelId will appear in your event, we then parse and apply a where to get all events and use stats instead of join to combine results together. As source is usually the name of the monitored file in Splunk, it seems strange to me that it would be a unique identifier for event and channel events, but it could be.

View solution in original post

acharlieh
Influencer

What if we did one search like so:

index=p_data (EventId OR ChannelId) | xmlkv | where (sourcetype="p_sourcetype" AND EventId!="") OR ChannelId!="" | stats values(EventId) as EventId values(ChannelId) as ChannelId by source

As xmlkv is your first step in both queries, I'm guessing that the text EventId or ChannelId will appear in your event, we then parse and apply a where to get all events and use stats instead of join to combine results together. As source is usually the name of the monitored file in Splunk, it seems strange to me that it would be a unique identifier for event and channel events, but it could be.

leonheart78
Explorer

index=p_data | rex field=_raw "language=\"eng\">\s+(?.*?)" | search index=p_data (EventId OR ChannelId OR Name) | xmlkv | where (sourcetype="p_sourcetype" AND EventId!="") OR ChannelId!="" | stats values(EventId) as EventId values(Name) as Name values(ChannelId) as ChannelId by source

The result came out as below

SOURCE                   EventId                 Name                     ChannelId
source1                    PG0019283          Winter                    174
                           PG0018294          Summer
                           PG0037926          Spring
source2                    PG0073947         Spiderman             186
                           PG0089362         Superman
source3                    PG0045281         Apple                       902
source4                    PG0019283         Winter                     174
                           PG0037926         Spring

I need to flatten them to look like below:

SOURCE                   EventId                 Name                     ChannelId
source1                    PG0019283          Winter                    174
source1                    PG0018294          Summer                 174
source1                    PG0037926          Spring                     174
source2                    PG0073947         Spiderman             186
source2                    PG0089362         Superman              186
source3                    PG0045281         Apple                       902
source4                    PG0019283         Winter                     174
source4                    PG0037926         Spring                      174

Thanks

0 Karma

acharlieh
Influencer

In the original events, what is the relationship between EventId and Name and ChannelId? Are they all part of a single event or are they separate events? The problem with stats values(field) is that order is not preserved, just uniqueness of values, so there is not necessarily a correlation between the first EventId and the first Name. It seems like ChannelId correlates loosely to source, but that could also be a problem of sample size. There are other statistical functions that could be used like list, but without knowing the makeup of your source data it's hard to make recommendations. If all are part of the same event, then stats could go away and we could just use table.

Other side note, the contents of your search just before the xmlkv should likely be part of your base search as it seems like you're dealing only with tokens that would exist in the raw event, so why extract things from disk only to filter them 2 steps later? Additionally I'm not sure if it's the forum stripping out important characters or not but it's possible that your rex should actually be a regex ?

0 Karma

landen99
Motivator

The search makes it clear that EventId is only in sourcetype="p_sourcetype", whereas ChannelId may exist in any sourcetype of the same index. This implies that these two fields exist in separate events from separate sourcetypes. "table" will not work, else join would have been unnecessary. "name" may exist in the same events as "ChannelId" and a technique of combining and separating may keep them together, if that is the case.

index=p_data (EventId OR ChannelId) | xmlkv | where (sourcetype="p_sourcetype" AND EventId!="") OR ChannelId!="" | eval channel_name=ChannelId." ".Name | stats values(EventId) as EventId values(channel_name) as channel_name by source | rex field=channel_name "(?<ChannelId>\S+)\s+(?<Name>.+)"
0 Karma

landen99
Motivator

leonheart78, the forum maybe really messing up your regex so that it looks really wierd like it should not work. Try a blank line before it and four spaces.

Second, to split up the EventId-ChannelId pairs, you should try:

| mvexpand ChannelId

after the stats command.

0 Karma

leonheart78
Explorer

Hi,

Thanks for your suggestion, it seems to have worked, but the issues is that within a ChannelId, it contains multiple EventId, how can I expand it so that each EventId is tagged with a ChannelId so to output to csv?

0 Karma

acharlieh
Influencer

Could you explain more and/or provide some sample (anonymized) events? Do you mean you have a single event that has a ChannelId and multiple EventIds in it? Or is it just how my search does things?

The simple answer might be you want to just pipe to an mvexpand EventId command, but there might be an even more optimal search out there depending on what your data actually looks like.

0 Karma

landen99
Motivator

Excellent approach. This will not only increase the speed and efficiency but also improve your ability to see variances in the data formatting, if any.

0 Karma

leonheart78
Explorer

Hi, your suggestion works! However, the each ChannelId now is tied with multiple EventId. How can I tag the ChannelId to each EventId?

0 Karma

landen99
Motivator

Join uses a subsearch, which comes with time and event limitations. 90s and 50k events max. The time limit is configurable. 50k events is an absolute max. You may also want to look at multi-value and null values. Check value formatting, including extra spaces, non-printable characters, etc. I never use max=0, so that might be causing problems. If you are expecting max=0 to remove the 50k limitation, it will not. .. I'll have to look into "max" more later, but it seems to allow search line customization of any max event limitations.

Instead, try:

index=p_data (EventId OR ChannelId) | xmlkv | where (sourcetype="p_sourcetype" AND EventId!="") OR ChannelId!="" | stats values(EventId) as EventId values(ChannelId) as ChannelId by source | mvexpand ChannelId
0 Karma
Get Updates on the Splunk Community!

Developer Spotlight with Paul Stout

Welcome to our very first developer spotlight release series where we'll feature some awesome Splunk ...

State of Splunk Careers 2024: Maximizing Career Outcomes and the Continued Value of ...

For the past four years, Splunk has partnered with Enterprise Strategy Group to conduct a survey that gauges ...

Data-Driven Success: Splunk & Financial Services

Splunk streamlines the process of extracting insights from large volumes of data. In this fast-paced world, ...