Splunk Search

Another Join Question - Searches With Matching sourcetype but Different index

ArchieCrozier
Path Finder

I have read through almost every Join label topic on the Splunk Community page and I don't seem to see one that fits my problem.  If there is one that works for this issue, please simply direct me to the correct discussion.

The closest discussion that looks like what I am shooting for is: How to join two searches on a common field where the value of the left search matches all values of ...  But this discussion doesn't have a solution.  And I've been through the docs.splunk.com pages reviewing the subsearch, append, appendcols, join and selfjoin.

The two searches I would like to join are:

Search 1:
index="_internal" source="*metrics.log" per_index_thruput series=autoshell host=lelsplunkix* | eval GB=kb/(1024*1024) | timechart span=12h sum(GB) as GB by series
Results: (example - 500k+ rows returned)
_time                                               _raw   sourcetype      GB
2020-08-18 07:04:33.307     ABC     ship                     0.0000264551490559
2020-08-18 07:04:31.168     LMN    rum                      0.0000000828877091
2020-08-18 07:04:24.174     XYZ     jacksparrow     0.0000000940635800
IMPORTANT: The index of all of these is "_internal", not the actual index that the source data comes from.

Search 2:
| tstats count where (index=BlackPearl OR index=Tortuga OR index=Swashbuckler) by index, sourcetype | table sourcetype, index
Results: (example - roughly 86 rows returned)
sourcetype     index
ship                    BlackPearl
crew                  BlackPearl
rum                    Tortuga
wench              Tortuga
willturner        Swachbuckler
jacksparrow  Swashbuckler

I want to join these results to make a single table of:
_time                                               _raw   sourcetype      index                        GB
2020-08-18 07:04:33.307     ABC     ship                     BlackPearl             0.0000264551490559
2020-08-18 07:04:31.168     LMN    rum                      Tortuga                   0.0000000828877091
2020-08-18 07:04:24.174     XYZ     jacksparrow     Swachbuckler     0.0000000940635800

I tried to use append and it just adds the additional sourcetype/index rows below the actual results (not as a new column).  I tried to use appendcols and the number of rows between the first search and the second search don't match, so only the first handful of rows get an index and the index doesn't match up with the sourcetype.  I tried to use join with the max=0 and type=inner and it only returned a handful of rows (less than 1000) and only for a few of the index/sourcetype combinations.  I even just tried to use the second search as a subsearch of the first search to limit the sourcetypes to ONLY the ones returned in the tstats search... which I think worked, but still didn't tell me which index applied to each sourcetype.  I can run the two separately, extract the data into excel and do a vlookup to get the results I want, but I need this to be in the report/search.  Help me!  I'm drowning.

Be gentle, this is my first discussion topic.  Hope this is enough information to clearly understand the problem.

Labels (1)
Tags (3)
0 Karma

ArchieCrozier
Path Finder

OK, so I figured out what was going on and I'd like to explain.  Then someone can correct me or tell me that I have a defective Splunk or something like that:

I was trying to make an ingestion metric by sourcetype. I know some of you will tell me that the metric is already provided by Splunk, but you do not know my environment. I work in Lehi, Utah and those metrics are collected and stored in Boise, Idaho (where I do not have the access to query).  However, the local indexers and heavy forwarders still have the sourcetype=splunkd, group=per_sourcetype_thruput, and source=/opt/splunk/var/log/splunk/metrics.log .  These logs track ingestion by sourcetype and they track it on a fairly frequent nature, so I could actually perform timecharts and other reporting visualizations over time. So, that was the background of what I was trying to accomplish.

In the past 15 minutes, for each sourcetype this could give me possible events/records of anywhere from 1 to 500. So, I took this data and broke it out to give me columns of sourcetype as well as the other metrics (kbps, eps, kb, etc.). Keep in mind that a single sourcetype would appear in my table thousands of times depending on the time frame I pull.

Then I ran a simple query to return a unique table of sourcetype/index combinations (dedup).  Each index/sourcetype combo appeared in the table only once. On average, I got between 75 and 150 results in this table.

When I joined these two tables, the first 75-150 rows in the sourcetype data table gained an index value. The join NEVER went farther in the sourcetype table than the number of records I had in the sourcetype/index table. In my outrageously optomistic mind, I believed that each row in the sourcetype data table would be evaluated against the sourcetype/index table (the way a normal SQL join works - the sane way a join works).  The only way I could get this to work was to create a MASSIVE sourcetype/index table that had excessively more rows in it than the sourcetype/index table, so that each row could be evaluated.  But this added too much overhead to the search and I quickly gave up on the effort.

I have tested this many times in our environment and I get the same result over-and-over again, so no one can tell me this is not the way it works using the type of syntax I outlined in my original post. The site I work at is running the 7.3.3 version of Splunk Enterprise.

I appreciate responses from @thambisetty and @to4kawa.  Both of you provided new ideas I had to go try and I did learn more from each experiment. Just wanted to send both of you a solid thank you for supporting me.

0 Karma

to4kawa
Ultra Champion

Once, use outputcsv or outputlookup to create a table of sourcetypes,

and It is better to join them together with lookup, I guess.

I think metadata is the fastest way to find out which sourcetype is in the index.

0 Karma

thambisetty
SplunkTrust
SplunkTrust

I don’t really understand how you are getting sourcetype field from Your first search.
I am giving you spl to join based on the results you posted.

sourcetype field should be there in both search results.

index="_internal" source="*metrics.log" per_index_thruput series=autoshell host=lelsplunkix* | eval GB=kb/(1024*1024) | timechart span=12h sum(GB) as GB by series

| join type=left sourcetype 

[| tstats count where (index=BlackPearl OR index=Tortuga OR index=Swashbuckler) by index, sourcetype | table sourcetype, index]

————————————
If this helps, give a like below.
0 Karma

ArchieCrozier
Path Finder

@thambisetty , You are 100% correct.  I screwed up (even after reading it so carefully).  The first search should have been this (need to watch what I copy/paste more carefully):

index=_internal source=*metrics.log group=per_sourcetype_thruput host=lelsplunkix* | rename series as sourcetype | eval GB=kb/(1024*1024) | table _time, _raw, sourcetype, GB

0 Karma

thambisetty
SplunkTrust
SplunkTrust

Alright. Then you can replace first search in my previous answer with your new search and it should work.

if it solves your problem, thumbs up is really appreciated😊

————————————
If this helps, give a like below.

ArchieCrozier
Path Finder

@thambisetty & @to4kawa, I appreciate your responses, but neither of those methods work.

@thambisetty, Your method gets me the closest, but I tried this same method before I submitted this help message ("I tried to use join with the max=0 and type=inner and it only returned a handful of rows and only for a few of the index/sourcetype combinations.").  Using your "join type=left" search suggestion resulted in hundreds of sourcetypes I did not want because they were not part of the type=inner join that I was trying to achieve.

@to4kawa, Your solution returned only 5 rows (when it should have returned a minimum of 45,000 rows.  And it was also missing most of the index/sourcetype combinations.

Both options presented here give me at most 5 combinations of index/sourcetype in the results.  Neither one provides the full list I was trying to achieve.

I've simplified the search to try and make more sense out of this and drastically reduce the number of rows I am returning into the result set:

Search 1: (returns less than 500 rows)
index=_internal source=*metrics.log group=per_sourcetype_thruput earliest=-5d@d latest=-0d@d host=lelsplunki* | rename series as sourcetype | eval GB=kb/(1024*1024) | bucket _time span=1d | stats count by _time, sourcetype

This returns 67 of sourcetypes I do not want.  I want only a set of sourcetypes that match my index types (~40 sourcetypes).

_timesourcetypecount
2020-08-14T00:00:00.000-0600ship3
2020-08-14T00:00:00.000-0600rum583
2020-08-14T00:00:00.000-0600jacksparrow623
2020-08-14T00:00:00.000-0600crew38
2020-08-14T00:00:00.000-0600splunkd11136
2020-08-14T00:00:00.000-0600splunkd_access1278

 

Charting this data with " | chart sum(GB) as TotGB by _time, sourcetype useother=false usenull=false limit=100" instead of the stats command gives results like this:

_timeshiprumjacksparrowcrewsplunkdsplunkd_access
2020-08-14T00:00:00.000-06003.01609E-050.1051905520.0079764880.02893366130.111213830.577985538
2020-08-15T00:00:00.000-06002.77823E-050.081466029.0844E-050.03256497329.982005070.536031151
2020-08-16T00:00:00.000-06004.0628E-050.0857742858.4443E-060.04478031929.957578680.540789116
2020-08-17T00:00:00.000-06003.3265E-050.105401807 0.04280171129.971040580.515818447
2020-08-18T00:00:00.000-06003.45539E-050.099920090.0001685510.03091429932.728555970.816412885


The index for all of these results is _internal, so I want to tie the sourcetype back to its actual index.  In my previous example I was using:

Search 2: (returns less than 100 rows)
| tstats count where (index=BlackPearl OR index=Tortuga OR index=Swashbuckler) by index, sourcetype | table sourcetype, index
Results: (example - roughly 86 rows returned)
sourcetype     index
ship                    BlackPearl
crew                  BlackPearl
rum                    Tortuga
wench              Tortuga
willturner        Swachbuckler
jacksparrow  Swashbuckler

Now that the data has been reduced from 500,000 events to the summary of the stats, I'm hoping that there is a way to join the two small result sets to make a single set of data that has the _time bucket, sourcetype, index, and Sum of the Total GB for each sourcetype over the time bucket window.

I'm going to try your solutions again with this new data set, but I don't know who flagged the reply provided by @thambisetty as a solution... it simply does not work.  Sorry.

0 Karma

ArchieCrozier
Path Finder

OK, here is my last update, I just don't think Splunk is capable of doing a join they way I expect a join to work:

Search 1: (designed to get back all of the index/sourcetype combinations I care about, takes less than a second to run, returned back 85 rows, columns of index and sourcetype)
| tstats count where (index=BlackPearl OR index=Tortuga OR index=Swashbuckler) by index, sourcetype | table sourcetype, index

Search 2: (designed to pull back the license usage by sourcetype from the Splunk internal metrics logs)
index=_internal source=*metrics.log group=per_sourcetype_thruput earliest=-5d@d latest=-0d@d host=lelsplunki* | rename series as sourcetype | eval GB=kb/(1024*1024) | bucket _time span=1d | stats sum(GB) as TotGB by _time, sourcetype, index

Ran search number 2 once for everything and got back 450 results.  Added "(series=crew OR series=jacksparrow)" to the base search to see how many should be returned and got back 169.  This is the 169 results I want back from this search.

Used both of the following searches to try the join option:

Search 3:
index=_internal source=*metrics.log group=per_sourcetype_thruput earliest=-5d@d latest=-0d@d host=lelsplunki* | rename series as sourcetype | eval GB=kb/(1024*1024) | bucket _time span=1d | join max=0 type=inner sourcetype [ | tstats count where (index=BlackPearl OR index=Tortuga OR index=Swashbuckler) by index, sourcetype | table index, sourcetype ] | stats sum(GB) as TotGB by _time, sourcetype, index

Search 4:
index=_internal source=*metrics.log group=per_sourcetype_thruput earliest=-5d@d latest=-0d@d host=lelsplunki* | rename series as sourcetype | eval GB=kb/(1024*1024) | bucket _time span=1d | stats sum(GB) as TotGB by _time, sourcetype | join max=0 type=inner sourcetype [ | tstats count where (index=BlackPearl OR index=Tortuga OR index=Swashbuckler) by index, sourcetype | table index, sourcetype ]

Both of those return only 16 rows and only ~5 combinations of index/sourcetype.  Append, appendcols has also been tried, but I am willing to try anything else you may suggest (even unique variations of append).

0 Karma

to4kawa
Ultra Champion

index=_internal source=*metrics.log group=per_sourcetype_thruput host=lelsplunkix* | rename series as sourcetype | eval GB=kb/(1024*1024) | table _time, _raw, sourcetype, GB
| append [| tstats count where (index=BlackPearl OR index=Tortuga OR index=Swashbuckler) by index, sourcetype | table sourcetype, index]
| selfjoin sourcetype

Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...