Splunk Search

Multiple sourcetypes combine datasets similar to concept Index-to-Match

TangSauce
Engager

Hello All,

I have been scouring the community and other boards but for the life of me cannot create a SPL query to get the results I need. Below is what I am trying to accomplish, any direction and help would be greatly appreciated.

I have 2 sourcetypes and 1 lookup table.

  • SourceA has fields (ID_a, name, date, trimmed_name, env, etc..)
  • SourceB has fields (ID, title_id, title, description solution, etc...)
  • Lookup has fields (trimmed_name, department, etc...)
  • SourceA gets updated weekly, therefore in the existing query I have earliest=-7d to exclude previous data
  • Characteristics of SourceA: name is the only unique field, ID_a will be duplicated across name depending on date field, env is duplicated across the dataset, trimmed_name is a field created from trimming name
  • SourceA ID_a and SourceB ID are the common field between the two datasets
  • Characteristics of SourceB: there are duplicates of ID, title_id, title, description, solution. When deduping title_id, values of title, description, solution, etc... are dedup'd and become unique. This is why I called SourceB a knowledge base. We can dedup SourceB down to the fields above to get a finite list.
  • SourceB will have multiple values of ID. But we only need to return fields\values where the dedup'd SourceA ID_a exists. This means from 100 events in SourceA, dedup'd by ID_a, results in 2 ID_a. These 2 ID_a values is what we to find in Source B and return title_id, title, description solution values for.
  • At a minimum, final results needed per source are: 
    • SourceA: name, date, env
    • SourceB: title_id, title, description, solution (other fields can be omitted, assuming we can add specific fields back as needed)
    • Lookup: department (other fields can be omitted, assuming we can add specific fields back as needed)

SourceA:

ID_anametrimmed_namedate
ABCABC_nAB01/01/2023
ABCABC_n1AB01/02/2023
ABCABC_n2AB01/03/2023
XYZXYZ_nXY02/01/2023
XYZXYZ_n1XY02/02/2023

 

SourceB:

IDtitle_idtitledescriptionsolution
ABC12345ABC_tABC_dABC_s
ABC23456ABC_t1ABC_d1ABC_s1
ABC93648ABC_t2ABC_d2ABC_s2
XYZ23456XYZ_tXYZ_dXYZ_s
XYZ38840XYZ_t1XYZ_d1XYZ_s1
MNO43245MNO_tMNO_dMNO_s
MNO36485MNO_t1MNO_d1MNO_s1
RST84678RST_tRST_dRST_s

 

Lookup Table:

trimmed_namedepartment
ABABC_dep
XYXYZ_dep
MNMNO_dep

 

Intended Result:

sourceA.namesourceB.title_idsourceB.titlesourceB.descriptionsourceB.solutionsourceA.datelookup.department
ABC_n12345ABC_tABC_dABC_s01/01/2023ABC_dep
ABC_n23456ABC_t1ABC_d1ABC_s101/01/2023ABC_dep
ABC_n93648ABC_t2ABC_d2ABC_s201/01/2023ABC_dep
ABC_n112345ABC_tABC_dABC_s01/02/2023ABC_dep
ABC_n123456ABC_t1ABC_d1ABC_s101/02/2023ABC_dep
ABC_n193648ABC_t2ABC_d2ABC_s201/02/2023ABC_dep
ABC_n212345ABC_tABC_dABC_s01/03/2023ABC_dep
ABC_n223456ABC_t1ABC_d1ABC_s101/03/2023ABC_dep
ABC_n293648ABC_t2ABC_d2ABC_s201/03/2023ABC_dep
XYZ_n23456XYZ_tXYZ_dXYZ_s02/01/2023XYZ_dep
XYZ_n38840XYZ_t1XYZ_d1XYZ_s102/01/2023XYZ_dep
XYZ_n23456XYZ_tXYZ_dXYZ_s02/02/2023XYZ_dep
XYZ_n38840XYZ_t1XYZ_d1XYZ_s102/02/2023XYZ_dep

Query, I have been playing with... From what I gather, I have events for the query but nothing displays when I try to get data from SourceB and SourceA at the same time.

 

index="myindex" (sourcetype="SourceB" type=INFO) OR (sourcetype="SourceA" date<2023-02-14 env=envA)
| rename ID_a AS ID
| fields ID, name, title, description, solution, date, env
| stats count values(*) AS * values(date) values(title) values(env) BY name
| where count=1

 

Labels (3)
0 Karma

ITWhisperer
SplunkTrust
SplunkTrust
| eval ID=coalesce(ID, ID_a)
| fields ID, name, title_id, title, description, solution, date, env
| stats list(*) AS * BY ID
| where env="envA"
| eval repA=mvrange(0,mvcount(name))
| mvexpand repA
| foreach name date env
    [| eval <<FIELD>>=mvindex(<<FIELD>>, repA)]
| eval repB=mvrange(0,mvcount(title_id))
| mvexpand repB
| foreach title_id title description solution
    [| eval <<FIELD>>=mvindex(<<FIELD>>, repB)]
0 Karma

yuanliu
SplunkTrust
SplunkTrust

Problems in your first search:

  1. When you rename ID_a as ID, you also erase any value of ID that comes from sourceB.  You need to pull value discretely.
  2. Field name only exists in SourceA.  The stats command's groupby will exclude any data from SourceB.
  3. The last filter will further exclude any "name" that appear in more than one event in SourceA, which is very unlikely.

Based on your description, you want to

  • pull all fields of interest in SourceA;
  • add fields of interest in SourceB if ID_a in the former matches ID in the latter; and
  • pull additional fields from lookup if trimmed_name in SourceA is found in the lookup.

Combining these, the search should be something like

index="myindex" (sourcetype="SourceB" type=INFO) OR (sourcetype="SourceA" date<2023-02-14 env=envA)
| eval ID = if(sourcetype == "SourceA", ID_a, ID) ``` use ID as universal field name ```
| fields ID, name, title, description, solution, date, env
| stats count values(*) AS * BY ID ``` not by name ```
| lookup lookuptable.csv trimmed_name ``` match trimmed_name ```

Here, I assume that you didn't intend to find singular ID.  But if you need any real restriction on count, you can add it.

Tags (1)
0 Karma

TangSauce
Engager

Based on your description, you want to

  • pull all fields of interest in SourceA;
  • add fields of interest in SourceB if ID_a in the former matches ID in the latter; and
  • pull additional fields from lookup if trimmed_name in SourceA is found in the lookup.

The above is accurate. However, the sample query you provide is still returning data from SourceB only.

Apologizes to all, but I may not be describing the problem and expected results inaccurately. Appreciate all the help so far nonetheless.

I updated my original post to provide more clarity.

0 Karma

yuanliu
SplunkTrust
SplunkTrust

I forgot to filter out data contributed from SourceB that had no match in SourceA.

index="myindex" (sourcetype="SourceB" type=INFO) OR (sourcetype="SourceA" date<2023-02-14 env=envA)
| eval ID = if(sourcetype == "SourceA", ID_a, ID) ``` use ID as universal field name ```
| fields ID, name, title, description, solution, date, env, sourcetype
| stats count values(*) AS * BY ID ``` not by name ```
| where sourcetype == "SourceA" ``` side effect of SPL's multivalue equality ```
| lookup lookuptable.csv trimmed_name ``` match trimmed_name ```

The where filter should be mvcount(sourcetype) > 1 OR sourcetype == "SourceA"; but SPL allows equality operator to match any one of multiple values if one side of the operator is a single value.

0 Karma

Tom_Lundie
Contributor

It looks like you're thinking along the right track.

The common property between sourceA and sourceB is "ID" so that is what we should group by.

Try the following:

 

index="myindex" (sourcetype="SourceB" type=INFO) OR (sourcetype="SourceA" date<2023-02-14 env=envA)
| eval drop=if(sourcetype="SourceA", 1, null())
| eval ID = coalesce(ID, ID_a)
| table ID, name, title, description, solution, date, env
| eventstats latest(name) as name, latest(trimmed_name) as trimmed_name, latest(date) as date by ID
| where isnull(drop)
| fields - drop
| lookup lookuptable.csv trimmed_name

 

 
Essentially, we're using sourceB as our main dataset, extracting what we need from sourceA and then dropping the sourceA events from the search results. Then finally it looks like we need to lookup using trimmed_name from your example data.

0 Karma

TangSauce
Engager

Thanks for the quick reply! Unfortunately, looks like the query is still only returning data from one source.  I played with the query to simplify to just returning 2 data points from SourceA and SourceB, but was unable get it to work.

  • The original query shows no results
  • Removing the following starts to show results 
    | where isnotnull(drop)
    | fields - drop

Here is an example of what is being returned.

sourceA.namesourceB.title_idsourceB.titlesourceB.descriptionsourceB.solutionsourceA.datelookup.department
 12345ABC_tABC_dABC_s  
 23456ABC_t1ABC_d1ABC_s1  
 93648ABC_t2ABC_d2ABC_s2  
 23456XYZ_tXYZ_dXYZ_s  
Tags (1)
0 Karma

Tom_Lundie
Contributor

Hi TangSauce,

You're welcome but apologies, I had a typo in my example search, that should have been:

| where isnull(drop)
| fields - drop

I have updated my original post to reflect that.

That should be working now, but your search results are not looking promising...

I've just noticed that you have a date parameter for sourceA are you trying to search different timeranges between the two sources?

 

index="myindex" (sourcetype="SourceB" type=INFO) OR (sourcetype="SourceA" date<2023-02-14 env=envA)

 

If date field is being extracted as the _time field for SourceA, Splunk can do the heavy lifting for your filtering:

 

index="myindex" (sourcetype="SourceB" type=INFO) OR (sourcetype="SourceA" latest="02/14/2023:00:00:00" env=envA)

 

Otherwise you'll need to manually filter the dates out yourself:

 

index="myindex" (sourcetype="SourceB" type=INFO) OR (sourcetype="SourceA" env=envA)
| where sourcetype="SourceB" OR (strptime(date, "%Y-%m-%d") > strptime("2023-02-14", "%Y-%m-%d"))

 

Either way, focus on writing a search that gets the correct data that you require from SourceA and SourceB together. Make sure that data from both SourceA and SourceB are present with the first section of the search.

Then you can use the rest of the search from my original answer to achieve the overall goal.

 

| eval drop=if(sourcetype="SourceA", 1, null())
| eval ID = coalesce(ID, ID_a)
| table ID, name, title, description, solution, date, env
| eventstats latest(name) as name, latest(trimmed_name) as trimmed_name, latest(date) as date by ID
| where isnull(drop)
| fields - drop
| lookup lookuptable.csv trimmed_name

 

TangSauce
Engager

Thanks again for the help. Still struggling and trying different ways to get the search to return or at least include the dataset and fields I need. Again, running into only being able to show values from SourceB but not SourceA. The date field in SourceA does not equal _time, but is specific to when the name field was changed (1-1 relationship).

I assume I would need to confirm the returned results are from both SourceA and SourceB via Events (~7.9k events). I have been able to play with python to achieve the goal but, not everyone in my org will be able access this data via a spreadsheet.

0 Karma
Get Updates on the Splunk Community!

Exporting Splunk Apps

Join us on Monday, October 21 at 11 am PT | 2 pm ET!With the app export functionality, app developers and ...

[Coming Soon] Splunk Observability Cloud - Enhanced navigation with a modern look and ...

We are excited to introduce our enhanced UI that brings together AppDynamics and Splunk Observability. This is ...

Splunk Smartness with Patrick Tatro | Episode 4

Welcome to another episode of "Splunk Smartness," where we explore how Splunk Education can revolutionize your ...