Hello All,
I have been scouring the community and other boards but for the life of me cannot create a SPL query to get the results I need. Below is what I am trying to accomplish, any direction and help would be greatly appreciated.
I have 2 sourcetypes and 1 lookup table.
SourceA:
ID_a | name | trimmed_name | date |
ABC | ABC_n | AB | 01/01/2023 |
ABC | ABC_n1 | AB | 01/02/2023 |
ABC | ABC_n2 | AB | 01/03/2023 |
XYZ | XYZ_n | XY | 02/01/2023 |
XYZ | XYZ_n1 | XY | 02/02/2023 |
SourceB:
ID | title_id | title | description | solution |
ABC | 12345 | ABC_t | ABC_d | ABC_s |
ABC | 23456 | ABC_t1 | ABC_d1 | ABC_s1 |
ABC | 93648 | ABC_t2 | ABC_d2 | ABC_s2 |
XYZ | 23456 | XYZ_t | XYZ_d | XYZ_s |
XYZ | 38840 | XYZ_t1 | XYZ_d1 | XYZ_s1 |
MNO | 43245 | MNO_t | MNO_d | MNO_s |
MNO | 36485 | MNO_t1 | MNO_d1 | MNO_s1 |
RST | 84678 | RST_t | RST_d | RST_s |
Lookup Table:
trimmed_name | department |
AB | ABC_dep |
XY | XYZ_dep |
MN | MNO_dep |
Intended Result:
sourceA.name | sourceB.title_id | sourceB.title | sourceB.description | sourceB.solution | sourceA.date | lookup.department |
ABC_n | 12345 | ABC_t | ABC_d | ABC_s | 01/01/2023 | ABC_dep |
ABC_n | 23456 | ABC_t1 | ABC_d1 | ABC_s1 | 01/01/2023 | ABC_dep |
ABC_n | 93648 | ABC_t2 | ABC_d2 | ABC_s2 | 01/01/2023 | ABC_dep |
ABC_n1 | 12345 | ABC_t | ABC_d | ABC_s | 01/02/2023 | ABC_dep |
ABC_n1 | 23456 | ABC_t1 | ABC_d1 | ABC_s1 | 01/02/2023 | ABC_dep |
ABC_n1 | 93648 | ABC_t2 | ABC_d2 | ABC_s2 | 01/02/2023 | ABC_dep |
ABC_n2 | 12345 | ABC_t | ABC_d | ABC_s | 01/03/2023 | ABC_dep |
ABC_n2 | 23456 | ABC_t1 | ABC_d1 | ABC_s1 | 01/03/2023 | ABC_dep |
ABC_n2 | 93648 | ABC_t2 | ABC_d2 | ABC_s2 | 01/03/2023 | ABC_dep |
XYZ_n | 23456 | XYZ_t | XYZ_d | XYZ_s | 02/01/2023 | XYZ_dep |
XYZ_n | 38840 | XYZ_t1 | XYZ_d1 | XYZ_s1 | 02/01/2023 | XYZ_dep |
XYZ_n | 23456 | XYZ_t | XYZ_d | XYZ_s | 02/02/2023 | XYZ_dep |
XYZ_n | 38840 | XYZ_t1 | XYZ_d1 | XYZ_s1 | 02/02/2023 | XYZ_dep |
Query, I have been playing with... From what I gather, I have events for the query but nothing displays when I try to get data from SourceB and SourceA at the same time.
index="myindex" (sourcetype="SourceB" type=INFO) OR (sourcetype="SourceA" date<2023-02-14 env=envA)
| rename ID_a AS ID
| fields ID, name, title, description, solution, date, env
| stats count values(*) AS * values(date) values(title) values(env) BY name
| where count=1
| eval ID=coalesce(ID, ID_a)
| fields ID, name, title_id, title, description, solution, date, env
| stats list(*) AS * BY ID
| where env="envA"
| eval repA=mvrange(0,mvcount(name))
| mvexpand repA
| foreach name date env
[| eval <<FIELD>>=mvindex(<<FIELD>>, repA)]
| eval repB=mvrange(0,mvcount(title_id))
| mvexpand repB
| foreach title_id title description solution
[| eval <<FIELD>>=mvindex(<<FIELD>>, repB)]
Problems in your first search:
Based on your description, you want to
Combining these, the search should be something like
index="myindex" (sourcetype="SourceB" type=INFO) OR (sourcetype="SourceA" date<2023-02-14 env=envA)
| eval ID = if(sourcetype == "SourceA", ID_a, ID) ``` use ID as universal field name ```
| fields ID, name, title, description, solution, date, env
| stats count values(*) AS * BY ID ``` not by name ```
| lookup lookuptable.csv trimmed_name ``` match trimmed_name ```
Here, I assume that you didn't intend to find singular ID. But if you need any real restriction on count, you can add it.
Based on your description, you want to
- pull all fields of interest in SourceA;
- add fields of interest in SourceB if ID_a in the former matches ID in the latter; and
- pull additional fields from lookup if trimmed_name in SourceA is found in the lookup.
The above is accurate. However, the sample query you provide is still returning data from SourceB only.
Apologizes to all, but I may not be describing the problem and expected results inaccurately. Appreciate all the help so far nonetheless.
I updated my original post to provide more clarity.
I forgot to filter out data contributed from SourceB that had no match in SourceA.
index="myindex" (sourcetype="SourceB" type=INFO) OR (sourcetype="SourceA" date<2023-02-14 env=envA)
| eval ID = if(sourcetype == "SourceA", ID_a, ID) ``` use ID as universal field name ```
| fields ID, name, title, description, solution, date, env, sourcetype
| stats count values(*) AS * BY ID ``` not by name ```
| where sourcetype == "SourceA" ``` side effect of SPL's multivalue equality ```
| lookup lookuptable.csv trimmed_name ``` match trimmed_name ```
The where filter should be mvcount(sourcetype) > 1 OR sourcetype == "SourceA"; but SPL allows equality operator to match any one of multiple values if one side of the operator is a single value.
It looks like you're thinking along the right track.
The common property between sourceA and sourceB is "ID" so that is what we should group by.
Try the following:
index="myindex" (sourcetype="SourceB" type=INFO) OR (sourcetype="SourceA" date<2023-02-14 env=envA)
| eval drop=if(sourcetype="SourceA", 1, null())
| eval ID = coalesce(ID, ID_a)
| table ID, name, title, description, solution, date, env
| eventstats latest(name) as name, latest(trimmed_name) as trimmed_name, latest(date) as date by ID
| where isnull(drop)
| fields - drop
| lookup lookuptable.csv trimmed_name
Essentially, we're using sourceB as our main dataset, extracting what we need from sourceA and then dropping the sourceA events from the search results. Then finally it looks like we need to lookup using trimmed_name from your example data.
Thanks for the quick reply! Unfortunately, looks like the query is still only returning data from one source. I played with the query to simplify to just returning 2 data points from SourceA and SourceB, but was unable get it to work.
| where isnotnull(drop) | fields - drop
Here is an example of what is being returned.
sourceA.name | sourceB.title_id | sourceB.title | sourceB.description | sourceB.solution | sourceA.date | lookup.department |
12345 | ABC_t | ABC_d | ABC_s | |||
23456 | ABC_t1 | ABC_d1 | ABC_s1 | |||
93648 | ABC_t2 | ABC_d2 | ABC_s2 | |||
23456 | XYZ_t | XYZ_d | XYZ_s |
Hi TangSauce,
You're welcome but apologies, I had a typo in my example search, that should have been:
| where isnull(drop) | fields - drop
I have updated my original post to reflect that.
That should be working now, but your search results are not looking promising...
I've just noticed that you have a date parameter for sourceA are you trying to search different timeranges between the two sources?
index="myindex" (sourcetype="SourceB" type=INFO) OR (sourcetype="SourceA" date<2023-02-14 env=envA)
If date field is being extracted as the _time field for SourceA, Splunk can do the heavy lifting for your filtering:
index="myindex" (sourcetype="SourceB" type=INFO) OR (sourcetype="SourceA" latest="02/14/2023:00:00:00" env=envA)
Otherwise you'll need to manually filter the dates out yourself:
index="myindex" (sourcetype="SourceB" type=INFO) OR (sourcetype="SourceA" env=envA)
| where sourcetype="SourceB" OR (strptime(date, "%Y-%m-%d") > strptime("2023-02-14", "%Y-%m-%d"))
Either way, focus on writing a search that gets the correct data that you require from SourceA and SourceB together. Make sure that data from both SourceA and SourceB are present with the first section of the search.
Then you can use the rest of the search from my original answer to achieve the overall goal.
| eval drop=if(sourcetype="SourceA", 1, null())
| eval ID = coalesce(ID, ID_a)
| table ID, name, title, description, solution, date, env
| eventstats latest(name) as name, latest(trimmed_name) as trimmed_name, latest(date) as date by ID
| where isnull(drop)
| fields - drop
| lookup lookuptable.csv trimmed_name
Thanks again for the help. Still struggling and trying different ways to get the search to return or at least include the dataset and fields I need. Again, running into only being able to show values from SourceB but not SourceA. The date field in SourceA does not equal _time, but is specific to when the name field was changed (1-1 relationship).
I assume I would need to confirm the returned results are from both SourceA and SourceB via Events (~7.9k events). I have been able to play with python to achieve the goal but, not everyone in my org will be able access this data via a spreadsheet.