Splunk Search

Matching with a lookup

DaveBunn
Path Finder

I'm trying to set up a regular search to check all our GitHub packages against the latest Shai Hulud npm packages.

within "SBOM.Packages{}" i'm trying to validate each of the field pairs SBOM.Packages{}.name and SBOM.Packages{}.versionInfo against a lookup table containing all the shai hulud compromised packages.

I started with
"index=github [|inputlookup shai-hulud.csv | table SBOM.Packages{}.name,  SBOM.Packages{}.versionInfo]"

This code works for packages names within the first few thousand characters of the event log (probably 10,000 chars knowing Splunk), but it does not reliably locate package names located a hundred or so packages in.

I've been trying to get an spath command running through a foreach loop but just can't get the loop to work.

so - the question.

Anyone already have a piece of SPL that checks npm packages against a lookup list.

OR

Anyone have an inkling how to iterate through a few hundred SBOM.Packages{} and compare them to the current list of 1500 compromised npm name / version variants.

Labels (1)
0 Karma

tscroggins
Influencer

Hi @DaveBunn,

Let's start with a publicly available list of compromised packages: https://github.com/wiz-sec-public/wiz-research-iocs/blob/main/reports/shai-hulud-2-packages.csv. The CSV file contains Package and Version fields that we'll correlate to SBOM.Packages objects:

Package,Version
02-echo,= 0.0.7
@accordproject/concerto-analysis,= 3.24.1
@accordproject/concerto-linter,= 3.24.1
@accordproject/concerto-linter-default-ruleset,= 3.24.1
@accordproject/concerto-metamodel,= 3.12.5
...

Note: I'm not affiliated with Wiz, Inc. We're all about Splunk here, but I don't see anything on https://research.splunk.com/ except for an attack range dataset.

Let's also start with three small test cases, two positive and one negative:

{"SBOM":{"Packages":[{"name":"@accordproject/concerto-linter","versionInfo":"3.24.1"}]}}
{"SBOM":{"Packages":[{"name":"lodash","versionInfo":"4.17.21"},{"name":"@accordproject/concerto-linter","versionInfo":"3.24.1"}]}}
{"SBOM":{"Packages":[{"name":"lodash","versionInfo":"4.17.21"}]}}

I'll assume by your question that you're starting with fields extracted with either KV_MODE = json or spath and not indexed extractions:

SBOM.Packages{}.name SBOM.Packages{}.versionInfo
@accordproject/concerto-linter3.24.1
lodash
@accordproject/concerto-linter
4.17.21
3.24.1
lodash4.17.21


As you've found, KV_MODE = json scans only the first 10240 characters of _raw by default. See the limits.conf.spec [kv] stanza maxchars setting for more information.

The main challenge is correlating a value in SBOM.Packages{}.name at index i with a value at the same index in SBOM.Packages{}.versionInfo.

We can extract and concatenate those values into a single multi-valued field using JSON eval functions:

| eval ioc=mvmap(json_array_to_mv(json_extract(_raw, "SBOM.Packages")), spath(_raw, "name").",".spath(_raw, "versionInfo"))

We can do the same with shai-hulud-2-packages.csv and use the result as a search filter:

| search [| inputlookup shai-hulud-2-packages.csv | eval ioc=Package.",".Version | fields ioc ]

Combining them together in a complete example, only the positive test cases are returned:

| makeresults format=csv data="_raw
\"{\"\"SBOM\"\":{\"\"Packages\"\":[{\"\"name\"\":\"\"@accordproject/concerto-linter\"\",\"\"versionInfo\"\":\"\"3.24.1\"\"}]}}\"
\"{\"\"SBOM\"\":{\"\"Packages\"\":[{\"\"name\"\":\"\"lodash\"\",\"\"versionInfo\"\":\"\"4.17.21\"\"},{\"\"name\"\":\"\"@accordproject/concerto-linter\"\",\"\"versionInfo\"\":\"\"3.24.1\"\"}]}}\"
\"{\"\"SBOM\"\":{\"\"Packages\"\":[{\"\"name\"\":\"\"lodash\"\",\"\"versionInfo\"\":\"\"4.17.21\"\"}]}}\"
"
| eval ioc=mvmap(json_array_to_mv(json_extract(_raw, "SBOM.Packages")), spath(_raw, "name").",= ".spath(_raw, "versionInfo"))
| search [| inputlookup shai-hulud-2-packages.csv | eval ioc=Package.",".Version | fields ioc ]
_raw
{"SBOM":{"Packages":[{"name":"@accordproject/concerto-linter","versionInfo":"3.24.1"}]}}
{"SBOM":{"Packages":[{"name":"lodash","versionInfo":"4.17.21"},{"name":"@accordproject/concerto-linter","versionInfo":"3.24.1"}]}}

 

yuanliu
SplunkTrust
SplunkTrust

What exactly is in that file?  Who/what produces this lookup?  What is the goal (desired output) you are trying achieve with this file? What exactly is in the source data?  Like @PickleRick says, this is not a Github forum nor a Shai Hulud forum.  Your question needs to focus on data and processing.

Based on the hint you dropped, I get the feeling that you are trying to find events containing certain field values that matches a list of values in the lookup.  The fields of interest are package's name and versioninfo as a pair.

There are several problems with the approach shown.  The biggest is the content of the file.  The root cause is Splunk's flattening of JSON arrays.  If you examine your raw data closely, you'll notice that SBOM.Packages{}.name,  SBOM.Packages{}.versionInfo are not independent keys.  They are keys in elements of an array (which Splunk denotes as SBOM.Packages{}).  You cannot arbitrarily pair them together.

Now, I assume either you (or your employer's organization) have control over the format and content of the lookup.  So, I strongly recommend that you organize your lookup around two essential keys, name and versionInfo.  Make sure that the two fields are not mismatched for your real purpose.

The second problem is also caused by Splunk's flattening of JSON array.  After flattening, SBOM.Packages{}.name and SBOM.Packages{}.versionInfo become unrelated multivalue fields, i.e., independent arrays of their own.  Using subsearch with such data is doomed to be inaccurate.  You have to return back to actual JSON array SBOM.Packages{}.

Provided that your lookup now has the correct pairs name and versioninfo, here is one traditional approach to seek out matches.

index=github
| fields - SBOM.Packages{}.* ``` optional but helps performance ```
| spath path=SBOM.Packages{}
| mvexpand SBOM.Packages{}
| spath input=SBOM.Packages{}
| fields - SBOM.Packages{} ``` again, optional ```
| lookup shai-hulud.csv name versioninfo output name as match_name
| where isnotnull(match_name)

Again, the actual solution depends a lot on what you want to do with this match.  There can be more efficient code paths to get to your end game.

PickleRick
SplunkTrust
SplunkTrust

If you use "iterate" in a sentence you're probably not thinking about your problem in a splunky way. 😉

Paste a sample of your data (anonymized/sanitized if needed) to visualize your problem and the expected outcome.

Get Updates on the Splunk Community!

Data Management Digest – December 2025

Welcome to the December edition of Data Management Digest! As we continue our journey of data innovation, the ...

Index This | What is broken 80% of the time by February?

December 2025 Edition   Hayyy Splunk Education Enthusiasts and the Eternally Curious!    We’re back with this ...

Unlock Faster Time-to-Value on Edge and Ingest Processor with New SPL2 Pipeline ...

Hello Splunk Community,   We're thrilled to share an exciting update that will help you manage your data more ...