Solved: Why is multi column join not working whereas singl...

pmittal · ‎10-10-2022

single column join is working

index=* source=jar columns.path="*/log4j-core*" NOT columns.path=*/log4j*2.17* host IN (*.test.com)

| rename columns.pid AS pid, columns.pid_ts as pid_ts, columns.path as path,

| dedup host path pid

| join pid type=left max=1 [search index=* source=process host IN (*.test.com) earliest=-25h latest=now

  | rename columns.pid AS pid, columns.cmdline as cmd, columns.username as user, columns.uid as uid, columns.groupname as group, columns.gid as gid

  | dedup host pid]

| table host, path, pid, user, uid, group, gid, cmd

but multi column join is not working

index=* source=jar columns.path="*/log4j-core*" NOT columns.path=*/log4j*2.17* host IN (*.test.com)

| rename columns.pid AS pid, columns.pid_ts as pid_ts, columns.path as path,

| dedup host path pid

| join host,pid type=left max=1 [search index=* source=process host IN (*.test.com) earliest=-25h latest=now

  | rename columns.pid AS pid, columns.cmdline as cmd, columns.username as user, columns.uid as uid, columns.groupname as group, columns.gid as gid

  | dedup host pid]

| table host, path, pid, user, uid, group, gid, cmd

Splunk Enterprise

Version:8.2.6Build:a6fe1ee8894b

gcusello · ‎10-10-2022

Hi @pmittal,

as you surely read in many posts on this community, Splunk isn't a DB and the usual approach thet people that worked on a DB (like me eleven years ago!) are usually using is wrong in Splunk:

I'm sure that your search is very slow, because the join command can be used only if there in't any other solution because it's very slow and it has the limit of 50,000 results in the subsearch.

For this reason, I hint to see a different approach based on stats, something like this:

index=* ((source=jar columns.path="*/log4j-core*" NOT columns.path=*/log4j*2.17*) OR (source=process earliest=-25h latest=now)) host IN (*.test.com)
| rename 
   columns.pid AS pid
   columns.pid_ts as pid_ts
   columns.path as path
   columns.cmdline as cmd
   columns.username as user
   olumns.uid as uid
   columns.groupname as group
   columns.gid as gid
| stats 
   values(host) AS host 
   values(path) AS path 
   values(user) AS user
   values(uid) AS uid
   values(group) AS group
   values(gid) AS gid
   values(cmd) AS cmd
   BY pid
| table host path pid user uid group gid cmd

In few words, you have to put in the main search both the searches and make a stats command by the common keys, adding as values the other fields you want.

About the multi field join, maybe in your search the problem is the comma between the join fields.

I suppose that you already checked that there are values matching the join.

Anyway, also in this case the solution could be:

index=* ((source=jar columns.path="*/log4j-core*" NOT columns.path=*/log4j*2.17*) OR (source=process earliest=-25h latest=now)) host IN (*.test.com)
| rename 
   columns.pid AS pid
   columns.pid_ts as pid_ts
   columns.path as path
   columns.cmdline as cmd
   columns.username as user
   olumns.uid as uid
   columns.groupname as group
   columns.gid as gid
| stats 
   values(path) AS path 
   values(user) AS user
   values(uid) AS uid
   values(group) AS group
   values(gid) AS gid
   values(cmd) AS cmd
   BY host pid
| table host path pid user uid group gid cmd

Ciao.

Giuseppe

View solution in original post

pmittal · ‎10-10-2022

Hi @gcusello ,

The problem seems to be due to no. of records greater than 50k as you pointed out. However, comma between the join fields is not an issue. There is one problem with your solution, it's not providing left join functionality. It just combine records from both the indexes. E.g. final output should have only 4 records but what I am getting is like 70k records. After stats, I tried filtering data using where but no success. What I want is only path containing vulnerable log4j i.e. the path filter used in first query. Sorry, I am new to Splunk. So, this might be a very basic ask

gcusello · ‎10-10-2022

Hi @pmittal,

please try this:

index=* ((source=jar columns.path="*/log4j-core*" NOT columns.path=*/log4j*2.17*) OR (source=process earliest=-25h latest=now)) host IN (*.test.com)
| rename 
   columns.pid AS pid
   columns.pid_ts as pid_ts
   columns.path as path
   columns.cmdline as cmd
   columns.username as user
   olumns.uid as uid
   columns.groupname as group
   columns.gid as gid
| stats 
   values(host) AS host 
   values(path) AS path 
   values(user) AS user
   values(uid) AS uid
   values(group) AS group
   values(gid) AS gid
   values(cmd) AS cmd
   dc(source) AS source_count
   BY pid
| where source_count=2
| table host path pid user uid group gid cmd

Ciao.

Giuseppe

pmittal · ‎10-10-2022

Since, I required regex; I used

| search path IN (*/log4j-core*) NOT path=*/log4j*2.17*

insetad of where command. Thank you for your support!!

gcusello · ‎10-10-2022

Hi @pmittal,

good for you, see next time!

Ciao and happy splunking

Giuseppe

P.S.: Karma Points are appreciated 😉

pmittal · ‎10-10-2022

got it working with search command instead of where. However, still not getting expected output but on right track. Will update in sometime

gcusello · ‎10-10-2022

Hi @pmittal,

as you surely read in many posts on this community, Splunk isn't a DB and the usual approach thet people that worked on a DB (like me eleven years ago!) are usually using is wrong in Splunk:

I'm sure that your search is very slow, because the join command can be used only if there in't any other solution because it's very slow and it has the limit of 50,000 results in the subsearch.

For this reason, I hint to see a different approach based on stats, something like this:

index=* ((source=jar columns.path="*/log4j-core*" NOT columns.path=*/log4j*2.17*) OR (source=process earliest=-25h latest=now)) host IN (*.test.com)
| rename 
   columns.pid AS pid
   columns.pid_ts as pid_ts
   columns.path as path
   columns.cmdline as cmd
   columns.username as user
   olumns.uid as uid
   columns.groupname as group
   columns.gid as gid
| stats 
   values(host) AS host 
   values(path) AS path 
   values(user) AS user
   values(uid) AS uid
   values(group) AS group
   values(gid) AS gid
   values(cmd) AS cmd
   BY pid
| table host path pid user uid group gid cmd

In few words, you have to put in the main search both the searches and make a stats command by the common keys, adding as values the other fields you want.

About the multi field join, maybe in your search the problem is the comma between the join fields.

I suppose that you already checked that there are values matching the join.

Anyway, also in this case the solution could be:

index=* ((source=jar columns.path="*/log4j-core*" NOT columns.path=*/log4j*2.17*) OR (source=process earliest=-25h latest=now)) host IN (*.test.com)
| rename 
   columns.pid AS pid
   columns.pid_ts as pid_ts
   columns.path as path
   columns.cmdline as cmd
   columns.username as user
   olumns.uid as uid
   columns.groupname as group
   columns.gid as gid
| stats 
   values(path) AS path 
   values(user) AS user
   values(uid) AS uid
   values(group) AS group
   values(gid) AS gid
   values(cmd) AS cmd
   BY host pid
| table host path pid user uid group gid cmd

Ciao.

Giuseppe

Why is multi column join not working whereas single column join is working fine?

Splunk Enterprise

join

🌟 From Audit Chaos to Clarity: Welcoming Audit Trail v2

Splunk Enterprise Security 8.x: The Essential Upgrade for Threat Detection, ...

Get Early Access to AI Playbook Authoring: Apply for the Alpha Private Preview ...

Are you a member of the Splunk Community?

Why is multi column join not working whereas single column join is working fine?

Splunk Enterprise

join

🌟 From Audit Chaos to Clarity: Welcoming Audit Trail v2

Splunk Enterprise Security 8.x: The Essential Upgrade for Threat Detection, ...

Get Early Access to AI Playbook Authoring: Apply for the Alpha Private Preview ...