Multi-Source Full Outer Join using Append

mcrawford44 · ‎01-17-2014

All,

As I understand it; The Splunk JOIN command does not have a 'full outer join' option. I was able to look-up an example of using the APPEND command, but the results are not what I expected.

I have 3 data sources. In a full join, the returned records (rows) should total the largest record (row) return total. So if source 1 has 1000 records, and 2-3 only have 50; The total records (rows) returned would be 1000. This is not occuring using the example provided here;

http://answers.splunk.com/answers/81741/full-outer-join

My 3 sources;

Splunk AD index (with deduped cn) - 29,860 events/stats

DB 1 (mcaffe encryption) - 75,178 records

DB 2 (marimba) - 161,791

Matching columns; AD=cn, DB1=ComputerName, DB2=name

If these were all SQL tables, a FULL OUTER JOIN would return 161,791 records as that is the largest table. As said earlier this does not occur with the example linked above.

I'm using this query;

index=ad_test 
| rename cn as ComputerName 
| dedup ComputerName 
| stats count by ComputerName distinguishedName 
| append [ | dbquery mcafee "SELECT * FROM EPOComputerProperties" | stats count by ComputerName CPUType] 
| append [ | dbquery marimba_2014 "SELECT * FROM inv_machine" | rename name AS ComputerName | stats count by ComputerName serial_number] 
| stats values(serial_number) as serial_number values(distinguishedName) as distinguishedName values(CPUType) as CPUType by ComputerName

The returned result count; 129,645

I'm trying to understand the behavior of why this is happening, and what do I need to do to pull a complete recordset from these 3 sources as a full outer join.

Thanks in advance.

mcrawford44 · ‎01-21-2014

Result count;

ad_test = 851 (I rebuilt the index to focus on a specific OU)

mcafee = 162,237 (Additional records have come in.)

marimba_2014 = 74,807 (including serial_number) 75,461 (without serial_number)

Combined append query results; 129,643

mcafee and marimba do contain SOME duplicates but there are only ~15 records of duplicates and the maximum count is 3. So a maximum of 45. Not 40k missing records. 😕

linu1988 · ‎01-20-2014

Activating the question 🙂

martin_mueller · ‎01-17-2014

I see you had counted the tables individually, but I don't see that for the subqueries - can you run these three and look at the numbers,make sure they match the expected numbers:

index=ad_test | rename cn as ComputerName | dedup ComputerName | stats count by ComputerName distinguishedName

| dbquery mcafee "SELECT * FROM EPOComputerProperties" | stats count by ComputerName CPUType

| dbquery marimba_2014 "SELECT * FROM inv_machine" | rename name AS ComputerName | stats count by ComputerName serial_number

Just to make sure those stats don't drop events with identical CNs/SNs...

linu1988 · ‎01-17-2014

Hello,
i don't quite understand the retirement but there are several things noticeable and can be optimized.

There is no point doing a database query operation in splunk as it is an overhead on both the system + network traffic. If you want to get the Computer name it can be renamed later on the returned result. There is the query.

 index=ad_test 
| rename cn as ComputerName 
| dedup ComputerName 
| stats count by ComputerName distinguishedName 
| append [ | dbquery mcafee "SELECT count(1) as CNT,ComputerName,CPUType  FROM EPOComputerProperties group by ComputerName,CPUType" | rename CNT as Count] 
| append [ | dbquery marimba_2014 "SELECT count(1) as CNT,name,serial_number FROM inv_machine group by name,serial_number" | rename name AS ComputerName | rename CNT AS Count1] 
| stats values(serial_number) as serial_number values(distinguishedName) as distinguishedName values(CPUType) as CPUType by ComputerName

Now this should give you a better performance and result. Thanks.

linu1988 · ‎01-17-2014

I do understand the full outer join which you want to do, but i am not quite understanding what is the requirement with a nonuniform dataset of 161919 and how do we handle that in splunk. And knowing the values is very important as that will give you the required data rather than a redundant set. I am curious to know what is the solution to this.

mcrawford44 · ‎01-17-2014

Knowing the values is irrelevant. A full outer join would return ANY value, and match the like values. Any non matched value is still appended as a record.

I apologize but I think you are not understanding the concept fully. I do appreciate the assistance though.

linu1988 · ‎01-17-2014

So you mean you don't have a distinct combination of 161919 computer and cputypes in your table? In your AD you have only 28000 records why do you want to compare with that many records from table!

Correct me with the below details

select distinct ComputerName,CPUType from EPOComputerProperties =161919?

select distinct name,serial_number from inv_machine=75323?

I want to know the values, if they are correct i may be of no help to you.

mcrawford44 · ‎01-17-2014

JOIN actually works, it just doesn't support FULL outer. Only LEFT outer.

Yes the grouping keeps the count the same for the individual queries. But when combined in the APPEND, only 100k rows are returned. It should be 161919.

linu1988 · ‎01-17-2014

if yes then you should use Join to get it done, if you don't have any common field with same values it is not possible to relate them.

linu1988 · ‎01-17-2014

So is it like you want to get the set like

ComputerName,distinguishedName,CPUType,Serial_number?

Secondly after the grouping from sql , do you still get 161919 and 75323 records?

mcrawford44 · ‎01-17-2014

From the documentation on SET; "Important: The set command works on less than 10 thousand results."

They are all flat tables that contain computer names and columns with information on the computer from the perspective of the source.

AD; (29860 records)
ComputerName (renamed cn), distinguishedName

DB1; (161919 records)
ComputerName, CPUType

DB2; (75323 records)
ComputerName (renamed name), serial_number

This concept; http://www.w3schools.com/sql/sql_join_full.asp

linu1988 · ‎01-17-2014

two append queries will return very less no of records, did you check executing them separately? You may need to see if all the 3 queries you have have the same ComputerName returned. Without seeing the data its not quite easy to give the optimum soultion.

Or you may want to look at set opration here to see how the results come. Union will give you everything i guess.

|set [search 1][search 2]

_http://docs.splunk.com/Documentation/Splunk/6.0.1/SearchReference/Set

mcrawford44 · ‎01-17-2014

To explain further this is a query to discover machines in 3 separate sources. So the goal is to return all instances of all machines regardless of the source.

If you do a left join on C, then you exclude records from A and B that are not inside C.

mcrawford44 · ‎01-17-2014

Left join won't work because the data is not static. Some days the other sources will be larger.

I fixed your query to add the group, however the individual queries still don't add up correctly. For example, the first append returns 161k rows. Your entire query is only bringing back 100k There is no filtering taking place on these to my knowledge, I'm lost as to where the missing records are going. None of the query results contain duplicates or nulls.

I used count because of the example link I provided in the question.

linu1988 · ‎01-17-2014

i missed a group by statement, but the last stats statement doesn't use the count, why do we have the count then? And if you want 15 results, why don't you make join type left and start from the table C, it will contain 15 records at last..

mcrawford44 · ‎01-17-2014

Well it doesn't fully work. One of the sub-queries returns the error above. Your query seems to return less. I'm looking for more. Example; 3 tables. A,B,C. A has 10 rows, B has 6 rows, C has 15 rows. The full outer join would return 15 rows including unmatched items.

linu1988 · ‎01-17-2014

So does it give you what you need? The result set is very less compared to your query

mcrawford44 · ‎01-17-2014

Fixed version;

index=ad_test
| rename cn as ComputerName
| dedup ComputerName
| stats count by ComputerName distinguishedName
| append [ | dbquery mcafee "SELECT count(1) as CNT,ComputerName,CPUType FROM EPOComputerProperties group by ComputerName,CPUType" | rename CNT as Count]
| append [ | dbquery marimba_2014 "SELECT count(1) as CNT,name,serial_number FROM inv_machine" | rename name AS ComputerName | rename CNT AS Count1]
| stats values(serial_number) as serial_number values(distinguishedName) as distinguishedName values(CPUType) as CPUType by ComputerName

mcrawford44 · ‎01-17-2014

The requirement is the function of a full outer join. Splunk's outer join is LEFT outer only. A join of this type puts multiple sources together, and reports back both matching and non matching items from all tables. Unlike a LEFT join a FULL outer join displays all results from all tables.

The query you provided needed to be fixed a bit, but when running it I recieve this error; [subsearch]: command="dbquery", A database error occurred: Column 'inv_machine.name' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

Multi-Source Full Outer Join using Append

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Best Practices: Splunk auto adjust pipeline queue

Laser Bananas and Edge Hubs: Exploring Operational Technology (OT) Data Through a ...

Event Series: Mastering AI Tokenomics and Splunk Agent Observability

Join the Conversation

Multi-Source Full Outer Join using Append

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Best Practices: Splunk auto adjust pipeline queue

Laser Bananas and Edge Hubs: Exploring Operational Technology (OT) Data Through a ...

Event Series: Mastering AI Tokenomics and Splunk Agent Observability