As I understand it; The Splunk JOIN command does not have a 'full outer join' option. I was able to look-up an example of using the APPEND command, but the results are not what I expected.
I have 3 data sources. In a full join, the returned records (rows) should total the largest record (row) return total. So if source 1 has 1000 records, and 2-3 only have 50; The total records (rows) returned would be 1000. This is not occuring using the example provided here;
My 3 sources;
Splunk AD index (with deduped cn) - 29,860 events/stats DB 1 (mcaffe encryption) - 75,178 records DB 2 (marimba) - 161,791 Matching columns; AD=cn, DB1=ComputerName, DB2=name
If these were all SQL tables, a FULL OUTER JOIN would return 161,791 records as that is the largest table. As said earlier this does not occur with the example linked above.
I'm using this query;
index=ad_test | rename cn as ComputerName | dedup ComputerName | stats count by ComputerName distinguishedName | append [ | dbquery mcafee "SELECT * FROM EPOComputerProperties" | stats count by ComputerName CPUType] | append [ | dbquery marimba_2014 "SELECT * FROM inv_machine" | rename name AS ComputerName | stats count by ComputerName serial_number] | stats values(serial_number) as serial_number values(distinguishedName) as distinguishedName values(CPUType) as CPUType by ComputerName
The returned result count; 129,645
I'm trying to understand the behavior of why this is happening, and what do I need to do to pull a complete recordset from these 3 sources as a full outer join.
Thanks in advance.
ad_test = 851 (I rebuilt the index to focus on a specific OU)
mcafee = 162,237 (Additional records have come in.)
marimba2014 = 74,807 (including serialnumber) 75,461 (without serial_number)
Combined append query results; 129,643
mcafee and marimba do contain SOME duplicates but there are only ~15 records of duplicates and the maximum count is 3. So a maximum of 45. Not 40k missing records. 😕
I see you had counted the tables individually, but I don't see that for the subqueries - can you run these three and look at the numbers,make sure they match the expected numbers:
index=ad_test | rename cn as ComputerName | dedup ComputerName | stats count by ComputerName distinguishedName | dbquery mcafee "SELECT * FROM EPOComputerProperties" | stats count by ComputerName CPUType | dbquery marimba_2014 "SELECT * FROM inv_machine" | rename name AS ComputerName | stats count by ComputerName serial_number
Just to make sure those
stats don't drop events with identical CNs/SNs...
i don't quite understand the retirement but there are several things noticeable and can be optimized.
There is no point doing a database query operation in splunk as it is an overhead on both the system + network traffic. If you want to get the Computer name it can be renamed later on the returned result. There is the query.
index=ad_test | rename cn as ComputerName | dedup ComputerName | stats count by ComputerName distinguishedName | append [ | dbquery mcafee "SELECT count(1) as CNT,ComputerName,CPUType FROM EPOComputerProperties group by ComputerName,CPUType" | rename CNT as Count] | append [ | dbquery marimba_2014 "SELECT count(1) as CNT,name,serial_number FROM inv_machine group by name,serial_number" | rename name AS ComputerName | rename CNT AS Count1] | stats values(serial_number) as serial_number values(distinguishedName) as distinguishedName values(CPUType) as CPUType by ComputerName
Now this should give you a better performance and result. Thanks.
I do understand the full outer join which you want to do, but i am not quite understanding what is the requirement with a nonuniform dataset of 161919 and how do we handle that in splunk. And knowing the values is very important as that will give you the required data rather than a redundant set. I am curious to know what is the solution to this.
Knowing the values is irrelevant. A full outer join would return ANY value, and match the like values. Any non matched value is still appended as a record.
I apologize but I think you are not understanding the concept fully. I do appreciate the assistance though.
So you mean you don't have a distinct combination of 161919 computer and cputypes in your table? In your AD you have only 28000 records why do you want to compare with that many records from table!
Correct me with the below details
select distinct ComputerName,CPUType from EPOComputerProperties =161919?
select distinct name,serialnumber from invmachine=75323?
I want to know the values, if they are correct i may be of no help to you.
JOIN actually works, it just doesn't support FULL outer. Only LEFT outer.
Yes the grouping keeps the count the same for the individual queries. But when combined in the APPEND, only 100k rows are returned. It should be 161919.
From the documentation on SET; "Important: The set command works on less than 10 thousand results."
They are all flat tables that contain computer names and columns with information on the computer from the perspective of the source.
AD; (29860 records)
ComputerName (renamed cn), distinguishedName
DB1; (161919 records)
DB2; (75323 records)
ComputerName (renamed name), serial_number
This concept; http://www.w3schools.com/sql/sql_join_full.asp
two append queries will return very less no of records, did you check executing them separately? You may need to see if all the 3 queries you have have the same ComputerName returned. Without seeing the data its not quite easy to give the optimum soultion.
Or you may want to look at set opration here to see how the results come. Union will give you everything i guess.
|set [search 1][search 2]
To explain further this is a query to discover machines in 3 separate sources. So the goal is to return all instances of all machines regardless of the source.
If you do a left join on C, then you exclude records from A and B that are not inside C.
Left join won't work because the data is not static. Some days the other sources will be larger.
I fixed your query to add the group, however the individual queries still don't add up correctly. For example, the first append returns 161k rows. Your entire query is only bringing back 100k There is no filtering taking place on these to my knowledge, I'm lost as to where the missing records are going. None of the query results contain duplicates or nulls.
I used count because of the example link I provided in the question.
i missed a group by statement, but the last stats statement doesn't use the count, why do we have the count then? And if you want 15 results, why don't you make join type left and start from the table C, it will contain 15 records at last..
Well it doesn't fully work. One of the sub-queries returns the error above. Your query seems to return less. I'm looking for more. Example; 3 tables. A,B,C. A has 10 rows, B has 6 rows, C has 15 rows. The full outer join would return 15 rows including unmatched items.
| rename cn as ComputerName
| dedup ComputerName
| stats count by ComputerName distinguishedName
| append [ | dbquery mcafee "SELECT count(1) as CNT,ComputerName,CPUType FROM EPOComputerProperties group by ComputerName,CPUType" | rename CNT as Count]
| append [ | dbquery marimba2014 "SELECT count(1) as CNT,name,serialnumber FROM invmachine" | rename name AS ComputerName | rename CNT AS Count1]
| stats values(serialnumber) as serialnumber values(distinguishedName) as distinguishedName values(CPUType) as CPUType by ComputerName
The requirement is the function of a full outer join. Splunk's outer join is LEFT outer only. A join of this type puts multiple sources together, and reports back both matching and non matching items from all tables. Unlike a LEFT join a FULL outer join displays all results from all tables.
The query you provided needed to be fixed a bit, but when running it I recieve this error; [subsearch]: command="dbquery", A database error occurred: Column 'inv_machine.name' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.