We have a claims table in Hunk and a provider table, both came from an RDBMS to Hadoop via sqoop.
How can I join these two tables on the provider id, which is a foreign key on the claims table and the primary key on the provider table?
I see https://answers.splunk.com/answers/122594/hunk-join-2-virtual-indexes.html, but I'm not clear about it...
I'm trying the join command first ; - )
1) index=claim 362657618
---- It immediately brings prov_tin as a numerical field
2) index=provider 362657618
---- It immediately brings tin as a numerical field
3) index=claim 362657618 | join prov_tin [search provider | rename tin AS prov_tin]
---- It spins and spins
Is the syntax of #3 right?
In #3 you've omitted the long number from the joined search, and the index= has gone missing. In the spirit of today's excellent virtual.conf talk by @sideview (recording here: http://wiki.splunk.com/Virtual_.conf ), try this:
(index=claim OR index=provider) 362657618 | eval common_tin = if(index="claim", prov_tin, tin) | stats values(*) as * by common_tin
values(*) as *
is awesome, provided that you put an explicit fields command in front of it... If (when) you ever have 100 or more fields incoming, search speed and memory usage can blow up unexpectedly.
It's the best I can do with the information given in the question 😛
Thank you Martin!!
The query runs for 20 minutes or so but nothing comes back. Let's keep in mind that the claim index holds around 2 billion claims and each of them has hundreds of fields.
I do see the MapR job being generated and it's visible via the resource manager.
In order to run on it on a subset of the data I changed the first part to be -
((index=claim source="part-m-00078") OR index=provider) 362657618
The intent is to run the query only on one sqoop file. Does it look right?
I do get results now but I'm not sure they are right...
The other angle to solve this is by accessing the database directly using Hunk with the DBConnect App - Lookup command:
http://docs.splunk.com/Documentation/DBX/latest/DeployDBX/Createandmanagedatabaselookups
and
http://docs.splunk.com/Documentation/Splunk/6.3.3/SearchReference/SQLtoSplunk
Interesting thing but, at this point, we probably won't choose to make a round-trip back to the DB.
This isn't Hunk-specific, but should apply regardless: https://answers.splunk.com/answers/129424/how-to-compare-fields-over-multiple-sourcetypes-without-jo...
Much appreciated - let me please check it...