Splunk Search
Highlighted

How can I join two huge datasets (millions of records) with different numbers of fields (and no timestamp)?

Motivator

Hi Splunkers,

I have tried stats dc(sourcetype) as count by commonfield | where count > 1. I assume this search is fetching all records to SH and doing statistics as a results its consuming space in Gigs in search head. its very time consuming. if you have done some thing to two compare two big data sets please guys advice me.

Note: These two data sets don't have a timestamp. The size of the two data sets is 300 million records x 1 field and 20 million records x 3 fields.

0 Karma
Highlighted

Re: How can I join two huge datasets (millions of records) with different numbers of fields (and no timestamp)?

Motivator

@somesoni2,

Sir, I have been told that we can use coalesce to join two big data sets.

I have seen that you have used coalesce in post like below,

index=abc OR index=def | eval commonfield=coalesce(field1,field4) | makemv commonfield delim="," | mvexpand commonfield | stats list() as * by commonfield | where isnotnull(field4) | eval temp=mvzip(field4,field5,"#") | mvexpand temp | rex field=temp "(?.)#(?.*)" | fields - temp commonfield

but, I am not sure how to use that.

please help me.

0 Karma
Highlighted

Re: How can I join two huge datasets (millions of records) with different numbers of fields (and no timestamp)?

Motivator

sample data:

index a:
field1,field2,field3

index b:
field4

field3 and field4 are same. but there is not timestamp in those fields.

0 Karma
Highlighted

Re: How can I join two huge datasets (millions of records) with different numbers of fields (and no timestamp)?

SplunkTrust
SplunkTrust

First of all, no sir.

You can do something like this (for inner join, data available in both indexes). It would be better if you can also provide your expected output.

index=a OR index=b | fields index field1 field2 field3 field4 | eval commonfield=coalesce(field3,field4) | stats dc(index) as indexes values(field1) as field1 values(field2) as field2 by commonfield | where indexes=2

View solution in original post

Highlighted

Re: How can I join two huge datasets (millions of records) with different numbers of fields (and no timestamp)?

Motivator

first of all thanks a lot for your swift response..
Expected output:

field1,field2,field3

but, my concern is its taking more storage space while calculating. I have not used coalesce.. i used stats after renaming the field4 to field3.

this is what i have done.
index=a OR index=b | rename field4 as field3 | stats dc(index) as count values(field1) values(field2) by field3 | where count>1

its killing my storage. is there any difference using rename and coaleasce.

0 Karma
Highlighted

Re: How can I join two huge datasets (millions of records) with different numbers of fields (and no timestamp)?

SplunkTrust
SplunkTrust

I'm afraid it's because of too many rows. Sometime including a fields or table command to limit the number of fields helps.

index=a OR index=b | table field1 field2 field3 field3 | eval field3=coalesce(field3,field4)  | stats dc(index) as count values(field1) values(field2) by field3 | where count>1
0 Karma
Highlighted

Re: How can I join two huge datasets (millions of records) with different numbers of fields (and no timestamp)?

SplunkTrust
SplunkTrust

The fields command is the important one for reducing storage. If you don't put that one in, then your | stats list(*) is keeping literally EVERY field in both files.

Try this -

index=a OR index=b | fields field1 field2 field3 field4 
| rename field4 as field3 
| stats dc(index) as indexcount, values(field1) as field1, values(field2) as field2 by field3 
| where indexcount>1
0 Karma
Highlighted

Re: How can I join two huge datasets (millions of records) with different numbers of fields (and no timestamp)?

Motivator

Thanks for your answers..

I have tried renaming field and i dont have more than 3 fields in one dataset and one field in another dataset.

I am much interested to know if there is any way to achieve this .easily

0 Karma
Highlighted

Re: How can I join two huge datasets (millions of records) with different numbers of fields (and no timestamp)?

Motivator

Thanks for your answers. fields and coalesce did good job.

Highlighted

Re: How can I join two huge datasets (millions of records) with different numbers of fields (and no timestamp)?

SplunkTrust
SplunkTrust

@thambisetty - you are welcome! We've moved the comment to answer. Please accept somesoni2's answer so that the question will show as solved.

0 Karma