Archive

How to perform JOIN on large amounts of data coming from DB Connect App on different indexers?

Explorer

index="xyz_order_line"|join ORDER_NUMBER_KEY[|inputlookup sample_lookup1.csv|where serial_no>0 AND serial_no<50001]| where some_condition |table ORDER_NUMBER_KEY,serial_no,field_x,field_y,field_z|append[search index="xyz_order_line"|join ORDER_NUMBER_KEY[|inputlookup sample_lookup1.csv|where serial_no>50000 AND serial_no<100001]| where some_condition |table ORDER_NUMBER_KEY,serial_no,field_x,field_y,field_z|append[search index="xyz_order_line"|join ORDER_NUMBER_KEY[|inputlookup sample_lookup1.csv|where serial_no>100000 AND serial_no<150001]| where some_condition |table ORDER_NUMBER_KEY,serial_no,field_x,field_y,field_z|append[search index="xyz_order_line"|join ORDER_NUMBER_KEY[|inputlookup sample_lookup1.csv|where serial_no>150000 AND serial_no<200001]| where some_condition |table ORDER_NUMBER_KEY,serial_no,field_x,field_y,field_z]| outputlookup sample_lookup2.csv

Need to run this query for last 7 days, because of this there is a huge number of records inflow (index=”xyz_order_line”= 20 lakh records) and sample_lookup1 has 12 lakhs records.

When I run above query for first 2 lakh records it works fine and gives all results but when I run this query for 6 to 8 lakhs range it gives results only for first 50k records rest joins doesn’t work.

Tried after minimizing the range to till 10k (instead of 50k at a time) also but join is not working in that case too.

Whereas when I run all queries separately(not using append) for each bunch like (600000-650000, 650000-700000,700000-750000 and 750000-800000) it gives correct result that means join is working properly.

It keeps on changing the behavior, some day one range works fine but next day same range doesn’t work.

One possible solution could be that I can write 30 different saved searches for all 15lakh records but that doesn’t seems to be the best approach and its taking too much time in execution.

Please suggest a best approach to solve this issue.

Basic scenario

  1. Data is pulled from Oracle database
  2. DBConnect app has been used
  3. Need to perform join on 4 different indexers
  4. Need to pull last 7 days data
  5. Due to large amount of data, storing it into lookup and processing 50k records at a time because sub search has limitation of 50k.
0 Karma

SplunkTrust
SplunkTrust

First, read this: http://answers.splunk.com/answers/129424/how-to-compare-fields-over-multiple-sourcetypes-without-joi...
Second, have you considered using lookup instead of join?
Third, have you considered performing the join in the DBMS before pulling the data into Splunk?

0 Karma

Explorer

Thanks for your reply Martin.

I have gone through all solution suggested by you.
First:- In this case we have different indexers not source types and the use-case is too complex to handle with stats command.

Second:- All this data is coming from Oracle database base and its a transactional data which keeps on changing so using a lookup is also bit difficult here. Although I am currently using lookup to store the output of first query and latter using it in second query to perform join.
Please let me know if you meant something else by using lookup.

Third:- even I was thinking of this option but am not sure how to implement this. As per the requirement I need to pull in last 7 days data and if I schedule a query to run every day and send data to SPLUNK, in that case there will be duplicate records of last 6 days.
probs:
1. how to remove duplicates or stop them getting indexed.
2. is there any possibility that performing join at database side can create any issue?
3. will it impact database performance?

Could you please be more specific on how to perform join at database side and get the final result into SPLUNK. please let me know if any additional information is needed.
thanks in advance.

0 Karma