Multiple index join with different formatted data ...

ppanchal · ‎09-22-2016

I have esbetalog in JSON format and etaprd in RAW format and outer joined as with CUSTOMER_ORDER_NUMBER column
both has same CUSTOMER_ORDER_NUMBER data but etaprd data is not coming in result
etaprd can have one or more _raw event data

Sample query below,

index=esbetalog source=PRD (LINE_OF_BUSINESS_CD="R" OR LINE_OF_BUSINESS_CD="C") | rename .CUSTOMER_ORDER_NUMBER as CUSTOEMR_ORDER_NUMBER | join type=left CUSTOMER_ORDER_NUMBER [search index=etaprd source=PRD ] | table CUSTOMER_ORDER_NUMBER, ETA_FROM_TIME, ETA_TO_TIME, ARRIVECUSTOMERSTAMP, LINE_OF_BUSINESS_CD | sort ETA_FROM_TIME

sideview · ‎09-22-2016

just in case it's this - you have a typo - Note the "CUSTOEMR_ORDER_NUMBER".

Also you don't need join for this and you can do it with just stats.

(index=esbetalog source=PRD (LINE_OF_BUSINESS_CD="R" OR LINE_OF_BUSINESS_CD="C")) OR (index=etaprd source=PRD) 
| table CUSTOMER_ORDER_NUMBER, ETA_FROM_TIME, ETA_TO_TIME, ARRIVECUSTOMERSTAMP, LINE_OF_BUSINESS_CD index 
| stats values(*) as * by CUSTOMER_ORDER_NUMBER 
| sort ETA_FROM_TIME

It will run quite a lot faster this way without the join, and not be subject to various truncation limits of the join command. The rename command seemed to be redundant so I removed it as well.

Note: values(*) as * should be used sparingly - note here I am careful to use the table command beforehand, so that it's only working with 5 fields.

There might be a tweak or so needed - but try it out and I can update the answer as necessary.

ppanchal · ‎09-22-2016

I tried your search query as and it did give me some result as shown below.

CUSTOMER_ORDER_NUMBER ARRIVECUSTOMERSTAMP ETA_FROM_TIME ETA_TO_TIME LINE_OF_BUSINESS_CD index
3787312003 2016-09-22 03:45:14.0 2016-09-21 02:01:00.000 2016-09-22 02:46:00.000 C esbetalog
2016-09-21 05:01:00.000 2016-09-22 05:46:00.000 etaprd

However, index esbetalog has multiple event for same CUSTOMER_ORDER_NUMBER. What I need that index 'etaprd' values to be repeated for those many event 'esbetalog' matching index events. Table output should look like as One-To-Many join. Another issue I have is that 'esbetalog' index will have 4k event at the most per day where as 'etaprd' index will have 2 million events per day. Because of that it is downgrading the performance. I dont need to bring unnecessary eventdata from 'etaprd' except only matching CUSTOMER_ORDER_NUMBER data.

sideview · ‎09-22-2016

OK. Great description, and I think I understand. I also think I have to know which fields are coming from which side. So here I've made an arbitrary assumption that it's ETA_TO_TIME and and ETA_FROM_TIME that are the fields only present in the sparser index=etaprd events. Most likely this assumption is of course wrong, but you can see how that assumption is represented in the query below and modify it to match reality.

(index=esbetalog source=PRD (LINE_OF_BUSINESS_CD="R" OR LINE_OF_BUSINESS_CD="C")) OR (index=etaprd source=PRD) CUSTOMER_ORDER_NUMBER=*
| table CUSTOMER_ORDER_NUMBER, ETA_FROM_TIME, ETA_TO_TIME, ARRIVECUSTOMERSTAMP, LINE_OF_BUSINESS_CD index 
| eventstats values(ETA_TO_TIME) as ETA_TO_TIME values(ETA_FROM_TIME) as ETA_FROM_TIME by CUSTOMER_ORDER_NUMBER
| search index=esbetalog
| sort ETA_FROM_TIME

A couple things are changed here from my base answer.

1) I snuck in another search term of CUSTOMER_ORDER_NUMBER=* This may or may not result in fewer events coming off disk, so you might want to add a couple other raw text terms that you know will be present in the right events.
2) Using eventstats instead of stats. This will not transform the rows, ie the output rows will be the same as the input rows, with some extra fields. The specific effect of the eventstats command here will be to copy the ETA_TO_TIME and ETA_FROM_TIME from the index=etaprd events over onto the corresponding index=esbetalog events for each given CUSTOMER_ORDER_NUMBER
3) An extra search command after the eventstats will filter down to just the esbetalog events.

somesoni2 · ‎09-22-2016

For one, the name of the field is misspelled in your rename command CUSTOEMR_ORDER_NUMBER should be CUSTOMER_ORDER_NUMBER, causing your join to fail and not getting any data from etaprd index. Update that and try again.

Further, joins are expensive, so give this alternative a try

index=esbetalog source=PRD (LINE_OF_BUSINESS_CD="R" OR LINE_OF_BUSINESS_CD="C")  OR (index=etaprd source=PRD)
| rename *CUSTOMER_ORDER_NUMBER as CUSTOMER_ORDER_NUMBER1
| eval CUSTOMER_ORDER_NUMBER=coalesce(CUSTOMER_ORDER_NUMBER,CUSTOMER_ORDER_NUMBER1)
| stats values(ETA_FROM_TIME) as ETA_FROM_TIME, values(ETA_TO_TIME) as ETA_TO_TIME , values(ARRIVECUSTOMERSTAMP) as ARRIVECUSTOMERSTAMP values(LINE_OF_BUSINESS_CD) as LINE_OF_BUSINESS_CD dc(index) as index by CUSTOMER_ORDER_NUMBER | where index=2 | fields - index

Multiple index join with different formatted data JSON and RAW is not working

Enterprise Security Content Update (ESCU) | New Releases

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

Index This | What are the 12 Days of Splunk-mas?