We are using many savedsearches to perform daily detection queries over huge datasets. Concretely, the anatomy of our queries is always the same. We have transactions and events related to many IDs (for example, bank account movements, and the ID would be the bank account). So, our searches login wants to find "yesterday's bank accounts that have already done N or more transactions in the past". Therefore, this would be an example of our SPL query:
index=transactions [ search index=transactions earliest=-1d@d latest=-0d@d | fields accountID | dedup accountID | table accountID] | ... rest of the query
As you can see I'm using a first-pipe-subsearch filter approach to speed up queries and only search for yesterday IDs in the past...
This is nice until you have scenarios with >10000 accountIDs with movements daily. In this case, I've tried to skip subsearch limits by using the format command:
index=transactions [ search index=transactions earliest=-1d@d latest=-0d@d | fields accountID | dedup accountID | table accountID | format max_results=100000] | ... rest of the query
But this still seems to be overkill. The limits are not hit thanks to format command, but the subsearch just dies and takes too long. Also, I'm not sure if exceding default limits is good practice...
Then I started to think about options WITHOUT subsearches like
index=transactions accountID | eval isFromYesterday = if( _time>relative_time(now(),"-1d@d") AND _time<relative_time(now(),"-0d@d"), true, false) | ...rest of the query
But this is still overkill as it will scan the whole index for all accountID without need to do such thing...
Can you share your advice on this issue?
Thank you in advance!!
@alvaromari83, since your subquery
search index=transactions earliest=-1d@d latest=-0d@d | fields accountID | dedup accountID | table accountID, needs to actually run only once per day, can you save the same as daily scheduled search after midnight and push the accountIDs to a lookup file using outputlookup command?
This way you would need to run only the main search against your index and your sub search (in the base query will change to inputlookup command.
Hello! This then would be something like:
index=transactions [ | inputlookup dailyaccountIDs.csv | format maxresults=100000 ] | ... rest of the query
This approach would for sure save some time in the subsearch filter, but keeps having the issue of using a subsearch, hitting the default 10,000 limit. So, format command with much higher limit size would be required... and then, the format parsing for the outer search it muuuch slower.
I've made some test with a csv of 10000 accountIDs, and with:
index=transactions [ | inputlookup dailyaccountIDs.csv | format maxresults=100000 ] | stats dc(accountID)
takes 55 minutes!!!, while this full scan approach:
index=transactions | lookup dailyaccountIDs.csv accountID OUTPUT isdaily | search isdaily=1 | stats dc(accountID)
takes 75 seconds, which is the same time that a raw fullscan takes:
index=transactions | stats dc(accountID)
So, not big improvement...
To me, this just calls out for the use of Summary Indexing instead of trying to do it via other means:
Have you looked into this or attempted to try an approach that uses this?