@Dennis Dedup is often used where stats could be used instead and I would normally suggest using stats as the tool to solve de-duplication, unless you specifically need to dedup and retain data, and ...
See more...
@Dennis Dedup is often used where stats could be used instead and I would normally suggest using stats as the tool to solve de-duplication, unless you specifically need to dedup and retain data, and where event order is important in the deduplication. I wouldn't say it's useless for large data sets, but doing things with _raw will be inefficient. A note of caution on transaction. Memory constraints are controlled through limits.conf, see the [transactions] stanza, so you will not see massive amounts of memory being consumed, but what will happen is that evictions will occur and as a results, your transactions MAY not reflect reality. I believe the default for maxopentxn (max open transactions) is 5,000 and maxopenevents is 100,000, so in your 24 hour period for your 13 billion events, you are going to have 150,000 events per second. The chances are reasonably high that you will have more than that number of open transactions, so in that case it's a reasonable possibility that you will miss a duplicate because an open transaction has been evicted to make way for a new one. Unless you can prove that the numbers you have reached are in fact correct, which on the size of your data may be challenging, I would treat that with an element of scepticism. I would suggest that you do a shorter time range search with the sha() technique and the transaction command to see that they come up with the same results. You should be able to get a good feeling of trust by running each of the searches over a 5-10 minute period.