Edit: Maybe this | stats latest(Id) as Id by SystemModstamp is the solution. Those records showing up now are where my flag field is null. It's possible I'm hitting a race condition between Splunk and the ORM engine in Salesforce. The "life" of these records are basically:
1. Written into me by an external system with no status / tag
2. Salesforce Flow picks up the record and immediately status tags it as "staged"
3. Some magic happens in another system internal to SFDC to process it.
4. Record is status tagged as "processed"
I'm now seeing some that Splunk must have picked up the second they hit me from the external system, before the Flow could put an initial status on them, because that field is null.
It's a custom object used for data staging for a secondary process, not sure what exactly to share. Even in a basic search this object, by its nature, is throwing duplicates in search.
Once the record is created in Salesforce, it sits in a "dirty" staging state before some behind the scenes code picks up the record, does some work through another system, and then updates a status field on the original staging record. I only want records that stay in the "dirty" staged state and never get updated, but Splunk manages to pick up the same record in both the "dirty" staged state and "clean" processed state and only the state and SystemModstamp are changed.
I tried changing the query and am still getting odd results so either that's not the ultimate fix or I'm still doing something wrong. The rest of my search predicates other tabling & stats off counts by Id so what I'm trying to do is "Give me all records where state = 'dirty' state and this is the latest version of that record by SystemModstamp."
I tried replacing my dedup Id with | stats latest(Id) as Id by SystemModstamp. Looks like I'm not seeing my duplicated "dirty" records which is good but now I'm seeing other records I didn't expect to see. Still going through those.
dedup on two fields will remove all events bar 1 that have the same combination of both fields, e.g.
| makeresults count=1000
| fields - _time
| eval Id=random() % 3,SystemModStamp=random() % 3
| eventstats count by Id,SystemModStamp
| dedup Id, SystemModStamp
| sort Id, SystemModStamp
| addcoltotals count
you will always get 9 events for the 1000 events.
However, in your data what does the _time field represent of those events?
Generally dedup can be done more predictably using stats with some kind of latest/max aggregation. if your event _time is the one you want then
| stats latest(*) as * by Id
may do the job.
dedup will simply record the first event it sees and remove subsequent duplicates, so it depends on event order, which unless you sort, may not be predictable.
Can you share some search to show what exactly you are trying to do and what data you want following the dedup