Problems with dedup dropping too many records?

arist0telis · ‎03-02-2023

I am having trouble with deduping on a Salesforce object and my "feels like" here is dedup isn't doing what I understand it to do. Short version of this is I have an object where records get updated by the system after insert. I only want the latest version of the record in my search. Since it's the system updating and not a user, LastModifiedDate doesn't get changed, only SystemModStamp, and sometimes only by a fraction of a second (but there is a noticeable difference in the SystemModStamp in search results where I have duplicated Ids). If I do | dedup Id it's not pulling the newest SystemModStamp, it's pulling the first one in the index. If I do | dedup Id,SystemModStamp I'm getting no records. Like, not just are we dropping the duplicate, we're dropping everything. I'm guessing that's because dedup by two fields isn't the composite key on a single record I thought it was? It's dropping any records with duplicated SystemModStamps? What I'm looking for is a composite key way to dedup on a single record by both Id and SystemModStamp.

arist0telis · ‎03-03-2023

Edit: Maybe this | stats latest(Id) as Id by SystemModstamp is the solution. Those records showing up now are where my flag field is null. It's possible I'm hitting a race condition between Splunk and the ORM engine in Salesforce. The "life" of these records are basically:

1. Written into me by an external system with no status / tag
2. Salesforce Flow picks up the record and immediately status tags it as "staged"
3. Some magic happens in another system internal to SFDC to process it.
4. Record is status tagged as "processed"

I'm now seeing some that Splunk must have picked up the second they hit me from the external system, before the Flow could put an initial status on them, because that field is null.

arist0telis · ‎03-03-2023

It's a custom object used for data staging for a secondary process, not sure what exactly to share. Even in a basic search this object, by its nature, is throwing duplicates in search.

Once the record is created in Salesforce, it sits in a "dirty" staging state before some behind the scenes code picks up the record, does some work through another system, and then updates a status field on the original staging record. I only want records that stay in the "dirty" staged state and never get updated, but Splunk manages to pick up the same record in both the "dirty" staged state and "clean" processed state and only the state and SystemModstamp are changed.

I tried changing the query and am still getting odd results so either that's not the ultimate fix or I'm still doing something wrong. The rest of my search predicates other tabling & stats off counts by Id so what I'm trying to do is "Give me all records where state = 'dirty' state and this is the latest version of that record by SystemModstamp."

I tried replacing my dedup Id with | stats latest(Id) as Id by SystemModstamp. Looks like I'm not seeing my duplicated "dirty" records which is good but now I'm seeing other records I didn't expect to see. Still going through those.

bowesmana · ‎03-02-2023

dedup on two fields will remove all events bar 1 that have the same combination of both fields, e.g.

| makeresults count=1000
| fields - _time
| eval Id=random() % 3,SystemModStamp=random() % 3
| eventstats count by Id,SystemModStamp
| dedup Id, SystemModStamp
| sort Id, SystemModStamp
| addcoltotals count

you will always get 9 events for the 1000 events.

However, in your data what does the _time field represent of those events?

Generally dedup can be done more predictably using stats with some kind of latest/max aggregation. if your event _time is the one you want then

| stats latest(*) as * by Id

may do the job.

dedup will simply record the first event it sees and remove subsequent duplicates, so it depends on event order, which unless you sort, may not be predictable.

Can you share some search to show what exactly you are trying to do and what data you want following the dedup

Problems with dedup dropping too many records?

SplunkTrust Application Period is Officially OPEN!

Splunk Answers Content Calendar, June Edition II

Splunk Observability Cloud's AI Assistant in Action Series: Auditing Compliance and ...

Are you a member of the Splunk Community?

Problems with dedup dropping too many records?

SplunkTrust Application Period is Officially OPEN!

Splunk Answers Content Calendar, June Edition II

Splunk Observability Cloud's AI Assistant in Action Series: Auditing Compliance and ...