Splunk Search

Problems with dedup dropping too many records?

arist0telis
Explorer
I am having trouble with deduping on a Salesforce object and my "feels like" here is dedup isn't doing what I understand it to do. Short version of this is I have an object where records get updated by the system after insert. I only want the latest version of the record in my search. Since it's the system updating and not a user, LastModifiedDate doesn't get changed, only SystemModStamp, and sometimes only by a fraction of a second (but there is a noticeable difference in the SystemModStamp in search results where I have duplicated Ids). If I do | dedup Id it's not pulling the newest SystemModStamp, it's pulling the first one in the index. If I do | dedup Id,SystemModStamp I'm getting no records. Like, not just are we dropping the duplicate, we're dropping everything. I'm guessing that's because dedup by two fields isn't the composite key on a single record I thought it was? It's dropping any records with duplicated SystemModStamps? What I'm looking for is a composite key way to dedup on a single record by both Id and SystemModStamp.
0 Karma

arist0telis
Explorer

Edit: Maybe this | stats latest(Id) as Id by SystemModstamp is the solution. Those records showing up now are where my flag field is null. It's possible I'm hitting a race condition between Splunk and the ORM engine in Salesforce. The "life" of these records are basically:

1. Written into me by an external system with no status / tag
2. Salesforce Flow picks up the record and immediately status tags it as "staged"
3. Some magic happens in another system internal to SFDC to process it.
4. Record is status tagged as "processed"

I'm now seeing some that Splunk must have picked up the second they hit me from the external system, before the Flow could put an initial status on them, because that field is null.

0 Karma

arist0telis
Explorer

It's a custom object used for data staging for a secondary process, not sure what exactly to share. Even in a basic search this object, by its nature, is throwing duplicates in search.

Once the record is created in Salesforce, it sits in a "dirty" staging state before some behind the scenes code picks up the record, does some work through another system, and then updates a status field on the original staging record. I only want records that stay in the "dirty" staged state and never get updated, but Splunk manages to pick up the same record in both the "dirty" staged state and "clean" processed state and only the state and SystemModstamp are changed.

I tried changing the query and am still getting odd results so either that's not the ultimate fix or I'm still doing something wrong. The rest of my search predicates other tabling & stats off counts by Id so what I'm trying to do is "Give me all records where state = 'dirty' state and this is the latest version of that record by SystemModstamp."

I tried replacing my dedup Id with | stats latest(Id) as Id by SystemModstamp. Looks like I'm not seeing my duplicated "dirty" records which is good but now I'm seeing other records I didn't expect to see. Still going through those.

0 Karma

bowesmana
SplunkTrust
SplunkTrust

dedup on two fields will remove all events bar 1 that have the same combination of both fields, e.g.

| makeresults count=1000
| fields - _time
| eval Id=random() % 3,SystemModStamp=random() % 3
| eventstats count by Id,SystemModStamp
| dedup Id, SystemModStamp
| sort Id, SystemModStamp
| addcoltotals count

you will always get 9 events for the 1000 events.

However, in your data what does the _time field represent of those events? 

Generally dedup can be done more predictably using stats with some kind of latest/max aggregation. if your event _time is the one you want then 

| stats latest(*) as * by Id

may do the job.

dedup will simply record the first event it sees and remove subsequent duplicates, so it depends on event order, which unless you sort, may not be predictable.

Can you share some search to show what exactly you are trying to do and what data you want following the dedup

 

0 Karma
Get Updates on the Splunk Community!

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...