Using DBXv3.1.1 I'm seeing many cases of duplicate data being indexed despite the data being inputted from a single source (remote HF located in a different country). I verified that the database only contains a single entry for the rising column checkpoint value as well. The duplication does not happen all the time (~20% of the time), but I am seeing upwards of 110 identical logs in Splunk. I am not sure what is going on and could use any recommendations or help. I've checked the _internal dbx* logs as well to no avail.
We continued to see this issue sporadically for a long while, but I had a breakthrough last week when trying to add a new input.
When using ORDER by DESC to find the latests checkpoint value for a query that I was troubleshooting, I accidentally kept in that ordering when I accepted and created the input. Verifying the new input, I noticed a bunch of event duplication and an equal amount of data indexed each polling interval. It seems that if DESC is set in the SQL query for rising column inputs, the input will be treated similar to a batch input, so each time the polling interval is met it will increment the checkpoint value to the newest one and backfill until the maximum number of entries for each polling interval is met. Just a heads up, DESC is an easy way to find current events, but not good if you save if that way.
tldr
** always use ORDER BY ASC for rising column inputs **
Do you have indexer acknowledgement turned on?
if this is enabled, the forwarder (dbx) will wait until it has received an Ack from the indexer before moving to the next event.
If you are on a lossy network (as could be the case as you mention this is far afield) and the Ack gets lost/delayed then the forwarder will re transmit any events where it did not receive the Ack.
In such cases this will result in duplication.
indexer ack is a tradeoff between duplicates and dropping events.
Thanks for the response. useACK wasn't enabled but I tried enabling it as some of our other inputs use it from another site and work just fine. I'll keep my fingers crossed.
So I've run my dataset back to when I made the change and it seems to have cleared up the duplication. Thanks for the help!
Over say 4hours:
index=platform host=remote-SQL*.local source=remote-SQL*
| stats count AS duplicates by checkpointValue, index, host, source, sourcetype
| where duplicates > 1
| stats sum(duplicates) BY index, host, source, sourcetype
returns something like:
platform remote-SQL1-SQL1.local remote.tlogs 50
platform remote-SQL1-SQL2.local remote.logins 1200
platform remote-SQL1-SQL10.local remote.registrations 5
The checkpointValue is a field alias I made to capture the different rising column checkpoint names which differed between the db inputs but this shouldn't be an issue since they wont be duplicate amongst the different inputs when I run the first stats command against them.
We have multiple HFs at each of our offices, but I verified that the db_inputs are not duplicated between any of these instances as well as the db_connections in the gui and .conf. There is a latency issue with the connection at the remote site since that country doesn't have the best internet, but from the _internal logs (if I'm reading them correctly) shouldn't cause an issue since Splunk knows to not duplicate events if the index, source, sourcetype and _raw is the same.
Thanks for the help!
To properly capture duplicates do this:
index=platform host=remote-SQL*.local source=remote-SQL*
| stats count by _raw
If checkpointValue represents different fields, then those events are not actually duplicates from splunk's point of view.
Also, if you have the same query running on multiple Heavy Forwarders you will almost certainly get duplicates. Your comment Splunk knows to not duplicate events if the index, source, sourcetype and _raw is the same
is not correct - The indexer is completely agnostic to duplicates - it will index whatever is sent to it. For dbconnect the only way to prevent duplicates is by use of the rising column value, and that value is unique to each db_input. IE if on one heavy forwarder you have two db_inputs looking at the same table, they will not share checkpoint information between them
So I think overall it was due to useACK not being on but I didn't know that you could run a duplication check based on the _raw data. This will definitely be helpful for some of my future searches.
The problem I was seeing is that large amount of db inputs were being indexed with the same checkpoint value. Running it against the _raw I saw the same amount of duplication as with not. I couldn't figure out why I was seeing 100+ events coming in with the exact same data and index time.
Thanks for the help though! I had spent quite a bit of time trying to figure this out. It's always nice to have some new eyes and better brain taking a look!
Can you post your query?
I mean your sql query.
Also you mentioned you have multiple inputs? bear in mind the checkpoint value is unique to each input. If you have multiple db inputs running against the same data set the checkpoints are NOT shared between them
Queries are all similar format (with exceptions of selecting fields from them).
SELECT *
FROM "customer"."dbo"."tlog"
WHERE Id > ?
ORDER BY Id ASC
Rising column on Id and timestamp column on itself.
Each query has various values for the rising column such a LogId, exceptionId, and the sort. I made the checkpointValue alias so i could run stats against them all where they had duplicate checkpoint values.