All Apps and Add-ons

Splunk DB Connect 2: DB Input does not collect all rows for new data source

Path Finder

I am using DB Connect 2.3.1. I am successfully importing data from a number of MS SQL databases already, and want to add a new database input.

The data in this new database goes back to early 2008. Unfortunately the primary key is not a bigint; it is a guid (char), so instead of using that as the rising column I have used a datetime column. When I attempted to import all data starting from checkpoint value 2008-01-01 it brought back a load of rows (~55k), running through to the current date, but definitely not as many as there are in the database (~880k).

Other settings for the input were:

  • Max Rows to Retrieve = 10,000
  • Fetch Size = left blank, apparently defaults to 300
  • Timestamp Column is the same datetime column used as the Rising Column
  • Execution Frequency = 120 seconds

Can anyone explain why I don't have as much data as expected, yet the checkpoint value is up-to-date?

Query

Here is an anonymized version of my query, entered in Advanced mode. For operational reasons I need to "buffer" the input by 6 hours to ensure events have been written to the database from distributed sources:

SELECT DateTime, ColA, ColB, ColC, ColD
FROM Table1
WHERE DateTime < DATEADD(hh, -6, GETDATE())
  AND DateTime > CAST(? AS datetime)
0 Karma
1 Solution

Path Finder

I think I have worked out why this is happening, in theory at least:

Because the primary key is a string Guid, the default order in which items are returned from the database are ordered by that column. Effectively this means "in random order" compared to the chronological progression I was hoping for.

Therefore I was getting 10,000 "random" rows from the database in each query and the checkpoint value was being moved along each time. I cannot be sure whether the checkpoint is set to the maximum value found in the Rising Column, or simply to the value in the last row retrieved; I suspect the latter else I'd probably not see more than the first 10,000 rows retrieved.

Either way, the DateTime field was jumping forward each time the query ran; despite the random nature, it can't go backward because the results are restricted by the WHERE clause to be more recent than the last checkpoint. After a few iterations the checkpoint was up-to-date but far fewer rows were returned in total than should have been collected.

Solution

I updated the query to included an ORDER BY clause to ensure that the data is in chronological order. Having deleted the results of the first run, the index seems to be filling up with the correct number of events per month compared to the database.

View solution in original post

0 Karma

Splunk Employee
Splunk Employee

there are a couple of things to consider
- having a rising column will actually add "order by" clause to your query, so not sure why you were experiencing this
- for catching up on historical data, you can do a batch input to ingest everything and then create a new rising column input to stay up to date
- timestamps are not recommended for rising columns: https://answers.splunk.com/answers/400221/why-is-using-a-timestamp-column-for-the-rising-col.html

0 Karma

Path Finder

I think I have worked out why this is happening, in theory at least:

Because the primary key is a string Guid, the default order in which items are returned from the database are ordered by that column. Effectively this means "in random order" compared to the chronological progression I was hoping for.

Therefore I was getting 10,000 "random" rows from the database in each query and the checkpoint value was being moved along each time. I cannot be sure whether the checkpoint is set to the maximum value found in the Rising Column, or simply to the value in the last row retrieved; I suspect the latter else I'd probably not see more than the first 10,000 rows retrieved.

Either way, the DateTime field was jumping forward each time the query ran; despite the random nature, it can't go backward because the results are restricted by the WHERE clause to be more recent than the last checkpoint. After a few iterations the checkpoint was up-to-date but far fewer rows were returned in total than should have been collected.

Solution

I updated the query to included an ORDER BY clause to ensure that the data is in chronological order. Having deleted the results of the first run, the index seems to be filling up with the correct number of events per month compared to the database.

View solution in original post

0 Karma

Path Finder

I guess one possible solution to this is to set Max Rows to Retrieve to be larger than the number of rows in the database, to see if that ensures that everything is ingested in one go. I don't really like this as a solution as it doesn't really help explain why the problem happened.

Update on this answer

Following on from my other answer to this problem, I wouldn't trust this answer to do what I wanted. It may well pull back all the data in one go, but since the query results aren't ordered chronologically then there is no saying what the checkpoint value would update to. It may be set to a very old date so when the query next runs it would then duplicate data. Eventually the checkpoint would become up-to-date but maybe only after it has ingested FAR more data than needed (and blown the license, maybe).

0 Karma