Archive

Necessary to order db query by rising column?

Motivator

Is it necessary to include an ORDER BY $risingcolumn$ in my database tail query? This can be very expensive on a large database not indexed on that column. (example: using the row's modifiedtime rather than indexed column create_time)

Or is Splunk/DBX smart enough to get a result set back and find the maximum of that column for saving until next run.

1 Solution

Splunk Employee
Splunk Employee

Edit by OP: This is the exact answer I was looking for (that yes, unfortunately ORDER is required) but it was buried in the comments below:

$rising_column$'s ? is from the last record of previous run.( when new records inserted, for example).

Therefore, if the ORDER got messed up, it may lose new records.

============================= original post ================================

Use Specify SQL query. The example is:

SQL Query:
SELECT * FROM actor {{WHERE $risingcolumn$ > ?}} ORDER BY lastupdate

Tail input settings
Rising Column:
last_update

The $rising_column$ here is exactly the same as the ORDER BY field. eventhought it is not necessary, but if the orders are different, it will mess up the data.

The idea of dbmontail is:

  1. first run QUERY without {{...}}. which is "SELECT * FROM actor ORDER BY last_update"
  2. continuous runs QUERY with {{....}} from the lastrecord of $risingcolumn$. the $risingcolumn$ will be replaced by "Rising Column" setting. which is "SELECT * FROM actor WHERE lastupdate > ? ORDER BY lastupdate". The ? is set to the lastrecord from the previous run.

The more useful cases to specify SQL are actually for constraints, for example:

SELECT * FROM actor WHERE myfield="abc" {{AND $risingcolumn$ > ?}} ORDER BY lastupdate

so you can have some constraints in SQL.

View solution in original post

Motivator

This may be helpful is you are having a large number of records in the table you are monitoring.

This is a query I am testing to only sample from the most recent 100000 records using the "order by desc" clause (similar to a unix tail -100000). I then put them back in to chronological sort order with the second "order by" clause. The amount of records can be adjusted depending on the volume of the feed.

select * from ( select top 100000 [DateTime],[RouterCallKey],[ANI],[DigitsDialed],[Variable1],[DNIS],[Duration],[TalkTime] from Termination_Call_Detail order by [DateTime] desc ) as T1 order by [DateTime] 
0 Karma

Champion

it only monitors the rising_field. If you have a field which is filling time and splunk monitors every 1 hour what is the point in running a query for whole day?just query for last hour data.

Basic tail functionality is just to tail the newer record rather than indexing everything..

0 Karma

Splunk Employee
Splunk Employee

DBX does filter timestamp field in table if doesn't specify. The drawback is if you have multiple timestamp fields? it will pick up the first one it found.
Use Specify SQL can allow to do constraints such as:

SQL Query: SELECT * FROM actor WHERE toyear(lastupdate)>2013.

0 Karma

SplunkTrust
SplunkTrust

NEVER convert the underlying data when you don't have to, whether in splunk or an underlying relational database. It's expensive and unnecessary.

DON'T DO THIS -

SQL Query: SELECT * FROM actor WHERE to_year(last_update)>2013.

DO THIS INSTEAD -

SQL Query: SELECT * FROM actor WHERE last_update> (whatever the date format in your version of SQL is for the constant "2013-12-31")
0 Karma

Champion

Hello,
I could confirm that if you need to monitor DB and have to get only the largest value from a column you need to order it by ORDER BY. As per the database concept the records will always be inserted in the order it is provided. The order by column / sorting manipulates the data and shows it to us like we want. So when you don't put the data in order the tail command will never know which is where! As per the tail command it only monitors the latest records.

if the value is greater than the previous monitored value and indexed then it will take all the records after than. The results are inconsistent. So it's better to order by with some time field rather than numeric/int field which doesn't suggest you the latest updates. Larger database query also should be filtered to returned only near part records rather than whole Table.

FYI: I had tested with a integer column and rising column always returns all the records if the newest record is greater than last indexed record and if record is smaller nothing is indexed

Thanks

0 Karma

Splunk Employee
Splunk Employee

Edit by OP: This is the exact answer I was looking for (that yes, unfortunately ORDER is required) but it was buried in the comments below:

$rising_column$'s ? is from the last record of previous run.( when new records inserted, for example).

Therefore, if the ORDER got messed up, it may lose new records.

============================= original post ================================

Use Specify SQL query. The example is:

SQL Query:
SELECT * FROM actor {{WHERE $risingcolumn$ > ?}} ORDER BY lastupdate

Tail input settings
Rising Column:
last_update

The $rising_column$ here is exactly the same as the ORDER BY field. eventhought it is not necessary, but if the orders are different, it will mess up the data.

The idea of dbmontail is:

  1. first run QUERY without {{...}}. which is "SELECT * FROM actor ORDER BY last_update"
  2. continuous runs QUERY with {{....}} from the lastrecord of $risingcolumn$. the $risingcolumn$ will be replaced by "Rising Column" setting. which is "SELECT * FROM actor WHERE lastupdate > ? ORDER BY lastupdate". The ? is set to the lastrecord from the previous run.

The more useful cases to specify SQL are actually for constraints, for example:

SELECT * FROM actor WHERE myfield="abc" {{AND $risingcolumn$ > ?}} ORDER BY lastupdate

so you can have some constraints in SQL.

View solution in original post

Motivator

Definitely. Anyone who has a large database that can not reasonably use an ORDER BY command due to excessive load placed on a database should support this enhancement! If only databases could do | where ... | sort ..., rather than | sort ... | where ... Ah, the flexibility of Splunk's search language.

0 Karma

Motivator

This sounds like a good idea for an enhancement request. Instead of thinking of a sql table being a list of events ordered by time, think of a sql table with fields that have a timestamp with objects that are modified, that then could be turned into events by simply making the rising column the modified date and not having to worry about what order splunk is getting them in. Thus making event logs from any ordinary Application.

Splunk Employee
Splunk Employee

there is an "interval" field you can set up if you want to slow down.

0 Karma

Splunk Employee
Splunk Employee

the previous run of SQL will provide a value of last $risingcolumn$ which is persisted in dbx. the coming run will pick it up and execute the SQL with > ?. so $risingcolumn$ in fact has to be distinct, and it is "rising or ASC".

0 Karma

Splunk Employee
Splunk Employee

furthermore, splunk events are based on event timestamp. it doesn't matter the order of inputting into it. all events come with their timestamp, that is the order of events.
the only case that will have the event order of importing is when you don't describe timestamp in dbmon and the retrieved fields don't contain any timestamp fields. then dbx will use the importing time as the timestamp for each event. so now you may see the different order of events as you expected.

0 Karma

Splunk Employee
Splunk Employee

First SQL is SELECT * FROM actor ORDER BY lastupdate ASC
Continuous SQL is SELECT * FROM actor WHERE last
update>? ORDER BY last_update ASC, the ? is from the last record of previous run.( when new records inserted, for example).

Therefore, if the ORDER got messed up, it may lose new records.

The more useful cases are actually for constraints, for example:

SELECT * FROM actor WHERE myfield="abc" {{AND $risingcolumn$ > ?}} ORDER BY lastupdate

so you can have some constraints in SQL.

Motivator

so '?' is the last_record from the previous run and not the biggest number/value from the previous run? Looks like a bug to me.

0 Karma

Champion

The functionality says it should, but long will it wait? But the thing is if you have a timestamp field in table why not filter it in query itself? why to tell splunk to do this expensive operation? And DBX app should always be used cleverly rather than a dumping everything, it is misuse of network resource/processing power at both end.

0 Karma