Splunk Search

How to leverage a secondary time field at scale?

woodcock
Esteemed Legend

I have a dataset that has both FINISHDATE and CHANGEDATE fields in text.  We use strptime(CHANGEDATE) for _time but we often need to filter only on FINISHDATE which means that we have to do an All time search, followed bywhere strptime(FINISHDATE)... which is super inefficient.  I am thinking that maybe an Accelerated Data Model with FINISHDATE setup as a number could help but I am not sure.  If I create this data model, how can I use it to bring back raw events quickly filtered by the FINISHDATE field?

I created a calculated field like this:

 

FINISHDATE_ = strptime(FINISHDATE, ...)

 

Then I created datamodel DOUBLE_DATE like this:

 

Extracted field: FINISHDATE_ = Number
Extracted field: ID = String

 

 Now I have several options:

I can use | from like this:

 

| from datamodel DOUBLE_DATE 
| search FINISHDATE_ > 1607299625

 

I can use | datamodel like this:

 

| datamodel DOUBLE_DATE flat
| search FINISHDATE_ > 1607299625

 

 Or I can use | tstats like this:

 

| tstats count
FROM datamodel=DOUBLE_DATE 
WHERE DOUBLE_DATE.FINISHDATE_ > 1607299625
BY DOUBLE_DATE.ID
| rename DOUBLE_DATE.* AS *

 

The problem is that I am not sure that any of these does what I need for it to do which is to leverage the secondary time field in DOUBLE_DATE.FINISHDATE_ in the foundational search (before the first pipe) so that it will BOTH return the full raw events AND be very fast.  The only way that I can see to do that is to use the last one above as a subsearch to a regular search but that would be klunky and have its own problems.  Is there some other new/schema-acceleration option that I am missing?

Why don't I just use INGEST_EVAL and create an index-time field?  Because we have already done millions of events over 10 years and I am trying to avoid having to reindex them.  But if I did this and I setup fields.conf so that Splunk knows that it is an index-time field, then I would be able to use AND FINISHDATE_ > 1607299625.

Why don't I just test more?  I have to wait for the admin to accelerate my datamodel and that may be a while.  I am suspecting that the | datamodel .. | search might trigger a schema-accelerated search optimization and "just work".  I will post an update after testing.

Labels (3)
0 Karma

tscroggins
Builder

@woodcock 

If your date format is fixed, e.g. %Y-%m-%d, you could use terms to limit the initial result set. If the date is in _raw with a field name and only minor breakers, e.g. FINISHDATE=%Y-%m-%d, you can further reduce the result set.

E.g.: With %Y-%m-%d and FINISHDATE>1607299625, FINISHDATE is 2020-07-06, and your terms would include (assuming latest=now as of this message):

(TERM(2020-07-07) OR TERM(2020-07-08) OR TERM(2020-07-09) OR TERM(2020-07-1*) OR TERM(2020-07-2*) OR TERM(2020-07-3*) OR TERM(2020-08-*) OR TERM(2020-09-*) OR TERM(2020-1*-*) OR (TERM(20*-*-*) NOT TERM(2020-*-*)))

With FINISHDATE=%Y-%m-%d:

(TERM(FINISHDATE=2020-07-07) OR TERM(FINISHDATE=2020-07-08) OR TERM(FINISHDATE=2020-07-09) OR TERM(FINISHDATE=2020-07-1*) OR TERM(FINISHDATE=2020-07-2*) OR TERM(FINISHDATE=2020-07-3*) OR TERM(FINISHDATE=2020-08-*) OR TERM(FINISHDATE=2020-09-*) OR TERM(FINISHDATE=2020-1*-*) OR (TERM(FINISHDATE=20*-*-*) NOT TERM(FINISHDATE=2020-*-*)))

The first set of terms will return more events than you need, but you'll have reduced the number of events piping through a subsequent where or search command.

The challenge becomes writing a family of date comparison macros (eq, ne, gt, lt, etc.) that convert an epoch value to a set of reduced terms, perhaps including a time format string parameter.

A similar set of reduced terms can be constructed for time values in e.g. %H:%M:%S format.

The only reference to "schema-accelerated search optimization" I can find in documentation is "improved data model drilldown" in the release notes for Splunk Enterprise 7.1. I believe this is a reference to data model search modes: search, flat, and acceleration_search.

As a test, I created an eval field:

EVAL-FINISHDATE_EPOCH = strptime(FINISHDATE, "%Y-%m-%d")

I then created an accelerated data model named change_with_finishdate with a root data set also named change_with_finishdate and a single auto-extracted field: FINISHDATE_EPOCH.

Searching the data model:

| datamodel summariesonly=t change_with_finishdate change_with_finishdate search
| search change_with_finishdate.FINISHDATE_EPOCH>1607299625

Or to strip data set prefixes from field names:

| datamodel summariesonly=t change_with_finishdate change_with_finishdate flat
| search FINISHDATE_EPOCH>1607299625

Splunk adds directives to the search:

| search (FINISHDATE_EPOCH=* index=main sourcetype=change_with_finishdate FINISHDATE_EPOCH>1607299625 (index=* OR index=_*)) DIRECTIVES(READ_SUMMARY(allow_old_summaries="false" dmid="CB3E9141-C5FE-4B9D-A458-16D364E70DC6_DM_search_change_with_finishdate" name="change_with_finishdate.change_with_finishdate" predicate="*" summariesonly="true"),READ_SUMMARY(predicate="\"change_with_finishdate.FINISHDATE_EPOCH\">1607299625")) | fields + _time, host, source, sourcetype, FINISHDATE_EPOCH

Even with hints (DIRECTIVES, READ_SUMMARY, dmid), I can't find documentation, .conf presentations, etc. on this feature.

In any case, raw events are returned, presumably filtered by the accelerated data model search.

In my very small test environment, a direct search using TERM as above is faster than using a data model. I'm very curious about what your results will be in a larger index. Day to day, I do find that effectively using index terms often outperforms data models.

0 Karma

woodcock
Esteemed Legend

Yes, this would work but it would be too hand-tuned and inflexible to handle general timepicker use.

Tags (1)
0 Karma

tscroggins
Builder

Agreed. Hence: "The challenge becomes writing a family of date comparison macros (eq, ne, gt, lt, etc.) that convert an epoch value to a set of reduced terms, perhaps including a time format string parameter." I haven't fiddled with it yet.

What did your data model experiments reveal?

0 Karma