I have a dataset that has both FINISHDATE and CHANGEDATE fields in text. We use strptime(CHANGEDATE) for _time but we often need to filter only on FINISHDATE which means that we have to do an All time search, followed by | where strptime(FINISHDATE)... which is super inefficient. I am thinking that maybe an Accelerated Data Model with FINISHDATE setup as a number could help but I am not sure. If I create this data model, how can I use it to bring back raw events quickly filtered by the FINISHDATE field?
I created a calculated field like this:
FINISHDATE_ = strptime(FINISHDATE, ...)
Then I created datamodel DOUBLE_DATE like this:
Extracted field: FINISHDATE_ = Number Extracted field: ID = String
Now I have several options:
I can use | from like this:
| from datamodel DOUBLE_DATE | search FINISHDATE_ > 1607299625
I can use | datamodel like this:
| datamodel DOUBLE_DATE flat | search FINISHDATE_ > 1607299625
Or I can use | tstats like this:
| tstats count FROM datamodel=DOUBLE_DATE WHERE DOUBLE_DATE.FINISHDATE_ > 1607299625 BY DOUBLE_DATE.ID | rename DOUBLE_DATE.* AS *
The problem is that I am not sure that any of these does what I need for it to do which is to leverage the secondary time field in DOUBLE_DATE.FINISHDATE_ in the foundational search (before the first pipe) so that it will BOTH return the full raw events AND be very fast. The only way that I can see to do that is to use the last one above as a subsearch to a regular search but that would be klunky and have its own problems. Is there some other new/schema-acceleration option that I am missing?
Why don't I just use INGEST_EVAL and create an index-time field? Because we have already done millions of events over 10 years and I am trying to avoid having to reindex them. But if I did this and I setup fields.conf so that Splunk knows that it is an index-time field, then I would be able to use AND FINISHDATE_ > 1607299625.
Why don't I just test more? I have to wait for the admin to accelerate my datamodel and that may be a while. I am suspecting that the | datamodel .. | search might trigger a schema-accelerated search optimization and "just work". I will post an update after testing.
If your date format is fixed, e.g. %Y-%m-%d, you could use terms to limit the initial result set. If the date is in _raw with a field name and only minor breakers, e.g. FINISHDATE=%Y-%m-%d, you can further reduce the result set.
E.g.: With %Y-%m-%d and FINISHDATE>1607299625, FINISHDATE is 2020-07-06, and your terms would include (assuming latest=now as of this message):
(TERM(2020-07-07) OR TERM(2020-07-08) OR TERM(2020-07-09) OR TERM(2020-07-1*) OR TERM(2020-07-2*) OR TERM(2020-07-3*) OR TERM(2020-08-*) OR TERM(2020-09-*) OR TERM(2020-1*-*) OR (TERM(20*-*-*) NOT TERM(2020-*-*)))
(TERM(FINISHDATE=2020-07-07) OR TERM(FINISHDATE=2020-07-08) OR TERM(FINISHDATE=2020-07-09) OR TERM(FINISHDATE=2020-07-1*) OR TERM(FINISHDATE=2020-07-2*) OR TERM(FINISHDATE=2020-07-3*) OR TERM(FINISHDATE=2020-08-*) OR TERM(FINISHDATE=2020-09-*) OR TERM(FINISHDATE=2020-1*-*) OR (TERM(FINISHDATE=20*-*-*) NOT TERM(FINISHDATE=2020-*-*)))
The first set of terms will return more events than you need, but you'll have reduced the number of events piping through a subsequent where or search command.
The challenge becomes writing a family of date comparison macros (eq, ne, gt, lt, etc.) that convert an epoch value to a set of reduced terms, perhaps including a time format string parameter.
A similar set of reduced terms can be constructed for time values in e.g. %H:%M:%S format.
The only reference to "schema-accelerated search optimization" I can find in documentation is "improved data model drilldown" in the release notes for Splunk Enterprise 7.1. I believe this is a reference to data model search modes: search, flat, and acceleration_search.
As a test, I created an eval field:
EVAL-FINISHDATE_EPOCH = strptime(FINISHDATE, "%Y-%m-%d")
I then created an accelerated data model named change_with_finishdate with a root data set also named change_with_finishdate and a single auto-extracted field: FINISHDATE_EPOCH.
Searching the data model:
| datamodel summariesonly=t change_with_finishdate change_with_finishdate search
| search change_with_finishdate.FINISHDATE_EPOCH>1607299625
Or to strip data set prefixes from field names:
| datamodel summariesonly=t change_with_finishdate change_with_finishdate flat
| search FINISHDATE_EPOCH>1607299625
Splunk adds directives to the search:
| search (FINISHDATE_EPOCH=* index=main sourcetype=change_with_finishdate FINISHDATE_EPOCH>1607299625 (index=* OR index=_*)) DIRECTIVES(READ_SUMMARY(allow_old_summaries="false" dmid="CB3E9141-C5FE-4B9D-A458-16D364E70DC6_DM_search_change_with_finishdate" name="change_with_finishdate.change_with_finishdate" predicate="*" summariesonly="true"),READ_SUMMARY(predicate="\"change_with_finishdate.FINISHDATE_EPOCH\">1607299625")) | fields + _time, host, source, sourcetype, FINISHDATE_EPOCH
Even with hints (DIRECTIVES, READ_SUMMARY, dmid), I can't find documentation, .conf presentations, etc. on this feature.
In any case, raw events are returned, presumably filtered by the accelerated data model search.
In my very small test environment, a direct search using TERM as above is faster than using a data model. I'm very curious about what your results will be in a larger index. Day to day, I do find that effectively using index terms often outperforms data models.
Agreed. Hence: "The challenge becomes writing a family of date comparison macros (eq, ne, gt, lt, etc.) that convert an epoch value to a set of reduced terms, perhaps including a time format string parameter." I haven't fiddled with it yet.
What did your data model experiments reveal?