topic How to leverage a secondary time field at scale? in Splunk Search

How to leverage a secondary time field at scale?

woodcock — Fri, 26 Feb 2021 18:13:43 GMT

I have a dataset that has both FINISHDATE and CHANGEDATE fields in text. We use strptime(CHANGEDATE) for _time but we often need to filter only on FINISHDATE which means that we have to do an All time search, followed by | where strptime(FINISHDATE)... which is super inefficient. I am thinking that maybe an Accelerated Data Model with FINISHDATE setup as a number could help but I am not sure. If I create this data model, how can I use it to bring back raw events quickly filtered by the FINISHDATE field?

I created a calculated field like this:

FINISHDATE_ = strptime(FINISHDATE, ...)

Then I created datamodel DOUBLE_DATE like this:

Extracted field: FINISHDATE_ = Number Extracted field: ID = String

Now I have several options:

I can use | from like this:

| from datamodel DOUBLE_DATE | search FINISHDATE_ > 1607299625

I can use | datamodel like this:

| datamodel DOUBLE_DATE flat | search FINISHDATE_ > 1607299625

Or I can use | tstats like this:

| tstats count FROM datamodel=DOUBLE_DATE WHERE DOUBLE_DATE.FINISHDATE_ > 1607299625 BY DOUBLE_DATE.ID | rename DOUBLE_DATE.* AS *

The problem is that I am not sure that any of these does what I need for it to do which is to leverage the secondary time field in DOUBLE_DATE.FINISHDATE_ in the foundational search (before the first pipe) so that it will BOTH return the full raw events AND be very fast. The only way that I can see to do that is to use the last one above as a subsearch to a regular search but that would be klunky and have its own problems. Is there some other new/schema-acceleration option that I am missing?

Why don't I just use INGEST_EVAL and create an index-time field? Because we have already done millions of events over 10 years and I am trying to avoid having to reindex them. But if I did this and I setup fields.conf so that Splunk knows that it is an index-time field, then I would be able to use AND FINISHDATE_ > 1607299625.

Why don't I just test more? I have to wait for the admin to accelerate my datamodel and that may be a while. I am suspecting that the | datamodel .. | search might trigger a schema-accelerated search optimization and "just work". I will post an update after testing.

Re: How to leverage a secondary time field at scale?

tscroggins — Sun, 28 Feb 2021 17:03:56 GMT

@woodcock

If your date format is fixed, e.g. %Y-%m-%d, you could use terms to limit the initial result set. If the date is in _raw with a field name and only minor breakers, e.g. FINISHDATE=%Y-%m-%d, you can further reduce the result set.

E.g.: With %Y-%m-%d and FINISHDATE>1607299625, FINISHDATE is 2020-07-06, and your terms would include (assuming latest=now as of this message):

(TERM(2020-07-07) OR TERM(2020-07-08) OR TERM(2020-07-09) OR TERM(2020-07-1*) OR TERM(2020-07-2*) OR TERM(2020-07-3*) OR TERM(2020-08-*) OR TERM(2020-09-*) OR TERM(2020-1*-*) OR (TERM(20*-*-*) NOT TERM(2020-*-*)))

With FINISHDATE=%Y-%m-%d:

(TERM(FINISHDATE=2020-07-07) OR TERM(FINISHDATE=2020-07-08) OR TERM(FINISHDATE=2020-07-09) OR TERM(FINISHDATE=2020-07-1*) OR TERM(FINISHDATE=2020-07-2*) OR TERM(FINISHDATE=2020-07-3*) OR TERM(FINISHDATE=2020-08-*) OR TERM(FINISHDATE=2020-09-*) OR TERM(FINISHDATE=2020-1*-*) OR (TERM(FINISHDATE=20*-*-*) NOT TERM(FINISHDATE=2020-*-*)))

The first set of terms will return more events than you need, but you'll have reduced the number of events piping through a subsequent where or search command.

The challenge becomes writing a family of date comparison macros (eq, ne, gt, lt, etc.) that convert an epoch value to a set of reduced terms, perhaps including a time format string parameter.

A similar set of reduced terms can be constructed for time values in e.g. %H:%M:%S format.

The only reference to "schema-accelerated search optimization" I can find in documentation is "improved data model drilldown" in the release notes for Splunk Enterprise 7.1. I believe this is a reference to data model search modes: search, flat, and acceleration_search.

As a test, I created an eval field:

EVAL-FINISHDATE_EPOCH = strptime(FINISHDATE, "%Y-%m-%d")

I then created an accelerated data model named change_with_finishdate with a root data set also named change_with_finishdate and a single auto-extracted field: FINISHDATE_EPOCH.

Searching the data model:

| datamodel summariesonly=t change_with_finishdate change_with_finishdate search
| search change_with_finishdate.FINISHDATE_EPOCH>1607299625

Or to strip data set prefixes from field names:

| datamodel summariesonly=t change_with_finishdate change_with_finishdate flat
| search FINISHDATE_EPOCH>1607299625

Splunk adds directives to the search:

| search (FINISHDATE_EPOCH=* index=main sourcetype=change_with_finishdate FINISHDATE_EPOCH>1607299625 (index=* OR index=_*)) DIRECTIVES(READ_SUMMARY(allow_old_summaries="false" dmid="CB3E9141-C5FE-4B9D-A458-16D364E70DC6_DM_search_change_with_finishdate" name="change_with_finishdate.change_with_finishdate" predicate="*" summariesonly="true"),READ_SUMMARY(predicate="\"change_with_finishdate.FINISHDATE_EPOCH\">1607299625")) | fields + _time, host, source, sourcetype, FINISHDATE_EPOCH

Even with hints (DIRECTIVES, READ_SUMMARY, dmid), I can't find documentation, .conf presentations, etc. on this feature.

In any case, raw events are returned, presumably filtered by the accelerated data model search.

In my very small test environment, a direct search using TERM as above is faster than using a data model. I'm very curious about what your results will be in a larger index. Day to day, I do find that effectively using index terms often outperforms data models.

Re: How to leverage a secondary time field at scale?

woodcock — Sun, 07 Mar 2021 00:56:33 GMT

Yes, this would work but it would be too hand-tuned and inflexible to handle general timepicker use.

Re: How to leverage a secondary time field at scale?

tscroggins — Sun, 07 Mar 2021 01:12:23 GMT

Agreed. Hence: "The challenge becomes writing a family of date comparison macros (eq, ne, gt, lt, etc.) that convert an epoch value to a set of reduced terms, perhaps including a time format string parameter." I haven't fiddled with it yet.

What did your data model experiments reveal?