Splunk Search

How to leverage a secondary time field at scale?

woodcock
Esteemed Legend

I have a dataset that has both FINISHDATE and CHANGEDATE fields in text.  We use strptime(CHANGEDATE) for _time but we often need to filter only on FINISHDATE which means that we have to do an All time search, followed bywhere strptime(FINISHDATE)... which is super inefficient.  I am thinking that maybe an Accelerated Data Model with FINISHDATE setup as a number could help but I am not sure.  If I create this data model, how can I use it to bring back raw events quickly filtered by the FINISHDATE field?

I created a calculated field like this:

 

FINISHDATE_ = strptime(FINISHDATE, ...)

 

Then I created datamodel DOUBLE_DATE like this:

 

Extracted field: FINISHDATE_ = Number
Extracted field: ID = String

 

 Now I have several options:

I can use | from like this:

 

| from datamodel DOUBLE_DATE 
| search FINISHDATE_ > 1607299625

 

I can use | datamodel like this:

 

| datamodel DOUBLE_DATE flat
| search FINISHDATE_ > 1607299625

 

 Or I can use | tstats like this:

 

| tstats count
FROM datamodel=DOUBLE_DATE 
WHERE DOUBLE_DATE.FINISHDATE_ > 1607299625
BY DOUBLE_DATE.ID
| rename DOUBLE_DATE.* AS *

 

The problem is that I am not sure that any of these does what I need for it to do which is to leverage the secondary time field in DOUBLE_DATE.FINISHDATE_ in the foundational search (before the first pipe) so that it will BOTH return the full raw events AND be very fast.  The only way that I can see to do that is to use the last one above as a subsearch to a regular search but that would be klunky and have its own problems.  Is there some other new/schema-acceleration option that I am missing?

Why don't I just use INGEST_EVAL and create an index-time field?  Because we have already done millions of events over 10 years and I am trying to avoid having to reindex them.  But if I did this and I setup fields.conf so that Splunk knows that it is an index-time field, then I would be able to use AND FINISHDATE_ > 1607299625.

Why don't I just test more?  I have to wait for the admin to accelerate my datamodel and that may be a while.  I am suspecting that the | datamodel .. | search might trigger a schema-accelerated search optimization and "just work".  I will post an update after testing.

Labels (3)

tscroggins
Influencer

@woodcock 

If your date format is fixed, e.g. %Y-%m-%d, you could use terms to limit the initial result set. If the date is in _raw with a field name and only minor breakers, e.g. FINISHDATE=%Y-%m-%d, you can further reduce the result set.

E.g.: With %Y-%m-%d and FINISHDATE>1607299625, FINISHDATE is 2020-07-06, and your terms would include (assuming latest=now as of this message):

(TERM(2020-07-07) OR TERM(2020-07-08) OR TERM(2020-07-09) OR TERM(2020-07-1*) OR TERM(2020-07-2*) OR TERM(2020-07-3*) OR TERM(2020-08-*) OR TERM(2020-09-*) OR TERM(2020-1*-*) OR (TERM(20*-*-*) NOT TERM(2020-*-*)))

With FINISHDATE=%Y-%m-%d:

(TERM(FINISHDATE=2020-07-07) OR TERM(FINISHDATE=2020-07-08) OR TERM(FINISHDATE=2020-07-09) OR TERM(FINISHDATE=2020-07-1*) OR TERM(FINISHDATE=2020-07-2*) OR TERM(FINISHDATE=2020-07-3*) OR TERM(FINISHDATE=2020-08-*) OR TERM(FINISHDATE=2020-09-*) OR TERM(FINISHDATE=2020-1*-*) OR (TERM(FINISHDATE=20*-*-*) NOT TERM(FINISHDATE=2020-*-*)))

The first set of terms will return more events than you need, but you'll have reduced the number of events piping through a subsequent where or search command.

The challenge becomes writing a family of date comparison macros (eq, ne, gt, lt, etc.) that convert an epoch value to a set of reduced terms, perhaps including a time format string parameter.

A similar set of reduced terms can be constructed for time values in e.g. %H:%M:%S format.

The only reference to "schema-accelerated search optimization" I can find in documentation is "improved data model drilldown" in the release notes for Splunk Enterprise 7.1. I believe this is a reference to data model search modes: search, flat, and acceleration_search.

As a test, I created an eval field:

EVAL-FINISHDATE_EPOCH = strptime(FINISHDATE, "%Y-%m-%d")

I then created an accelerated data model named change_with_finishdate with a root data set also named change_with_finishdate and a single auto-extracted field: FINISHDATE_EPOCH.

Searching the data model:

| datamodel summariesonly=t change_with_finishdate change_with_finishdate search
| search change_with_finishdate.FINISHDATE_EPOCH>1607299625

Or to strip data set prefixes from field names:

| datamodel summariesonly=t change_with_finishdate change_with_finishdate flat
| search FINISHDATE_EPOCH>1607299625

Splunk adds directives to the search:

| search (FINISHDATE_EPOCH=* index=main sourcetype=change_with_finishdate FINISHDATE_EPOCH>1607299625 (index=* OR index=_*)) DIRECTIVES(READ_SUMMARY(allow_old_summaries="false" dmid="CB3E9141-C5FE-4B9D-A458-16D364E70DC6_DM_search_change_with_finishdate" name="change_with_finishdate.change_with_finishdate" predicate="*" summariesonly="true"),READ_SUMMARY(predicate="\"change_with_finishdate.FINISHDATE_EPOCH\">1607299625")) | fields + _time, host, source, sourcetype, FINISHDATE_EPOCH

Even with hints (DIRECTIVES, READ_SUMMARY, dmid), I can't find documentation, .conf presentations, etc. on this feature.

In any case, raw events are returned, presumably filtered by the accelerated data model search.

In my very small test environment, a direct search using TERM as above is faster than using a data model. I'm very curious about what your results will be in a larger index. Day to day, I do find that effectively using index terms often outperforms data models.

0 Karma

woodcock
Esteemed Legend

Yes, this would work but it would be too hand-tuned and inflexible to handle general timepicker use.

Tags (1)

tscroggins
Influencer

Agreed. Hence: "The challenge becomes writing a family of date comparison macros (eq, ne, gt, lt, etc.) that convert an epoch value to a set of reduced terms, perhaps including a time format string parameter." I haven't fiddled with it yet.

What did your data model experiments reveal?

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...