Getting Data In

Is there a way to create an additional index based on a discovered field?

mohitab
Path Finder

I went through the Exploring Splunk book which states that the data is indexed w.r.t. _time, host , source & sourceType (Ref Table 2-1. Fields Splunk Always Indexes).

I experimented with this in mind and found that on a CSV data of about 70MB, it took ~14 sec to find events with non-duplicate _time field. And it took ~26 sec to perform same operation on an discovered field.

I am wondering if there is a way, just like in most RDBMS, to create additional index based on the discovered field. Any suggestions? I went through the book and have played with Splunk but did not find anything.

0 Karma
1 Solution

dwaddle
SplunkTrust
SplunkTrust

Splunk is not like an RDBMS and concepts do not always carry over from one to the other in perfect comparison.

A great example of this is your question - within an RDBMS adding an additional index to a table is a trivial operation. You issue the CREATE INDEX SQL statement, referencing a table and some set of columns on that table that you would like to be in the index. The SQL engine then creates a new B-Tree (or similar) file with the values of those columns as keys and pointers to the rows in the table as values.

Comparatively, Splunk does not (and cannot) do exactly this. First of all, Splunk's lack of defined schema at write means that there are no "tables" or "columns" to work with. And once Splunk has indexed data, it becomes largely read-only and the on-disk structure cannot be substantially modified without re-indexing. Further, Splunk has already created a full-text index on all of "words" in your source data anyway. If we were to give Splunk two sample events:

2014-10-16T22:40:47 myserver sshd[12345]: Accepted password for user buttercup

and

2014-10-16,22:41:00,aaa,bbb,ccc,ddd,12345

Then it would make entries in its inverted index for tokens from these events like

10
12345
16
2014
22
40
41
47
aaa
accepted
bbb
buttercup
ccc
for
myserver
password
sshd
user

And possibly others too - event segmentation at indexing is a complex topic. This full-text indexing combined with schema at read (as opposed to at write) is one of the strengths of Splunk's approach to unstructured and semi-structured data. Data that is more structured, like CSV, JSON, or W3C data, can be additionally processed for context to create even better indexing for this data.

Your performance example you provided is a little short on details for us to do much in the way of serious advice for speeding up your searches. Some more details there might help folks be able to improve your performance some.

But as a rule -- to be successful with Splunk you need to learn how it differs from your RDBMS technologies and master the "Splunkish ways" of doing things. It is different technology and is going to require some different approaches for maximum success.

View solution in original post

dwaddle
SplunkTrust
SplunkTrust

Splunk is not like an RDBMS and concepts do not always carry over from one to the other in perfect comparison.

A great example of this is your question - within an RDBMS adding an additional index to a table is a trivial operation. You issue the CREATE INDEX SQL statement, referencing a table and some set of columns on that table that you would like to be in the index. The SQL engine then creates a new B-Tree (or similar) file with the values of those columns as keys and pointers to the rows in the table as values.

Comparatively, Splunk does not (and cannot) do exactly this. First of all, Splunk's lack of defined schema at write means that there are no "tables" or "columns" to work with. And once Splunk has indexed data, it becomes largely read-only and the on-disk structure cannot be substantially modified without re-indexing. Further, Splunk has already created a full-text index on all of "words" in your source data anyway. If we were to give Splunk two sample events:

2014-10-16T22:40:47 myserver sshd[12345]: Accepted password for user buttercup

and

2014-10-16,22:41:00,aaa,bbb,ccc,ddd,12345

Then it would make entries in its inverted index for tokens from these events like

10
12345
16
2014
22
40
41
47
aaa
accepted
bbb
buttercup
ccc
for
myserver
password
sshd
user

And possibly others too - event segmentation at indexing is a complex topic. This full-text indexing combined with schema at read (as opposed to at write) is one of the strengths of Splunk's approach to unstructured and semi-structured data. Data that is more structured, like CSV, JSON, or W3C data, can be additionally processed for context to create even better indexing for this data.

Your performance example you provided is a little short on details for us to do much in the way of serious advice for speeding up your searches. Some more details there might help folks be able to improve your performance some.

But as a rule -- to be successful with Splunk you need to learn how it differs from your RDBMS technologies and master the "Splunkish ways" of doing things. It is different technology and is going to require some different approaches for maximum success.

martin_mueller
SplunkTrust
SplunkTrust

Yeah, you can create your own indexed fields from within transforms.conf.

There are several good reasons for doing this, and several good reasons for not doing this. What's your underlying use case that makes you want this?

0 Karma

mohitab
Path Finder

@martin_mueller: I found that I do not have good enough reasons for indexing based on Flight_ID. The possible values of Flight_ID are quite distinct from values taken by other fields.

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

First off, a Report is a saved search so saving your search as a Report will do the trick.

As for your searching, are you doing something like this?

index=flight_data sourcetype=flight_csv Flight_ID="LH450"

If so, Splunk will look for the term LH450 in the lexicon, retrieving matching events very efficiently - see @dwaddle's post. Search-time field extraction is applied afterwards to verify the term actually is in the desired field. To make sure that's working well for your data check if the number of matching events is close to the number of scanned events in the job inspector.

0 Karma

mohitab
Path Finder

@ cpetterborg: I am using the latest version. Apparently I could not find a way of saving the search. There is an option of saving the search as Report, Alert or Dashboard.

@martin_mueller: In my CSV data (which has sensor data), I have a column, lets call it Flight_ID, on which most of the queries are based. I am rarely using _time field in the queries. More or less I am interested in finding events with particular Flight_ID, finding unique Flight_IDs and getting count of the number of records for each Flight_ID. Can you provide some tutorial/cheatsheet which put more light on how to create index on discovered Fields.

0 Karma

cpetterborg
SplunkTrust
SplunkTrust

Save as Report is probably what you want to do, though you could also do Save as Alert. Then in the Settings -> Searches interface, you can edit the search and use the Accelerate this Search checkbox:

Accelerate this Search

0 Karma

cpetterborg
SplunkTrust
SplunkTrust

Have you tried to save the search, then use the "optimize this search" checkbox? It adds additional metadata to make the search go faster.

0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In September, the Splunk Threat Research Team had two releases of new security content via the Enterprise ...

New in Observability - Improvements to Custom Metrics SLOs, Log Observer Connect & ...

The latest enhancements to the Splunk observability portfolio deliver improved SLO management accuracy, better ...

Improve Data Pipelines Using Splunk Data Management

  Register Now   This Tech Talk will explore the pipeline management offerings Edge Processor and Ingest ...