Solved: Is there a way to create an additional index based...

mohitab · ‎11-14-2014

I went through the Exploring Splunk book which states that the data is indexed w.r.t. _time, host , source & sourceType (Ref Table 2-1. Fields Splunk Always Indexes).

I experimented with this in mind and found that on a CSV data of about 70MB, it took ~14 sec to find events with non-duplicate _time field. And it took ~26 sec to perform same operation on an discovered field.

I am wondering if there is a way, just like in most RDBMS, to create additional index based on the discovered field. Any suggestions? I went through the book and have played with Splunk but did not find anything.

dwaddle · ‎11-16-2014

Splunk is not like an RDBMS and concepts do not always carry over from one to the other in perfect comparison.

A great example of this is your question - within an RDBMS adding an additional index to a table is a trivial operation. You issue the CREATE INDEX SQL statement, referencing a table and some set of columns on that table that you would like to be in the index. The SQL engine then creates a new B-Tree (or similar) file with the values of those columns as keys and pointers to the rows in the table as values.

Comparatively, Splunk does not (and cannot) do exactly this. First of all, Splunk's lack of defined schema at write means that there are no "tables" or "columns" to work with. And once Splunk has indexed data, it becomes largely read-only and the on-disk structure cannot be substantially modified without re-indexing. Further, Splunk has already created a full-text index on all of "words" in your source data anyway. If we were to give Splunk two sample events:

2014-10-16T22:40:47 myserver sshd[12345]: Accepted password for user buttercup

and

2014-10-16,22:41:00,aaa,bbb,ccc,ddd,12345

Then it would make entries in its inverted index for tokens from these events like

10
12345
16
2014
22
40
41
47
aaa
accepted
bbb
buttercup
ccc
for
myserver
password
sshd
user

And possibly others too - event segmentation at indexing is a complex topic. This full-text indexing combined with schema at read (as opposed to at write) is one of the strengths of Splunk's approach to unstructured and semi-structured data. Data that is more structured, like CSV, JSON, or W3C data, can be additionally processed for context to create even better indexing for this data.

Your performance example you provided is a little short on details for us to do much in the way of serious advice for speeding up your searches. Some more details there might help folks be able to improve your performance some.

But as a rule -- to be successful with Splunk you need to learn how it differs from your RDBMS technologies and master the "Splunkish ways" of doing things. It is different technology and is going to require some different approaches for maximum success.

View solution in original post

dwaddle · ‎11-16-2014

Splunk is not like an RDBMS and concepts do not always carry over from one to the other in perfect comparison.

A great example of this is your question - within an RDBMS adding an additional index to a table is a trivial operation. You issue the CREATE INDEX SQL statement, referencing a table and some set of columns on that table that you would like to be in the index. The SQL engine then creates a new B-Tree (or similar) file with the values of those columns as keys and pointers to the rows in the table as values.

Comparatively, Splunk does not (and cannot) do exactly this. First of all, Splunk's lack of defined schema at write means that there are no "tables" or "columns" to work with. And once Splunk has indexed data, it becomes largely read-only and the on-disk structure cannot be substantially modified without re-indexing. Further, Splunk has already created a full-text index on all of "words" in your source data anyway. If we were to give Splunk two sample events:

2014-10-16T22:40:47 myserver sshd[12345]: Accepted password for user buttercup

and

2014-10-16,22:41:00,aaa,bbb,ccc,ddd,12345

Then it would make entries in its inverted index for tokens from these events like

10
12345
16
2014
22
40
41
47
aaa
accepted
bbb
buttercup
ccc
for
myserver
password
sshd
user

And possibly others too - event segmentation at indexing is a complex topic. This full-text indexing combined with schema at read (as opposed to at write) is one of the strengths of Splunk's approach to unstructured and semi-structured data. Data that is more structured, like CSV, JSON, or W3C data, can be additionally processed for context to create even better indexing for this data.

Your performance example you provided is a little short on details for us to do much in the way of serious advice for speeding up your searches. Some more details there might help folks be able to improve your performance some.

But as a rule -- to be successful with Splunk you need to learn how it differs from your RDBMS technologies and master the "Splunkish ways" of doing things. It is different technology and is going to require some different approaches for maximum success.

martin_mueller · ‎11-15-2014

Yeah, you can create your own indexed fields from within transforms.conf.

There are several good reasons for doing this, and several good reasons for not doing this. What's your underlying use case that makes you want this?

mohitab · ‎11-17-2014

@martin_mueller: I found that I do not have good enough reasons for indexing based on Flight_ID. The possible values of Flight_ID are quite distinct from values taken by other fields.

martin_mueller · ‎11-17-2014

First off, a Report is a saved search so saving your search as a Report will do the trick.

As for your searching, are you doing something like this?

index=flight_data sourcetype=flight_csv Flight_ID="LH450"

If so, Splunk will look for the term LH450 in the lexicon, retrieving matching events very efficiently - see @dwaddle's post. Search-time field extraction is applied afterwards to verify the term actually is in the desired field. To make sure that's working well for your data check if the number of matching events is close to the number of scanned events in the job inspector.

mohitab · ‎11-16-2014

@ cpetterborg: I am using the latest version. Apparently I could not find a way of saving the search. There is an option of saving the search as Report, Alert or Dashboard.

@martin_mueller: In my CSV data (which has sensor data), I have a column, lets call it Flight_ID, on which most of the queries are based. I am rarely using _time field in the queries. More or less I am interested in finding events with particular Flight_ID, finding unique Flight_IDs and getting count of the number of records for each Flight_ID. Can you provide some tutorial/cheatsheet which put more light on how to create index on discovered Fields.

cpetterborg · ‎11-17-2014

Save as Report is probably what you want to do, though you could also do Save as Alert. Then in the Settings -> Searches interface, you can edit the search and use the Accelerate this Search checkbox:

cpetterborg · ‎11-14-2014

Have you tried to save the search, then use the "optimize this search" checkbox? It adds additional metadata to make the search go faster.

Is there a way to create an additional index based on a discovered field?

New This Month in Splunk Observability Cloud - Metrics Usage Analytics, Enhanced K8s ...

Alerting Best Practices: How to Create Good Detectors

Discover Powerful New Features in Splunk Cloud Platform: Enhanced Analytics, ...