I went through the Exploring Splunk book which states that the data is indexed w.r.t. _time, host , source & sourceType (Ref Table 2-1. Fields Splunk Always Indexes).
I experimented with this in mind and found that on a CSV data of about 70MB, it took ~14 sec to find events with non-duplicate _time field. And it took ~26 sec to perform same operation on an discovered field.
I am wondering if there is a way, just like in most RDBMS, to create additional index based on the discovered field. Any suggestions? I went through the book and have played with Splunk but did not find anything.
Splunk is not like an RDBMS and concepts do not always carry over from one to the other in perfect comparison.
A great example of this is your question - within an RDBMS adding an additional index to a table is a trivial operation. You issue the CREATE INDEX
SQL statement, referencing a table and some set of columns on that table that you would like to be in the index. The SQL engine then creates a new B-Tree (or similar) file with the values of those columns as keys and pointers to the rows in the table as values.
Comparatively, Splunk does not (and cannot) do exactly this. First of all, Splunk's lack of defined schema at write means that there are no "tables" or "columns" to work with. And once Splunk has indexed data, it becomes largely read-only and the on-disk structure cannot be substantially modified without re-indexing. Further, Splunk has already created a full-text index on all of "words" in your source data anyway. If we were to give Splunk two sample events:
2014-10-16T22:40:47 myserver sshd[12345]: Accepted password for user buttercup
and
2014-10-16,22:41:00,aaa,bbb,ccc,ddd,12345
Then it would make entries in its inverted index for tokens from these events like
10
12345
16
2014
22
40
41
47
aaa
accepted
bbb
buttercup
ccc
for
myserver
password
sshd
user
And possibly others too - event segmentation at indexing is a complex topic. This full-text indexing combined with schema at read (as opposed to at write) is one of the strengths of Splunk's approach to unstructured and semi-structured data. Data that is more structured, like CSV, JSON, or W3C data, can be additionally processed for context to create even better indexing for this data.
Your performance example you provided is a little short on details for us to do much in the way of serious advice for speeding up your searches. Some more details there might help folks be able to improve your performance some.
But as a rule -- to be successful with Splunk you need to learn how it differs from your RDBMS technologies and master the "Splunkish ways" of doing things. It is different technology and is going to require some different approaches for maximum success.
Splunk is not like an RDBMS and concepts do not always carry over from one to the other in perfect comparison.
A great example of this is your question - within an RDBMS adding an additional index to a table is a trivial operation. You issue the CREATE INDEX
SQL statement, referencing a table and some set of columns on that table that you would like to be in the index. The SQL engine then creates a new B-Tree (or similar) file with the values of those columns as keys and pointers to the rows in the table as values.
Comparatively, Splunk does not (and cannot) do exactly this. First of all, Splunk's lack of defined schema at write means that there are no "tables" or "columns" to work with. And once Splunk has indexed data, it becomes largely read-only and the on-disk structure cannot be substantially modified without re-indexing. Further, Splunk has already created a full-text index on all of "words" in your source data anyway. If we were to give Splunk two sample events:
2014-10-16T22:40:47 myserver sshd[12345]: Accepted password for user buttercup
and
2014-10-16,22:41:00,aaa,bbb,ccc,ddd,12345
Then it would make entries in its inverted index for tokens from these events like
10
12345
16
2014
22
40
41
47
aaa
accepted
bbb
buttercup
ccc
for
myserver
password
sshd
user
And possibly others too - event segmentation at indexing is a complex topic. This full-text indexing combined with schema at read (as opposed to at write) is one of the strengths of Splunk's approach to unstructured and semi-structured data. Data that is more structured, like CSV, JSON, or W3C data, can be additionally processed for context to create even better indexing for this data.
Your performance example you provided is a little short on details for us to do much in the way of serious advice for speeding up your searches. Some more details there might help folks be able to improve your performance some.
But as a rule -- to be successful with Splunk you need to learn how it differs from your RDBMS technologies and master the "Splunkish ways" of doing things. It is different technology and is going to require some different approaches for maximum success.
Yeah, you can create your own indexed fields from within transforms.conf.
There are several good reasons for doing this, and several good reasons for not doing this. What's your underlying use case that makes you want this?
@martin_mueller: I found that I do not have good enough reasons for indexing based on Flight_ID. The possible values of Flight_ID are quite distinct from values taken by other fields.
First off, a Report is a saved search so saving your search as a Report will do the trick.
As for your searching, are you doing something like this?
index=flight_data sourcetype=flight_csv Flight_ID="LH450"
If so, Splunk will look for the term LH450
in the lexicon, retrieving matching events very efficiently - see @dwaddle's post. Search-time field extraction is applied afterwards to verify the term actually is in the desired field. To make sure that's working well for your data check if the number of matching events is close to the number of scanned events in the job inspector.
@ cpetterborg: I am using the latest version. Apparently I could not find a way of saving the search. There is an option of saving the search as Report, Alert or Dashboard.
@martin_mueller: In my CSV data (which has sensor data), I have a column, lets call it Flight_ID, on which most of the queries are based. I am rarely using _time field in the queries. More or less I am interested in finding events with particular Flight_ID, finding unique Flight_IDs and getting count of the number of records for each Flight_ID. Can you provide some tutorial/cheatsheet which put more light on how to create index on discovered Fields.
Save as Report is probably what you want to do, though you could also do Save as Alert. Then in the Settings -> Searches interface, you can edit the search and use the Accelerate this Search checkbox:
Have you tried to save the search, then use the "optimize this search" checkbox? It adds additional metadata to make the search go faster.