Splunk Search

What exactly is a tsdix file?

Path Finder

what exactly is a tsidx file? Can someone explain please? I don't quite understand the definition:

"A tsidx file associates each unique keyword in your data with location references to events(??), which are stored in a companion rawdata file"

I ask this in relation to tstats command which states "Use the tstats command to perform statistical queries on indexed fields in tsidx files".

Can someone explain this in context to tstats?

1 Solution

Splunk Employee
Splunk Employee

tsidx (time series index) files are created as part of the indexing pipeline processing. The incoming data is parsed into terms (think 'words' delimited by certain characters) and this list of terms is then stored along with offset (a number) that represents the location in the rawdata file (journal.gz) that the event data is written to.
It is the exact same thing as an index in a book, except it is a complete index rather than a subset. If every word in a book would be in the index, the index would be way larger than the book itself, which is exactly what happens in Splunk. If you look at an index bucket directory on disk, you will find that the size for the index and other metadata files often exceeds the size of the compressed raw data.

Searches using tstats only use the tsidx files, i.e. Splunk does not have to read, unzip and search the journal.gz files to create the search results, which is obviously orders of magnitudes faster.

Try it for yourself! The following two searches are semantically identical and should return the same exact results on your Splunk instance. Pick "Previous week" from the timerange picker and then take a look at how long they each take in Job Inspector once they are complete.

index=_internal  | stats count by sourcetype

Equivalent tstats search:

| tstats count where index=_internal by sourcetype 

In my environment, the first one takes 115s, the tstats search completes in 4s.

Note that this only works for indexed fields, not for fields extracted at search time. By default that is _time, source, host and sourcetype.

Hope that makes sense.
BTW, you can use the walklex command to take a look at what's in a given tsidx file.

View solution in original post

Splunk Employee
Splunk Employee

@aoliullah - Did one of the answers below help clarify what a tsdix file is? If yes, please click “Accept” below the best answer to resolve this post and upvote anything that was helpful. If no, please leave a comment with more feedback. Thanks.

0 Karma

Splunk Employee
Splunk Employee

tsidx (time series index) files are created as part of the indexing pipeline processing. The incoming data is parsed into terms (think 'words' delimited by certain characters) and this list of terms is then stored along with offset (a number) that represents the location in the rawdata file (journal.gz) that the event data is written to.
It is the exact same thing as an index in a book, except it is a complete index rather than a subset. If every word in a book would be in the index, the index would be way larger than the book itself, which is exactly what happens in Splunk. If you look at an index bucket directory on disk, you will find that the size for the index and other metadata files often exceeds the size of the compressed raw data.

Searches using tstats only use the tsidx files, i.e. Splunk does not have to read, unzip and search the journal.gz files to create the search results, which is obviously orders of magnitudes faster.

Try it for yourself! The following two searches are semantically identical and should return the same exact results on your Splunk instance. Pick "Previous week" from the timerange picker and then take a look at how long they each take in Job Inspector once they are complete.

index=_internal  | stats count by sourcetype

Equivalent tstats search:

| tstats count where index=_internal by sourcetype 

In my environment, the first one takes 115s, the tstats search completes in 4s.

Note that this only works for indexed fields, not for fields extracted at search time. By default that is _time, source, host and sourcetype.

Hope that makes sense.
BTW, you can use the walklex command to take a look at what's in a given tsidx file.

View solution in original post

Influencer

There was a great talk at conf2016 related to this, slides are here https://conf.splunk.com/files/2016/slides/fields-indexed-tokens-and-you.pdf

0 Karma

SplunkTrust
SplunkTrust

The idx part is for "index". The ts part is "time series", but, the whole thing is generally synonymous with "index file".

http://docs.splunk.com/Splexicon:Indexfiles

An index file contains keys, and pointers to data.

If an index file exists for the fields in the data that you are looking for, then you can use the tstats command to gather information that is accessible by that index. If no index file exists for that data, then tstats wont work.

So, for example, let's suppose that you have your system set up, for a particular index and sourcetype, to index the source IP address into a field called src_ip. Let's suppose you want a quick count of all the traffic on a particular day from a series of IP addresses 123.123.123.1-50. Since you have an index on that field, you can use tstats in summary mode instead of stats, which will be MUCH more efficient.

0 Karma
State of Splunk Careers

Access the Splunk Careers Report to see real data that shows how Splunk mastery increases your value and job satisfaction.

Find out what your skills are worth!