Solved: What exactly is a tsdix file?

aoliullah · ‎01-30-2017

what exactly is a tsidx file? Can someone explain please? I don't quite understand the definition:

"A tsidx file associates each unique keyword in your data with location references to events(??), which are stored in a companion rawdata file"

I ask this in relation to tstats command which states "Use the tstats command to perform statistical queries on indexed fields in tsidx files".

Can someone explain this in context to tstats?

s2_splunk · ‎01-30-2017

tsidx (time series index) files are created as part of the indexing pipeline processing. The incoming data is parsed into terms (think 'words' delimited by certain characters) and this list of terms is then stored along with offset (a number) that represents the location in the rawdata file (journal.gz) that the event data is written to.
It is the exact same thing as an index in a book, except it is a complete index rather than a subset. If every word in a book would be in the index, the index would be way larger than the book itself, which is exactly what happens in Splunk. If you look at an index bucket directory on disk, you will find that the size for the index and other metadata files often exceeds the size of the compressed raw data.

Searches using tstats only use the tsidx files, i.e. Splunk does not have to read, unzip and search the journal.gz files to create the search results, which is obviously orders of magnitudes faster.

Try it for yourself! The following two searches are semantically identical and should return the same exact results on your Splunk instance. Pick "Previous week" from the timerange picker and then take a look at how long they each take in Job Inspector once they are complete.

index=_internal  | stats count by sourcetype

Equivalent tstats search:

| tstats count where index=_internal by sourcetype

In my environment, the first one takes 115s, the tstats search completes in 4s.

Note that this only works for indexed fields, not for fields extracted at search time. By default that is _time, source, host and sourcetype.

Hope that makes sense.
BTW, you can use the walklex command to take a look at what's in a given tsidx file.

View solution in original post

aaraneta_splunk · ‎02-13-2017

@aoliullah - Did one of the answers below help clarify what a tsdix file is? If yes, please click “Accept” below the best answer to resolve this post and upvote anything that was helpful. If no, please leave a comment with more feedback. Thanks.

s2_splunk · ‎01-30-2017

tsidx (time series index) files are created as part of the indexing pipeline processing. The incoming data is parsed into terms (think 'words' delimited by certain characters) and this list of terms is then stored along with offset (a number) that represents the location in the rawdata file (journal.gz) that the event data is written to.
It is the exact same thing as an index in a book, except it is a complete index rather than a subset. If every word in a book would be in the index, the index would be way larger than the book itself, which is exactly what happens in Splunk. If you look at an index bucket directory on disk, you will find that the size for the index and other metadata files often exceeds the size of the compressed raw data.

Searches using tstats only use the tsidx files, i.e. Splunk does not have to read, unzip and search the journal.gz files to create the search results, which is obviously orders of magnitudes faster.

Try it for yourself! The following two searches are semantically identical and should return the same exact results on your Splunk instance. Pick "Previous week" from the timerange picker and then take a look at how long they each take in Job Inspector once they are complete.

index=_internal  | stats count by sourcetype

Equivalent tstats search:

| tstats count where index=_internal by sourcetype

In my environment, the first one takes 115s, the tstats search completes in 4s.

Note that this only works for indexed fields, not for fields extracted at search time. By default that is _time, source, host and sourcetype.

Hope that makes sense.
BTW, you can use the walklex command to take a look at what's in a given tsidx file.

jplumsdaine22 · ‎01-30-2017

There was a great talk at conf2016 related to this, slides are here https://conf.splunk.com/files/2016/slides/fields-indexed-tokens-and-you.pdf

DalJeanis · ‎01-30-2017

The idx part is for "index". The ts part is "time series", but, the whole thing is generally synonymous with "index file".

http://docs.splunk.com/Splexicon:Indexfiles

An index file contains keys, and pointers to data.

If an index file exists for the fields in the data that you are looking for, then you can use the tstats command to gather information that is accessible by that index. If no index file exists for that data, then tstats wont work.

So, for example, let's suppose that you have your system set up, for a particular index and sourcetype, to index the source IP address into a field called src_ip. Let's suppose you want a quick count of all the traffic on a particular day from a series of IP addresses 123.123.123.1-50. Since you have an index on that field, you can use tstats in summary mode instead of stats, which will be MUCH more efficient.

What exactly is a tsdix file?

Index This | Why did the turkey cross the road?

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Feel the Splunk Love: Real Stories from Real Customers

Are you a member of the Splunk Community?

What exactly is a tsdix file?

Index This | Why did the turkey cross the road?

Enter the Agentic Era with Splunk AI Assistant for SPL 1.4

Feel the Splunk Love: Real Stories from Real Customers