Why is
| tstats count where index=* by sourcetype
so much faster than
index=* | stats count by sourcetype
?
tstats
is faster than stats
since tstats only looks at the indexed metadata (the .tsidx
files in the buckets on the indexers) whereas stats is working off the data (in this case the raw events) before that command.
Since tstats can only look at the indexed metadata it can only search fields that are in the metadata. By default, this only includes index-time fields such as sourcetype
, host
, source
, _time
, etc.
You can do this:
| tstats count by index sourcetype source
But you can't do this:
| tstats count where status>200 by username
Since status
and username
are not index-time fields (they are search-time).
tstats
can run on the index-time fields from the following methods:
tscollect
search commandfields.conf
, props.conf
, and transforms.conf
INDEXED_EXTRACTIONS
in props.conf
for structured data like CSVGenerally, I recommend using accelerated data models.
References:
http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/tstats
http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/tscollect
http://docs.splunk.com/Documentation/Splunk/latest/Knowledge/Acceleratedatamodels
http://docs.splunk.com/Documentation/Splunk/latest/Indexer/Indextimeversussearchtime
http://docs.splunk.com/Documentation/Splunk/latest/Data/Configureindex-timefieldextraction
http://docs.splunk.com/Splexicon:Tsidxfile
============================ EDIT #2 =============================
For even more in-depth articles then see
2016 talk by David Veuve: http://conf.splunk.com/sessions/2016-sessions.html#search=tstats&
2017 talk again by David Veuve: http://conf.splunk.com/sessions/2017-sessions.html#search=tstats&
2017 talk by me: http://conf.splunk.com/sessions/2017-sessions.html#search=speed%20up&
============================ EDIT #1 =============================
Since this seems to be an popular answer, I'll get in even more details:
For our example, let's use the out-of-the-box data model called "Splunk's Internal Server Logs - SAMPLE" at http://127.0.0.1:8000/en-US/app/search/data_model_editor?model=%2FservicesNS%2Fnobody%2Fsearch%2Fdat...
First, run a simple tstats on the DM (doesn't have to be accelerated) to make sure it's working and you get some result:
| tstats count from datamodel=internal_server
Note that I use the DM filename internal_server
(ie Object ID), not the "pretty" name. If the DM isn't accelerated then tstats will translate to a normal search command, so the above command will run:
index=_internal source=*scheduler.log* OR source=*metrics.log* OR source=*splunkd.log* OR source=*license_usage.log* OR source=*splunkd_access.log* | stats count
The translation is defined by the base search of the DM (under "Constraints").
You can verify that you'll get the exact same count from both the tstats and normal search. Make sure you use the same fixed time range (ie from X to Y). Don't do "Last X minutes" since the time range will be different when you run the search ad-hoc.
If the data model is accelerated then the new *.tsidx
indexed files are created on the indexers at $SPLUNK_DB$/<index_name>/datamodel_summary/<bucket_id>_<indexer_guid>/<search_head_guid>/DM_<app>_<data_model_name>
.
If you wanna "see" what's inside these tsidx
files then you can do something like:
./splunk cmd walklex /opt/splunk/var/lib/splunk/_internaldb/datamodel_summary/53_FC88CEC5-A07C-40AB-AC9A-6098C3336C42/FC88CEC5-A07C-40AB-AC9A-6098C3336C42/DM_search_internal_server/1457686104-1457046881-3297202533729992781.tsidx "*"
Anyway, tstats
can basically accesses and searches on these special, DM-created tsidx
files. You tell tstats
which DM to use with the from datamodel=internal_server
clause.
Now accelerate the internal_server
DM if you haven't already. Pick a window big enough like 7 days and search the last 24 hours for testing. Once the DM is finished accelerating to 100%, then try running this:
| tstats count where index=_internal by sourcetype source
And note how much faster this is compared to
index=_internal | stats count by sourcetype source
If you don't see a big difference then either try increasing the time range/acceleration window or create your own DM on a much chattier data source/index.
Remember that everything has a cost. And in this case, you'll trading disk space to gain faster search. That's why the DM status in the UI tells you how much disk space the DM acceleration takes up. The size is how big the .tsidx files are across all indexers. But this still a great trade. Disks are cheap; CPUs are not.
As mentioned before, another drawback of tstats
is that you can only access either the default index-time fields or custom index-time fields created by DM acceleration. And if you forgot to add a field in the DM then you must stop/delete the DM acceleration, modify the DM, then re-accelerate. On a big DM (like from a large Splunk Enterprise Security environment), this could take hours or days depending on how much data needs to be accelerated and how busy the servers are.
Note that tstats
is like stats
but more "SQL-like". It can take a where
and a by
clause too. For example:
| tstats count from datamodel=internal_server where source=*scheduler.log
Which happens to be the same as
| tstats count from datamodel=internal_server where nodename=server.scheduler
Because this DM has a child node under the the Root Event. The name, once again, comes from the "Object ID", not the pretty name label (ie use summaryindexing
, not "Summary Indexing Searches").
Also note that every field you want to reference in the DM, must be prefixed by the node name Object ID server
. Except the default 4 fields in the DM: _time
, host
, source
, sourcetype
. So you this won't work:
| tstats count from datamodel=internal_server by name current_size_kb
name
and current_size_kb
aren't one of the 4 default DM fields, so it must be server.name
and server. current_size_kb
. This is also the main reason I choose very short (usually one-letter) node names since it can become very annoying to write server.
all the time.
One last thing worth mentioning is tstats
performance. We all know tstats
on an accelerated DMs is fast since it's mostly reading from disk and minimizing computation, but tstats
isn't very good at returning many, many results. So although you can do this:
| tstats count from datamodel=foo by a.apple a.pear a.orange _time span=1s
You really shouldn't. tstats
can't return raw events and trying to "trick" it to return raw events by using span=1s
is going against its design principle. You can be clever and get raw events if you use tstats
inside of a subsearch like this:
index=data [| tstats count from datamodel=foo where a.name="hobbes" by a.id a.user | rename a.* as * | fields - count]
So basically tstats
is really good at aggregating values and reducing rows. tstats
will have as bad performance as a normal search (or worse) if your search isn't trying to reduce. For example, if you have 10 million rows in a DM and your tstats
is grouping everything by _time span=1s
and returning 8 million rows, then that's going to be a slow search even if your tstats
is searching on an accelerated DM. But if your tstats
is doing something like avg(all.foo) by all.group
and returning only 1000 rows (but still searching on 10 million events) then it'll be blazing fast since it's reducing.
Also note that if you do by _time
in tstats
then tstats
will automatically group _time
based on the search time range similar to timechart
(ie if you search the last 24 hours then the bucket/group size will be 30 minutes). You also can't go any granular than 1 second so all microseconds will be group together.
Lastly tstats has an prestats=t
option, but that's another lesson for another day (prestats is like si-commands in Splunk). If you really want to know more then check out my 2017 conf talk "Speed up your search!".
Hopefully that is enough to get you started. If you get stuck then troubleshoot your tstats
by keep removing extra clause until you get results again (like removing the by
and where
clauses). Eventually you'll end up with the most basic tstats
command that will give you results:
| tstats count from datamodel=foo
then work backward again until you spot where you get more than 0 results. Common pitfalls include
server.cpu_seconds
or all.foo
, not just cpu_seconds
or foo
, but you can just do sourcetype
or source
)by
clause include null events (common pitfalls in stats
too); one way to remedy that is to create an evaluated field in the DM and do something like foo=coalesce(foo, "NULL")
Good luck and may your searches be fast!
Hi,
these are all great answers. I'd like to add some more explanations and a warning 🙂
As mentioned before, Splunk extracts some metadata (source, sourcetype, time!, host, ...) at index time.
This information is stored as a key::value pair so it has a meaning like "sourcetype::acccess_combined".
In addition Splunk will extract tokens (or words) of all data and store them as a full-text index (like an index in a book).
But this time without having a context (no information about the meaning of the data)!
Everything is stored in an TSIDX file as skawasaki mentioned.
Using an accelerated data model you will tell Splunk to extract the information of interest as a KEY::VALUE pair and now you have added CONTEXT to the value. This information is stored in an additional TSIDX file.
Using tstats Splunk does not need to parse an event anymore, it's just reading the KEY::VALUE pairs.
You can compare TSIDX with a columnar data base.
In addition the link between the value and the original event is lost to some extend.
That is why you can't use tstats to find a single event. It's simply because tstats is used for a statistics use case!
On the other hand it's lightning fast compared to "stats".
One big caveat is that you trade speed with storage space!
There is a rule of thumb that Splunk needs about 50% of the raw data if the compressed raw event in combination with the indexed data is stored in a bucket (journal.gz + .TSIDX + some metadata).
Using a data model this is usually not a big deal because you normally would not create a data model where all information of an event is extracted into fields and indexed again.
And here is why you have to be careful if you use INDEXED_EXTRACTIONS:
Because INDEXED_EXTRACTIONS is used with already structured information (CSV, TSC, PSV, JSON) Splunk will extract ALL data at index time and will create some sort of COPY of your data. Because a TSIDX files is not compressed this might lead to the fact that the storage needed for data ingested using INDEXED_EXTRACTIONS is larger than the uncompressed raw data!
There are some interesting articles in the Splunk docs and nice blog posts:
http://docs.splunk.com/Documentation/Splunk/latest/Troubleshooting/CommandlinetoolsforusewithSupport
http://www.duanewaddle.com/splunk-bucket-lexicons-and-segmentation/
http://blogs.splunk.com/2013/10/18/iis-logs-and-splunk-6/
HTH,
Holger
tstats
is faster than stats
since tstats only looks at the indexed metadata (the .tsidx
files in the buckets on the indexers) whereas stats is working off the data (in this case the raw events) before that command.
Since tstats can only look at the indexed metadata it can only search fields that are in the metadata. By default, this only includes index-time fields such as sourcetype
, host
, source
, _time
, etc.
You can do this:
| tstats count by index sourcetype source
But you can't do this:
| tstats count where status>200 by username
Since status
and username
are not index-time fields (they are search-time).
tstats
can run on the index-time fields from the following methods:
tscollect
search commandfields.conf
, props.conf
, and transforms.conf
INDEXED_EXTRACTIONS
in props.conf
for structured data like CSVGenerally, I recommend using accelerated data models.
References:
http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/tstats
http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/tscollect
http://docs.splunk.com/Documentation/Splunk/latest/Knowledge/Acceleratedatamodels
http://docs.splunk.com/Documentation/Splunk/latest/Indexer/Indextimeversussearchtime
http://docs.splunk.com/Documentation/Splunk/latest/Data/Configureindex-timefieldextraction
http://docs.splunk.com/Splexicon:Tsidxfile
============================ EDIT #2 =============================
For even more in-depth articles then see
2016 talk by David Veuve: http://conf.splunk.com/sessions/2016-sessions.html#search=tstats&
2017 talk again by David Veuve: http://conf.splunk.com/sessions/2017-sessions.html#search=tstats&
2017 talk by me: http://conf.splunk.com/sessions/2017-sessions.html#search=speed%20up&
============================ EDIT #1 =============================
Since this seems to be an popular answer, I'll get in even more details:
For our example, let's use the out-of-the-box data model called "Splunk's Internal Server Logs - SAMPLE" at http://127.0.0.1:8000/en-US/app/search/data_model_editor?model=%2FservicesNS%2Fnobody%2Fsearch%2Fdat...
First, run a simple tstats on the DM (doesn't have to be accelerated) to make sure it's working and you get some result:
| tstats count from datamodel=internal_server
Note that I use the DM filename internal_server
(ie Object ID), not the "pretty" name. If the DM isn't accelerated then tstats will translate to a normal search command, so the above command will run:
index=_internal source=*scheduler.log* OR source=*metrics.log* OR source=*splunkd.log* OR source=*license_usage.log* OR source=*splunkd_access.log* | stats count
The translation is defined by the base search of the DM (under "Constraints").
You can verify that you'll get the exact same count from both the tstats and normal search. Make sure you use the same fixed time range (ie from X to Y). Don't do "Last X minutes" since the time range will be different when you run the search ad-hoc.
If the data model is accelerated then the new *.tsidx
indexed files are created on the indexers at $SPLUNK_DB$/<index_name>/datamodel_summary/<bucket_id>_<indexer_guid>/<search_head_guid>/DM_<app>_<data_model_name>
.
If you wanna "see" what's inside these tsidx
files then you can do something like:
./splunk cmd walklex /opt/splunk/var/lib/splunk/_internaldb/datamodel_summary/53_FC88CEC5-A07C-40AB-AC9A-6098C3336C42/FC88CEC5-A07C-40AB-AC9A-6098C3336C42/DM_search_internal_server/1457686104-1457046881-3297202533729992781.tsidx "*"
Anyway, tstats
can basically accesses and searches on these special, DM-created tsidx
files. You tell tstats
which DM to use with the from datamodel=internal_server
clause.
Now accelerate the internal_server
DM if you haven't already. Pick a window big enough like 7 days and search the last 24 hours for testing. Once the DM is finished accelerating to 100%, then try running this:
| tstats count where index=_internal by sourcetype source
And note how much faster this is compared to
index=_internal | stats count by sourcetype source
If you don't see a big difference then either try increasing the time range/acceleration window or create your own DM on a much chattier data source/index.
Remember that everything has a cost. And in this case, you'll trading disk space to gain faster search. That's why the DM status in the UI tells you how much disk space the DM acceleration takes up. The size is how big the .tsidx files are across all indexers. But this still a great trade. Disks are cheap; CPUs are not.
As mentioned before, another drawback of tstats
is that you can only access either the default index-time fields or custom index-time fields created by DM acceleration. And if you forgot to add a field in the DM then you must stop/delete the DM acceleration, modify the DM, then re-accelerate. On a big DM (like from a large Splunk Enterprise Security environment), this could take hours or days depending on how much data needs to be accelerated and how busy the servers are.
Note that tstats
is like stats
but more "SQL-like". It can take a where
and a by
clause too. For example:
| tstats count from datamodel=internal_server where source=*scheduler.log
Which happens to be the same as
| tstats count from datamodel=internal_server where nodename=server.scheduler
Because this DM has a child node under the the Root Event. The name, once again, comes from the "Object ID", not the pretty name label (ie use summaryindexing
, not "Summary Indexing Searches").
Also note that every field you want to reference in the DM, must be prefixed by the node name Object ID server
. Except the default 4 fields in the DM: _time
, host
, source
, sourcetype
. So you this won't work:
| tstats count from datamodel=internal_server by name current_size_kb
name
and current_size_kb
aren't one of the 4 default DM fields, so it must be server.name
and server. current_size_kb
. This is also the main reason I choose very short (usually one-letter) node names since it can become very annoying to write server.
all the time.
One last thing worth mentioning is tstats
performance. We all know tstats
on an accelerated DMs is fast since it's mostly reading from disk and minimizing computation, but tstats
isn't very good at returning many, many results. So although you can do this:
| tstats count from datamodel=foo by a.apple a.pear a.orange _time span=1s
You really shouldn't. tstats
can't return raw events and trying to "trick" it to return raw events by using span=1s
is going against its design principle. You can be clever and get raw events if you use tstats
inside of a subsearch like this:
index=data [| tstats count from datamodel=foo where a.name="hobbes" by a.id a.user | rename a.* as * | fields - count]
So basically tstats
is really good at aggregating values and reducing rows. tstats
will have as bad performance as a normal search (or worse) if your search isn't trying to reduce. For example, if you have 10 million rows in a DM and your tstats
is grouping everything by _time span=1s
and returning 8 million rows, then that's going to be a slow search even if your tstats
is searching on an accelerated DM. But if your tstats
is doing something like avg(all.foo) by all.group
and returning only 1000 rows (but still searching on 10 million events) then it'll be blazing fast since it's reducing.
Also note that if you do by _time
in tstats
then tstats
will automatically group _time
based on the search time range similar to timechart
(ie if you search the last 24 hours then the bucket/group size will be 30 minutes). You also can't go any granular than 1 second so all microseconds will be group together.
Lastly tstats has an prestats=t
option, but that's another lesson for another day (prestats is like si-commands in Splunk). If you really want to know more then check out my 2017 conf talk "Speed up your search!".
Hopefully that is enough to get you started. If you get stuck then troubleshoot your tstats
by keep removing extra clause until you get results again (like removing the by
and where
clauses). Eventually you'll end up with the most basic tstats
command that will give you results:
| tstats count from datamodel=foo
then work backward again until you spot where you get more than 0 results. Common pitfalls include
server.cpu_seconds
or all.foo
, not just cpu_seconds
or foo
, but you can just do sourcetype
or source
)by
clause include null events (common pitfalls in stats
too); one way to remedy that is to create an evaluated field in the DM and do something like foo=coalesce(foo, "NULL")
Good luck and may your searches be fast!
There's a fourth option - use INDEXED_EXTRACTIONS in props.conf for structured data, those fields will also be tstats-able.
Thanks. I've added INDEXED_EXTRACTIONS
.
My few sentences on question-2 as question-1 is answered.
You can use tstats command for better performance. For using tstats command, you need one of the below
1. create namespace with tscollect command
2. Created datamodel and accelerated (From 6.0 onwards and same as tscollect)
3. Use the existing job id (search artifacts)
When you use tscollect or data model acceleration, there will be scheduled job to update the new tsidx file continuously in the background (+1 to your concurrent searches) and the new tsidx file adds to your disk space.
I came across this question in my head the other day so tried doing some research through documentation and found these pages useful.
http://docs.splunk.com/Splexicon:Tsidxfile
http://docs.splunk.com/Documentation/Splunk/6.2.0/SearchReference/Tscollect
From my understanding, tsidx files contains the keywords from raw data in the indexes they reference, so using tstats performs stats queries on the indexed fields in tsidx files rather than directly through the raw data which is less inefficient. For your second question, hopefully someone more advanced than me can answer that for ya...like @martin_mueller ? 😛 I've seen him recommend the tstats method lately to improve search performance.
Recommend staying away from tscollect - the docs might be a useful reference but the command itself has been (very happily) replaced by data models and acceleration.