Hello,
I would like to know if there is a way to see/check tsidx files size and raw data file size.
I would like to reduce tsidx file size to improve search performance.
Thanks for support
Nordine
Hello,
Indeed, my concern is about performance for both indexation and search.
Because I can see from time to time, that _indextime values is getting far from the event timestamp: with a gap from 5 to 100 seconds, when the average is 30ms(200ms at the worst)
Actually, we are indexing json formatted logs, and yes, using indexed extraction = json (set in the source Type)
So, I guess all fields are indexed automatically.
Then, checking with walklex:
| walklex index="my_index" type=field
| search NOT field=" *"
| stats list(distinct_values) by field
It shows, for last 60min:
events:10272
Number of fields: 403
And some of them with uniq values, like
field list(distinct_values)
sessionid
193249
220320
204598
201715
214656
183875
195165
196683
221079
204274
215453
186199
181808
198200
178018
192400
184038
176133
205139
205432
186822
174164
196244
185719
179251
197758
203770
190584
178399
"avoiding indexed fields is sound as a general rule of thumb"
If I understand well, the best should be to avoid fields with large number of uniq values to be indexed, and index only fields with low number of possible values(success/failed, green/yellow/red...,)
Then, in my case, what could be a better configuration to reduce number of indexed fields, and to index only the fields with low cardinality, as you mentioned?
Again, thank you all for your time/support
Regards
Well... There are multiple things here.
1. _time vs _indextime - this doesn't have to have anything to do with the performance of the indexing pipeline. There can be multiple reasons for this from pipelines clogging to drifting time on the sources. It would need more detailed troubleshooting to find out the reason behind this.
2. As a rule of thumb, indexed extractions are bad. While sometimes it is the "only way" using built-in Splunk mechanisms (for example - ingesting CSVs with variable order of columns), it is generally better to have the data pre-processed with external tool to transform it to a format more suitable for normal indexing.
3. Since you are talking about json data with indexed extractions, I suspect you're using the bulit-in _json sourcetype which should not be used in production.
The way to go would be to define your own sourcetype using kv_mode=json but not using indexed extractions. You can't selectively not index some fields when using indexed extractions.
And single indexed fields are very tricky with structured data. I wouldn't recommend it.
Hello PickleRick,
for the point 3, you mean by using kv_mode=json, unlike using indexed extractions, I will be able to "selectively not index some fields". Would you mind to give me some more details, or examples how I can do?
On my side, I've checked the source type which is used, and indeed:
indexed extractions = json
and in advanced tab:
kv_mode = none
So, you recommend to set:
indexed extractions = none
and in advanced tab:
kv_mode = json
Can you confirm this is the right way?
Then, how can I exclude some specific fields from automatic extraction?
Thanks a lot
Regards
Nordine
I mean that if you're using indexed extractions you can't selectively choose which fields are getting indexed as indexed fields and which are not. With indexed extractions Splunk extracts and indexes all fields from your json/csv/xml/whatever as indexed fields.
With KV_MODE=json (or KV_MODE=auto but it's better to be precise here so that Splunk doesn't have to guess). Splunk doesn't index fields as indexed fields unless they are explicitly extracted as indexed fields (which will be difficult/impossible with structured data).
Anyway, the best practice about handling json data, unless you have a very very good reason to do otherwise, is to use search-time extractions, not indexed extractions.
Sorry, I'm not sure to get it: "Splunk doesn't index fields as indexed fields, unless they are explicitly extracted as indexed fields" How is it possible with splunk cloud?
if I understand well, with kv_mode=json, event if our logs are json formatted, I will have to extract one by one all fields I need, using the field extractions feature.
The fields will be then extracted at search time, and not indexed.
Right?
Then, wouldn't be there a risk on the search performance, if all fields are extracted at search time?
Also, the usage of tstat will need to be reviewed for all our saved searches/dashboards...etc.
Am I right?
Thanks
BR
Nordine
Thanks for you replies,
indeed, I was using dbinspect to check buckets size.
But I need to have tsidx vs raw data files size.
Do you know if somehow splunk support, as they should have access to file system, could answer to this need?
Thanks
Regards
Splunk Support can get access to the file system if they don't have, but they no doubt will ask "Why do you *need* to know?".
I, too, wonder why you *need* to know. Splunk Cloud is a service and how that service is provided shouldn't matter as long as you get what you pay for.
What problem are you trying to solve?
The dbinspect command will give you the sizes of each bucket, but there's nothing I know that will break that down further.
To reduce the size of tsidx files, do fewer index-time field extractions (especially JSON) and only accelerate the datamodels you use.
You cannot directly access or view tsidx and raw data file sizes in Splunk Cloud, as file system access is restricted. However, you can estimate index storage usage (including tsidx and raw data) using the dbinspect command.
| dbinspect index=<your_index> | stats sum(rawSize) as total_raw_size sum(sizeOnDiskMB) as total_disk_size | eval total_raw_size_MB=round(total_raw_size/1024/1024,2) | table total_raw_size_MB total_disk_size
This provides an estimate of the raw data size and the total disk usage (which includes tsidx and other metadata).
I dont think there is anything you can change in Splunk Cloud to reduce tsidx size, but also confused as to why you want to. I'd argue increased number of indexed fields (which would increase tsidx sizes) would *improve* search performance if used with things like tstats.
🌟 Did this answer help you? If so, please consider:
Your feedback encourages the volunteers in this community to continue contributing
Actually, indexing too many fields can have negative impact on performance since you're bloating your tsidx files heavily. Remember that for indexed fields you're creating a separate lexicon entry for each key-value pair for indexed fielda. It might not be that heavy for fields for which number of possible values is low (like true/false, accept/reject/drop and so on) but for indexed extractions from json data where both field names can be long as well as values can be unique long strings - you're growing your lexicon greatly so even when you're using efficient searching methods (I don't know exactly what Splunk uses internally but I'd expect something akin to btrees or radix trees) the amount of data you have to dig through increases due to sheer size of the set you're searching.
@PickleRick I absolutely agree, although maybe I kept my justification too short!
Having high cardinality in indexed fields could be a big problem, and would not be advised (this annoys me about INDEXED_EXTRACTION=json) - but indexed fields used well can drastically improved performance if users update searches to use tstats.
Side note - Have you ever noticed that AWS logs often create high cardinality indexed fields for things like S3 objects because of the double :: in the string? e.g. arn:aws:s3:::my-example-bucket/images/photo.jpg becomes arn:aws:s3 with value ":my-example-bucket/images/photo.jpg" and can end up being pretty high cardinality!
@nordinethales Has something been raised to your attention that leads you to think that you have large tsidx files which are causing performance issues? Are you defining indexed fields at ingest time for some of your data (e.g. INDEXED_EXTRACTIONS=json or other approaches e.g. INGEST_EVAL)
If you're planning to reduce the number of indexed fields then its worth ensuring no searches are currently using them.
🌟 Did this answer help you? If so, please consider:
Your feedback encourages the volunteers in this community to continue contributing
@livehybrid I knew that you probably knew that but it's worth posting so that the knowledge is spread 😉
Yes, the rule about avoiding indexed fields is sound as a general rule of thumb but of course in some well thought cases indexed fields can bring tremendous gains in search speed. Especially if you're not searching for particular values but just want to get aggregations which you can achieve with tstats. That's one of the valid reasons for using indexed fields sometimes.
And yep, due to how they are internally represented, there is no way to distinguish between an indexed field and a simple indexed term with :: in the middle. I haven't worked with AWS logs but I did notice that in other cases.