Getting Data In

CSV data vs key-value data. Which is faster for performance?

koshyk
Super Champion

hi,
We have an incoming custom dataset which consumes approx 700GB a day and is currently used for CIM. Currently it is in Key-value format. there is a proposal for changing it to csv, which reduces the dataset by approx 60% to 280GB a day. The data savings are quite significant. We know the client is fixed, so lack of flexibility is NOT an issue

Existing format in every line

service="retail" source_port="514" dest_port="22" destination_ip="1.2.3.4" source_ip="7.2.3.4" 

Proposed format in every line

"retail",514,22,"1.2.3.4","7.2.3.4"

The key question is, from a performance point of view would there be an impact so if we use CIM on csv format? Also would it have bad impact on tsidx creation? The data comes as syslog & files are rotated at 100MB size (if it matters). I've tried with a smaller subset in my test machine, but I couldn't find any changes in performance with small amount of data. But would like to get experience

0 Karma
1 Solution

Simeon
Splunk Employee
Splunk Employee

Performance can be measured in different ways, but also covers both indexing and search. You could run your tests again and use job inspector to see any exact differences, but I would first ask why would you want to remove or add the key value pair fidelity. We typically encourage people to add field names so they are easier on the eyes and the performance difference to the user is not noticeable. If you search is slow, it's still gonna be slow regardless of csv or key-value pair format.

From an indexing perspective, you would save on size with csv and there is optimized/automated field extraction iirc. You are essentially saving extra bytes through the removal of the field name in the key value pair. Many years ago, people would switch to csv to save on licensing, but you remove fidelity and searchable terms.

From a search perspective, it kinda depends. If you have terms (field names) you need to search upon, like using service or source_port as a keyword, the csv format won't be as optimized as I don't believe it exists in the same way in the tsidx file (would have to double check this). I would imagine an apples to apples comparison of a "stats count" by one of the fields would return slightly different results, potentially slightly faster in the csv format as the actual value you count and process to extract from rawdata might be faster. If you consider counting by the last field in your first example line, source_ip, I would imagine that the extraction/tracking of that field will be much longer than via the csv method as we should look for the last comma then return that field, compared to trying to regex for source_ip and returning that value. I'll reiterate, it really depends what you care about and the type of search.

View solution in original post

sloshburch
Splunk Employee
Splunk Employee

@koshyk - you got a lot of brilliant engineers helping on this thread. Let us know what else, or, if one of the answers helped, go ahead and accept it so we know you're all set.

0 Karma

koshyk
Super Champion

done it mate. thanks

0 Karma

ledion
Path Finder

key=value, from a performance point of view is not the best format - it is the easiest to write logs out, has great readability, compresses really well but of course it is really wasteful from a license point of view. As the developer of most of search time extractions, I am actually surprised that you are not seeing even better search time performance gains (expected similar to change in data size) ... but obviously a lot depends on what searches you've tested. One thing to note is that with .csv files your fields become indexed fields and thus your index size (.tsidx files) on disk might suffer (depending on the cardinality of your fields). You could avoid this by not using index time CSV parsing but instead use delimiter based KV at search time - if the file format doesn't change (ie headers are the same) then delimiter KV has few/no drawbacks.

0 Karma

Simeon
Splunk Employee
Splunk Employee

carnality....

0 Karma

ledion
Path Finder

there, fixed it 🙂

0 Karma

Simeon
Splunk Employee
Splunk Employee

Performance can be measured in different ways, but also covers both indexing and search. You could run your tests again and use job inspector to see any exact differences, but I would first ask why would you want to remove or add the key value pair fidelity. We typically encourage people to add field names so they are easier on the eyes and the performance difference to the user is not noticeable. If you search is slow, it's still gonna be slow regardless of csv or key-value pair format.

From an indexing perspective, you would save on size with csv and there is optimized/automated field extraction iirc. You are essentially saving extra bytes through the removal of the field name in the key value pair. Many years ago, people would switch to csv to save on licensing, but you remove fidelity and searchable terms.

From a search perspective, it kinda depends. If you have terms (field names) you need to search upon, like using service or source_port as a keyword, the csv format won't be as optimized as I don't believe it exists in the same way in the tsidx file (would have to double check this). I would imagine an apples to apples comparison of a "stats count" by one of the fields would return slightly different results, potentially slightly faster in the csv format as the actual value you count and process to extract from rawdata might be faster. If you consider counting by the last field in your first example line, source_ip, I would imagine that the extraction/tracking of that field will be much longer than via the csv method as we should look for the last comma then return that field, compared to trying to regex for source_ip and returning that value. I'll reiterate, it really depends what you care about and the type of search.

koshyk
Super Champion

It has been a while I've put the question, I had to do it myself . We are now following the format(2) which is WITHOUT key-value and actually we had got performance improvement of about

=> 5-10% of indexing . (may be coz of reduction in size of event itself)
=> 20-25% performance in search time with our own extraction logic. This was a shock me as well as I thought key-value was better.

0 Karma

sloshburch
Splunk Employee
Splunk Employee

Interesting stats. To your own point, it could be normalized to the size of the data. So is the indexing improvement normalized per byte? Not that it needs to be, at the end of the day it's just how fast can you get your answer #amiright? lol

0 Karma

Simeon
Splunk Employee
Splunk Employee

@koshyk - if you could share the extraction logic (regex) and the type of search, we could probably tell you why performance is improved.

0 Karma

sloshburch
Splunk Employee
Splunk Employee

FWIW, I see most customers moving to JSON from semantic=style, and def no one switching to CSV or something so restrictive.

I want to highlight @Simeon’s key point that if other’s, who are not familiar with the data, need to see raw events, then having a more description format will be a win (whereas csv is NOT self descriptive), and more interestingly.

0 Karma

rfaircloth_splu
Splunk Employee
Splunk Employee

the key proposition for Splunk is "native" or raw. If the source is natively producing json, xml or KV the best all things considered path is that raw form. Pre translation i.e. schema on write is very high risk and is the Achilles heal of other solutions identifying and rectifying data problems due to translation is difficult and often results in failure to monitor. If a new solution such as a business application was being implemented today and that solution was to log in a performant way to for example kafka or SNS I would use a minified json format, with a schema indicator. An example of this in use today is AWS cloud watch events.

One example of things not to do is wrap a txt message in json for example packaging a Cisco ASA event inside of json requires escaping characters. Parsing fields from fields in json is very difficult. While a native Jason format is very easy to work with.

0 Karma

sloshburch
Splunk Employee
Splunk Employee

Good highlight! Don't rewrite things to transform and compromise the original data. Use its raw form! If creating something new, then this thread becomes greater guidance. Thanks, @rfaircloth!

0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In December, the Splunk Threat Research Team had 1 release of new security content via the Enterprise Security ...

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

(This is the first of a series of 2 blogs). Splunk Enterprise Security is a fantastic tool that offers robust ...

Index This | What are the 12 Days of Splunk-mas?

December 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...