Solved: Re: CSV data vs key-value data. Which is faster fo...

koshyk · ‎10-23-2017

hi,
We have an incoming custom dataset which consumes approx 700GB a day and is currently used for CIM. Currently it is in Key-value format. there is a proposal for changing it to csv, which reduces the dataset by approx 60% to 280GB a day. The data savings are quite significant. We know the client is fixed, so lack of flexibility is NOT an issue

Existing format in every line

service="retail" source_port="514" dest_port="22" destination_ip="1.2.3.4" source_ip="7.2.3.4"

Proposed format in every line

"retail",514,22,"1.2.3.4","7.2.3.4"

The key question is, from a performance point of view would there be an impact so if we use CIM on csv format? Also would it have bad impact on tsidx creation? The data comes as syslog & files are rotated at 100MB size (if it matters). I've tried with a smaller subset in my test machine, but I couldn't find any changes in performance with small amount of data. But would like to get experience

Simeon · ‎06-26-2018

Performance can be measured in different ways, but also covers both indexing and search. You could run your tests again and use job inspector to see any exact differences, but I would first ask why would you want to remove or add the key value pair fidelity. We typically encourage people to add field names so they are easier on the eyes and the performance difference to the user is not noticeable. If you search is slow, it's still gonna be slow regardless of csv or key-value pair format.

From an indexing perspective, you would save on size with csv and there is optimized/automated field extraction iirc. You are essentially saving extra bytes through the removal of the field name in the key value pair. Many years ago, people would switch to csv to save on licensing, but you remove fidelity and searchable terms.

From a search perspective, it kinda depends. If you have terms (field names) you need to search upon, like using service or source_port as a keyword, the csv format won't be as optimized as I don't believe it exists in the same way in the tsidx file (would have to double check this). I would imagine an apples to apples comparison of a "stats count" by one of the fields would return slightly different results, potentially slightly faster in the csv format as the actual value you count and process to extract from rawdata might be faster. If you consider counting by the last field in your first example line, source_ip, I would imagine that the extraction/tracking of that field will be much longer than via the csv method as we should look for the last comma then return that field, compared to trying to regex for source_ip and returning that value. I'll reiterate, it really depends what you care about and the type of search.

View solution in original post

sloshburch · ‎07-24-2018

@koshyk - you got a lot of brilliant engineers helping on this thread. Let us know what else, or, if one of the answers helped, go ahead and accept it so we know you're all set.

koshyk · ‎07-26-2018

done it mate. thanks

ledion · ‎07-12-2018

key=value, from a performance point of view is not the best format - it is the easiest to write logs out, has great readability, compresses really well but of course it is really wasteful from a license point of view. As the developer of most of search time extractions, I am actually surprised that you are not seeing even better search time performance gains (expected similar to change in data size) ... but obviously a lot depends on what searches you've tested. One thing to note is that with .csv files your fields become indexed fields and thus your index size (.tsidx files) on disk might suffer (depending on the cardinality of your fields). You could avoid this by not using index time CSV parsing but instead use delimiter based KV at search time - if the file format doesn't change (ie headers are the same) then delimiter KV has few/no drawbacks.

Simeon · ‎07-13-2018

carnality....

ledion · ‎07-13-2018

there, fixed it 🙂

Simeon · ‎06-26-2018

Performance can be measured in different ways, but also covers both indexing and search. You could run your tests again and use job inspector to see any exact differences, but I would first ask why would you want to remove or add the key value pair fidelity. We typically encourage people to add field names so they are easier on the eyes and the performance difference to the user is not noticeable. If you search is slow, it's still gonna be slow regardless of csv or key-value pair format.

From an indexing perspective, you would save on size with csv and there is optimized/automated field extraction iirc. You are essentially saving extra bytes through the removal of the field name in the key value pair. Many years ago, people would switch to csv to save on licensing, but you remove fidelity and searchable terms.

From a search perspective, it kinda depends. If you have terms (field names) you need to search upon, like using service or source_port as a keyword, the csv format won't be as optimized as I don't believe it exists in the same way in the tsidx file (would have to double check this). I would imagine an apples to apples comparison of a "stats count" by one of the fields would return slightly different results, potentially slightly faster in the csv format as the actual value you count and process to extract from rawdata might be faster. If you consider counting by the last field in your first example line, source_ip, I would imagine that the extraction/tracking of that field will be much longer than via the csv method as we should look for the last comma then return that field, compared to trying to regex for source_ip and returning that value. I'll reiterate, it really depends what you care about and the type of search.

koshyk · ‎07-03-2018

It has been a while I've put the question, I had to do it myself . We are now following the format(2) which is WITHOUT key-value and actually we had got performance improvement of about

=> 5-10% of indexing . (may be coz of reduction in size of event itself)
=> 20-25% performance in search time with our own extraction logic. This was a shock me as well as I thought key-value was better.

sloshburch · ‎07-12-2018

Interesting stats. To your own point, it could be normalized to the size of the data. So is the indexing improvement normalized per byte? Not that it needs to be, at the end of the day it's just how fast can you get your answer #amiright? lol

Simeon · ‎07-12-2018

@koshyk - if you could share the extraction logic (regex) and the type of search, we could probably tell you why performance is improved.

sloshburch · ‎07-03-2018

FWIW, I see most customers moving to JSON from semantic=style, and def no one switching to CSV or something so restrictive.

I want to highlight @Simeon’s key point that if other’s, who are not familiar with the data, need to see raw events, then having a more description format will be a win (whereas csv is NOT self descriptive), and more interestingly.

rfaircloth_splu · ‎07-03-2018

the key proposition for Splunk is "native" or raw. If the source is natively producing json, xml or KV the best all things considered path is that raw form. Pre translation i.e. schema on write is very high risk and is the Achilles heal of other solutions identifying and rectifying data problems due to translation is difficult and often results in failure to monitor. If a new solution such as a business application was being implemented today and that solution was to log in a performant way to for example kafka or SNS I would use a minified json format, with a schema indicator. An example of this in use today is AWS cloud watch events.

One example of things not to do is wrap a txt message in json for example packaging a Cisco ASA event inside of json requires escaping characters. Parsing fields from fields in json is very difficult. While a native Jason format is very easy to work with.

sloshburch · ‎07-12-2018

Good highlight! Don't rewrite things to transform and compromise the original data. Use its raw form! If creating something new, then this thread becomes greater guidance. Thanks, @rfaircloth!

CSV data vs key-value data. Which is faster for performance?

What the End of Support for Splunk Add-on Builder Means for You

Solve, Learn, Repeat: New Puzzle Channel Now Live

Building Reliable Asset and Identity Frameworks in Splunk ES

Are you a member of the Splunk Community?

CSV data vs key-value data. Which is faster for performance?

What the End of Support for Splunk Add-on Builder Means for You

Solve, Learn, Repeat: New Puzzle Channel Now Live

Building Reliable Asset and Identity Frameworks in Splunk ES