<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: CSV data vs key-value data. Which is faster for performance? in Getting Data In</title>
    <link>https://community.splunk.com/t5/Getting-Data-In/CSV-data-vs-key-value-data-Which-is-faster-for-performance/m-p/323298#M60218</link>
    <description>&lt;P&gt;done it mate. thanks&lt;/P&gt;</description>
    <pubDate>Thu, 26 Jul 2018 18:33:59 GMT</pubDate>
    <dc:creator>koshyk</dc:creator>
    <dc:date>2018-07-26T18:33:59Z</dc:date>
    <item>
      <title>CSV data vs key-value data. Which is faster for performance?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/CSV-data-vs-key-value-data-Which-is-faster-for-performance/m-p/323286#M60206</link>
      <description>&lt;P&gt;hi,&lt;BR /&gt;
We have an incoming custom dataset which consumes approx 700GB a day and is currently used for CIM. Currently it is in Key-value format.  there is a proposal for changing it to csv, which reduces the dataset by approx 60% to 280GB a day.   The data savings are quite significant. We know the client is fixed, so lack of flexibility is NOT an issue&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;Existing format in every line&lt;/STRONG&gt;&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;service="retail" source_port="514" dest_port="22" destination_ip="1.2.3.4" source_ip="7.2.3.4" 
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;&lt;STRONG&gt;Proposed format  in every line&lt;/STRONG&gt;&lt;/P&gt;

&lt;PRE&gt;&lt;CODE&gt;"retail",514,22,"1.2.3.4","7.2.3.4"
&lt;/CODE&gt;&lt;/PRE&gt;

&lt;P&gt;The key question is, &lt;STRONG&gt;from a performance point of view would there be an impact so if we use CIM on csv format&lt;/STRONG&gt;?  Also would it have bad impact on tsidx creation? The data comes as syslog &amp;amp; files are rotated at 100MB size (if it matters).   I've tried with a smaller subset in my test machine, but I couldn't find any changes in performance with small amount of data. But would like to get experience &lt;/P&gt;</description>
      <pubDate>Mon, 23 Oct 2017 09:20:06 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/CSV-data-vs-key-value-data-Which-is-faster-for-performance/m-p/323286#M60206</guid>
      <dc:creator>koshyk</dc:creator>
      <dc:date>2017-10-23T09:20:06Z</dc:date>
    </item>
    <item>
      <title>Re: CSV data vs key-value data. Which is faster for performance?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/CSV-data-vs-key-value-data-Which-is-faster-for-performance/m-p/323287#M60207</link>
      <description>&lt;P&gt;Performance can be measured in different ways, but also covers both indexing and search.  You could run your tests again and use job inspector to see any exact differences, but I would first ask &lt;STRONG&gt;why would you want to remove or add the key value pair fidelity.&lt;/STRONG&gt;  We typically encourage people to add field names so they are easier on the eyes and the performance difference to the user is not noticeable.  If you search is slow, it's still gonna be slow regardless of csv or key-value pair format.&lt;/P&gt;

&lt;P&gt;From an indexing perspective, you would save on size with csv and there is optimized/automated field extraction iirc.  You are essentially saving extra bytes through the removal of the field name in the key value pair.  Many years ago, people would switch to csv to save on licensing, but you remove fidelity and searchable terms.  &lt;/P&gt;

&lt;P&gt;From a search perspective, it kinda depends.  If you have terms (field names) you need to search upon, like using service or source_port as a keyword, the csv format won't be as optimized as I don't believe it exists in the same way in the tsidx file (would have to double check this).  I would imagine an apples to apples comparison of a "stats count" by one of the fields would return slightly different results, potentially slightly faster in the csv format as the actual value you count and process to extract from rawdata might be faster.  If you consider counting by the last field in your first example line, source_ip, I would imagine that the extraction/tracking of that field will be much longer than via the csv method as we should look for the last comma then return that field, compared to trying to regex for source_ip and returning that value.  I'll reiterate, it really depends what you care about and the type of search.&lt;/P&gt;</description>
      <pubDate>Tue, 29 Sep 2020 20:12:14 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/CSV-data-vs-key-value-data-Which-is-faster-for-performance/m-p/323287#M60207</guid>
      <dc:creator>Simeon</dc:creator>
      <dc:date>2020-09-29T20:12:14Z</dc:date>
    </item>
    <item>
      <title>Re: CSV data vs key-value data. Which is faster for performance?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/CSV-data-vs-key-value-data-Which-is-faster-for-performance/m-p/323288#M60208</link>
      <description>&lt;P&gt;FWIW, I see most customers moving to JSON from semantic=style, and def no one switching to CSV or something so restrictive.&lt;/P&gt;

&lt;P&gt;I want to highlight @Simeon’s key point that if other’s, who are not familiar with the data, need to see raw events, then having a more description format will be a win (whereas csv is NOT self descriptive), and more interestingly.&lt;/P&gt;</description>
      <pubDate>Tue, 03 Jul 2018 12:46:14 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/CSV-data-vs-key-value-data-Which-is-faster-for-performance/m-p/323288#M60208</guid>
      <dc:creator>sloshburch</dc:creator>
      <dc:date>2018-07-03T12:46:14Z</dc:date>
    </item>
    <item>
      <title>Re: CSV data vs key-value data. Which is faster for performance?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/CSV-data-vs-key-value-data-Which-is-faster-for-performance/m-p/323289#M60209</link>
      <description>&lt;P&gt;the key proposition for Splunk is "native" or raw. If the source is natively producing json, xml or KV the best all things considered path is that raw form. Pre translation i.e. schema on write is very high risk and is the Achilles heal of other solutions identifying and rectifying data problems due to translation is difficult and often results in failure to monitor. If a new solution such as a business application was being implemented today and that solution was to log in a performant way to for example kafka or SNS I would use a minified json format, with a schema indicator. An example of this in use today is AWS cloud watch events.&lt;/P&gt;

&lt;P&gt;One example of things not to do is wrap a txt message in json for example packaging a  Cisco ASA event inside of json requires escaping characters. Parsing fields from fields in json is very difficult. While a native Jason format is very easy to work with.&lt;/P&gt;</description>
      <pubDate>Tue, 03 Jul 2018 12:59:47 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/CSV-data-vs-key-value-data-Which-is-faster-for-performance/m-p/323289#M60209</guid>
      <dc:creator>rfaircloth_splu</dc:creator>
      <dc:date>2018-07-03T12:59:47Z</dc:date>
    </item>
    <item>
      <title>Re: CSV data vs key-value data. Which is faster for performance?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/CSV-data-vs-key-value-data-Which-is-faster-for-performance/m-p/323290#M60210</link>
      <description>&lt;P&gt;It has been a while I've put the question, I had to do it myself . We are now following the &lt;STRONG&gt;format(2)&lt;/STRONG&gt; which is &lt;STRONG&gt;WITHOUT key-value&lt;/STRONG&gt; and actually we had got performance &lt;STRONG&gt;improvement&lt;/STRONG&gt; of about&lt;/P&gt;

&lt;P&gt;=&amp;gt; 5-10% of indexing . (may be coz of reduction in size of event itself)&lt;BR /&gt;
=&amp;gt; 20-25% performance in search time with our own extraction logic. This was a shock me as well as I thought key-value was better.&lt;/P&gt;</description>
      <pubDate>Tue, 03 Jul 2018 14:37:57 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/CSV-data-vs-key-value-data-Which-is-faster-for-performance/m-p/323290#M60210</guid>
      <dc:creator>koshyk</dc:creator>
      <dc:date>2018-07-03T14:37:57Z</dc:date>
    </item>
    <item>
      <title>Re: CSV data vs key-value data. Which is faster for performance?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/CSV-data-vs-key-value-data-Which-is-faster-for-performance/m-p/323291#M60211</link>
      <description>&lt;P&gt;Good highlight! Don't rewrite things to transform and compromise the original data. Use its raw form! If creating something new, then this thread becomes greater guidance. Thanks, @rfaircloth!&lt;/P&gt;</description>
      <pubDate>Thu, 12 Jul 2018 13:50:36 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/CSV-data-vs-key-value-data-Which-is-faster-for-performance/m-p/323291#M60211</guid>
      <dc:creator>sloshburch</dc:creator>
      <dc:date>2018-07-12T13:50:36Z</dc:date>
    </item>
    <item>
      <title>Re: CSV data vs key-value data. Which is faster for performance?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/CSV-data-vs-key-value-data-Which-is-faster-for-performance/m-p/323292#M60212</link>
      <description>&lt;P&gt;Interesting stats. To your own point, it could be normalized to the size of the data. So is the indexing improvement normalized per byte? Not that it needs to be, at the end of the day it's just how fast can you get your answer #amiright? lol&lt;/P&gt;</description>
      <pubDate>Thu, 12 Jul 2018 13:52:20 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/CSV-data-vs-key-value-data-Which-is-faster-for-performance/m-p/323292#M60212</guid>
      <dc:creator>sloshburch</dc:creator>
      <dc:date>2018-07-12T13:52:20Z</dc:date>
    </item>
    <item>
      <title>Re: CSV data vs key-value data. Which is faster for performance?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/CSV-data-vs-key-value-data-Which-is-faster-for-performance/m-p/323293#M60213</link>
      <description>&lt;P&gt;@koshyk - if you could share the extraction logic (regex) and the type of search, we could probably tell you why performance is improved.  &lt;/P&gt;</description>
      <pubDate>Thu, 12 Jul 2018 18:31:32 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/CSV-data-vs-key-value-data-Which-is-faster-for-performance/m-p/323293#M60213</guid>
      <dc:creator>Simeon</dc:creator>
      <dc:date>2018-07-12T18:31:32Z</dc:date>
    </item>
    <item>
      <title>Re: CSV data vs key-value data. Which is faster for performance?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/CSV-data-vs-key-value-data-Which-is-faster-for-performance/m-p/323294#M60214</link>
      <description>&lt;P&gt;key=value, from a performance point of view is &lt;STRONG&gt;not&lt;/STRONG&gt; the best format - it is the easiest to write logs out, has great readability, compresses really well but of course it is really wasteful from a license point of view. As the developer of most of search time extractions, I am actually surprised that you are not seeing &lt;STRONG&gt;even better&lt;/STRONG&gt; search time performance gains (expected similar to change in data size) ... but obviously a lot depends on what searches you've tested. One thing to note is that with .csv files your fields become indexed fields and thus your index size (.tsidx files) on disk might suffer (depending on the cardinality of your fields). You could avoid this by not using index time CSV parsing but instead use &lt;A href="https://www.splunk.com/blog/2008/02/12/delimiter-based-key-value-pair-extraction.html"&gt;delimiter based KV&lt;/A&gt; at search time - if the file format doesn't change (ie headers are the same) then delimiter KV has few/no drawbacks. &lt;/P&gt;</description>
      <pubDate>Fri, 13 Jul 2018 04:20:22 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/CSV-data-vs-key-value-data-Which-is-faster-for-performance/m-p/323294#M60214</guid>
      <dc:creator>ledion</dc:creator>
      <dc:date>2018-07-13T04:20:22Z</dc:date>
    </item>
    <item>
      <title>Re: CSV data vs key-value data. Which is faster for performance?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/CSV-data-vs-key-value-data-Which-is-faster-for-performance/m-p/323295#M60215</link>
      <description>&lt;P&gt;carnality....&lt;/P&gt;</description>
      <pubDate>Fri, 13 Jul 2018 12:03:31 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/CSV-data-vs-key-value-data-Which-is-faster-for-performance/m-p/323295#M60215</guid>
      <dc:creator>Simeon</dc:creator>
      <dc:date>2018-07-13T12:03:31Z</dc:date>
    </item>
    <item>
      <title>Re: CSV data vs key-value data. Which is faster for performance?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/CSV-data-vs-key-value-data-Which-is-faster-for-performance/m-p/323296#M60216</link>
      <description>&lt;P&gt;there, fixed it &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 13 Jul 2018 16:58:45 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/CSV-data-vs-key-value-data-Which-is-faster-for-performance/m-p/323296#M60216</guid>
      <dc:creator>ledion</dc:creator>
      <dc:date>2018-07-13T16:58:45Z</dc:date>
    </item>
    <item>
      <title>Re: CSV data vs key-value data. Which is faster for performance?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/CSV-data-vs-key-value-data-Which-is-faster-for-performance/m-p/323297#M60217</link>
      <description>&lt;P&gt;@koshyk - you got a lot of brilliant engineers helping on this thread. Let us know what else, or, if one of the answers helped, go ahead and accept it so we know you're all set.&lt;/P&gt;</description>
      <pubDate>Tue, 24 Jul 2018 12:58:38 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/CSV-data-vs-key-value-data-Which-is-faster-for-performance/m-p/323297#M60217</guid>
      <dc:creator>sloshburch</dc:creator>
      <dc:date>2018-07-24T12:58:38Z</dc:date>
    </item>
    <item>
      <title>Re: CSV data vs key-value data. Which is faster for performance?</title>
      <link>https://community.splunk.com/t5/Getting-Data-In/CSV-data-vs-key-value-data-Which-is-faster-for-performance/m-p/323298#M60218</link>
      <description>&lt;P&gt;done it mate. thanks&lt;/P&gt;</description>
      <pubDate>Thu, 26 Jul 2018 18:33:59 GMT</pubDate>
      <guid>https://community.splunk.com/t5/Getting-Data-In/CSV-data-vs-key-value-data-Which-is-faster-for-performance/m-p/323298#M60218</guid>
      <dc:creator>koshyk</dc:creator>
      <dc:date>2018-07-26T18:33:59Z</dc:date>
    </item>
  </channel>
</rss>

