Solved: Line splitting at a regular expression for ""

pgreer_splunk · ‎10-14-2017

I'm needing to split a stream of data (from a REST API call) that is CSV data, variable line lengths at the initial set. The split should be when the stream of data has two double quotes together -> "" <-

Example data is:

"AmazonEC2","Asia Pacific (Sydney)","AWS Region","m3.2xlarge","Yes","General purpose","8","Intel Xeon E5-2670 v2 (Ivy Bridge/Sandy Bridge)","2.5 GHz","30 GiB","2 x 80 SSD","High","64-bit",,,,,,,,"Dedicated","Windows","No License required",,,,,,,,"APS2-DedicatedUsage:m3.2xlarge","RunInstances:0002",,,"26",,,,,,,,,,,,,,,"16",,"NA","Intel AVX; Intel Turbo","Amazon Elastic Compute Cloud""FGNPDK5ZFJP4S9NC","MZU6U2429S","FGNPDK5ZFJP4S9NC.MZU6U2429S.2TG2D8R56U","Reserved","Upfront Fee","2016-08-31",,,"Quantity","17896","USD","3yr","All Upfront","convertible","Compute Instance","AmazonEC2","US West (N. California)","AWS Region","c3.4xlarge","Yes","Compute optimized","16","Intel Xeon E5-2680 v2 (Ivy Bridge)","2.8 GHz","30 GiB","2 x 160 SSD","High","64-bit",,,,,,,,"Shared","RHEL","No License required",,,,,,,,"USW1-BoxUsage:c3.4xlarge","RunInstances:0010",,,"55","Yes",,,,,,,,,,,,,,"32",,"NA","Intel AVX; Intel Turbo","Amazon Elastic Compute Cloud""QD2X48Z37JG3VNFX","HU7G6KETJZ","QD2X48Z37JG3VNFX.HU7G6KETJZ.6YS6EN2CT7","Reserved","Windows with SQL Server Enterprise (Amazon VPC), r3.2xlarge reserved instance applied","2016-11-30","0","Inf","Hrs","1.9700000000","USD","1yr","Partial Upfront","standard","Compute Instance","AmazonEC2","Asia Pacific (Tokyo)","AWS Region","r3.2xlarge","Yes","Memory optimized","8","Intel Xeon E5-2670 v2 (Ivy Bridge)","2.5 GHz","61 GiB","1 x 160 SSD","High","64-bit",,,,,,,,"Dedicated","Windows","No License required",,,,,,,,"APN1-DedicatedUsage:r3.2xlarge","RunInstances:0102",,,"26","Yes",,,,,,,,,,,,,,"16",,"SQL Ent","Intel AVX; Intel Turbo","Amazon Elastic Compute Cloud""DCM8ZJ894B27CQ8G","4NA7Y494T4","DCM8ZJ894B27CQ8G.4NA7Y494T4.6YS6EN2CT7","Reserved","Linux/UNIX (Amazon VPC), g3.8xlarge reserved instance applied","2017-06-30","0","Inf","Hrs","2.1400000000","USD","1yr","No Upfront","standard","Compute Instance","AmazonEC2","US West (N. California)","AWS Region","g3.8xlarge","Yes","GPU instance","32","Intel Xeon E5-2686 v4 (Broadwell)","2.3 GHz","244 GiB","EBS only","10 Gigabit","64-bit",,,,,,,,"Shared","Linux","No License required",,,,,,,,"USW1-BoxUsage:g3.8xlarge","RunInstances",,"7000 Mbps","0","Yes","2",,,,,,,,,,"Yes","Yes","Yes","64",,"NA","Intel AVX, Intel AVX2, Intel Turbo","Amazon Elastic Compute Cloud""EX33FD39CKVCKNYQ","MZU6U2429S","EX33FD39CKVCKNYQ.MZU6U2429S.2TG2D8R56U","Reserved","Upfront Fee","2017-04-30",,,"Quantity","14074","USD","3yr","All Upfront","convertible","Compute Instance","AmazonEC2","US West (N. California)","AWS Region","m4.4xlarge","Yes","General purpose","16","Intel Xeon E5-2676 v3 (Haswell)","2.4 GHz","64 GiB","EBS only","High","64-bit",,,,,,,,"Dedicated","Linux","No License required",,,,,,,,"USW1-DedicatedUsage:m4.4xlarge","RunInstances",,"2000 Mbps","53.5","Yes",,,,,,,,,,,,,,"32",,"NA","Intel AVX; Intel AVX2; Intel Turbo","Amazon Elastic Compute Cloud""QGQ2W8XX4J2CGD82","4NA7Y494T4","QGQ2W8XX4J2CGD82.4NA7Y494T4.6YS6EN2CT7","Reserved","Red Hat Enterprise Linux (Amazon VPC), m4.xlarge reserved instance applied","2017-04-30","0","Inf","Hrs","0.2154000000","USD","1yr","No Upfront","standard","Compute Instance","AmazonEC2","Asia Pacific (Singapore)","AWS Region","m4.xlarge","Yes","General purpose","4","Intel Xeon E5-2676 v3 (Haswell)","2.4  GHz","16 GiB","EBS only","High","64-bit",,,,,,,,"Shared","RHEL","No License required",,,,,,,,"APS1-BoxUsage:m4.xlarge","RunInstances:0010",,"750 Mbps","13","Yes",,,,,,,,,,,,,,"8",,"NA","Intel AVX; Intel AVX2; Intel Turbo","Amazon Elastic Compute Cloud""DZS3NEJDE8E98442","4NA7Y494T4","DZS3NEJDE8E98442.4NA7Y494T4.6YS6EN2CT7","Reserved","Windows with SQL Server Standard (Amazon VPC), i3.4xlarge reserved instance applied","2017-06-30","0","Inf","Hrs","3.5970000000","USD","1yr","No Upfront","standard","Compute Instance","AmazonEC2","EU (Ireland)","AWS Region","i3.4xlarge","Yes","Storage optimized","16","Intel Xeon E5-2686 v4 (Broadwell)","2.3 GHz","122 GiB","2 x 1.9 NVMe SSD","Up to 10 Gigabit","64-bit",,,,,,,,"Shared","Windows","No License required",,,,,,,,"EU-BoxUsage:i3.4xlarge","RunInstances:0006",,"3500 Mbps","99","Yes",,,,,,,,,,,,,,"32",,"SQL Std","Intel AVX, Intel AVX2, Intel Turbo","Amazon Elastic Compute Cloud"

acharlieh · ‎10-14-2017

I believe the props.conf settings you want for your sourcetype on the splunk instance (indexer/hwf) that'll be doing the parsing of your data will be:

[yoursourcetype]
LINE_BREAKER = "()"
SHOULD_LINEMERGE = false

LINE_BREAKER should have a capturing group that it removes from the data as being between lines... by default it's any number of consecutive newline and carriage return characters, but in this case it'll remove the matching nothing between two consecutive double quotes.

You probably also want to configure the timestamp identification properties, as well as the search time properties for what the fields of your CSV mean, but those are different similar steps 🙂

View solution in original post

acharlieh · ‎10-14-2017

I believe the props.conf settings you want for your sourcetype on the splunk instance (indexer/hwf) that'll be doing the parsing of your data will be:

[yoursourcetype]
LINE_BREAKER = "()"
SHOULD_LINEMERGE = false

LINE_BREAKER should have a capturing group that it removes from the data as being between lines... by default it's any number of consecutive newline and carriage return characters, but in this case it'll remove the matching nothing between two consecutive double quotes.

You probably also want to configure the timestamp identification properties, as well as the search time properties for what the fields of your CSV mean, but those are different similar steps 🙂

pgreer_splunk · ‎10-16-2017

That set me on the right track. Having issues ignoring certain events from being ingested, so still working that front, but the events are breaking as desired at a "" within the data stream from the API.

Thanks!

gjanders · ‎10-14-2017

I'm assuming if you cannot do:

LINE_BREAKER = \"\"

Then something like:

LINE_BREAKER = \x22\x22

Perhaps?
I can convert this to an answer if it works.

-
Alerts for Splunk Admins, Version Control for Splunk, Decrypt2 VersionControl For SplunkCloud

Line splitting at a regular expression for ""

Building Reliable Asset and Identity Frameworks in Splunk ES

Cloud Monitoring Console - Unlocking Greater Visibility in SVC Usage Reporting

Automatic Discovery Part 3: Practical Use Cases

Are you a member of the Splunk Community?

Line splitting at a regular expression for ""

Building Reliable Asset and Identity Frameworks in Splunk ES

Cloud Monitoring Console - Unlocking Greater Visibility in SVC Usage Reporting

Automatic Discovery Part 3: Practical Use Cases