Getting Data In

Line splitting at a regular expression for ""

pgreer_splunk
Splunk Employee
Splunk Employee

I'm needing to split a stream of data (from a REST API call) that is CSV data, variable line lengths at the initial set. The split should be when the stream of data has two double quotes together -> "" <-

Example data is:

"AmazonEC2","Asia Pacific (Sydney)","AWS Region","m3.2xlarge","Yes","General purpose","8","Intel Xeon E5-2670 v2 (Ivy Bridge/Sandy Bridge)","2.5 GHz","30 GiB","2 x 80 SSD","High","64-bit",,,,,,,,"Dedicated","Windows","No License required",,,,,,,,"APS2-DedicatedUsage:m3.2xlarge","RunInstances:0002",,,"26",,,,,,,,,,,,,,,"16",,"NA","Intel AVX; Intel Turbo","Amazon Elastic Compute Cloud""FGNPDK5ZFJP4S9NC","MZU6U2429S","FGNPDK5ZFJP4S9NC.MZU6U2429S.2TG2D8R56U","Reserved","Upfront Fee","2016-08-31",,,"Quantity","17896","USD","3yr","All Upfront","convertible","Compute Instance","AmazonEC2","US West (N. California)","AWS Region","c3.4xlarge","Yes","Compute optimized","16","Intel Xeon E5-2680 v2 (Ivy Bridge)","2.8 GHz","30 GiB","2 x 160 SSD","High","64-bit",,,,,,,,"Shared","RHEL","No License required",,,,,,,,"USW1-BoxUsage:c3.4xlarge","RunInstances:0010",,,"55","Yes",,,,,,,,,,,,,,"32",,"NA","Intel AVX; Intel Turbo","Amazon Elastic Compute Cloud""QD2X48Z37JG3VNFX","HU7G6KETJZ","QD2X48Z37JG3VNFX.HU7G6KETJZ.6YS6EN2CT7","Reserved","Windows with SQL Server Enterprise (Amazon VPC), r3.2xlarge reserved instance applied","2016-11-30","0","Inf","Hrs","1.9700000000","USD","1yr","Partial Upfront","standard","Compute Instance","AmazonEC2","Asia Pacific (Tokyo)","AWS Region","r3.2xlarge","Yes","Memory optimized","8","Intel Xeon E5-2670 v2 (Ivy Bridge)","2.5 GHz","61 GiB","1 x 160 SSD","High","64-bit",,,,,,,,"Dedicated","Windows","No License required",,,,,,,,"APN1-DedicatedUsage:r3.2xlarge","RunInstances:0102",,,"26","Yes",,,,,,,,,,,,,,"16",,"SQL Ent","Intel AVX; Intel Turbo","Amazon Elastic Compute Cloud""DCM8ZJ894B27CQ8G","4NA7Y494T4","DCM8ZJ894B27CQ8G.4NA7Y494T4.6YS6EN2CT7","Reserved","Linux/UNIX (Amazon VPC), g3.8xlarge reserved instance applied","2017-06-30","0","Inf","Hrs","2.1400000000","USD","1yr","No Upfront","standard","Compute Instance","AmazonEC2","US West (N. California)","AWS Region","g3.8xlarge","Yes","GPU instance","32","Intel Xeon E5-2686 v4 (Broadwell)","2.3 GHz","244 GiB","EBS only","10 Gigabit","64-bit",,,,,,,,"Shared","Linux","No License required",,,,,,,,"USW1-BoxUsage:g3.8xlarge","RunInstances",,"7000 Mbps","0","Yes","2",,,,,,,,,,"Yes","Yes","Yes","64",,"NA","Intel AVX, Intel AVX2, Intel Turbo","Amazon Elastic Compute Cloud""EX33FD39CKVCKNYQ","MZU6U2429S","EX33FD39CKVCKNYQ.MZU6U2429S.2TG2D8R56U","Reserved","Upfront Fee","2017-04-30",,,"Quantity","14074","USD","3yr","All Upfront","convertible","Compute Instance","AmazonEC2","US West (N. California)","AWS Region","m4.4xlarge","Yes","General purpose","16","Intel Xeon E5-2676 v3 (Haswell)","2.4 GHz","64 GiB","EBS only","High","64-bit",,,,,,,,"Dedicated","Linux","No License required",,,,,,,,"USW1-DedicatedUsage:m4.4xlarge","RunInstances",,"2000 Mbps","53.5","Yes",,,,,,,,,,,,,,"32",,"NA","Intel AVX; Intel AVX2; Intel Turbo","Amazon Elastic Compute Cloud""QGQ2W8XX4J2CGD82","4NA7Y494T4","QGQ2W8XX4J2CGD82.4NA7Y494T4.6YS6EN2CT7","Reserved","Red Hat Enterprise Linux (Amazon VPC), m4.xlarge reserved instance applied","2017-04-30","0","Inf","Hrs","0.2154000000","USD","1yr","No Upfront","standard","Compute Instance","AmazonEC2","Asia Pacific (Singapore)","AWS Region","m4.xlarge","Yes","General purpose","4","Intel Xeon E5-2676 v3 (Haswell)","2.4  GHz","16 GiB","EBS only","High","64-bit",,,,,,,,"Shared","RHEL","No License required",,,,,,,,"APS1-BoxUsage:m4.xlarge","RunInstances:0010",,"750 Mbps","13","Yes",,,,,,,,,,,,,,"8",,"NA","Intel AVX; Intel AVX2; Intel Turbo","Amazon Elastic Compute Cloud""DZS3NEJDE8E98442","4NA7Y494T4","DZS3NEJDE8E98442.4NA7Y494T4.6YS6EN2CT7","Reserved","Windows with SQL Server Standard (Amazon VPC), i3.4xlarge reserved instance applied","2017-06-30","0","Inf","Hrs","3.5970000000","USD","1yr","No Upfront","standard","Compute Instance","AmazonEC2","EU (Ireland)","AWS Region","i3.4xlarge","Yes","Storage optimized","16","Intel Xeon E5-2686 v4 (Broadwell)","2.3 GHz","122 GiB","2 x 1.9 NVMe SSD","Up to 10 Gigabit","64-bit",,,,,,,,"Shared","Windows","No License required",,,,,,,,"EU-BoxUsage:i3.4xlarge","RunInstances:0006",,"3500 Mbps","99","Yes",,,,,,,,,,,,,,"32",,"SQL Std","Intel AVX, Intel AVX2, Intel Turbo","Amazon Elastic Compute Cloud"
0 Karma
1 Solution

acharlieh
Influencer

I believe the props.conf settings you want for your sourcetype on the splunk instance (indexer/hwf) that'll be doing the parsing of your data will be:

[yoursourcetype]
LINE_BREAKER = "()"
SHOULD_LINEMERGE = false

LINE_BREAKER should have a capturing group that it removes from the data as being between lines... by default it's any number of consecutive newline and carriage return characters, but in this case it'll remove the matching nothing between two consecutive double quotes.

You probably also want to configure the timestamp identification properties, as well as the search time properties for what the fields of your CSV mean, but those are different similar steps 🙂

View solution in original post

acharlieh
Influencer

I believe the props.conf settings you want for your sourcetype on the splunk instance (indexer/hwf) that'll be doing the parsing of your data will be:

[yoursourcetype]
LINE_BREAKER = "()"
SHOULD_LINEMERGE = false

LINE_BREAKER should have a capturing group that it removes from the data as being between lines... by default it's any number of consecutive newline and carriage return characters, but in this case it'll remove the matching nothing between two consecutive double quotes.

You probably also want to configure the timestamp identification properties, as well as the search time properties for what the fields of your CSV mean, but those are different similar steps 🙂

pgreer_splunk
Splunk Employee
Splunk Employee

That set me on the right track. Having issues ignoring certain events from being ingested, so still working that front, but the events are breaking as desired at a "" within the data stream from the API.

Thanks!

0 Karma

gjanders
SplunkTrust
SplunkTrust

I'm assuming if you cannot do:

LINE_BREAKER = \"\"

Then something like:

LINE_BREAKER = \x22\x22

Perhaps?
I can convert this to an answer if it works.

0 Karma
Get Updates on the Splunk Community!

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...

Splunk Custom Visualizations App End of Life

The Splunk Custom Visualizations apps End of Life for SimpleXML will reach end of support on Dec 21, 2024, ...