Solved: Re: Field Extraction with CSV File

SplunkDash · ‎07-10-2024

Hello.

I have some issues with field parsing for the CSV files using props configuration. I should be getting 11 fields for each of the events/rows, but parsing is giving me 17 fields. Here are the 3 sample events (First row is header row) from that CSV file and the props.conf file is also provided below:

Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col8,Col9,Col10,Col11

APIDEV,4xs54,000916,DEV,Update,Integrate,String\,Set\,Number\,ID,Standard,2024-07-10T23:10:45.001Z,Process_TIME\,URI\,Session_Key\,Services,Hourly

APITEST,4ys34,000916,TEST,Update,Integrate,String\,Set\,Number\,String,Typicall\,Response,2024-07-10T23:10:45.021Z,CPU_TIME\,URI\,Session_Key\,Type\,Request,Monthly

APITEST,4ys34,000916,DEV,Insert,Integrate,Char\,Set\,System\,ID,On_Demand,2024-07-10T23:10:45.051Z,CPU_TIME\,URI\,Session_Key\,Services,Hourly

*Bold texts in each of the events should count one field

props.conf

[mypropscon]

SHOULD_LINEMERGE=False

LINE_BREAKER=([\r\n]+)

INDEXED_EXTRACTIONS=CSV

KV_MODE=none

disabled=false

TIME_FORMAT=%Y-%m-%dT%H:%M:%S.%QZ

HEARDER_FIELD_LINE_NUMBER=1

Any recommendation to resolve that issue will be highly appreciated. Thank you so much for your support as always.

ITWhisperer · ‎07-11-2024

The number of backslashes in the data doesn't matter, it is the number of backslashes in the regex string I was talking about. Backslashes normally need to be escaped (with a backslash), however, sometimes these backslashes have to be escaped as well, hence the need for 4 backslashes to represent a single backslash. Try something like this (which allows for escaped (backslashed) commas in all the columns).

(?<Col1>.+?(?<!\\\\)),(?<Col2>.+?(?<!\\\\)),(?<Col3>.+?(?<!\\\\)),(?<Col4>.+?(?<!\\\\)),(?<Col5>.+?(?<!\\\\)),(?<Col6>.+?(?<!\\\\)),(?<Col7>.+?(?<!\\\\)),(?<Col8>.+?(?<!\\\\)),(?<Col9>.+?(?<!\\\\)),(?<Col10>.+?(?<!\\\\)),(?<Col11>.+?(?<!\\\\))$

Again, it might be that you only need two backslashes each time instead of four.

View solution in original post

PickleRick · ‎07-11-2024

That's an interesting problem but I think it's a bit of a malformed data. If your field values contain commas, they should be enclosed in quotes.

If your column order is constant, you can define a regex-based search-time extraction including escaped commas in field value.

gcusello · ‎07-10-2024

Hi @SplunkDash ,

let me understand: you have 17 fields in your csv but you want to extract only 11 of them, is it correct?

do you want to delete the other fields or only you don't need them in visualization?

if you want to delete the extra fields, you can use SEDCMD to delete the extra fields before indexing.

In the second case, you can leave all as is and take only the 11 fields.

Ciao.

Giuseppe

SplunkDash · ‎07-11-2024

Hello @gcusello and @PickleRick

Thank you so much for your response, truly appreciate it. All values should be part of this extraction and outcome of this extraction should be and as an example:

APIDEV,4xs54,000916,DEV,Update,Integrate,String\,Set\,Number\,ID,Standard,2024-07-10T23:10:45.001Z,Process_TIME\,URI\,Session_Key\,Services,Hourly

Col1=APIDEV

Col2=4xs54

Col3=000916

Col4=DEV

Col5=Update

Col6=Integrate

Col7=String\,Set\,Number\,ID

Col8=Standard

Col9=2024-07-10T23:10:45.001Z

Col10=Process_TIME\,URI\,Session_Key\,Services

Col11=Hourly

In the case of Regex, what would be the Regex for this extraction within the props.conf.

ITWhisperer · ‎07-11-2024

If it is only Col7 and Col10 that could have escaped commas, try something like this

(?<Col1>[^,]+),(?<Col2>[^,]+),(?<Col3>[^,]+),(?<Col4>[^,]+),(?<Col5>[^,]+),(?<Col6>[^,]+),(?<Col7>.+?(?<!\\)),(?<Col8>[^,]+),(?<Col9>[^,]+),(?<Col10>.+?(?<!\\)),(?<Col11>[^,]+)$

You may have to double-up the backslashes

(?<Col1>[^,]+),(?<Col2>[^,]+),(?<Col3>[^,]+),(?<Col4>[^,]+),(?<Col5>[^,]+),(?<Col6>[^,]+),(?<Col7>.+?(?<!\\\\)),(?<Col8>[^,]+),(?<Col9>[^,]+),(?<Col10>.+?(?<!\\\\)),(?<Col11>[^,]+)$

SplunkDash · ‎07-11-2024

Hello @ITWhisperer,

Thank you so much for sharing that with me. But it's not only for Col7 and Col10 all time. It might be within any Column and also # of Backslashes might not always be the same for the same Column. In that case how would I address that issue

ITWhisperer · ‎07-11-2024

The number of backslashes in the data doesn't matter, it is the number of backslashes in the regex string I was talking about. Backslashes normally need to be escaped (with a backslash), however, sometimes these backslashes have to be escaped as well, hence the need for 4 backslashes to represent a single backslash. Try something like this (which allows for escaped (backslashed) commas in all the columns).

(?<Col1>.+?(?<!\\\\)),(?<Col2>.+?(?<!\\\\)),(?<Col3>.+?(?<!\\\\)),(?<Col4>.+?(?<!\\\\)),(?<Col5>.+?(?<!\\\\)),(?<Col6>.+?(?<!\\\\)),(?<Col7>.+?(?<!\\\\)),(?<Col8>.+?(?<!\\\\)),(?<Col9>.+?(?<!\\\\)),(?<Col10>.+?(?<!\\\\)),(?<Col11>.+?(?<!\\\\))$

Again, it might be that you only need two backslashes each time instead of four.

Field Extraction with CSV File

other

Enterprise Security Content Update (ESCU) | New Releases

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

Index This | What are the 12 Days of Splunk-mas?