Regex to extract fields from CSV file based on num...

nancylawrence00 · ‎09-07-2015

Hi,

I have two types of logs:

Log1:

Jun 18 14:10:57 lec05674568 ABC[455135]: 2015-06-18 14:10:57;indexserver;lec056741;XBE;06;30893;11.91.6.128;lec056287312;9580;9599;LEC Audit - SYSTEM Logins;CRITICAL;CONNECT;SYSTEM;;;;;;DAPNAPS;UNSUCCESSFUL;;;;;;invalid username or password;;5562789;;

Log2:

Jun 18 14:10:57 lec05674568 ABC[455135]: 2015-06-18 14:10:57;indexserver;lec056741;XBE;06;30893;DATABASE01;11.91.6.128;lec056287312;9580;9599;LEC Audit - SYSTEM Logins;CRITICAL;CONNECT;SYSTEM;;;;;;DAPNAPS;UNSUCCESSFUL;;;;;;invalid username or password;;5562789;;

If i use

^([^;]*;){10}(?<Field_Name>[^;]*)

based on the number of delimiters ;, I get wrong output from Log2 as there is an Extra field (DATABASE01)

I need to count the number of delimiters and then extract the field based on the count.

Please advice.

Thanks
Nancy

gcato · ‎09-09-2015

Not exactly sure what you want, but maybe splitting the events into multivalue fields and then using multivalue eval commands will get the results you need. If will certainly give you a count of the number of fields.

Anyway, here are some examples of how to use eval's mv commands to extract fields and will hopefully get you going (ignore the first half which is just creating the test output).

|stats count as logs | eval logs="Jun 18 14:10:57 lec05674568 ABC[455135]: 2015-06-18 14:10:57;indexserver;lec056741;XBE;06;30893;11.91.6.128;lec056287312;9580;9599;LEC Audit - SYSTEM Logins;CRITICAL;CONNECT;SYSTEM;;;;;;DAPNAPS;UNSUCCESSFUL;;;;;;invalid username or password;;5562789;;|Jun 18 14:10:57 lec05674568 ABC[455135]: 2015-06-18 14:10:57;indexserver;lec056741;XBE;06;30893;DATABASE01;11.91.6.128;lec056287312;9580;9599;LEC Audit - SYSTEM Logins;CRITICAL;CONNECT;SYSTEM;;;;;;DAPNAPS;UNSUCCESSFUL;;;;;;invalid username or password;;5562789;;" 
| eval logs=split(logs, "|") |mvexpand logs
| eval fields = split(logs, ";") 
| eval n1 = mvfind(fields, "LEC.*") | eval extract_field1 = mvindex(fields, n1)
| eval n2 = mvfind(fields, "CONNECT") | eval extract_field2 = mvindex(fields, n1,n2)
| eval n3 = mvfind(fields, "DATABASE\d+") | eval n3=if(isnull(n3), 5, n3) | eval extract_field3 = mvindex(fields, n3+1)
| eval c = mvcount(fields)
| eval extract_field4 = case (c==31, mvindex(fields, c-3), c==32, mvindex(fields, c-3))
| ... <etc> ...

Also refer to the eval comand in the docs for further reference and even more mv commands that you can use to hopefully get the results you want. http://docs.splunk.com/Documentation/Splunk/6.2.5/SearchReference/CommonEvalFunctions#Multivalue_fun...

nancylawrence00 · ‎09-12-2015

Hi,

I can extract all the fields by:

^([^;];){1}(?[^;])
^([^;];){2}(?[^;])
^([^;];){3}(?[^;]) and so on......

the issue is that i have an extra field in middle of the log and the above regex extracts the wrong field once it gets to the extra field in some of my logs.......

For further info i here are the logs....

Log1:
Jun 18 14:10:57 lec05674568 ABC[455135]: 2015-06-18 14:10:57;indexserver;lec056741;XBE;06;30893;11.91.6.128;lec056287312;9580;9599;LEC Audit - SYSTEM Logins;CRITICAL;CONNECT;SYSTEM;;;;;;DAPNAPS;UNSUCCESSFUL;;;;;;invalid username or password;;5562789;;

Log2:
Jun 18 14:10:57 lec05674568 ABC[455135]: 2015-06-18 14:10:57;indexserver;lec056741;XBE;06;30893;DATABASE01;11.91.6.128;lec056287312;9580;9599;LEC Audit - SYSTEM Logins;CRITICAL;CONNECT;SYSTEM;;;;;;DAPNAPS;UNSUCCESSFUL;;;;;;invalid username or password;;5562789;;

You see there is an extra field in Log2 (Database1), so when i use the regex mentioned above the numbering changes and thus pulls up the wrong field.

gcato · ‎09-15-2015

Hi nancylawrence007,

I understand the issue but I'm still not sure what you exactly want from the examples provided. As far as I know regex cannot count delimiters, so that is not an option. Basically, you need to normalise the logs so they always have the same number of fields, for which the sed command is perfect. So either,

add an extra ";" between "field6;field_7" in log1 (32 fields), or
substitute the ";" before DATABASE01 field in log2 (31 fields), or
remove the ";DATABASE01" section (31 fields)

I'd recommend #1, as #2 means you would need to perform another regex on the substituted field to extract its two values, and #3 removes a field (though it could be extracted earlier). Anyway, the "rex mode=sed ..." run anywhere example below would do the trick (assuming log1's 6 & 7 fields are consistently a "number;ip_addr") for #1 option.

 |stats count as logs | eval logs="Jun 18 14:10:57 lec05674568 ABC[455135]: 2015-06-18 14:10:57;indexserver;lec056741;XBE;06;30893;11.91.6.128;lec056287312;9580;9599;LEC Audit - SYSTEM Logins;CRITICAL;CONNECT;SYSTEM;;;;;;DAPNAPS;UNSUCCESSFUL;;;;;;invalid username or password;;5562789;;|Jun 18 14:10:57 lec05674568 ABC[455135]: 2015-06-18 14:10:57;indexserver;lec056741;XBE;06;30893;DATABASE01;11.91.6.128;lec056287312;9580;9599;LEC Audit - SYSTEM Logins;CRITICAL;CONNECT;SYSTEM;;;;;;DAPNAPS;UNSUCCESSFUL;;;;;;invalid username or password;;5562789;;"
 | eval logs=split(logs, "|") |mvexpand logs
 | rex mode=sed field=logs "s/((?:[^;]+;){5}\d+;)(\d+\.\d+\.\d+\.\d+;.*)/\1;\2/"
 | eval log_fields=split(logs, ";") |eval field_count = mvcount(log_fields)

Now, both logs have the same number of fields (as the field_count shows) and whatever regex extraction you are using will work.

richgalloway · ‎09-09-2015

Each log should be a separate sourcetype with appropriate extractions created for them.

---
If this reply helps you, Karma would be appreciated.

nancylawrence00 · ‎09-09-2015

i am afraid we cannot do so.....i need a regex to count the delimiters first and then based on the condition of the number of delimiters extract the fields accordingly.

MuS · ‎09-09-2015

Sorry to say, but you're contradicting yourself with your comments. Once you need to count the ; and extract fields based on the count, in the next comment you want all fields?!?! It's pretty hard to help this way and since most of use are doing this voluntarily, it would be nice to provide precise information....other wise it's like asking the magic 8-ball

nancylawrence00 · ‎09-12-2015

Hi,

I have to extract all the fields by:

^([^;];){1}(?[^;])
^([^;];){2}(?[^;])
^([^;];){3}(?[^;]) and so on......

the issue is that i have an extra field in middle of the log and the above regex extracts the wrong field once it gets to the extra field in some of my logs.......

For further info i here are the logs....

Log1:
Jun 18 14:10:57 lec05674568 ABC[455135]: 2015-06-18 14:10:57;indexserver;lec056741;XBE;06;30893;11.91.6.128;lec056287312;9580;9599;LEC Audit - SYSTEM Logins;CRITICAL;CONNECT;SYSTEM;;;;;;DAPNAPS;UNSUCCESSFUL;;;;;;invalid username or password;;5562789;;

Log2:
Jun 18 14:10:57 lec05674568 ABC[455135]: 2015-06-18 14:10:57;indexserver;lec056741;XBE;06;30893;DATABASE01;11.91.6.128;lec056287312;9580;9599;LEC Audit - SYSTEM Logins;CRITICAL;CONNECT;SYSTEM;;;;;;DAPNAPS;UNSUCCESSFUL;;;;;;invalid username or password;;5562789;;

You see there is an extra field in Log2 (Database1), so when i use the regex mentioned above the numbering changes and thus pulls up the wrong field.

nancylawrence00 · ‎09-07-2015

Logs above are just examples, exact logs differ but are separated by delimiters

nancylawrence00 · ‎09-09-2015

i need to extract all the fields

nancylawrence00 · ‎09-07-2015

Jun 18 14:10:57 lec05674568 ABC[455135]: 2015-06-18 14:10:57;indexserver;lec056741;XBE;06;30893;10.81.5.128;lec056287312;9580;9599;LEC Audit - SYSTEM Logins;CRITICAL;CONNECT;SYSTEM;;;;;;DAPNAPS;UNSUCCESSFUL;;;;;;invalid username or password;;5562789;;

MuS · ‎09-07-2015

thanks 🙂 in this example, where is the needed time ?

MuS · ‎09-08-2015

sorry to ask again, but which is the value you want to get for both examples?
Your regex captures from Log1:

9599;LEC Audit - SYSTEM Logins

and from Log2:

9580;9599

If you don't provide this information there is no way to help you with this problem.

MuS · ‎09-07-2015

either provide the exact log events or try the second regex mentioned by @gcato

MuS · ‎09-07-2015

Hi nancylawrence007,

Or you start from the end of the event, if the end is always ;VPN; you can use this regex:

;(?<Event_TimeStamp>\d+:\d+);VPN;

cheers, MuS

gcato · ‎09-07-2015

Nice answer MuS.

Just thinking that if the event ending is different, then you may also be able to just extract the time with something like this (if no other fields match the time pattern) .

... |eval Event_TimeStamp=_raw |rex mode=sed field=Event_TimeStamp "s/([^;]+;)+(\d\d:\d\d)(;.*$)/\2/"

gcato · ‎09-07-2015

Though, thinking about it again, if the field time pattern is unique across all fields, then you do not need to worry about the number of delimiters (or sed) and the following would work too

 rex "([^;]+;)+(?\d\d:\d\d);"

Regex to extract fields from CSV file based on number of delimiters?

Say goodbye to manually analyzing phishing and malware threats with Splunk Attack ...

AppDynamics is now part of Splunk Ideas

Advanced Splunk Data Management Strategies