Dashboards & Visualizations

Splunk Regex (mail csv data extraction)

Joannna
Explorer

I have a field that's called file_content on an source type.
This has a CSV inside.

Meaning every event has a field (file_content) that has a csv inside it. Every event is an email Can't be field extraction as the "file_content" is really hard to find inside the data.

I used the regex query to extract the data, and I have 2 issues with it,

 

1. field TicketNumber has both, but my regex ignores the " and I get number2 on the next column
- ," number , number2 ",
- ,number,


2. It's very slow as I get 1 CSV per hour every day. so i wonder if there is any automation or a better way to do this


First i get all lines:

| makemv delim="
" file_content
| mvexpand file_content
| table file_content _time

Then I get the regex per line

| rex field=file_content "(?P<ContactId>[^\s,]+),(?P<Customernumber>[^\s,]+),(?P<AfterContactWorkDuration>[^\s,]+),(?P<AfterContactWorkEndTimestamp>[^\s,]+),(?P<AfterContactWorkStartTimestamp>[^\s,]+),(?P<AgentInteractionDuration>[^\s,]+),(?P<ConnectedToAgentTimestamp>[^\s,]+),(?P<CustomerHoldDuration>[^\s,]+),(?P<Hierarchygroups_Level1_GroupName>[^\s,]+),(?P<Hierarchygroups_Level2_GroupName>[^\s,]+),(?P<Hierarchygroups_Level3_GroupName>[^\s,]+),(?P<LongestHoldDuration>[^\s,]+),(?P<NumberOfHolds>[^\s,]+),(?P<Routingprofile>[^\s,]+),(?P<Agent>[^\s,]+),(?P<AgentConnectionAttempts>[^\s,]+),(?P<ConnectedToSystemTimestamp>[^\s,]+),(?P<DisconnectTimestamp>[^\s,]+),(?P<InitiationMethod>[^\s,]+),(?P<InitiationTimestamp>[^\s,]+),(?P<LastUpdateTimestamp>[^\s,]+),(?P<NextContactId>[^\s,]+),(?P<PreviousContactId>[^\s,]+),(?P<DequeueTimestamp>[^\s,]+),(?P<Duration>[^\s,]+),(?P<EnqueueTimestamp>[^\s,]+),(?P<Name>[^\s,]+),(?P<TransferCompletedTimestamp>[^\s,]+),(?P<HandleTime>[^\s,]+),(?P<TicketNumber>[^\s,]+),(?P<Account>[^\s,]+),(?P<AccountName>[^\s,]+),(?P<Country>[^\s,]+),(?P<Language>[^\s,]+),(?P<Site>[^\s,]+),(?P<WrapCode>[^\s,]+)"


And here is an example of how the data should look like (in csv):

 

ContactId,Customernumber,AfterContactWorkDuration,AfterContactWorkEndTimestamp,AfterContactWorkStartTimestamp,AgentInteractionDuration,ConnectedToAgentTimestamp,CustomerHoldDuration,Hierarchygroups_Level1_GroupName,Hierarchygroups_Level2_GroupName,Hierarchygroups_Level3_GroupName,LongestHoldDuration,NumberOfHolds,Routingprofile,Agent,AgentConnectionAttempts,ConnectedToSystemTimestamp,DisconnectTimestamp,InitiationMethod,InitiationTimestamp,LastUpdateTimestamp,NextContactId,PreviousContactId,DequeueTimestamp,Duration,EnqueueTimestamp,Name,TransferCompletedTimestamp,HandleTime,TicketNumber,Account,AccountName,Country,Language,Site,WrapCode
aaaa-xxxxxx,123456789,90,29/06/2021 01:00,29/06/2021 01:00,111,29/06/2021 01:00,0,country1,xx,yy,90,90,language,dummy,1,29/06/2021 01:00,29/06/2021 01:00,type_x,29/06/2021 01:00,29/06/2021 01:00,,,29/06/2021 01:00,11,29/06/2021 01:00,type_y,29/06/2021 01:00,201,A123,xxx,xxx,country_y,language,type_w,xxxx
bbbb-xxxxxx,987654321,90,29/06/2021 01:00,29/06/2021 01:00,111+P4,29/06/2021 01:00,0,country1,xx,yy,90,90,language,dummy,1,29/06/2021 01:00,29/06/2021 01:00,type_x,29/06/2021 01:00,29/06/2021 01:00,,,29/06/2021 01:00,11,29/06/2021 01:00,type_y,29/06/2021 01:00,201,"""A123,B123""",xxx,xxx,country_y,language,type_w,xxxx

 

Labels (1)
0 Karma
1 Solution

ITWhisperer
SplunkTrust
SplunkTrust

Your timestamps have spaces in (not accounted for in your expression) and the ticket number optionally has pairs of double quotes. Try something like this

| rex field=file_content "(?P<ContactId>[^\s,]*),(?P<Customernumber>[^\s,]*),(?P<AfterContactWorkDuration>[^\s,]*),(?P<AfterContactWorkEndTimestamp>[^,]*),(?P<AfterContactWorkStartTimestamp>[^,]*),(?P<AgentInteractionDuration>[^\s,]*),(?P<ConnectedToAgentTimestamp>[^,]*),(?P<CustomerHoldDuration>[^\s,]*),(?P<Hierarchygroups_Level1_GroupName>[^\s,]*),(?P<Hierarchygroups_Level2_GroupName>[^\s,]*),(?P<Hierarchygroups_Level3_GroupName>[^\s,]*),(?P<LongestHoldDuration>[^\s,]*),(?P<NumberOfHolds>[^\s,]*),(?P<Routingprofile>[^\s,]*),(?P<Agent>[^\s,]*),(?P<AgentConnectionAttempts>[^\s,]*),(?P<ConnectedToSystemTimestamp>[^,]*),(?P<DisconnectTimestamp>[^,]*),(?P<InitiationMethod>[^\s,]*),(?P<InitiationTimestamp>[^,]*),(?P<LastUpdateTimestamp>[^,]*),(?P<NextContactId>[^\s,]*),(?P<PreviousContactId>[^\s,]*),(?P<DequeueTimestamp>[^,]*),(?P<Duration>[^\s,]*),(?P<EnqueueTimestamp>[^,]*),(?P<Name>[^\s,]*),(?P<TransferCompletedTimestamp>[^,]*),(?P<HandleTime>[^\s,]*),(?P<TicketNumber>((\"[^\"]*\")+|[^\s,]*)),(?P<Account>[^\s,]*),(?P<AccountName>[^\s,]*),(?P<Country>[^\s,]*),(?P<Language>[^\s,]*),(?P<Site>[^\s,]*),(?P<WrapCode>[^\s,]*)"

View solution in original post

ITWhisperer
SplunkTrust
SplunkTrust

Your timestamps have spaces in (not accounted for in your expression) and the ticket number optionally has pairs of double quotes. Try something like this

| rex field=file_content "(?P<ContactId>[^\s,]*),(?P<Customernumber>[^\s,]*),(?P<AfterContactWorkDuration>[^\s,]*),(?P<AfterContactWorkEndTimestamp>[^,]*),(?P<AfterContactWorkStartTimestamp>[^,]*),(?P<AgentInteractionDuration>[^\s,]*),(?P<ConnectedToAgentTimestamp>[^,]*),(?P<CustomerHoldDuration>[^\s,]*),(?P<Hierarchygroups_Level1_GroupName>[^\s,]*),(?P<Hierarchygroups_Level2_GroupName>[^\s,]*),(?P<Hierarchygroups_Level3_GroupName>[^\s,]*),(?P<LongestHoldDuration>[^\s,]*),(?P<NumberOfHolds>[^\s,]*),(?P<Routingprofile>[^\s,]*),(?P<Agent>[^\s,]*),(?P<AgentConnectionAttempts>[^\s,]*),(?P<ConnectedToSystemTimestamp>[^,]*),(?P<DisconnectTimestamp>[^,]*),(?P<InitiationMethod>[^\s,]*),(?P<InitiationTimestamp>[^,]*),(?P<LastUpdateTimestamp>[^,]*),(?P<NextContactId>[^\s,]*),(?P<PreviousContactId>[^\s,]*),(?P<DequeueTimestamp>[^,]*),(?P<Duration>[^\s,]*),(?P<EnqueueTimestamp>[^,]*),(?P<Name>[^\s,]*),(?P<TransferCompletedTimestamp>[^,]*),(?P<HandleTime>[^\s,]*),(?P<TicketNumber>((\"[^\"]*\")+|[^\s,]*)),(?P<Account>[^\s,]*),(?P<AccountName>[^\s,]*),(?P<Country>[^\s,]*),(?P<Language>[^\s,]*),(?P<Site>[^\s,]*),(?P<WrapCode>[^\s,]*)"

Joannna
Explorer

wow thank you so much the issue is now , the 2second one that it's too much data and I get this error :

 

command.mvexpand: output will be truncated at 15700 results due to excessive memory usage. Memory threshold of 500MB as configured in limits.conf / [mvexpand] / max_mem_usage_mb has been reached.
 
Anything I can do here or any other way I can do this extraction?
For example you can run a report everyday and save the outcome to an lookup , but this wouldn't work as it would be too much data for a lookup , is there any other solution?
 
0 Karma

kamlesh_vaghela
SplunkTrust
SplunkTrust

@Joannna 

Instead of this

| makemv delim="
" file_content
| mvexpand file_content
| table file_content _time

 

I suggest below solutions. You can pick any of them

| rename file_content as _raw
| multikv forceheader=1
| table _time _raw

OR

| makemv delim="
" file_content
| stats count by _time file_content

 

Thanks
KV
▄︻̷̿┻̿═━一

If any of my reply helps you to solve the problem Or gain knowledge, an upvote would be appreciated.

Joannna
Explorer

Thank you, i tested both and checked the fastest one is the second approach takes 1.321 secs my original after moving the table to the top as sugested was 1.789 secs. 🙂 thanks 

 

Tags (1)
0 Karma

ITWhisperer
SplunkTrust
SplunkTrust

Try moving the table command to before the mvexpand that way _raw doesn't get copied into every event before being eliminated. Other than that, you could try increasing the limit in limits.conf, or there is a longer way around it described in a post I made a while ago (although this doesn't really solve the memory issue, only the row count issue).

Joannna
Explorer

Well you fixed my issues, thank you, 🙂

I looked around and I found somepeople talk about summary index , do you think this would be a good option for me ?

0 Karma
Get Updates on the Splunk Community!

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...

Splunk Custom Visualizations App End of Life

The Splunk Custom Visualizations apps End of Life for SimpleXML will reach end of support on Dec 21, 2024, ...