I have a field that's called file_content on an source type.
This has a CSV inside.
Meaning every event has a field (file_content) that has a csv inside it. Every event is an email Can't be field extraction as the "file_content" is really hard to find inside the data.
I used the regex query to extract the data, and I have 2 issues with it,
1. field TicketNumber has both, but my regex ignores the " and I get number2 on the next column
- ," number , number2 ",
- ,number,
2. It's very slow as I get 1 CSV per hour every day. so i wonder if there is any automation or a better way to do this
First i get all lines:
| makemv delim="
" file_content
| mvexpand file_content
| table file_content _time
Then I get the regex per line
| rex field=file_content "(?P<ContactId>[^\s,]+),(?P<Customernumber>[^\s,]+),(?P<AfterContactWorkDuration>[^\s,]+),(?P<AfterContactWorkEndTimestamp>[^\s,]+),(?P<AfterContactWorkStartTimestamp>[^\s,]+),(?P<AgentInteractionDuration>[^\s,]+),(?P<ConnectedToAgentTimestamp>[^\s,]+),(?P<CustomerHoldDuration>[^\s,]+),(?P<Hierarchygroups_Level1_GroupName>[^\s,]+),(?P<Hierarchygroups_Level2_GroupName>[^\s,]+),(?P<Hierarchygroups_Level3_GroupName>[^\s,]+),(?P<LongestHoldDuration>[^\s,]+),(?P<NumberOfHolds>[^\s,]+),(?P<Routingprofile>[^\s,]+),(?P<Agent>[^\s,]+),(?P<AgentConnectionAttempts>[^\s,]+),(?P<ConnectedToSystemTimestamp>[^\s,]+),(?P<DisconnectTimestamp>[^\s,]+),(?P<InitiationMethod>[^\s,]+),(?P<InitiationTimestamp>[^\s,]+),(?P<LastUpdateTimestamp>[^\s,]+),(?P<NextContactId>[^\s,]+),(?P<PreviousContactId>[^\s,]+),(?P<DequeueTimestamp>[^\s,]+),(?P<Duration>[^\s,]+),(?P<EnqueueTimestamp>[^\s,]+),(?P<Name>[^\s,]+),(?P<TransferCompletedTimestamp>[^\s,]+),(?P<HandleTime>[^\s,]+),(?P<TicketNumber>[^\s,]+),(?P<Account>[^\s,]+),(?P<AccountName>[^\s,]+),(?P<Country>[^\s,]+),(?P<Language>[^\s,]+),(?P<Site>[^\s,]+),(?P<WrapCode>[^\s,]+)"
And here is an example of how the data should look like (in csv):
ContactId,Customernumber,AfterContactWorkDuration,AfterContactWorkEndTimestamp,AfterContactWorkStartTimestamp,AgentInteractionDuration,ConnectedToAgentTimestamp,CustomerHoldDuration,Hierarchygroups_Level1_GroupName,Hierarchygroups_Level2_GroupName,Hierarchygroups_Level3_GroupName,LongestHoldDuration,NumberOfHolds,Routingprofile,Agent,AgentConnectionAttempts,ConnectedToSystemTimestamp,DisconnectTimestamp,InitiationMethod,InitiationTimestamp,LastUpdateTimestamp,NextContactId,PreviousContactId,DequeueTimestamp,Duration,EnqueueTimestamp,Name,TransferCompletedTimestamp,HandleTime,TicketNumber,Account,AccountName,Country,Language,Site,WrapCode
aaaa-xxxxxx,123456789,90,29/06/2021 01:00,29/06/2021 01:00,111,29/06/2021 01:00,0,country1,xx,yy,90,90,language,dummy,1,29/06/2021 01:00,29/06/2021 01:00,type_x,29/06/2021 01:00,29/06/2021 01:00,,,29/06/2021 01:00,11,29/06/2021 01:00,type_y,29/06/2021 01:00,201,A123,xxx,xxx,country_y,language,type_w,xxxx
bbbb-xxxxxx,987654321,90,29/06/2021 01:00,29/06/2021 01:00,111+P4,29/06/2021 01:00,0,country1,xx,yy,90,90,language,dummy,1,29/06/2021 01:00,29/06/2021 01:00,type_x,29/06/2021 01:00,29/06/2021 01:00,,,29/06/2021 01:00,11,29/06/2021 01:00,type_y,29/06/2021 01:00,201,"""A123,B123""",xxx,xxx,country_y,language,type_w,xxxx
Your timestamps have spaces in (not accounted for in your expression) and the ticket number optionally has pairs of double quotes. Try something like this
| rex field=file_content "(?P<ContactId>[^\s,]*),(?P<Customernumber>[^\s,]*),(?P<AfterContactWorkDuration>[^\s,]*),(?P<AfterContactWorkEndTimestamp>[^,]*),(?P<AfterContactWorkStartTimestamp>[^,]*),(?P<AgentInteractionDuration>[^\s,]*),(?P<ConnectedToAgentTimestamp>[^,]*),(?P<CustomerHoldDuration>[^\s,]*),(?P<Hierarchygroups_Level1_GroupName>[^\s,]*),(?P<Hierarchygroups_Level2_GroupName>[^\s,]*),(?P<Hierarchygroups_Level3_GroupName>[^\s,]*),(?P<LongestHoldDuration>[^\s,]*),(?P<NumberOfHolds>[^\s,]*),(?P<Routingprofile>[^\s,]*),(?P<Agent>[^\s,]*),(?P<AgentConnectionAttempts>[^\s,]*),(?P<ConnectedToSystemTimestamp>[^,]*),(?P<DisconnectTimestamp>[^,]*),(?P<InitiationMethod>[^\s,]*),(?P<InitiationTimestamp>[^,]*),(?P<LastUpdateTimestamp>[^,]*),(?P<NextContactId>[^\s,]*),(?P<PreviousContactId>[^\s,]*),(?P<DequeueTimestamp>[^,]*),(?P<Duration>[^\s,]*),(?P<EnqueueTimestamp>[^,]*),(?P<Name>[^\s,]*),(?P<TransferCompletedTimestamp>[^,]*),(?P<HandleTime>[^\s,]*),(?P<TicketNumber>((\"[^\"]*\")+|[^\s,]*)),(?P<Account>[^\s,]*),(?P<AccountName>[^\s,]*),(?P<Country>[^\s,]*),(?P<Language>[^\s,]*),(?P<Site>[^\s,]*),(?P<WrapCode>[^\s,]*)"
Your timestamps have spaces in (not accounted for in your expression) and the ticket number optionally has pairs of double quotes. Try something like this
| rex field=file_content "(?P<ContactId>[^\s,]*),(?P<Customernumber>[^\s,]*),(?P<AfterContactWorkDuration>[^\s,]*),(?P<AfterContactWorkEndTimestamp>[^,]*),(?P<AfterContactWorkStartTimestamp>[^,]*),(?P<AgentInteractionDuration>[^\s,]*),(?P<ConnectedToAgentTimestamp>[^,]*),(?P<CustomerHoldDuration>[^\s,]*),(?P<Hierarchygroups_Level1_GroupName>[^\s,]*),(?P<Hierarchygroups_Level2_GroupName>[^\s,]*),(?P<Hierarchygroups_Level3_GroupName>[^\s,]*),(?P<LongestHoldDuration>[^\s,]*),(?P<NumberOfHolds>[^\s,]*),(?P<Routingprofile>[^\s,]*),(?P<Agent>[^\s,]*),(?P<AgentConnectionAttempts>[^\s,]*),(?P<ConnectedToSystemTimestamp>[^,]*),(?P<DisconnectTimestamp>[^,]*),(?P<InitiationMethod>[^\s,]*),(?P<InitiationTimestamp>[^,]*),(?P<LastUpdateTimestamp>[^,]*),(?P<NextContactId>[^\s,]*),(?P<PreviousContactId>[^\s,]*),(?P<DequeueTimestamp>[^,]*),(?P<Duration>[^\s,]*),(?P<EnqueueTimestamp>[^,]*),(?P<Name>[^\s,]*),(?P<TransferCompletedTimestamp>[^,]*),(?P<HandleTime>[^\s,]*),(?P<TicketNumber>((\"[^\"]*\")+|[^\s,]*)),(?P<Account>[^\s,]*),(?P<AccountName>[^\s,]*),(?P<Country>[^\s,]*),(?P<Language>[^\s,]*),(?P<Site>[^\s,]*),(?P<WrapCode>[^\s,]*)"
wow thank you so much the issue is now , the 2second one that it's too much data and I get this error :
Instead of this
| makemv delim="
" file_content
| mvexpand file_content
| table file_content _time
I suggest below solutions. You can pick any of them
| rename file_content as _raw
| multikv forceheader=1
| table _time _raw
OR
| makemv delim="
" file_content
| stats count by _time file_content
Thanks
KV
▄︻̷̿┻̿═━一
If any of my reply helps you to solve the problem Or gain knowledge, an upvote would be appreciated.
Thank you, i tested both and checked the fastest one is the second approach takes 1.321 secs my original after moving the table to the top as sugested was 1.789 secs. 🙂 thanks
Try moving the table command to before the mvexpand that way _raw doesn't get copied into every event before being eliminated. Other than that, you could try increasing the limit in limits.conf, or there is a longer way around it described in a post I made a while ago (although this doesn't really solve the memory issue, only the row count issue).
Well you fixed my issues, thank you, 🙂
I looked around and I found somepeople talk about summary index , do you think this would be a good option for me ?