I am working with the following input and wanted some advice on how/where to specify the field extractions:
"\x00\x00\x00103700079 C9E840 13372786523 7137 210018 51730064 #850 1 000 "
I have documentation from the vendor specifying value lengths and definitions and we can perform most field extractions via individial regex field extractions, but we wanted to know if there is a better or more effecient method recommended.
For regerence, the field mapping table is listed below and have included samples for a couple of the current field extractions.
1-2 Time of day-hours
3-4 Time of day-minutes
5 Duration-hours
6-7 Duration-minutes
8 Duration-tenths of minutes
9 Condition code
10-13 Access code dialed
14-17 Access code used
18-32 Dialed number
33-42 Calling number
43-57 Account code
58-64 Authorization code
65-66 Space
67 FRL
68-70 Incoming circuit ID (hundreds, tens, units)
71-73 Outgoing circuit ID (hundreds, tens, units)
74 Feature flag
75-76 Attendant console
77-80 Incoming TAC
81-82 Node number
83-85 INS
86-88 IXC
89 BCC
90 MA-UUI
91 Resource flag
92-95 Packet count
96 TSC flag
97-100 Reserved
101 Carriage return (Not displayed)
102 Line feed (Not displayed)
103-105 Null (displayed as “\x00\x00\x00” at beginning of new line)
For example, to extract the duration hours, minutes, tenths of minutes we use the following regex:
"^.{16}(?<duration_hours>\d{1})"
"^.{17}(?<duration_minutes>\d{2})"
"^.{19}(?<duration_tenths_minutes>\d{1})"
A single regular expression is IMO the most efficient way to extract the fields here. To get rid of the \x00 values in your events, you could adjust the LINE_BREAKER settings of your sourcetype:
props.conf:
[<your sourcetype>]
LINE_BREAKER=([\x00\r\n]+)
EXTRACT-fields=<the regex here>
Most efficient would probably be a single search time REGEX extraction:
EXTRACT-fields = (?<hour>.{2})(?<min>.{2})(?<duration_h>.)(?<duration_m>.{2})(?<duration_mtenths>.{8})(?<cc>.)(?<accesscd_dialed>.{4})(?<accesscd_used>.{4})(?<num_dialed>.{15})(?<num_calling.{10})
And so on. That way, all fields come in in a single pass over the data. Note that with this particular data, you may run into some problems searching for particular fields by a specific value (if the value is pressed right up against adjacent fields with no white space). You can deal with those for selected fields if you're commonly searching on them by using index-time extractions, but again, selectively and only if you determine it's really necessary for that field (e.g., don't do it with the time fields, and probably not with the dialed number)
Most efficient would probably be a single search time REGEX extraction:
EXTRACT-fields = (?<hour>.{2})(?<min>.{2})(?<duration_h>.)(?<duration_m>.{2})(?<duration_mtenths>.{8})(?<cc>.)(?<accesscd_dialed>.{4})(?<accesscd_used>.{4})(?<num_dialed>.{15})(?<num_calling.{10})
And so on. That way, all fields come in in a single pass over the data. Note that with this particular data, you may run into some problems searching for particular fields by a specific value (if the value is pressed right up against adjacent fields with no white space). You can deal with those for selected fields if you're commonly searching on them by using index-time extractions, but again, selectively and only if you determine it's really necessary for that field (e.g., don't do it with the time fields, and probably not with the dialed number)
Thank you, I think this is the information we were looking for.
Your time and attention is greatly appreciated!
Because if you're not searching for the specific values, indexing more fields will increase the size of the index, which can decrease performance for all searches. If you are searching rarely for specific values of fieldname
, you can search with fieldname=*value*
(vs fieldname=value
) which will work but will be slower for that search only. If you are not searching for specific values, but reporting instead (e.g., stats count by number_dialed
) then indexed fields are no better than search-time extracted ones.
It sounds like index time extraction is best as many of the fields are adjancent.
Why do you recommend against items such as time or dialed number in the extraction at index? The target application with be a Call Detail Record index, and a sub-component of an event correlation system.
A single regular expression is IMO the most efficient way to extract the fields here. To get rid of the \x00 values in your events, you could adjust the LINE_BREAKER settings of your sourcetype:
props.conf:
[<your sourcetype>]
LINE_BREAKER=([\x00\r\n]+)
EXTRACT-fields=<the regex here>
The code:
LINE_BREAKER=([\x00\r\n]+)
Does not appear to be removing the "\x00\x00\x00" from the