Looking for assistance on manually building a regex for the following data. Here is the data below and how far along I was able to get with the Splunk regex builder. I continue getting the following error message:
The extraction failed. If you are extracting multiple fields, try removing one or more fields. Start with extractions that are embedded within longer text strings.
Can you assist with that? I appreciate it. each field is seperate by a comma
Data:
Jan 7 10:34:22, 172.20.1.62, Jan 7 14:34:23, DSO-TW-ASA-Prim-SFR SFIMS: [Primary Detection Engine (252a23cc-7196-11e4-8256-c709c2db90d1)][FMPA - Main Policy], Connection Type: End, User: fred, Client: SSL client, Application Protocol: HTTPS, Web App: Unknown, Access Control Rule Name: Malware | URL Monitor, Access Control Rule Action: Allow, Access Control Rule Reasons: Unknown, URL Category: Government, URL Reputation: High risk, URL: https://sharepoint.fmpa.com, Interface Ingress: MPLS-MFN, Interface Egress: RouterNet, Security Zone Ingress: N/A, Security Zone Egress: N/A, Security Intelligence Matching IP: None, Security Intelligence Category: None, Client Version: (null), Number of File Events: 0, Number of IPS Events: 0, TCP Flags: 0x0, NetBIOS Domain: (null), Initiator Packets: 9, Responder Packets: 9, Initiator Bytes: 2457, Responder Bytes: 2974, Context: unknown {TCP} 172.23.3.151:60442 -> 10.0.0.88:443
Regex:
^(?P<Extract_Date>\w+\s+\d+\s+\d+:\d+:\d+)\s+(?P<Host>[^ ]+)\s+(?P<Date>\w+\s+\d+)\s+(?P<Time>[^ ]+)[^:\n]*:\s+(?P<DSO>\[\w+\s+\w+\s+\w+\s+\([a-f0-9]+\-\d+\-[a-f0-9]+\-\d+\-[a-f0-9]+\)\]\[\w+\s+\-\s+\w+\s+\w+\])(?:[^ \n]* ){3}(?P<Connection_Type>[^,]+)[^,\n]*,\s+\w+:\s+(?P<User>[^,]+),\s+\w+:\s+(?P<Client>[^,]+)[^:\n]*:\s+(?P<App_Protocol>\w+)
I'd recommend you use a tool like RegEx101.com
I'll get you started:
^(?P<extract_date>.*?),(?P<host>.*?),\s(?P<date>\w+\s\d+)\s(?P<time>\d+\:\d+\:\d+),\s(?P<DSO>.*?):.*Connection\sType\:(?P<connection_type>.*?),\sUser\:\s(?P<user>.*?),
That will extract the following fields and values:
extract_date [0-14] Jan 7 10:34:22
host [15-27] 172.20.1.62
date [29-34] Jan 7
time [35-43] 14:34:23
DSO [45-70] DSO-TW-ASA-Prim-SFR SFIMS
connection_type [175-179] End
user [187-191] fred
If you want more help, please specify the exact fields you want the extract and the associated values.
I'd recommend you use a tool like RegEx101.com
I'll get you started:
^(?P<extract_date>.*?),(?P<host>.*?),\s(?P<date>\w+\s\d+)\s(?P<time>\d+\:\d+\:\d+),\s(?P<DSO>.*?):.*Connection\sType\:(?P<connection_type>.*?),\sUser\:\s(?P<user>.*?),
That will extract the following fields and values:
extract_date [0-14] Jan 7 10:34:22
host [15-27] 172.20.1.62
date [29-34] Jan 7
time [35-43] 14:34:23
DSO [45-70] DSO-TW-ASA-Prim-SFR SFIMS
connection_type [175-179] End
user [187-191] fred
If you want more help, please specify the exact fields you want the extract and the associated values.
Jchampagne, quite honestly I feel lost even in regex101. I am trying to extract all fields within this report. I will highlight the remaining ones I need. The problem is that because of the structure, I continue to get the same error. Basically any other field before : is what I am trying to extract. let me know if you can assist.
Jan 15 14:09:43 172.20.1.62 Jan 15 18:09:49 DSO-TW-ASA-Prim-SFR SFIMS: [Primary Detection Engine (252a23cc-7196-11e4-8256-c709c2db90d1)][FMPA - Main Policy] Connection Type: End, User: annb, Client: SSL client, Application Protocol: HTTPS, Web App: Unknown, Access Control Rule Name: Malware | URL Monitor, Access Control Rule Action: Allow, Access Control Rule Reasons: Unknown, URL Category: Government, URL Reputation: High risk, URL: https://fmpa.com, Interface Ingress: Internet, Interface Egress: RouterNet, Security Zone Ingress: N/A, Security Zone Egress: N/A, Security Intelligence Matching IP: None, Security Intelligence Category: None, Client Version: (null), Number of File Events: 0, Number of IPS Events: 0, TCP Flags: 0x0, NetBIOS Domain: (null), Initiator Packets: 15, Responder Packets: 17, Initiator Bytes: 4786, Responder Bytes: 9705, Context: unknown {TCP} 172.20.7.90:57535 -> 10.0.0.89:443
Are these all the fields you want?
extract_date: Jan 15 14:09:43
host: 172.20.1.62
date: Jan 15
time: 18:09:49
DSO-TW-ASA-Prim-SFR SFIMS: [Primary Detection Engine (252a23cc-7196-11e4-8256-c709c2db90d1)][FMPA - Main Policy]
I'm not sure the field name or value is correct on the one above
connection_type: End
user: annb
client: SSL client
application_protocol: HTTPS
web_app: Unknown
access_control_rule_name: Malware | URL Monitor
access_control_rule_action: Allow
access_control_rule_reasons: Unknown
url_category: Government
url_reputation: High risk
url: https://fmpa.com
interface_ingress: Internet
interface_egress: RouterNet
security_zone_ingress: N/A
security_zone_egress: N/A
security_intelligence_matching_ip: None
security_intelligence_category: None
client_version: (null)
number_of_file_events: 0
number_of_ips_events: 0
tcp_flags: 0x0
netbios_domain: (null)
initiator_packets: 15
responder_packets: 17
initiator_bytes: 4786
responder_bytes: 9705
context: unknown {TCP} 172.20.7.90:57535 -> 10.0.0.89:443
I'll help you out of the RegEx with this one, but you'll really be better off if you can start picking up a bit of the RegEx syntax so you can use RegEx101.com or other tools. Would it help if I explained the syntax I'm using in the RegEx in my previous response?
Yes, those are the fields.
Absolutely, my goal is to be able to create them myself. In the meantime, are you able to point me to a tutorial on building these expressions?
All of the RegEx resources that @lguinn mentions are fantastic. I also really like the O'Reilly RegEx book: http://shop.oreilly.com/product/9780596528126.do
As for the RegEx I provided you, it is a fairly repetitive expression, so I'll break it down into the basic parts:
(?P<EXTRACT_DATE>\w+\s\d+\s\d+:\d+:\d+)
This is a named capturing group, (?P begins the group.
Anything we put in-between the less than < and greater than > signs will be come the name of the extraction. In this example, our extraction will be called EXTRACT_DATE
Everything that comes after that is what we want to capture.
\w - match word characters (letters, numbers, or _ )
+ - match one or more (in this case, capture one or more word characters)
\s - match a whitespace charachter
\d - match a digit charachter
We'll capture everything that matches our RegEx values above until we reach the end of the capturing group )
This next expression is what I use almost entirely for the rest of the data:
(?P<evt_host>[^\s]+)
I'll break down the part that is different from the above example:
[^\s]+
Normally, anything between brackets [] tells RegEx to match any character inside those brackets. However, we've put a carat ^ as the first character inside the brackets, which tells RegEx to match anything but what's inside the brackets. So what this says is that we should match any character except the whitespace character. The plus sign + after the brackets tells RegEx to match one or more characters that are not a whitespace character. What we end up with, is a capture group that will match everything until we encounter a space or in other examples from my RegEx below, a comma.
Please let me know if there is anything I can clarify further!
Okay, the following RegEx:
(?P<EXTRACT_DATE>\w+\s\d+\s\d+:\d+:\d+)\s(?P<evt_host>[^\s]+)\s(?P<evt_date>\w+\s\d+)\s(?P<evt_time>[^\s]+)\sDSO-TW-ASA-Prim-SFR\sSFIMS:\s(?P<DSO_TW_ASA_Prim_SFR_SFIMS>.*)\sConnection\sType:\s(?P<connection_type>[^,]+),\sUser:\s(?P<user>[^,]+),\sClient:\s(?<client>[^,]+),\sApplication\sProtocol:\s(?P<protocol>[^,]+),\sWeb\sApp:\s(?P<web_app>[^,]+),\sAccess\sControl\sRule\sName:\s(?P<ac_rule_name>[^,]+),\sAccess\sControl\sRule\sAction:\s(?P<ac_rule_action>[^,]+),\sAccess\sControl\sRule\sReasons:\s(?P<ac_rule_reasons>[^,]+),\sURL\sCategory:\s(?P<url_category>[^,]+),\sURL\sReputation:\s(?P<url_reputation>[^,]+),\sURL:\s(?P<url>[^,]+),\sInterface\sIngress:\s(?P<if_ingress>[^,]+),\sInterface\sEgress:\s(?P<if_egress>[^,]+),\sSecurity\sZone\sIngress:\s(?P<sz_ingress>[^,]+),\sSecurity\sZone\sEgress:\s(?P<sz_egress>[^,]+),\sSecurity\sIntelligence\sMatching\sIP:\s(?P<si_matching_ip>[^,]+),\sSecurity\sIntelligence\sCategory:\s(?P<si_category>[^,]+),\sClient\sVersion:\s(?<client_version>[^,]+),\sNumber\sof\sFile\sEvents:\s(?P<num_file_events>[^,]+),\sNumber\sof\sIPS\sEvents:\s(?P<num_ips_events>[^,]+),\sTCP\sFlags:\s(?P<tcp_flags>[^,]+),\sNetBIOS\sDomain:\s(?P<netbios_domain>[^,]+),\sInitiator\sPackets:\s(?P<init_packets>[^,]+),\sResponder\sPackets:\s(?P<resp_packets>[^,]+),\sInitiator\sBytes:\s(?P<init_bytes>[^,]+),\sResponder\sBytes:\s(?P<resp_bytes>[^,]+),\sContext:\s(?P<context>.*)
will give you the following fields:
EXTRACT_DATE [0-15] Jan 15 14:09:43
evt_host [16-27] 172.20.1.62
evt_date [28-34] Jan 15
evt_time [35-43] 18:09:49
DSO_TW_ASA_Prim_SFR_SFIMS [71-156] [Primary Detection Engine (252a23cc-7196-11e4-8256-c709c2db90d1)][FMPA - Main Policy]
connection_type [174-177] End
user [185-189] annb
client [199-209] SSL client
protocol [233-238] HTTPS
web_app [249-256] Unknown
ac_rule_name [284-305] Malware | URL Monitor
ac_rule_action [335-340] Allow
ac_rule_reasons [371-378] Unknown
url_category [394-404] Government
url_reputation [422-431] High risk
url [438-454] https://fmpa.com
if_ingress [475-483] Internet
if_egress [503-512] RouterNet
sz_ingress [537-540] N/A
sz_egress [564-567] N/A
si_matching_ip [604-608] None
si_category [642-646] None
client_version [664-670] (null)
num_file_events [695-696] 0
num_ips_events [720-721] 0
tcp_flags [734-737] 0x0
netbios_domain [755-761] (null)
init_packets [782-784] 15
resp_packets [805-807] 17
init_bytes [826-830] 4786
resp_bytes [849-853] 9705
context [864-912] unknown {TCP} 172.20.7.90:57535 -> 10.0.0.89:443
Best tutorial: Teach Yourself Regular Expressions in 10 Minutes by Ben Forta
It isn't language-specific, and although it will take more than 10 minutes, it is short and to the point.
Use a tool like RegEx101.com or another to practice the things shown in the book.
Online tutorials:
http://www.regexone.com
http://www.rexegg.com
Other tools (there are a zillion if you search for them...), which often have a tutorial component:
http://www.regexr.com
http://www.regexpal.com
http://www.regular-expressions.info
RegEx Buddy - a Windows-based tool that costs $ but many people love
For your data, I would not use the interactive field extractor in Splunk. Since the data follows a repetitive pattern, it will be easier to manually specify the extraction in the configuration files. Here is the manual page that you need: Create and maintain search-time field extractions through configuration files - scroll down to "Create advanced search-time field extractions with field transforms" to find the section that you need.
Here is a starting point:
props.conf
[yoursourcetype]
REPORT-extfields=extract_my_fields
transforms.conf
[extract_my_fields]
DELIMS = ", ", ": "
You're off to a good start. Just insert a comma after your first capturing group and it will match. Then carry on the same way for the rest of the fields.