Hi,
I am trying to extract some fields which are generally bound by other strings (eg Some Text 1 Some Text 2). I have a situation where a field may or may not have anything following it.
For example, with this data set :
1 Some Text 1 <my field 1> Some Text 2
2 Some Text 1 <my field 1>",
3 Some Text 1 <my field 1> Some Text 2
4 Some Text 1 <my field 1> Some Text 2
5 Some Text 1 <my field 1>",
This regex partly works in that is extracts correctly items 1, 3, and 4:
Some Text 1\s+(?P<my field 1>.+)\s(Some Text 2|\",)
This regex partly works in that is extracts correctly items 2 and 5, but extracts the entirety of items 1, 3, and 4.
Some Text 1\s+(?P<my field 1>.+)(Some Text 2|\",)
The difference is the "\s". I can't seem to include that in the match group, only before it.
I am sure I am missing something obvious but can't seem to see it. Any help much appreciated.
Thankyou.
Hi rhyjones,
Are you trying to extract these fields using search query ie, rex command or doing it in transforms for index time?
For search query, you can try below regex with rex command?
|rex field=FieldName "(?:Some Text 1\s+)(?P<myfield1>.+)(?=\s+Some Text 2|\",)"
Ensure you have specified field=FieldName if your event data is not coming in _raw field, where FieldName is the name of the column/field in which the string to be extracted is present.
Hi rhyjones,
Are you trying to extract these fields using search query ie, rex command or doing it in transforms for index time?
For search query, you can try below regex with rex command?
|rex field=FieldName "(?:Some Text 1\s+)(?P<myfield1>.+)(?=\s+Some Text 2|\",)"
Ensure you have specified field=FieldName if your event data is not coming in _raw field, where FieldName is the name of the column/field in which the string to be extracted is present.
So effectively, I can get it running correctly with either "match" by themselves, but if I put them in a non-capturing match group, only the second match is "hit". That means items that are at the end of the line already are correctly returned, but items that have "Some Text 2" are actually captured all the way until the ", combination is matched.
Hi jincy_18,
I did some more experimenting and unfortunately have the same issue. I can either extract "myfield1" when followed by ",
or I can extract "myfield1" when followed by a space then a "Some Text 2".
If I try to have both in a match group I get the one followed by ", extracted correctly, and all the other rows extract until they get to a ", combination.
I might try a different tack.
Thanks again.
Hi rhys,
Have you checked if the space characters are actually spaces or tabs?
Also, in the sample you provided, " Some Text 1 Some Text 2", is " Some Text 1 " always present, I mean is it the same always, like wise for "Some Text 2" when ever it is present is it the same?
Hi jincy_18,
Excellent question.
"Some Text 1" is always there. This works for records that do have text following the extracted field:
Some Text 1\s+(?P<my field 1>.+)\sSome Text 2
This works for records that do not have text following the extracted field:
Some Text 1\s+(?P<my field 1>.+)\",
This does not work
Some Text 1\s+(?P<my field 1>.+)(?:\sSome Text 2|\",)
This last one returns correct extracts for records that do not have text following the extracted field. For records that do have text following the extracted field it returns all the following text up to the next instance of the ", combination rather than stopping before the "Some Text 2" literal string.
Hope that makes sense.
What about:
Some Text 1\s+(?P<my field 1>.+?)(?:\sSome Text 2|\",)
Making the .+
a lazy match ( .+?
) will help it to not include Some Text 2
as part of the match.
cpetterborg, that was the missing bit !! Thankyou !
This now appears to be pulling the field in correctly in both cases.
Some Text 1\s+(?P
Thankyou both for all you assistance. Very much appreciated !
Thankyou jincy_18. I will have a go when I get to the office tomorrow.
I was experimenting using the rex command, but mostly in the field extraction wizard. Effectively I am only trying to extract "my field 1" and I am identifying it based on the fact it is preceded by the literal string "Some Text 1" and a space, and followed immediately by either "Some Text 2" OR the ", combination.
I discovered in another extract I was doing that in the event that was immediately followed by the combination
","text3
I had to use the following regex :
Some Text 1\s+(?P<my field 1>.+)\.{7}text3
This kind of made me think I had a Unicode issue.
Thankyou for the hint. I'll check it out tomorrow.
I'm a bit confused by what you want in the end. Is this what you want to see:
Spot on. 5 Matches regardless of whether there is a string following, or a ", following.
That construct does not appear to be working in Splunk (or in my dataset). For example, if I put the \s inside the match brackets then it seems to be ignored and that side of the match fails.
I don't know if you noticed, but the name I used in the capture group doesn't have spaces. That is a requirement - no spaces in capture group names. I don't know if that might be causing things to not work for you. You could also just try a space character instead of a \s
. I'm not sure if either of those will help, but they are worth a try.
Thankyou.
Yes, I discovered the requirements for no spaces (apologies, my "sample" didn't convey that). I did play around with just using the space character too. I think I ill go home and start tomorrow with fresh eyes !
Thankyou for the suggestions. You have started me on a couple of new paths of testing so much appreciated. I'll update here if I find a solution.
I am partly wondering if the ".+" may be part of the issue. Given the content of can be varied and contain spaces and special characters I am not sure how to get around that.