I have some logs from a media server that are all formatted in a consistent way, making field extraction creation very easy. I have created the same group of field extractions numerous times because they stop working within 24hrs even without any change in the format of the logs. I have looked at properly tagged events and I have looked at the logs that were not properly tagged and they are identical. There is no reason that I can think of for these field extractions to only work for a short amount of time.
Looking at your example, you have this in your regex:
(?P<player>\w+\s+\d+\s+\w+)
However, your events have either Roku 3
or Roku 2 XS
for the player
field - this regex matches the 2 XS, but not the 3 for lack of a third word.
Looking at your example, you have this in your regex:
(?P<player>\w+\s+\d+\s+\w+)
However, your events have either Roku 3
or Roku 2 XS
for the player
field - this regex matches the 2 XS, but not the 3 for lack of a third word.
Very good. Thanks. I noticed there were a couple more regex issues as well. Evidently, when you do multiple extractions in one rule with the field extractions tool, they all have to be accurate or none of them work individually.
That is expected behaviour, a regex can only extract fields if it matches the string.
Agreed. What interested me is that, if you create a field extraction at once with multiple fields, if any of them do not match, then all do not match. I get it now that I think about it. It is one long regex. At first, I was looking at it like it was individual regex, but it hit me now that it is not. I might have been better off creating one off field extractions instead of doing them all in one extraction rule, but I had never created them all in one rule before and it was something I wanted to try. Thanks again for all your help!
"One long regex" and "many short regexes" are fundamentally different things.
Depending on your data, existence (or not) of one field may influence the interpretation of other fields, so you may get wrong extractions if you simply chop up the large regex into smaller regexes in some scenarios. In such a case it may be necessary to have several long regexes, where each understands only one way your data works.
In other cases you can have shorter, more modular regexes to avoid overlapping definitions or, as you experienced, subtle errors.
Does the sourcetype name remain the same for events over time? That is, is the sourcename for events that occur today (when extraction is not working) the same as the sourcename of events that occur yesterdau (when extraction is working)
Yes. The source type and the source name both remain the same.
Hmmmm. The only other suggestion I can make (other than getting a sample of the data and the REGEX you are using and helping debug, which I am happy help with BTW) is to ask about where the extractions are being stored. Specifically, are they in the props.conf of the app in which you are executing the search?
The field extraction is applied only to the search app. I am playing around with a regex tester online to see if I can figure out why the ones that don't work are messed up.
Do extractions work for events older than 24 hours? Or do they just not work at all for any event, no matter their timestamp?
They appear to only work for extractions that are older and not recent. I imagine that might be an issue with my regex, but I don't know exactly what is off.
Are you defining extractions against sourcetype or source? Are you able to provide the configuration you have defined in your props.conf?
pms_watched : EXTRACT-user,title,transcode,release_year,content_rating,player,play_length,watched_percentage,client_ip
^(?:[^:\n]*:){3}\s+(?P<user>[^ ]+) Watched: (?P<title>[^\[]+)\[(?P<transcode>\w+)[^ \n]* \[(?P<release_year>[^\]]+)[^ \n]* \[(?P<content_rating>\w+\-\d+)\]\s+\w+\s+(?P<player>\w+\s+\d+\s+\w+)\s+\w+\s+(?P<play_length>\d+\s+[a-z]+\s+)\[(?P<watched_percentage>\d+%)\]\s+(?P<client_ip>.+)
I built against the source type using the field extraction tool in the web GUI.
Example of logs that did NOT extract properly:
Mon Aug 17 00:14:14 2015: pvols1979 Watched: CSI: Crime Scene Investigation - Gum Drops - s06e05 [T] [2005] [TV-14] on Roku 3 for 48 minutes [100%] 192.168.1.175
Sat Aug 15 22:21:14 2015: Amy Watched: NCIS: New Orleans - The List - s01e18 [T] [2015] [TV-PG] on Roku 2 XS for 42 minutes [100%] 192.168.1.134
Examples of logs that did extract properly:
Sat Aug 15 21:29:14 2015: Amy Watched: Rizzoli & Isles - Nice to Meet You, Dr. Isles - s06e08 [T] [2015] [TV-14] on Roku 2 XS for 42 minutes [100%] 192.168.1.134
Sat Aug 15 20:44:14 2015: Amy Watched: Rizzoli & Isles - A Bad Seed Grows - s06e07 [T] [2015] [TV-14] on Roku 2 XS for 42 minutes [100%] 192.168.1.134
You mentioned re-creating the extractions - how, where?
I used the tool create the field extractions. By recreating, I mean that I delete the extraction and build again. It works for a day and then just stops working.
Checked _internal
for errors?
I don't see anything in _internal that seems to relate.
Does the field extraction config disappear?