Getting Data In

Regex's and Windows paths in inputs.conf and props.conf

mfrost8
Builder

I'm confused about the behavior of regex's in inputs.conf and props.conf when using Windows directory paths. Particularly the use of '\' as an escape character or when trying to say something like '\d' for digits.

That is, in props.conf, I could easily say

[source::.../*\.log]
sourcetype=my_log

on Unix/Linux, but if I did that on a Windows box, I'm not sure what '.' matches. Is it the literal dot or is it a directory separator followed by any character? If so, how do I escape under Windows?

Or something like the following in inputs.conf on a Windows platform

[monitor://D:\Program Files\Foo\*\.log]

That's what I want to say, but I know that doesnt do what I want.

I see plenty of examples in the docs for inputs.conf and props.conf but nothing that really indicates how you would handle Windows paths differently.

I optimistically tried to use '/' as the Windows path separator and while Splunk added it to the list of directories to monitor, it would not select any files or directories until I switched it back to '\'.

Thanks

Tags (1)
1 Solution

Lowell
Super Champion

Yeah, it's tricky at first. You can actually be at a disadvantage if you already know regular expressions, because while it seems familiar, there are some hidden gotchas and things that are hard to translate in your head.

First, let's point out the problem so we understand why we need a better solution.... If you've done path-matching with regexes before, than you know it's generally painful because both "\" and sometimes "/" can have special meaning (The issues with '/' are more because of the common regex environments, such as in sed's s/find/replace/ syntax which perl also adopts. So while this is not a regex limitation, but it can add some confusion because sometimes people get used to needing to write "\/".) Also, not everyone knows regular expressions. Most people are familiar with simple glob patterns, but they are not very powerful. (Most people could tell you that *.exe means files that end with .exe, for example). So the guys at splunk came up with a new and yet familiar path matching syntax that is both easy to understand but yet still has nearly all of the cool regular expression functionality right under the surface.


Source matching and monitor style syntax are pretty similar but are not identical.

For source matching, you have to wild chard expressions: "*" means match a path or file name. And "..." Means match anything (including across directory boundaries). A simple "." always means a literal period, which is good since they appear quite frequency in file names. For monitor matching, you don't have the ability to use "..." since you generally want to monitor a specific group of files and don't want to recursively keep walking through a potentially massive directory structure.

On top of that, you can add a number of common regex features. (I believe that behind the scenes all of the pattern matting logic gets translated into regex anyways.)

Keep in mind that because certain charters have a special meaning you can't use them in their normal regex way. Specifically this is true of "*" and ".". For example, you don't want to say ".log.\d*" because of the "*". This gets translated into the regex "\.log\.\d.*" which will appear to work with the file name myapp.log.1 but will then mysteriously fail to match myapp.log and will actually match myapp.log.1-blah, which you probably don't want. Not being able to use "*" like this generally isn't a problem though. We can easily replace the "*" with a "+". So you can instead use the expression ".log(.\d+)?" which now successfully matches with or without a trailing digit extension and doesn't match any other weird paths.

Also, please remember that we are only talking about pattern matching in [monitor://], [batch://] stanzas in inputs.conf and the [source::] and [host::] stanzas in props.conf. There are are many other places in the splunk configurations that use normal (PCRE-style) regexes, such as while/black lists, field extractions, and so on. (Thanks to gkanapathy for pointing this out.)


I frequently deploy my props.conf files to both windows and unix boxes, I try to always write my source matching patterns to work on both platforms. It's not pretty, but pattern matching hardly ever is. So instead of using a "/" or a "\" I use "[/\\]". It does look funny, especially since the stanza names are wrapped in square brackets too, but it works fine. (The syntax-highlighting on this site doesn't like it too much though, ;-). I've also seen other configs use (/|\\) which would work too. I chose the first option since (1) it's shorter, and (2) it's a faster regex (since we don't want to create a capture group. and writing (?:/|\\) just seems even more absurd.)

So, starting with your example, we could write it like this:

[source::...[/\\]*.log]

Although, there isn't much point in matching a slash at all in this case, since all paths will have a slash somewhere anywas, this doesn't really add much value. So we could have just easily have written:

[source::....log]

Notice that there are 4 dots. The first three ... means match anything, and the last . is matching a literal period.

So, say you wanted to match a more complicated path with "oracle" somewhere as a directory name, and the filename must be "sqlnet.log", you would end up with something like this:

 [source::...[/\\]oracle[/\\]...[/\\]sqlnet.log]
 sourcetype=oracle_sqlnet

If you have a log with an optional digit suffix (which is common with rotated log files), you could do this:

[source::...[/\\][Aa]pache[/\\]logs[/\\]access_log(.\d+)?]

Note that we are allowing for both an upper and lower case "A" in the word apache.

Here is another more complex example that matches logs named emagent.log, emagentfetchlet.log, emias.log, emoms.oms.log:

[source::...[/\\][Oo]racle...[/\\]em(agent(fetchlet)?|ias|oms).log]


In terms of monitor stanzas, you simply use a "\" as the directory separator, just as you found out. And if you want to wildcard anything you can with a "*". Here is an example of probably the most complicated monitor stanza I have. (Note that I never use the [/\\] thing in monitor stanzas, since they are always platform specific anyways. In fact I'm not sure if you can use [0-9]+ for example.)

[monitor://$SPNK_WMHOME\MWS\server\default\logs\20*_*\(_full_|install).log]

The wildcards here are needed because my app produces writes log files in different directories based on the date. (One portion of the path is in the YYYY_MM_DD format, which is what the 20*_* is there for). I'm only monitoring two specific log files in each folder: _full.log and install.log. You may also note that I'm using an environmental variable as the base location of my application install, which changes on my different systems, and this lets me reuse a common configuration on multiple splunk installs by setting a environment variable on each server.


If you think in terms of regular expressions, then the following might help you translate in your head:

  1. Find and replace "..." with the regex ".*" (match anything)
  2. Find and replace "*" always becomes "[^/\\]*" (match any non-directory separator characters)
  3. Find and replace "." with "\." (a literal dot)

If you want to take a peek at how splunk is translating your source matching rules into regexes, than you can take a look at one of your search.log files. (e.g $SPLUNK_HOME/var/run/splunk/dispatch/<job_sid>/search.log). Search down through the log file until you find lines with "PropertiesMapConfig - Expanded file pattern. These lines show you your patterns being expanded into regex form-- and then you'll be even more appreciative that splunk came up with a more path-friendly pattern matching syntax. (If someone knows an easier way to get to this info, please add a comment)

Hope that helps you.

View solution in original post

Lowell
Super Champion

Yeah, it's tricky at first. You can actually be at a disadvantage if you already know regular expressions, because while it seems familiar, there are some hidden gotchas and things that are hard to translate in your head.

First, let's point out the problem so we understand why we need a better solution.... If you've done path-matching with regexes before, than you know it's generally painful because both "\" and sometimes "/" can have special meaning (The issues with '/' are more because of the common regex environments, such as in sed's s/find/replace/ syntax which perl also adopts. So while this is not a regex limitation, but it can add some confusion because sometimes people get used to needing to write "\/".) Also, not everyone knows regular expressions. Most people are familiar with simple glob patterns, but they are not very powerful. (Most people could tell you that *.exe means files that end with .exe, for example). So the guys at splunk came up with a new and yet familiar path matching syntax that is both easy to understand but yet still has nearly all of the cool regular expression functionality right under the surface.


Source matching and monitor style syntax are pretty similar but are not identical.

For source matching, you have to wild chard expressions: "*" means match a path or file name. And "..." Means match anything (including across directory boundaries). A simple "." always means a literal period, which is good since they appear quite frequency in file names. For monitor matching, you don't have the ability to use "..." since you generally want to monitor a specific group of files and don't want to recursively keep walking through a potentially massive directory structure.

On top of that, you can add a number of common regex features. (I believe that behind the scenes all of the pattern matting logic gets translated into regex anyways.)

Keep in mind that because certain charters have a special meaning you can't use them in their normal regex way. Specifically this is true of "*" and ".". For example, you don't want to say ".log.\d*" because of the "*". This gets translated into the regex "\.log\.\d.*" which will appear to work with the file name myapp.log.1 but will then mysteriously fail to match myapp.log and will actually match myapp.log.1-blah, which you probably don't want. Not being able to use "*" like this generally isn't a problem though. We can easily replace the "*" with a "+". So you can instead use the expression ".log(.\d+)?" which now successfully matches with or without a trailing digit extension and doesn't match any other weird paths.

Also, please remember that we are only talking about pattern matching in [monitor://], [batch://] stanzas in inputs.conf and the [source::] and [host::] stanzas in props.conf. There are are many other places in the splunk configurations that use normal (PCRE-style) regexes, such as while/black lists, field extractions, and so on. (Thanks to gkanapathy for pointing this out.)


I frequently deploy my props.conf files to both windows and unix boxes, I try to always write my source matching patterns to work on both platforms. It's not pretty, but pattern matching hardly ever is. So instead of using a "/" or a "\" I use "[/\\]". It does look funny, especially since the stanza names are wrapped in square brackets too, but it works fine. (The syntax-highlighting on this site doesn't like it too much though, ;-). I've also seen other configs use (/|\\) which would work too. I chose the first option since (1) it's shorter, and (2) it's a faster regex (since we don't want to create a capture group. and writing (?:/|\\) just seems even more absurd.)

So, starting with your example, we could write it like this:

[source::...[/\\]*.log]

Although, there isn't much point in matching a slash at all in this case, since all paths will have a slash somewhere anywas, this doesn't really add much value. So we could have just easily have written:

[source::....log]

Notice that there are 4 dots. The first three ... means match anything, and the last . is matching a literal period.

So, say you wanted to match a more complicated path with "oracle" somewhere as a directory name, and the filename must be "sqlnet.log", you would end up with something like this:

 [source::...[/\\]oracle[/\\]...[/\\]sqlnet.log]
 sourcetype=oracle_sqlnet

If you have a log with an optional digit suffix (which is common with rotated log files), you could do this:

[source::...[/\\][Aa]pache[/\\]logs[/\\]access_log(.\d+)?]

Note that we are allowing for both an upper and lower case "A" in the word apache.

Here is another more complex example that matches logs named emagent.log, emagentfetchlet.log, emias.log, emoms.oms.log:

[source::...[/\\][Oo]racle...[/\\]em(agent(fetchlet)?|ias|oms).log]


In terms of monitor stanzas, you simply use a "\" as the directory separator, just as you found out. And if you want to wildcard anything you can with a "*". Here is an example of probably the most complicated monitor stanza I have. (Note that I never use the [/\\] thing in monitor stanzas, since they are always platform specific anyways. In fact I'm not sure if you can use [0-9]+ for example.)

[monitor://$SPNK_WMHOME\MWS\server\default\logs\20*_*\(_full_|install).log]

The wildcards here are needed because my app produces writes log files in different directories based on the date. (One portion of the path is in the YYYY_MM_DD format, which is what the 20*_* is there for). I'm only monitoring two specific log files in each folder: _full.log and install.log. You may also note that I'm using an environmental variable as the base location of my application install, which changes on my different systems, and this lets me reuse a common configuration on multiple splunk installs by setting a environment variable on each server.


If you think in terms of regular expressions, then the following might help you translate in your head:

  1. Find and replace "..." with the regex ".*" (match anything)
  2. Find and replace "*" always becomes "[^/\\]*" (match any non-directory separator characters)
  3. Find and replace "." with "\." (a literal dot)

If you want to take a peek at how splunk is translating your source matching rules into regexes, than you can take a look at one of your search.log files. (e.g $SPLUNK_HOME/var/run/splunk/dispatch/<job_sid>/search.log). Search down through the log file until you find lines with "PropertiesMapConfig - Expanded file pattern. These lines show you your patterns being expanded into regex form-- and then you'll be even more appreciative that splunk came up with a more path-friendly pattern matching syntax. (If someone knows an easier way to get to this info, please add a comment)

Hope that helps you.

View solution in original post

emiller42
Motivator

I did not find the above to work in my experience. Full regex is NOT compatible with [source::] stanzas in props.conf. According to the documentation, the only regex-like syntax valid is:

... recurses through directories until the match is met.
* matches anything but / 0 or more times.
| is equivalent to 'or'
( ) are used to limit scope of |.

see the following for details:
http://splunk-base.splunk.com/answers/71035/using-regex-in-source-stanzas-propsconf

0 Karma

dmaislin_splunk
Splunk Employee
Splunk Employee

Lowell,

This should make it into core props.conf documentation. Probably my favorite answer on splunkbase in two years.

0 Karma

Lowell
Super Champion

Is this your actual example, or something simplified for posting it online, because E:\Dir\(Web|web).config only really only points to a single file on Windows -- because file systems on Windows are case-insensitive. I'm not sure if that's your issue here or not.

0 Karma

ftk
Motivator

I tried using a monitor stanza as per your example [monitor://E:\Dir(Web|web).config] but it doesn't work: 06-15-2010 16:02:08.821 WARN FilesystemChangeWatcher - error getting attributes of path "E:\Dir(Web|web).config": The filename, directory name, or volume label syntax is incorrect.

0 Karma

Lowell
Super Champion

gkanapthy, Looks like your formatting got screwed up on that last comment. Can you repost it?

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

The other thing is that a / forward slash will work as a Windows path separator in an inputs.conf monitor stanza header, in that the file will get monitored by Splunk. But I would advise you not to use it, and use the \ backslash. Even if you do use it, I believe if you try to then match on it with a source stanza in props.conf, it will only match with a \.

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

Note that all of this only applies in the stanza headers for monitor and batch stanzas in inputs.conf and stanza headers source (and host) stanzas in props.conf. Regexes in whitelists, transforms, extractions, etc., and straight PCRE regex.

0 Karma
Register for .conf21 Now! Go Vegas or Go Virtual!

How will you .conf21? You decide! Go in-person in Las Vegas, 10/18-10/21, or go online with .conf21 Virtual, 10/19-10/20.