Solved: Re: regex file names from path and/or url

marquiselee · ‎01-10-2013

I need to extract filenames so I can transact across many logs of different types and such.

some logs have full urls - http://www.test1.com/43/test.txt

some logs have only paths - /43/test.txt

some logs are standar looking logs and some are actually XML data dump that was indexed as a "standard log". - <\url>http://www.test1.com/43/test.txt<\/url>

sometimes the whole path may be enclosed in parenthesis or quotes too - "/43/test.txt"

the basic principle is i need to extract files (filename.ext)

I don't have access to the file system and can only use "Extract Fields" in the web interface?

any thoughts?

alacercogitatus · ‎01-10-2013

You can use the rex command. This will find anything after a slash, then anything except a period, then the 3 \w extension.

your_search | rex field=_raw "/(?<filename>[^\.]*\.\w{3})"

View solution in original post

alacercogitatus · ‎01-10-2013

You can use the rex command. This will find anything after a slash, then anything except a period, then the 3 \w extension.

your_search | rex field=_raw "/(?<filename>[^\.]*\.\w{3})"

marquiselee · ‎01-15-2013

Thanks, I finally got something to work using your rex as the foundation and by specifying extensions.

[^/]/(?[\w-]+.(?:[A-Z]{3}|mpeg|mpg|mp4|mov|ism|ismv|isma|ts|flv|sami|scc|vtt|ttml))

alacercogitatus · ‎01-11-2013

your_search | rex field=_raw "/(?<filename>[^\./]*\.\w{3,4})[\s'\"<>\(\)]" This should grab anything after the slash, with an extension 3 or 4 \w in length, followed by the characters you described earlier.

Ayn · ‎01-11-2013

Look, if you can't find a pattern that uniquely identifies the data you're after, then neither can Splunk. So what you need is simply to go through all the different encountered variants of filenames in your logs and find a common pattern that catches them all - or, failing that, a set of different patterns that catch them all separately.

marquiselee · ‎01-10-2013

2013-01-10 16:01:55,411 INFO [1357833716802] [775] ts=2013-01-10T21:01:55Z aid=http://access.auth.sp1.internal.net/data/Account/2189541263 id=1357833716802 t=Encoder.Task.CreateProfileJob rt=77 c=1 tm="file://strg.cp03.internal.net/data/file/ingest/TestProcessFile.mov

marquiselee · ‎01-10-2013

no patter other than it's obviously a file name. path/file.ext

here are a few examples... all from the same sourcetype

2013-01-10 16:02:27,033 DEBUG [1357833733497] [775] 404 : FAILURE : 100% : Exception encountered in plugin [Encoder.Task.CreateProfileJob]! Plugin Terminated. Encode operations failed: 10102013-01-10T16:02:15-05:00Unable to open input file [/mnt/ops/file/ingest/TestProcessFile.mov] : 3951

2013-01-10 16:02:17,601 DEBUG [1357833739023] [742] --- input #0: sourceFile=file://strg.cp03.internal.net/data/file/ingest/TestProcessFile.mov

lguinn2 · ‎01-10-2013

Is there a pattern within the event that can be used to identify the file name? What does an actual event look like?

marquiselee · ‎01-10-2013

Thanks, This is heading in the right direction but the paths are much longer than my example and are not uniform in the directory structure...

/mnt/mezzanine/mezzanine/provider/business.doc
or
/pac/output/brand/media.mov
or
http://

also, some files have 4 letter extensions.

if it helps the extension on the file will always be followed only by a space or the following characters ' " < > ()

marquiselee · ‎01-10-2013

test.txt was an example. there are thousands of files that are uniquely name but appear in different logs. The files name aren't what's important but that in many cases is the only thing i'll be able to join on.

lguinn2 · ‎01-10-2013

You could do this in your search:

source=*test.txt

and it will find events from the test.txt file, whether or not it has a URL or a path or nothing at all.
If you really need a regular expression, you can even do that with the regex command.

yoursearchhere | regex "yourregexhere"

I don't think you need to do any field extractions at all. But perhaps I misunderstood the question. If this doesn't work, can you post a few lines of your data?

Are you talking about the actual name of the log file? If yes, then there is already a field extracted. Its name is source. You don't need to do a "join" - the first search will work.

Are you talking about a file name that is contained within your event data? If yes, then I need to see some of the data to help you with the field extraction.

Finally, do you want to summarize the data based on the file name? If yes, then this should work:

yoursearchhere source=*test.txt
| rex field=source "/(?<filename>.*?)$"
| stats count by filename

Of course, you might need to modify the stats command and the initial search, etc.

marquiselee · ‎01-10-2013

The source is not the file name I'm trying to extract. The various logs(sources) contain reference to hundreds of thousands of files. so a log line may look like this...

"2013-01-10 11:24:17,345 DEBUG [1357817043844] [649] 439 : FAILURE : 100% : Exception encountered in plugin [Encoder.Task.CreateJob]! Plugin Terminated. Encode operations failed: 10102013-01-10T11:24:05-05:00Unable to open input file [/pac/output/lcvtv/testmedia.mov] : 4110"

marquiselee · ‎01-10-2013

test.txt was an example. there are thousands of files that are uniquely name but appear in different logs. The files name aren't what's important but that in many cases is the only thing i'll be able to join on.

regex file names from path and/or url

Splunk ITSI & Correlated Network Visibility

Leveraging Detections from the Splunk Threat Research Team & Cisco Talos

New in Splunk Observability Cloud: Automated Archiving for Unused Metrics

Are you a member of the Splunk Community?

regex file names from path and/or url

Splunk ITSI & Correlated Network Visibility

Leveraging Detections from the Splunk Threat Research Team & Cisco Talos

New in Splunk Observability Cloud: Automated Archiving for Unused Metrics