Splunk Search

regex file names from path and/or url

marquiselee
Path Finder

I need to extract filenames so I can transact across many logs of different types and such.

some logs have full urls - http://www.test1.com/43/test.txt

some logs have only paths - /43/test.txt

some logs are standar looking logs and some are actually XML data dump that was indexed as a "standard log". - <\url>http://www.test1.com/43/test.txt<\/url>

sometimes the whole path may be enclosed in parenthesis or quotes too - "/43/test.txt"

the basic principle is i need to extract files (filename.ext)

I don't have access to the file system and can only use "Extract Fields" in the web interface?

any thoughts?

Tags (2)
0 Karma
1 Solution

alacercogitatus
SplunkTrust
SplunkTrust

You can use the rex command. This will find anything after a slash, then anything except a period, then the 3 \w extension.

your_search | rex field=_raw "/(?<filename>[^\.]*\.\w{3})"

View solution in original post

0 Karma

alacercogitatus
SplunkTrust
SplunkTrust

You can use the rex command. This will find anything after a slash, then anything except a period, then the 3 \w extension.

your_search | rex field=_raw "/(?<filename>[^\.]*\.\w{3})"

0 Karma

marquiselee
Path Finder

Thanks, I finally got something to work using your rex as the foundation and by specifying extensions.

[^/]/(?[\w-]+.(?:[A-Z]{3}|mpeg|mpg|mp4|mov|ism|ismv|isma|ts|flv|sami|scc|vtt|ttml))

0 Karma

alacercogitatus
SplunkTrust
SplunkTrust

your_search | rex field=_raw "/(?<filename>[^\./]*\.\w{3,4})[\s'\"<>\(\)]" This should grab anything after the slash, with an extension 3 or 4 \w in length, followed by the characters you described earlier.

Ayn
Legend

Look, if you can't find a pattern that uniquely identifies the data you're after, then neither can Splunk. So what you need is simply to go through all the different encountered variants of filenames in your logs and find a common pattern that catches them all - or, failing that, a set of different patterns that catch them all separately.

marquiselee
Path Finder

2013-01-10 16:01:55,411 INFO [1357833716802] [775] ts=2013-01-10T21:01:55Z aid=http://access.auth.sp1.internal.net/data/Account/2189541263 id=1357833716802 t=Encoder.Task.CreateProfileJob rt=77 c=1 tm="file://strg.cp03.internal.net/data/file/ingest/TestProcessFile.mov

0 Karma

marquiselee
Path Finder

no patter other than it's obviously a file name. path/file.ext

here are a few examples... all from the same sourcetype

2013-01-10 16:02:27,033 DEBUG [1357833733497] [775] 404 : FAILURE : 100% : Exception encountered in plugin [Encoder.Task.CreateProfileJob]! Plugin Terminated. Encode operations failed: 10102013-01-10T16:02:15-05:00Unable to open input file [/mnt/ops/file/ingest/TestProcessFile.mov] : 3951

2013-01-10 16:02:17,601 DEBUG [1357833739023] [742] --- input #0: sourceFile=file://strg.cp03.internal.net/data/file/ingest/TestProcessFile.mov

0 Karma

lguinn2
Legend

Is there a pattern within the event that can be used to identify the file name? What does an actual event look like?

0 Karma

marquiselee
Path Finder

Thanks, This is heading in the right direction but the paths are much longer than my example and are not uniform in the directory structure...

/mnt/mezzanine/mezzanine/provider/business.doc
or
/pac/output/brand/media.mov
or
http://

also, some files have 4 letter extensions.

if it helps the extension on the file will always be followed only by a space or the following characters ' " < > ()

0 Karma

marquiselee
Path Finder

test.txt was an example. there are thousands of files that are uniquely name but appear in different logs. The files name aren't what's important but that in many cases is the only thing i'll be able to join on.

0 Karma

lguinn2
Legend

You could do this in your search:

source=*test.txt

and it will find events from the test.txt file, whether or not it has a URL or a path or nothing at all.
If you really need a regular expression, you can even do that with the regex command.

yoursearchhere | regex "yourregexhere"

I don't think you need to do any field extractions at all. But perhaps I misunderstood the question. If this doesn't work, can you post a few lines of your data?

Are you talking about the actual name of the log file? If yes, then there is already a field extracted. Its name is source. You don't need to do a "join" - the first search will work.

Are you talking about a file name that is contained within your event data? If yes, then I need to see some of the data to help you with the field extraction.

Finally, do you want to summarize the data based on the file name? If yes, then this should work:

yoursearchhere source=*test.txt
| rex field=source "/(?<filename>.*?)$"
| stats count by filename

Of course, you might need to modify the stats command and the initial search, etc.

0 Karma

marquiselee
Path Finder

The source is not the file name I'm trying to extract. The various logs(sources) contain reference to hundreds of thousands of files. so a log line may look like this...

"2013-01-10 11:24:17,345 DEBUG [1357817043844] [649] 439 : FAILURE : 100% : Exception encountered in plugin [Encoder.Task.CreateJob]! Plugin Terminated. Encode operations failed: 10102013-01-10T11:24:05-05:00Unable to open input file [/pac/output/lcvtv/testmedia.mov] : 4110"

0 Karma

marquiselee
Path Finder

test.txt was an example. there are thousands of files that are uniquely name but appear in different logs. The files name aren't what's important but that in many cases is the only thing i'll be able to join on.

0 Karma
Get Updates on the Splunk Community!

Last Chance to Submit Your Paper For BSides Splunk - Deadline is August 12th!

Hello everyone! Don't wait to submit - The deadline is August 12th! We have truly missed the community so ...

Ready, Set, SOAR: How Utility Apps Can Up Level Your Playbooks!

 WATCH NOW Powering your capabilities has never been so easy with ready-made Splunk® SOAR Utility Apps. Parse ...

DevSecOps: Why You Should Care and How To Get Started

 WATCH NOW In this Tech Talk we will talk about what people mean by DevSecOps and deep dive into the different ...