I need to extract filenames so I can transact across many logs of different types and such.
some logs have full urls - http://www.test1.com/43/test.txt
some logs have only paths - /43/test.txt
some logs are standar looking logs and some are actually XML data dump that was indexed as a "standard log". - <\url>http://www.test1.com/43/test.txt<\/url>
sometimes the whole path may be enclosed in parenthesis or quotes too - "/43/test.txt"
the basic principle is i need to extract files (filename.ext)
I don't have access to the file system and can only use "Extract Fields" in the web interface?
any thoughts?
You can use the rex command. This will find anything after a slash, then anything except a period, then the 3 \w extension.
your_search | rex field=_raw "/(?<filename>[^\.]*\.\w{3})"
You can use the rex command. This will find anything after a slash, then anything except a period, then the 3 \w extension.
your_search | rex field=_raw "/(?<filename>[^\.]*\.\w{3})"
Thanks, I finally got something to work using your rex as the foundation and by specifying extensions.
[^/]/(?
your_search | rex field=_raw "/(?<filename>[^\./]*\.\w{3,4})[\s'\"<>\(\)]"
This should grab anything after the slash, with an extension 3 or 4 \w
in length, followed by the characters you described earlier.
Look, if you can't find a pattern that uniquely identifies the data you're after, then neither can Splunk. So what you need is simply to go through all the different encountered variants of filenames in your logs and find a common pattern that catches them all - or, failing that, a set of different patterns that catch them all separately.
2013-01-10 16:01:55,411 INFO [1357833716802] [775] ts=2013-01-10T21:01:55Z aid=http://access.auth.sp1.internal.net/data/Account/2189541263 id=1357833716802 t=Encoder.Task.CreateProfileJob rt=77 c=1 tm="
no patter other than it's obviously a file name. path/file.ext
here are a few examples... all from the same sourcetype
2013-01-10 16:02:27,033 DEBUG [1357833733497] [775] 404 : FAILURE : 100% : Exception encountered in plugin [Encoder.Task.CreateProfileJob]! Plugin Terminated. Encode operations failed: 10102013-01-10T16:02:15-05:00Unable to open input file [/mnt/ops/file/ingest/TestProcessFile.mov] : 3951
2013-01-10 16:02:17,601 DEBUG [1357833739023] [742] --- input #0: sourceFile=file://strg.cp03.internal.net/data/file/ingest/TestProcessFile.mov
Is there a pattern within the event that can be used to identify the file name? What does an actual event look like?
Thanks, This is heading in the right direction but the paths are much longer than my example and are not uniform in the directory structure...
/mnt/mezzanine/mezzanine/provider/business.doc
or
/pac/output/brand/media.mov
or
http://
also, some files have 4 letter extensions.
if it helps the extension on the file will always be followed only by a space or the following characters ' " < > ()
test.txt was an example. there are thousands of files that are uniquely name but appear in different logs. The files name aren't what's important but that in many cases is the only thing i'll be able to join on.
You could do this in your search:
source=*test.txt
and it will find events from the test.txt
file, whether or not it has a URL or a path or nothing at all.
If you really need a regular expression, you can even do that with the regex command.
yoursearchhere | regex "yourregexhere"
I don't think you need to do any field extractions at all. But perhaps I misunderstood the question. If this doesn't work, can you post a few lines of your data?
Are you talking about the actual name of the log file? If yes, then there is already a field extracted. Its name is source
. You don't need to do a "join" - the first search will work.
Are you talking about a file name that is contained within your event data? If yes, then I need to see some of the data to help you with the field extraction.
Finally, do you want to summarize the data based on the file name? If yes, then this should work:
yoursearchhere source=*test.txt
| rex field=source "/(?<filename>.*?)$"
| stats count by filename
Of course, you might need to modify the stats command and the initial search, etc.
The source is not the file name I'm trying to extract. The various logs(sources) contain reference to hundreds of thousands of files. so a log line may look like this...
"2013-01-10 11:24:17,345 DEBUG [1357817043844] [649] 439 : FAILURE : 100% : Exception encountered in plugin [Encoder.Task.CreateJob]! Plugin Terminated. Encode operations failed: 10102013-01-10T11:24:05-05:00Unable to open input file [/pac/output/lcvtv/testmedia.mov] : 4110"
test.txt was an example. there are thousands of files that are uniquely name but appear in different logs. The files name aren't what's important but that in many cases is the only thing i'll be able to join on.