Getting Data In

How to filter out temp files when searching with Hunk?

Motivator

Running into an issue where a query against a virtual index errors out when it hits *.tmp files in the HDFS directory.

Is there a way to filter, or prevent the query from looking at *.tmp files as it's performing the query?

Cloudera said to perform a filtering of the files in the target input directory, to remove away any .tmp files as these are in-progress files from Flume and can get renamed during the job, causing this error.

Thx

Tags (4)
0 Karma
1 Solution

Splunk Employee
Splunk Employee

You can use the whitelist regex, see this blog post for an example

View solution in original post

0 Karma

Splunk Employee
Splunk Employee

I'm afraid I can't figure out what's going on from the info here. The only other difference I see is that the pattern that works starts with .*?, while the other two start with .?, but that really should not matter. I think you may need to contact support to have somebody go through this with you.

0 Karma

Motivator

Just an update:

For the time capturing regex I had to set the 'Time Range' to 1 day as the we're saving logs to one folder per day (12/14, 12/15, 12/16, etc). By setting the 'Time Range' to 1 day, I can now search logs per day.

Hope this helps

Thx

0 Karma

Splunk Employee
Splunk Employee

You can use the whitelist regex, see this blog post for an example

View solution in original post

0 Karma

Motivator

Thx for the link.

I had the whitelist regex as follows: ISE.*

I then changed the regex to: (ISE.*\.(\d+))

as the Cisco ISE logs either end with . when fully written, or .tmp as the file is still be written to.

I have a different regex problem (Time capturing regex) which is driving me mad if you don't mind taking a look at.

We have three directories on HDFS:

• /LogCentral/Firewall
• /LogCentral/ISE
• /LogCentral/ WindowsEvent

I have the following regex applied to our Firewall virtual index and I can use the time picker no problem

.*?/Firewall/(\d+)-(\d+)-(\d+)/.*?) 

However, applying the same format to the other two logs

.?/ISE/(\d+)-(\d+)-(\d+)/.*?)
.?/WindowsEvent/(\d+)-(\d+)-(\d+)/.*?)

I get no events at all no matter what dates I select in the time picker, yet I'm using the same format.

Tried the following regex as I got a match on regex101.com:

.+\ISE\/(\d+)-(\d+)-(\d+)

Yes when I enter that and try and run a search, it errors out:

[cdhprovider] Error while running external process, return_code=255. See search.log for more info
[cdhprovider] IOException - No input paths specified in job.

Thx

0 Karma

Splunk Employee
Splunk Employee

As copied here, your regexes have unbalanced parentheses. For example, ".?/ISE/(d+)-(d+)-(d+)/.?)" has a final ) char that is not matched on the left. Is that a copying artifact, or what you're really using? If the latter, try removing the final ).

0 Karma

Motivator

That is a copying artifact - should be:

.*?/ISE/(\d+)-(\d+)-(\d+)/.*?
0 Karma

Splunk Employee
Splunk Employee

OK, then that regex looks OK to me. Can you verify that the data format is the same for the ISE index as it is for the Firewall index?

0 Karma

Motivator

It's exact, and that's what's driving me crazy

• /LogCentral/Firewall/yyyy-MM-dd
• /LogCentral/ISE/yyyy-MM-dd
• /LogCentral/WindowsEvent/yyyy-MM-dd

and I have yyyyMMdd entered for Format

0 Karma