Getting Data In

How to filter out temp files when searching with Hunk?

jwalzerpitt
Influencer

Running into an issue where a query against a virtual index errors out when it hits *.tmp files in the HDFS directory.

Is there a way to filter, or prevent the query from looking at *.tmp files as it's performing the query?

Cloudera said to perform a filtering of the files in the target input directory, to remove away any .tmp files as these are in-progress files from Flume and can get renamed during the job, causing this error.

Thx

Tags (4)
0 Karma
1 Solution

Ledion_Bitincka
Splunk Employee
Splunk Employee

You can use the whitelist regex, see this blog post for an example

View solution in original post

0 Karma

kschon_splunk
Splunk Employee
Splunk Employee

I'm afraid I can't figure out what's going on from the info here. The only other difference I see is that the pattern that works starts with .*?, while the other two start with .?, but that really should not matter. I think you may need to contact support to have somebody go through this with you.

0 Karma

jwalzerpitt
Influencer

Just an update:

For the time capturing regex I had to set the 'Time Range' to 1 day as the we're saving logs to one folder per day (12/14, 12/15, 12/16, etc). By setting the 'Time Range' to 1 day, I can now search logs per day.

Hope this helps

Thx

0 Karma

Ledion_Bitincka
Splunk Employee
Splunk Employee

You can use the whitelist regex, see this blog post for an example

0 Karma

jwalzerpitt
Influencer

Thx for the link.

I had the whitelist regex as follows: ISE.*

I then changed the regex to: (ISE.*\.(\d+))

as the Cisco ISE logs either end with . when fully written, or .tmp as the file is still be written to.

I have a different regex problem (Time capturing regex) which is driving me mad if you don't mind taking a look at.

We have three directories on HDFS:

• /LogCentral/Firewall
• /LogCentral/ISE
• /LogCentral/ WindowsEvent

I have the following regex applied to our Firewall virtual index and I can use the time picker no problem

.*?/Firewall/(\d+)-(\d+)-(\d+)/.*?) 

However, applying the same format to the other two logs

.?/ISE/(\d+)-(\d+)-(\d+)/.*?)
.?/WindowsEvent/(\d+)-(\d+)-(\d+)/.*?)

I get no events at all no matter what dates I select in the time picker, yet I'm using the same format.

Tried the following regex as I got a match on regex101.com:

.+\ISE\/(\d+)-(\d+)-(\d+)

Yes when I enter that and try and run a search, it errors out:

[cdhprovider] Error while running external process, return_code=255. See search.log for more info
[cdhprovider] IOException - No input paths specified in job.

Thx

0 Karma

kschon_splunk
Splunk Employee
Splunk Employee

As copied here, your regexes have unbalanced parentheses. For example, ".?/ISE/(d+)-(d+)-(d+)/.?)" has a final ) char that is not matched on the left. Is that a copying artifact, or what you're really using? If the latter, try removing the final ).

0 Karma

jwalzerpitt
Influencer

That is a copying artifact - should be:

.*?/ISE/(\d+)-(\d+)-(\d+)/.*?
0 Karma

kschon_splunk
Splunk Employee
Splunk Employee

OK, then that regex looks OK to me. Can you verify that the data format is the same for the ISE index as it is for the Firewall index?

0 Karma

jwalzerpitt
Influencer

It's exact, and that's what's driving me crazy

• /LogCentral/Firewall/yyyy-MM-dd
• /LogCentral/ISE/yyyy-MM-dd
• /LogCentral/WindowsEvent/yyyy-MM-dd

and I have yyyyMMdd entered for Format

0 Karma
Get Updates on the Splunk Community!

Built-in Service Level Objectives Management to Bridge the Gap Between Service & ...

Wednesday, May 29, 2024  |  11AM PST / 2PM ESTRegister now and join us to learn more about how you can ...

Get Your Exclusive Splunk Certified Cybersecurity Defense Engineer at Splunk .conf24 ...

We’re excited to announce a new Splunk certification exam being released at .conf24! If you’re headed to Vegas ...

Share Your Ideas & Meet the Lantern team at .Conf! Plus All of This Month’s New ...

Splunk Lantern is Splunk’s customer success center that provides advice from Splunk experts on valuable data ...