Getting Data In

How to filter out temp files when searching with Hunk?

jwalzerpitt
Influencer

Running into an issue where a query against a virtual index errors out when it hits *.tmp files in the HDFS directory.

Is there a way to filter, or prevent the query from looking at *.tmp files as it's performing the query?

Cloudera said to perform a filtering of the files in the target input directory, to remove away any .tmp files as these are in-progress files from Flume and can get renamed during the job, causing this error.

Thx

Tags (4)
0 Karma
1 Solution

Ledion_Bitincka
Splunk Employee
Splunk Employee

You can use the whitelist regex, see this blog post for an example

View solution in original post

0 Karma

kschon_splunk
Splunk Employee
Splunk Employee

I'm afraid I can't figure out what's going on from the info here. The only other difference I see is that the pattern that works starts with .*?, while the other two start with .?, but that really should not matter. I think you may need to contact support to have somebody go through this with you.

0 Karma

jwalzerpitt
Influencer

Just an update:

For the time capturing regex I had to set the 'Time Range' to 1 day as the we're saving logs to one folder per day (12/14, 12/15, 12/16, etc). By setting the 'Time Range' to 1 day, I can now search logs per day.

Hope this helps

Thx

0 Karma

Ledion_Bitincka
Splunk Employee
Splunk Employee

You can use the whitelist regex, see this blog post for an example

0 Karma

jwalzerpitt
Influencer

Thx for the link.

I had the whitelist regex as follows: ISE.*

I then changed the regex to: (ISE.*\.(\d+))

as the Cisco ISE logs either end with . when fully written, or .tmp as the file is still be written to.

I have a different regex problem (Time capturing regex) which is driving me mad if you don't mind taking a look at.

We have three directories on HDFS:

• /LogCentral/Firewall
• /LogCentral/ISE
• /LogCentral/ WindowsEvent

I have the following regex applied to our Firewall virtual index and I can use the time picker no problem

.*?/Firewall/(\d+)-(\d+)-(\d+)/.*?) 

However, applying the same format to the other two logs

.?/ISE/(\d+)-(\d+)-(\d+)/.*?)
.?/WindowsEvent/(\d+)-(\d+)-(\d+)/.*?)

I get no events at all no matter what dates I select in the time picker, yet I'm using the same format.

Tried the following regex as I got a match on regex101.com:

.+\ISE\/(\d+)-(\d+)-(\d+)

Yes when I enter that and try and run a search, it errors out:

[cdhprovider] Error while running external process, return_code=255. See search.log for more info
[cdhprovider] IOException - No input paths specified in job.

Thx

0 Karma

kschon_splunk
Splunk Employee
Splunk Employee

As copied here, your regexes have unbalanced parentheses. For example, ".?/ISE/(d+)-(d+)-(d+)/.?)" has a final ) char that is not matched on the left. Is that a copying artifact, or what you're really using? If the latter, try removing the final ).

0 Karma

jwalzerpitt
Influencer

That is a copying artifact - should be:

.*?/ISE/(\d+)-(\d+)-(\d+)/.*?
0 Karma

kschon_splunk
Splunk Employee
Splunk Employee

OK, then that regex looks OK to me. Can you verify that the data format is the same for the ISE index as it is for the Firewall index?

0 Karma

jwalzerpitt
Influencer

It's exact, and that's what's driving me crazy

• /LogCentral/Firewall/yyyy-MM-dd
• /LogCentral/ISE/yyyy-MM-dd
• /LogCentral/WindowsEvent/yyyy-MM-dd

and I have yyyyMMdd entered for Format

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

Splunk is officially part of Cisco

Revolutionizing how our customers build resilience across their entire digital footprint.   Splunk ...

Splunk APM & RUM | Planned Maintenance March 26 - March 28, 2024

There will be planned maintenance for Splunk APM and RUM between March 26, 2024 and March 28, 2024 as ...