Getting Data In

How to log whole site content whilst excluding specific file extensions and file types

Explorer

(I am an absolute novice at this, the answer maybe obvious but I am still learning the trade please bear with me)

For this exercise I am trying to index the whole site e.g www.lkm93.com whilst avoiding massive file names that may cause my daily indexing allowance to go over the limit.

The regex I have figured out so far is:

[waf_exclude]
DEST_KEY = queue 
FORMAT  = nullQueue 
REGEX = .*\(tif|mp3|jpg|js|css|mp4|java|waf|png|gif|svg|jpeg|JPG|JS|JPEG|MID|MIDI|MP3|MP4|MPG|MPEG|PDF|PNG|TIFF|TXT|WAV|ZIP)

(I have repeated some extensions is capitals letters to make sure I match the extensions in both cases)

This I believe should be indexing everything on my site www.lkm93.com and the regex I have added to that will exclude the file named file extensions. I have reloaded the transforms.conf file and I don't seem to be pulling in data outside of what I am already pulling in. Is there anything obvious that I could be missing here?

0 Karma
1 Solution

Legend

Hi @lkm93,
at first, I think that you used also props.conf adding:

[your_sourcetype]
TRANSFORMS-waf_exclude = waf_exclude

Then, where do you inserted props.conf and transforms.conf? they must be on Indexers or (when present) on Heavy Forwarders.

Then, do you restarted Splunk after modifying props.conf and transfrorms.conf?

Then, you didn't escaped the last parenthesis? the correct regex is .*\(tif|mp3|jpg|js|css|mp4|java|waf|png|gif|svg|jpeg|JPG|JS|JPEG|MID|MIDI|MP3|MP4|MPG|MPEG|PDF|PNG|TIFF|TXT|WAV|ZI\P)

At least, check your regex using the regex command:

index=your_index
| regex ".*\(tif|mp3|jpg|js|css|mp4|java|waf|png|gif|svg|jpeg|JPG|JS|JPEG|MID|MIDI|MP3|MP4|MPG|MPEG|PDF|PNG|TIFF|TXT|WAV|ZIP\)"

Finally I saw that there are extension in uppercase not present in lowercase or reverse (ZIP, css, etc...).

Ciao.
Giuseppe

View solution in original post

Legend

Hi @lkm93,
at first, I think that you used also props.conf adding:

[your_sourcetype]
TRANSFORMS-waf_exclude = waf_exclude

Then, where do you inserted props.conf and transforms.conf? they must be on Indexers or (when present) on Heavy Forwarders.

Then, do you restarted Splunk after modifying props.conf and transfrorms.conf?

Then, you didn't escaped the last parenthesis? the correct regex is .*\(tif|mp3|jpg|js|css|mp4|java|waf|png|gif|svg|jpeg|JPG|JS|JPEG|MID|MIDI|MP3|MP4|MPG|MPEG|PDF|PNG|TIFF|TXT|WAV|ZI\P)

At least, check your regex using the regex command:

index=your_index
| regex ".*\(tif|mp3|jpg|js|css|mp4|java|waf|png|gif|svg|jpeg|JPG|JS|JPEG|MID|MIDI|MP3|MP4|MPG|MPEG|PDF|PNG|TIFF|TXT|WAV|ZIP\)"

Finally I saw that there are extension in uppercase not present in lowercase or reverse (ZIP, css, etc...).

Ciao.
Giuseppe

View solution in original post

Explorer

Hello Giuseppe,

thank you for your prompt reply.

I have re-arranged my props.conf file after reading your reply and also re-configured the transforms.conf file.

Here'show my props.conf file looks now:

[waf_log]
pulldown_type = true
MAX_TIMESTAMP_LOOKAHEAD = 32
SHOULD_LINEMERGE = False
TRANSFORMS-null = waf_include,waf_exclude,waf_include_xapi,waf_drop_x
LEARN_SOURCETYPE = false
TZ = GMT

Transforms.conf looks like this:

[waf_include]
DEST_KEY = queue
FORMAT = indexQueue
REGEX = .*

[waf_exclude]
DEST_KEY = queue
FORMAT = nullQueue
REGEX = .*\.(tif|mp3|jpg|js|css|java|Ico|waf|png|gif|svg|jpeg|avi|mid|midi|mpg|mpeg|mov|qt|png|ram|rar|tiff|txt|wav|zip|TIF|MP3|CSS|JAVA|ICO|WAF|PNG|SVG|AVI|CSS|EXE|GIF|JPG|JS|JPEG|MID|MIDI|MPG|MPEG|MOV|QT|PNG|RAM|RAR|TIFF|TXT|WAV|ZIP).*

[waf_include_xapi]
DEST_KEY = queue
FORMAT = indexQueue
REGEX = blah-blah

[waf_drop_x]
DEST_KEY = queue
FORMAT = nullQueue
REGEX = blahblah

My props.conf and transforms.conf files are on the Splunk manager, I thought that would be the reasonable place to have them.

I also discovered that by https://splunk-fqdn/en-US/debug/refresh I could refresh the all the .conf files. Do I definitely need to restart Splunk based on the new changes I have just made?

And lastly I have fixed the Regex to pick up whole urls on that domain, it's picking up everything I needs in the test I have done. also the extensions have been fixed I was in a rush to get the question out to the world..thank you!

What do you think of this now?

0 Karma

Legend

Hi @lkm93,
at first, you don't need the waf_include stanza, but I usually insert it!
Then, you don't need * in REGEX = .*, you can use REGEX = ..

Then you don't need the include stanzas whan you have REGEX = ., because you already have all that you didn't discard, so try something like this:
in props.conf

 [waf_log]
 pulldown_type = true
 MAX_TIMESTAMP_LOOKAHEAD = 32
 SHOULD_LINEMERGE = False
TRANSFORMS-null = waf_include,waf_exclude
 LEARN_SOURCETYPE = false
 TZ = GMT

in transforms.conf

 [waf_include]
 DEST_KEY = queue
 FORMAT = indexQueue
 REGEX = .*

 [waf_exclude]
 DEST_KEY = queue
 FORMAT = nullQueue
 REGEX = .*\.(tif|mp3|jpg|js|css|java|Ico|waf|png|gif|svg|jpeg|avi|mid|midi|mpg|mpeg|mov|qt|png|ram|rar|tiff|txt|wav|zip|TIF|MP3|CSS|JAVA|ICO|WAF|PNG|SVG|AVI|CSS|EXE|GIF|JPG|JS|JPEG|MID|MIDI|MPG|MPEG|MOV|QT|PNG|RAM|RAR|TIFF|TXT|WAV|ZIP).*

Ciao.
Giuseppe

Explorer

Hi @gcusello

Thank you for thi si applied this configuration and it seems to be working as you described! no longer picking up the unwanted extensions.

0 Karma