I am experiencing a problem with finding logs using keyword searching for anomalies in log files. The search string below seems to ignore some of the strings from the _raw event data (not detecting %2e in the failed examples) although it will detect it in other examples. What could be causing the different behavior? (The events that were not found by the search string are from the same log file as some of the successfully caught entries). Is there a better command I can use for this type of search without using rex regex extraction, which is very slow to complete the search
`rex ".*(?<mal>(%2[Ee])|(%[Cc]1)|(%1[Cc])|(%[Cc]0)|(%[Aa][Ee])|(%9[Cc])|(%5[Cc]))" | mal=*
Search String
sourcetype=access_combined (%2E OR %C1 OR %1C OR %C0 OR %AE)
Log Entries Captured by Search
10.2.0.1 - - [16/Aug/2015:14:52:25 -0600] "GET /.%c0%80.jsp HTTP/1.1" 302 384
10.2.0.1 - - [16/Aug/2015:14:52:25 -0600] "GET /%2e/WEB-INF/web.xml HTTP/1.1" 302 388
10.50.7.1 10.50.7.2 [16/Aug/2015:11:27:10 -0600] [pid 14680:tid 47938897791296] "GET /EAA/faces/%c0%ae/WEB-INF/web.xml%00.jsp HTTP/1.1" 404 188 75
[Sun Aug 16 14:52:25 2015] [error] [client 10.2.0.1] Invalid URI in request GET /%2e%2e/META-INF/ HTTP/1.1
Logs Entries Not Captured by Search
10.2.0.1 20.10.2.2 [21/Aug/2015:03:08:30 -0600] [pid 15599:tid 46921285732672] "GET /nice%20ports%2C/Tri%6Eity.txt*%2ebak* HTTP/1.0" 404 188 1187
10.2.0.1 - [21/Aug/2015:03:11:20 -0600] [pid 28088:tid 4158601424] "GET /nice%20ports%2C/Tri%6Eity.txt*%2e*bak HTTP/1.0" 404 344 1095
Thanks
I have finally heard back from the Engineering team, and understand that the behaviour is as per design.
In order to explain this, may I request you to please take a look at the following links:
http://docs.splunk.com/Documentation/Splunk/latest/Admin/Segmentersconf
http://docs.splunk.com/Documentation/Splunk/6.2.5/Knowledge/Createandmaintainsearch-timefieldextract...
Now, let me take an example each of working and non-working case:
Event in Set A:
10.2.0.1 - - [16/Aug/2015:14:52:25 -0600] "GET /%2e/WEB-INF/web.xml HTTP/1.1" 302 388
Event in Set B:
10.2.0.1 20.10.2.2 [21/Aug/2015:03:08:30 -0600] [pid 15599:tid 46921285732672] "GET /nice%20ports%2C/Tri%6Eity.txt%2ebak HTTP/1.0" 404 188 1187
Now, when you search for %2E, event in Set A would get listed, but event in Set B wouldn't. This is because, %2e in Set A is in between delimiters (minor breaker / ), hence becomes a token. In case of Set B, %2e is part of a larger token, hence wouldn't get listed in the results.
Next, if you search for %20, again event in Set B wouldn't get listed, as it is part of larger token. However, if you search for %2C, it will get listed because, it is adjacent to minor breaker ( / ).
In summary, when you search a string, and don't want to make it greedy, then the same needs to be a token, or a sub-token(if configured the way it has been explained in the above link). If your string is a subset of a larger token, then you either need to search for the whole token, or make your search greedy, or extract it as a sub-token.
To make your search non-greedy in this case, you would have to extract the required sub-tokens (%2E, %C1, %1C, %C0, %AE etc) using props.conf into a field, and then search for the field. For example, this is what I did (for %2e and %20 case):
In props.conf, I created the following:
[segment]
EXTRACT-hex1 = (?%2e)
EXTRACT-hex2 = (?%20)
In fields.conf:
[hex_code1]
INDEXED = False
INDEXED_VALUE = False
[hex_code2]
INDEXED = False
INDEXED_VALUE = False
Next, when I search for, say hex_code1=%2e, the results returned is same as the greedy search %2e*. The same with hex_code2=%20.
Use this -
sourcetype=access_combined ("%2E" OR "%C1" OR "%1C" OR "%C0" OR "%AE" OR "%2e")
I have tried the search using quotations around each keyword and the search is still unable to detect many of the entries that the much slower rex extraction is able to catch. Is there an option I can enable or a different search method I may use to read through the _raw data with keyword matching? Is there any processing performed during indexing that could prevent the events from being caught by keyword filters due to erroneous optimizations?