Event breaking fails randomly

jkst1972_2 · ‎06-12-2017

I have an issue with event line breaking in an access log I hope someone can guide me on.
We have an access log where every line is an event. The issue: randomly events are broken mid line.

Looking at the source file on the app server, event breaking is always correct.
When using “Show source“ in Splunk GUI, it indicates wrong event breaking.

Some more details on our config :
• We use an index cluster (4 nodes) with auto load balance
• We use “useAck”

Discoveries summarized :
• Line breaking are always correct when looking on the source file on the application servers.
• Using “show source“ in splunk always indicates the wrong event breaking when the error occurs.
• When the issue occur the event breaking in splunk occur with a random amount of bytes into the line.
• There are 4 application servers generating the access logs running Splunk forwarder - the errors occurs on all of the servers, equally distributed.
• The access log file on the servers are rotated midnight every night
• The occurrence rate of the issue correlate with increasing volume (the issue is more often seen during high-volume periods than low-volume periods).
• The issue does not correlate with the time the forwarder switch indexer (as part of the autoLB functionality). Tested this based on a hypothesis that the wrong event breaking occur when the forwarder switch to start sending to the next indexer.
• The ratio error vs. totalcount varies between 0.05-0.12 precent, and correlate with increased amount of events

One event sample with wrong event break:
[09/Jun/2017:10:15:36 +0200] client_ip=... user=- verb=GET uri=/……&WIDTH=256&HEIGHT=256 user_agent="Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, li
ke Gecko) Chrome/58.0.3029.110 Safari/537.36 OPR/..." origin=- referer=http://www.***/ resp_code=200 resp_time=5 xfwdfor="..., ..." guid=b3812af9-6715-a492-0afb-73ddbe57428d

I have added the following props.conf (on the indexers) for the sourcetype :

[appname-access-log]
ANNOTATE_PUNCT = True
AUTO_KV_JSON = true
BREAK_ONLY_BEFORE =
BREAK_ONLY_BEFORE_DATE = true
CHARSET = UTF-8
DATETIME_CONFIG = /etc/datetime.xml
HEADER_MODE =
LEARN_SOURCETYPE = true
LINE_BREAKER_LOOKBEHIND = 100
LOOKUP-dropdowns = dropdownsLookup host OUTPUT unix_category unix_group
MATCH_LIMIT = 100000
MAX_DAYS_AGO = 2000
MAX_DAYS_HENCE = 2
MAX_DIFF_SECS_AGO = 3600
MAX_DIFF_SECS_HENCE = 604800
MAX_EVENTS = 256
MAX_TIMESTAMP_LOOKAHEAD = 128
MUST_BREAK_AFTER =
MUST_NOT_BREAK_AFTER =
MUST_NOT_BREAK_BEFORE =
SEGMENTATION = indexing
SEGMENTATION-all = full
SEGMENTATION-inner = inner
SEGMENTATION-outer = outer
SEGMENTATION-raw = none
SEGMENTATION-standard = standard
SHOULD_LINEMERGE = true
TIME_PREFIX = [
TRANSFORMS =
TRUNCATE = 10000
detect_trailing_nulls = false
maxDist = 100
priority =
sourcetype =

romedome · ‎04-12-2018

Were you able to get this issue resolved? I'm seeing this issue on a log that is generated on the search head, no forwarder involved. The lines themselves are thousands of characters long so I think I have the same issue as you.

mattymo · ‎06-12-2017

Why are you linemerging? Are these events not all one line? Are you using a custom linebreaker?

Are you using a UF on the app server? Is there a Heavy Forwarder involved in the data path to the indexers?

Keep in mind, examining the source file after the fact is not a true representation of what was going on, as Splunk's tail reader monitors in close to real time. The general behavior I have found is that there was a break in the file write so Splunk thinks the line is done or has been closed.

I have this situation at one of my clients and in another answers post, and am trying to get a sure fire way to identify root cause. https://answers.splunk.com/answers/545031/sending-logs-over-scp-to-heavy-forwarder-why-does-1.html#c...

Are you using forceTimeBasedAutoLB?
http://docs.splunk.com/Documentation/Splunk/6.6.1/Admin/Outputsconf

forceTimebasedAutoLB = [true|false]
* Forces existing streams to switch to newly elected indexer every
  AutoLB cycle.
* On universal forwarders, use the EVENT_BREAKER_ENABLE and
  EVENT_BREAKER settings in props.conf rather than forceTimebasedAutoLB
  for improved load balancing, line breaking, and distribution of events.
* Defaults to false.

Also do you see either of these symptoms on the forwarder(s) involved?

1) If there are far too many files getting dropped at the same time. There will be a log message "File descriptor cache is full (100), trimming".
2) There are TcpOutputProc WARN message regarding read timeout.

If none of the above apply feel free to prepare to open a case an gather debug from TailReader/WatchedFile and TailingProcessor and open a case.

http://docs.splunk.com/Documentation/Splunk/6.6.1/Troubleshooting/Enabledebuglogging

Use the cli commands so that you dont need to restart splunk.

- MattyMo

jkst1972_2 · ‎06-14-2017

Thank you mmodestino_splunk

On the app servers there is UF and there is no HF between the UF and the INXers.
1) I have tested the following on the forwarder
* outputs.conf
- forceTimebasedAutoLB = false
* props.conf
- EVENT_BREAKER_ENABLE = true

The issue remained the same

2) I have not found any errors of type File descriptor cache is full (100) trimming

3) I have not found any warnings of type TcpOutputProc

I will turn on debugging and open a case

dineshraj9 · ‎06-12-2017

Just have this as your sourcetype and test the line breaking once. Remove any other line breaker related configuration apart from the below and try.

[appname-access-log]
TIME_PREFIX = \[
MAX_TIMESTAMP_LOOKAHEAD = 30
TIME_FORMAT = %d/%b/%Y:%T %z
LINE_BREAKER = ([\n\r]+)(?=\[\d+\/\w+\/\d+\:\d+)
SHOULD_LINEMERGE = False

jkst1972_2 · ‎06-13-2017

Thanks alot!
I've tested what you suggested, but it did not work.
Working on the comments given by mmodestino_splunk

Event breaking fails randomly

Updated Team Landing Page in Splunk Observability

New! Splunk Observability Search Enhancements for Splunk APM Services/Traces and ...

Webinar Recap | Revolutionizing IT Operations: The Transformative Power of AI and ML ...