Getting Data In

short json get truncated even if TRUNCATE is set to 0

bestSplunker
Contributor

I have some json data forward to universal forwarder via syslog. Then universal forwarder is forwarders them to the indexer cluster.

syslog (on log source server) ——> uf——>indexer cluster

When I searching I can see some short json got truncated . It is truncated after about 1000~2000 characters or so, I set props.conf on indexers as follows:

[mysourcetype]
INDEXED_EXTRACTIONS = json
category = Structured
SHOULD_LINEMERGE  = false
disabled = false
pulldown_type = true
TIME_FORMAT = %s
TIME_PREFIX = ^\{"timestamp":
TRUNCATE = 0

raw json

{"timestamp":1527213681,"request_headers":{"host":"172.10.101.200:8888","connection":"keep-alive","referer":"http:\/\/172.10.101.200:8888\/superset\/dashboard\/ptsjyy\/?preselect_filters=%7B%22114%22%3A%7B%22__time_grain%22%3A%22month%22%2C%22source%22%3A%5B%5D%7D%7D","accept-encoding":"gzip, deflate, sdch","x-requested-with":"XMLHttpRequest","cookie":"session=.eJyV0N1qxCQABeB3mesQfxONr7KUMNGxhtq6qNtlW_ruFXrdQudu4HxzYD5hj5VaAtfrjSbYzwAOlJFcxKA3iWa13uChtOJSo0WU1iiYwLca915e6G3kjd7sogXxzUfOURgTOQmzbkoRX6TX64FeLmG4XDxmGuYjje2Kz7Sns_VSH-AukHq_OsaEkbNcZsn5LAV3dgwL2NJRsIbXEii_n3RneUA2rvyhOh6Z_iV-7Xma4Nao_nxIwNc3efpdRg.Ddpwlg.tze07woTDVEN_iZ4sIS_jzT4VGI","accept-language":"en-US,en;q=0.8","user-agent":"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/54.0.2840.71 Safari\/537.36","accept":"application\/json, text\/javascript, *\/*; q=0.01"},"id":"34a2743ad59c93907872","method":"GET","uri":"\/superset\/explore_json\/table\/39\/","client":"10.195.28.22","uri_args":{"form_data":"{\"datasource\":\"39__table\",\"viz_type\":\"area\",\"slice_id\":113,\"granularity_sqla\":\"riqi\",\"time_grain_sqla\":\"month\",\"since\":\"\",\"until\":\"\",\"metrics\":[\"sum__VAL_02\"],\"groupby\":[\"source\"],\"limit\":50,\"timeseries_limit_metric\":null,\"order_desc\":true,\"show_brush\":false,\"show_legend\":true,\"line_interpolation\":\"linear\",\"stacked_style\":\"stack\",\"color_scheme\":\"googleCategory20c\",\"rich_tooltip\":true,\"contribution\":false,\"show_controls\":false,\"x_axis_format\":\"%Y-%m-%d\",\"x_axis_showminmax\":true,\"y_axis_format\":\",\",\"y_axis_bounds\":[null,null],\"y_log_scale\":false,\"rolling_type\":\"None\",\"time_compare\":null,\"num_period_compare\":\"\",\"period_ratio_type\":\"growth\",\"resample_how\":null,\"resample_rule\":null,\"resample_fillmethod\":null,\"annotation_layers\":[],\"where\":\"\",\"having\":\"\",\"filters\":[],\"extra_filters\":[{\"col\":\"__time_grain\",\"op\":\"in\",\"val\":\"month\"},{\"col\":\"source\",\"op\":\"in\",\"val\":[]}]}","preselect_filters":"{\"114\":{\"__time_grain\":\"month\",\"source\":[]}}"},"alerts":[{"msg":"Repetitive non-word characters anomaly detected","id":51002,"match":6}]}

Truncated json (displayed in search results):

{"timestamp":1527213681,"request_headers":{"host":"172.10.101.200:8888","connection":"keep-alive","referer":"http:\/\/172.10.101.200:8888\/superset\/dashboard\/ptsjyy\/?preselect_filters=%7B%22114%22%3A%7B%22__time_grain%22%3A%22month%22%2C%22source%22%3A%5B%5D%7D%7D","accept-encoding":"gzip, deflate, sdch","x-requested-with":"XMLHttpRequest","cookie":"session=.eJyV0N1qxCQABeB3mesQfxONr7KUMNGxhtq6qNtlW_ruFXrdQudu4HxzYD5hj5VaAtfrjSbYzwAOlJFcxKA3iWa13uChtOJSo0WU1iiYwLca915e6G3kjd7sogXxzUfOURgTOQmzbkoRX6TX64FeLmG4XDxmGuYjje2Kz7Sns_VSH-AukHq_OsaEkbNcZsn5LAV3dgwL2NJRsIbXEii_n3RneUA2rvyhOh6Z_iV-7Xma4Nao_nxIwNc3efpdRg.Ddpwlg.tze07woTDVEN_iZ4sIS_jzT4VGI","accept-language":"en-US,en;q=0.8","user-agent":"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/54.0.2840.71 Safari\/537.36","accept":"application\/json, text\/javascript, *\/*; q=0.01"},"id":"34a2743ad59c93907872","method":"GET","uri":"\/superset\/explore_json\/table\/39\/","client":"10.195.28.22","uri_args":{"form_data":"{\"datasource\":\"39__table\",\"viz_type\":\"area\",\"slice_id\":113,\"granularity_sqla\":\"riqi\",\"time_grain_sqla\":\"month\",\"since\":\"\",\"until\":\"\",\"metrics\":[\"sum__VAL_02\"],\"groupby\":[\"source\"],\"limit\":50,\"timeseries_limit_metric\":null,\"order_desc\":true,\"show_brush\":false,\"show_legend\":true,\"line_interpolation\":\"linear\",\"stacked_style\":\"stack\",\"color_scheme\":\"googleCategory20c\",\"rich_tooltip\":true,\"contribution\":false,\"show_controls\":false,\"x_axis_format\":\"%Y-%m-%d\",\"x_axis_showminmax\":true,\"y_axis_format\":\",\",\"y_axis_bounds\":[null,null],\"y_log_scale\":false,\"rolling_type\":\"None\",\"time_compare\":null,\"num_period_compare\":\"\",\"period_ratio_type\":\"growth\",\"resample_how\":null,\"resample_rule\":null,\"resample_fillmethod\":null,\"annotation_layers\":[],\"where\":\"\",\"having\":\"\",\"filters\":[],\"extra_filters\":[{\"col\":\"__time_grain\",\"op\":\"in\",\"val\":\"month\"},{\"col\":\"source\",\"op\":\"in\",\"val\":[]}]}","preselect_filters":"{\

I use the json validator to verify JSON data syntax and no special characters in JSON.
I try searching internal log . but I can't find any error logs about json truncation.

index=_internal LineBreakingProcessor data_sourcetype=mysourcetype

all help will be appreciated.

Tags (1)
0 Karma
1 Solution

bestSplunker
Contributor

@FrankVl @niketnilay thank you for you help. I have solved the problem. The reason is that syslog uses the UDP port to send data . udp can't send huge chunks data. rsyslog and syslog-ng both cut at 2k. When I modified to TCP input on UF, all the problems were solved.

View solution in original post

0 Karma

bestSplunker
Contributor

@FrankVl @niketnilay thank you for you help. I have solved the problem. The reason is that syslog uses the UDP port to send data . udp can't send huge chunks data. rsyslog and syslog-ng both cut at 2k. When I modified to TCP input on UF, all the problems were solved.

0 Karma

FrankVl
Ultra Champion

Ah, good that you found it 🙂

Yeah, syslog in general is not the best method for transporting big chunks of json data. It wasn't really designed for that and combined with UDP I'm not completely surprised it breaks somehow.

Even though you got it working, might be nice to take a look at using the HTTP Event Collector for this. But that will require a small bit of scripting on the source side, to read the JSON from somewhere and send it to HEC using curl or so.

0 Karma

FrankVl
Ultra Champion

I wouldn't be surprised if this wasn't truncation happening, but rather event breaking in the wrong place and the second half of the event ending up somewhere out of order in the timescale. Did you search for "All Time" to check if you can find the second half of the event?

How are you collecting the syslog data? As a UDP/TCP input? Or as files written by a syslog daemon? If UDP/TCP: did you run a network capture to see if there was perhaps a newline in the incoming data? Same for when using syslog daemon and files: anything suspicious in the file that splunk would read?

I would suggest to configure your sourcetype with a proper LINE_BREAKER setting (e.g. ([\r\n]+)\{"timestamp":), to prevent Splunk from breaking if syslog decides to insert a random newline somewhere (syslog may be a bit quirky with such large, not syslog standard, data). Also, if you're UF is on 6.5 or later, consider specifying a props.conf on the UF with EVENT_BREAKER setting, such that the UF already correctly recognizes event boundaries. That also improves AutoLB behaviour 🙂

bestSplunker
Contributor

@FrankVl

thank you for your reply.
firstly,I can't find the second half of event for "All Time" in search results . So I think Splunk or syslog did'nt split a JSON data into 2 events to indexed and one json per line in raw text
Secondly,Splunk UF uses UDP port 9009 to receive events from syslog forwarding.
thirdly,I couldn't exclude any possibility, I can also try the LINE_BREAKER settings you mentioned.

0 Karma

FrankVl
Ultra Champion

Another thing that comes to mind: how have you set your outputs.conf? Did you enable forceTimebasedAutoLB by any chance?

0 Karma

bestSplunker
Contributor

@FrankVI

hey , this is my outputs.conf on uf.

[indexer_discovery:master1]
pass4SymmKey = *****
master_uri = https://master_ip:8089

[tcpout:group1]
autoLBFrequency = 30
forceTimebasedAutoLB = true
indexerDiscovery = master1

[tcpout]
defaultGroup = group1

Does forceTimebasedAutoLB also affect linebreak?

0 Karma

FrankVl
Ultra Champion

forceTimebasedAutoLB does exactly what it says: it forces the UF to switch to another indexer destination as per autoLBFrequency and since the UF doesn't understand event boundaries, that causes it to switch somewhere in the middle of events. Causing part of an event to go to one indexer and part of it to go to another (and then parts may easily get lost on indexers because broken fragments may not have proper timestamping etc.).
There are some mechanisms in place to prevent data from getting lost (the UF actually sends overlapping data rather than making a hard split), but especially with big events, that overlap is not big enough to prevent data loss.

So yes, my gut tells me that that is your problem.

What version of Splunk is that UF? Since 6.5 UF supports the EVENT_BREAKER config, that enables the UF to recognize event boundaries, and removing the need to use forceTimebasedAutoLB.

From the props.conf spec:

EVENT_BREAKER_ENABLE = [true|false]
* When set to true, Splunk will split incoming data with a light-weight
  chunked line breaking processor so that data is distributed fairly evenly
  amongst multiple indexers. Use this setting on the UF to indicate that
  data should be split on event boundaries across indexers especially
  for large files.
* Defaults to false

# Use the following to define event boundaries for multi-line events
# For single-line events, the default settings should suffice

EVENT_BREAKER = <regular expression>
* When set, Splunk will use the setting to define an event boundary at the
  end of the first matching group instance.
0 Karma

niketn
Legend

@bestSplunker, I tried to ingest the data with Splunk's default settings (except for changing the category as Custom.

I do have the following settings that are different than yours. Instead of the following settings for Timestamp,

TIME_FORMAT = %s
TIME_PREFIX = ^{"timestamp":
I used the following:

TIMESTAMP_FIELDS=timestamp

Also, SHOULD_LINEMERGE was set to true as opposed to false in your case:

SHOULD_LINEMERGE=true

Following is my props.conf:

CHARSET=
INDEXED_EXTRACTIONS=json
KV_MODE=none
NO_BINARY_CHECK=true
SHOULD_LINEMERGE=true
TIMESTAMP_FIELDS=timestamp
category=Custom
description=JavaScript Object Notation format. For more information, visit http://json.org/
disabled=false
pulldown_type=true

PS: Although I do not have TRUNCATE=0, not sure if it impacts. I was also able to apply spath on _raw data directly in SPL, which means JSON is structured correctly and spath is able to traverse all the fields.

Through SPL and indexing I was able to pull the last field of JSON data i.e. uri_args.preselect_filters

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"
0 Karma

bestSplunker
Contributor

@niketnilay thank you for your reply, The default settings for version of splunk ≤ 6.5 do not seem to have parameters TIMESTAMP_FIELDS. I don't know if this parameter is supported? but I can try it.

0 Karma

bestSplunker
Contributor

@niketnilay I tried using the props.conf configuration you provide, The effect I got was that some JSON overlapped. It did not linebreak correctly. So . I think parameter SHOULD_LINEMERGE should be set to false

0 Karma

bestSplunker
Contributor

@niketnilay by the way, I couldn't use of parameter TIMESTAMP_FIELDS to extracting timestamp. Otherwise, _time can't be extracted correctly.It will be supersede by indextime

            _time                                                        timestamp in the json 
            2018/05/29  10:55:46.000                                        1527562341

    1527562341 Conversion to Beijing time is : 2018/5/29 10:52:21
0 Karma

MuS
Legend

can you provide a sample event in raw and let us know where Splunk truncates it?

0 Karma

bestSplunker
Contributor

ok. json data is attached now.

0 Karma
Get Updates on the Splunk Community!

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...

Explore the Latest Educational Offerings from Splunk [January 2025 Updates]

At Splunk Education, we are committed to providing a robust learning experience for all users, regardless of ...

Developer Spotlight with Paul Stout

Welcome to our very first developer spotlight release series where we'll feature some awesome Splunk ...