I have seen several questions regarding null (\x00) bytes in data, but none have helped me resolve my issue so far.
I am trying to read a log file from Sophos using Universal Forwarders. I have done the following so far:
Added a new sourcetype in Splunk Web.
props.conf on the indexer:
[my_sourcetype]
NO_BINARY_CHECK = 1
SHOULD_LINEMERGE = false
TIME_FORMAT = %Y%m%d %H%M%S
TZ = UTC
pulldown_type = 1
CHARSET = UTF-16LE
Modified inputs.conf on the forwarders:
[monitor://C:\ProgramData\Sophos\Sophos Device Control\logs]
sourcetype=my_sourcetype
Sample data from C:\ProgramData\Sophos\Sophos Device Control\logs\DeviceControl.txt:
20131001 150737 Device control has started on this machine.
20131003 131815 Device control has started on this machine.
When I search sourcetype="my_sourcetype", I see data, but it looks like this:
\x002\x000\x001\x003\x00 \x001\x005\x000\x007\x003\x007....
If I copy that data into Notepad, replace \x00 with nothing, then I see the data that I expect.
Before I left tonight, I noticed that the text file I am reading from is blue in Windows Explorer, which indicates the compression bit is set. Every file in this folder is set this way, and removing compression is not an option.
What do I need to do in order to have Splunk index the data without null values? All other data coming from TA-Windows and other apps is fine and does not show null values.
Update 10/17/13:
Wanted to clarify that this is Splunk 4.3.3 on Windows Server 2008 R2 SP1, with Windows 7 SP1 x64 hosts running the Universal Forwarder. Upgrading Splunk is not an option at this time, but we are pushing to do so in the near future.
/etc/system/local/outputs.conf on the forwarder:
[tcpout]
defaultGroup = 1.2.3.4_9997
[tcpout:1.2.3.4_9997]
server = 1.2.3.4:9997
[tcpout-server://1.2.3.4:9997]
/etc/system/local/inputs.conf on the indexer:
[default]
host = my_hostname
[script://$SPLUNK_HOME\bin\scripts\splunk-admon.path]
disabled = 0
[script://$SPLUNK_HOME\bin\scripts\splunk-perfmon.path]
disabled = 0
.... (two more script stanzas)
[monitor://C:\ProgramData\Sophos\Sophos Device Control\logs]
sourcetype=my_sourcetype
Again, all other data coming from the forwarders looks fine without null bytes. Only the data from Sophos is an issue. I am also noticing entries in Splunk with just a single null character as the data (\x00).
Issue resolved for now: had to set CHARSET = UTF16-LE on props.conf on the forwarders as well as the indexer. I was mistakenly putting the CHARSET line into inputs.conf on the forwarders.
Issue resolved for now: had to set CHARSET = UTF16-LE on props.conf on the forwarders as well as the indexer. I was mistakenly putting the CHARSET line into inputs.conf on the forwarders.
This also worked for my case, exactly the same issue as you described. It was key to put the props.conf with the CHARSET on both the UF and the indexer, otherwise, it didn't work.
So, do we need to install UTF16-LE on the indexer server to decode it. My server only has UTF-8.
I followed the instructions here (http://answers.splunk.com/answers/83790/how-do-i-remove-x00-characters-from-my-log-message) to remove nulls before indexing (edited props.conf) and the data looks normal now, but for at least one of the hosts so far the timestamp is incorrect. The entry in Splunk that now looks correct has the same timestamp as the previous entries that had \x00 bytes. For another host it is correctly parsing the timestamp from the data.
I'd rather it be correctly processed up front instead of replacing nulls, but if I can get the timestamp correct I can live with it.
Just noticed this line in splunkd.log on the indexer:
WARN UTF8Processor - Using charset UTF-8 for events from 'UTF-16LE', as the monitor is believed over the raw text which may be source:C:\ProgramData\Sophos\Sophos Device Control\logs\DeviceControl.txt|host::my_host|my_sourcetype|remoteport::56789