Hello,
We are using a CSV lookup file in some of our searches. This file was the source of an error, throwing
Error in 'lookup' command: Error initializing index: 'internal_reindex - exiting indexing, Could not add token 'xxxxxxxxxxxxxxxxxxxxxxxxx' to index!After investigating and some trials and errors, we realized that this errors occurs only when a string in the file contains a NULL character AND the file is bigger than 25MB, the default "max_memtable_bytes".
Our file changes daily and its size is usually around 30MB, meaning one quick and easy fix could be to increase this limit. Is there a "cleaner" solution allowing bigger CSV files to be indexed at search-time even when they contain NULL characters?
Thanks
P.S. : We are currently using Splunk 9.4. Migration towards 10.x is being planned.
The inputlookup command isn't bothered by null bytes, so you can also use @kml_uvce's earlier advice to sanitize the lookup after the add-on generates it:
| inputlookup foo.csv
| foreach * [| eval "<<FIELD>>"=replace('<<FIELD>>', "\\0", "\\x00") ]
| outputlookup foo.csv
Hi @Enzo54,
You will find the same problem in 10.x.
Don't be afraid to increase the size of max_memtable_bytes in this or other cases. The setting is there to help you get the best performance out of frequently used large lookup files.
While null is a valid UTF-8 character, it poses a challenge for internal functions expecting null-terminated strings, whether they're in splunkd, Python's CSV reader, or elsewhere.
I work occasionally with binary data in Splunk, so I wouldn't assume your null bytes are corruption as @kml_uvce has done, but their advice is otherwise sound: preprocess your lookup files and use a surrogate character to represent null. Splunk uses \x00 itself when sanitizing _raw, but you may prefer the URL-encoded %00, the UTF-8 encoding for ␀ (the byte sequence E29080), or some other value depending on your source representation. Whatever you choose, deserializing the data is simple with the rex command using mode=sed or the eval command using the replace() and urldecode() functions.
Our search looks like this :
| tstats summariesonly=true values(Web.user) values(Web.src) c from datamodel=Web by Web.url Web.user
| rename values(*) as *
| rename Web.* as *
| lookup csv_url url OUTPUT hashkey as hk
| where isnotnull(hk)and it is the "lookup" part that fails. With your suggestion of surrogating the character, would it be something that you do inside this saved search, or somewhere else, and how?
The NULL characters are located in the "url" field, so replacing them with "%00" would be fine.
Treat the NULL byte as what it almost certainly is: corruption in the source data. A \x00 in a text CSV serves no purpose, and stripping it lets the file stay on the efficient on-disk indexed path with no memory bloat. Where you strip it depends on how the file is produced:
If an external process writes the file daily, add a sanitization pass before it lands in the lookups/ dir:
tr -d '\000' < raw.csv > clean.csvIf the lookup is generated by a Splunk search piped to outputlookup, clean it in SPL right before writing, so it's all one pipeline:
... | eval myfield=replace(myfield, "\x00", "") | outputlookup mylookup.csv(use a foreach over fields if you don't know which column carries it).
You can check and find which field/source injects the NULL, because the usual culprit is an upstream encoding problem — e.g. UTF‑16 data read as UTF‑8 leaves a NUL between every character, or a binary value leaking into a text column. Fixing that at the source is better than scrubbing downstream forever.
If this lookup is large, changes every day, and is read constantly, the more architectural option is to move it to a KV Store collection instead of a CSV
Another addon is actually generating the CSV file from API calls. I checked the original source, which does contain \x00. After checking with the authors of said API, those null bytes are expected and can be present at the end or in the middle of the string.
To indirectly cite/translate what they said, their source of data is supposed to already be cleaned, then they can pass the data onto the API and then to the CSV builder inside their Splunk Addon.
The addon then writes the CSV file from its python code. Editing the python code could be possible, but it would also mean that whenever the addon gets updated, the modification would get lost and the issue would come back.
Regarding the KV-store option, the file is ~30MB, updated twice a day. It is used every 5min in a scheduled search. However, the addon does not give us the possibility of outputing its data as KV Store.
The inputlookup command isn't bothered by null bytes, so you can also use @kml_uvce's earlier advice to sanitize the lookup after the add-on generates it:
| inputlookup foo.csv
| foreach * [| eval "<<FIELD>>"=replace('<<FIELD>>', "\\0", "\\x00") ]
| outputlookup foo.csv
I see. So that would mean creating a new scheduled search sanitizing the CSV file after it has been created.
That would also mean making sure this runs between the daily update of the file and the next scheduled use of said file.
I will have a look at this and let you know what ended up working best.
EDIT : I created a new Scheduled search with the suggested inputlookup/eval/outpulookup chain. The time of edition of the CSV file is not fixed (API response time can vary), so the search was scheduled to run every 2min in the 10min window when the CSV update usually happens.
ok, you can also try and increase the size max_memtable_bytes