New(ish) to splunk, so RTFM (with link to FM) is fine.
Customer has splunk, want to link with DB connect into document image system. System images stored in database as blob objects. Some as PDF , some as scanned jpg. etc...
Goal is to splunk this data. Obviously some data stored in the database about the files are "good enough" for many lookups and reports but some times they will need to get data from the files themselves.
I can think of several ways to do this, pull file , feed through come third party OCR / PDF to txt processor then return values as file data in directory path against which then splunk would ingest. Not very elegant, and would require some API coding into applications to do OCR / conversion with trigger return to splunk to then start indexing data.
I have to believe someone else has cracked this cookie... Any ideas?
In order to use non-ASCII data in Splunk, it should first be converted into ASCII data. This can be done in SQL with CAST or CONVERT, but it may not be useful if the data needs to be compared later in the process, unless the exact same conversion algorithm and transformations are used on the data.
While there are some nifty hacks, I agree with the assessment that using an external tool is probably a better choice. It doesn't have to be a commercial one.
We've done some neat stuff with exiftool and the Bro TA, for instance.
I would search through the apps at apps.splunk.com but are you sure Splunk is the right tool for this situation? Whenever people are working with documents, I usually suggest MarkLogic which has tools to help you generate the metadata that you are describing. It is an incredible product and does things in a totally different way than Splunk and is better suited for non-plain-text data sources: http://www.MarkLogic.com/
P.S. These are the main guys that swooped in and make HealthCare.gov actually work; without them, it probably never would have.