Getting Data In

DB Connect: Blob object Search

JeremeyWise
Explorer

PreSales Question.

New(ish) to splunk, so RTFM (with link to FM) is fine.

Customer has splunk, want to link with DB connect into document image system. System images stored in database as blob objects. Some as PDF , some as scanned jpg. etc...

Goal is to splunk this data. Obviously some data stored in the database about the files are "good enough" for many lookups and reports but some times they will need to get data from the files themselves.

I can think of several ways to do this, pull file , feed through come third party OCR / PDF to txt processor then return values as file data in directory path against which then splunk would ingest. Not very elegant, and would require some API coding into applications to do OCR / conversion with trigger return to splunk to then start indexing data.

I have to believe someone else has cracked this cookie... Any ideas?

Thanks

Tags (1)

weeb
Splunk Employee
Splunk Employee

In order to use non-ASCII data in Splunk, it should first be converted into ASCII data. This can be done in SQL with CAST or CONVERT, but it may not be useful if the data needs to be compared later in the process, unless the exact same conversion algorithm and transformations are used on the data.

While there are some nifty hacks, I agree with the assessment that using an external tool is probably a better choice. It doesn't have to be a commercial one.

We've done some neat stuff with exiftool and the Bro TA, for instance.

http://www.sno.phy.queensu.ca/~phil/exiftool/

Splunk Add-on for Bro IDS
https://splunkbase.splunk.com/app/1617/

woodcock
Esteemed Legend

I would search through the apps at apps.splunk.com but are you sure Splunk is the right tool for this situation? Whenever people are working with documents, I usually suggest MarkLogic which has tools to help you generate the metadata that you are describing. It is an incredible product and does things in a totally different way than Splunk and is better suited for non-plain-text data sources:
http://www.MarkLogic.com/

P.S. These are the main guys that swooped in and make HealthCare.gov actually work; without them, it probably never would have.

Get Updates on the Splunk Community!

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...