Getting Data In

Does Splunk have OCR capability?

vsabbis
New Member

I am trying to utilize Splunk to implement entity extraction or text mining. I have huge number of PDF, TIFF, and HTML files which I need to upload to Splunk from which I would need to parse and extract useful text information. Then retrieve specific parts of the information from those files.

Does Splunk have OCR (Optical Character Recognition) capability? How do I upload these files into Splunk and extract desired tags or text from the file?

0 Karma

jkat54
SplunkTrust
SplunkTrust

PDF, no
TIFF, no
HTML, yes

Best you could do with PDF and TIFF is store them in binary format in Splunk. Not OCR by any means but you can use regular expression to parse out data from the HTML files which are most likely in ANSI or UTF.

0 Karma

DalJeanis
Legend

Apparently just text/char ... on the following 2015 thread they said "ASCII", but that's obviously an archaic reference to some old-school last century version of UTF-8 that only old fogies like me (and the forgettless Internet) have heard of...

https://answers.splunk.com/answers/263997/db-connect-blob-object-search.html

There are some useful suggestions and opinions there, well worth reviewing.

0 Karma

jkat54
SplunkTrust
SplunkTrust

Hey, can you contact me @daljeanis? We need to connect somehow.

0 Karma

DalJeanis
Legend

Always open to connect to anyone real on linkedin at http://linkedin.com/in/daljeanis.

Invitation sent.

0 Karma
Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

Announcing Modern Navigation: A New Era of Splunk User Experience

We are excited to introduce the Modern Navigation feature in the Splunk Platform, available to both cloud and ...

Modernize your Splunk Apps – Introducing Python 3.13 in Splunk

We are excited to announce that the upcoming releases of Splunk Enterprise 10.2.x and Splunk Cloud Platform ...

Step into “Hunt the Insider: An Splunk ES Premier Mystery” to catch a cybercriminal ...

After a whole week of being on call, you fell asleep on your keyboard, and you hit a sequence of buttons that ...