How to monitor docx and PDF files in Splunk?



I think about new application for our organization and for that I need the ability to monitor (=index,read) the content of doc / docx / PDF files.

When I import the file to Splunk it preview like hex / binary so I think we should define new sourcetype for those files and especially change the charset to something that fit it.

I searched a lot about it but seems that anyone deals with this before.

Can you help me with this?



0 Karma


Hi omerr,

Are you trying to use splunk to search within your docx / PDF or simply store it?

Either way, splunk doesn't provide a default way to handle this.
You could use a script in combination with some kind of docx / pdf to text utility to load your docx / PDF's textual content into splunk.

Or, ...

If you want to try simply indexing the files straight-up, then simply add something like this to your props.conf file:


Which should force your PDFs to be indexed even though they are binary. I suspect you will not like the results, but you can give it a try.

cheers, MuS

Did you miss .conf21 Virtual?

Good news! The event's keynotes and many of its breakout sessions are now available online, and still totally FREE!