Reporting

How to monitor docx and PDF files in Splunk?

Explorer

Hi,

I think about new application for our organization and for that I need the ability to monitor (=index,read) the content of doc / docx / PDF files.

When I import the file to Splunk it preview like hex / binary so I think we should define new sourcetype for those files and especially change the charset to something that fit it.

I searched a lot about it but seems that anyone deals with this before.

Can you help me with this?

Thanks,

Omer.

0 Karma

SplunkTrust
SplunkTrust

Hi omerr,

Are you trying to use splunk to search within your docx / PDF or simply store it?

Either way, splunk doesn't provide a default way to handle this.
You could use a script in combination with some kind of docx / pdf to text utility to load your docx / PDF's textual content into splunk.

Or, ...

If you want to try simply indexing the files straight-up, then simply add something like this to your props.conf file:

[source::....pdf]
NO_BINARY_CHECK = true

Which should force your PDFs to be indexed even though they are binary. I suspect you will not like the results, but you can give it a try.

cheers, MuS