Looking to use splunk to index malware analysis data. Out puts from tools like install control 5, capture bat, filemon and regmon are already imported easily.
what i'm looking for is the ability to import/index malicious pdf files or any pdf file for that matter.
Are you trying to use splunk to search within your PDF or simply store it?
Either way, splunk doesn't provide a default way to handle this. You could:
While PDF's are partially text-readable they often contain binary content too, like compressed content or images. I suppose if you had a utility that could uncompress the compressed portion of your PDFs it's possible you could end up with something half-useful within splunk, but I'm kind of guessing that splunk may not be the right tool to use here.
If you just flat-out point splunk to some PDFs, you'd have to tell it to ignore it's normal "binary" content check, and it would just index them, but you'd see lots of
\x00 kind of things throughout the file where are the unprintable characters are.
If you provide more info about what you want to be able to do with your PDFs and splunk, there may be a better answer.
Based on your followup comments.... What about this as an approach:
http://pdf-web-server.domain/downloadpdf&id=my_unique_id(This could be as simple as a folder shared by apache, or a simple PHP page that accepts some GET style arguments.) The unique id could be a sequence number, or just a timestamp and a cleaned up version of the original file name, it doesn't matter so much as long as it's unique (and doesn't contain illegal web characters). Make sure that the unique id is in your output log, so that it will be indexed by splunk. Splunk can be setup to extract this value as a field.
There are lots of variations you can do with this approach, but hopefully this gives you an idea of how you can get started.
If you want to try simply indexing PDF files straight-up, then simply add something like this to your props.conf file:
[source::....pdf] NO_BINARY_CHECK = true
Which should force your PDFs to be indexed even though they are binary. I suspect you will not like the results, but you can give it a try.