Looking to use splunk to index malware analysis data. Out puts from tools like install control 5, capture bat, filemon and regmon are already imported easily.
what i'm looking for is the ability to import/index malicious pdf files or any pdf file for that matter.
Are you trying to use splunk to search within your PDF or simply store it?
Either way, splunk doesn't provide a default way to handle this. You could:
While PDF's are partially text-readable they often contain binary content too, like compressed content or images. I suppose if you had a utility that could uncompress the compressed portion of your PDFs it's possible you could end up with something half-useful within splunk, but I'm kind of guessing that splunk may not be the right tool to use here.
If you just flat-out point splunk to some PDFs, you'd have to tell it to ignore it's normal "binary" content check, and it would just index them, but you'd see lots of \x00
kind of things throughout the file where are the unprintable characters are.
If you provide more info about what you want to be able to do with your PDFs and splunk, there may be a better answer.
Based on your followup comments.... What about this as an approach:
http://pdf-web-server.domain/downloadpdf&id=my_unique_id
(This could be as simple as a folder shared by apache, or a simple PHP page that accepts some GET style arguments.) The unique id could be a sequence number, or just a timestamp and a cleaned up version of the original file name, it doesn't matter so much as long as it's unique (and doesn't contain illegal web characters). Make sure that the unique id is in your output log, so that it will be indexed by splunk. Splunk can be setup to extract this value as a field.There are lots of variations you can do with this approach, but hopefully this gives you an idea of how you can get started.
Or, ...
If you want to try simply indexing PDF files straight-up, then simply add something like this to your props.conf file:
[source::....pdf]
NO_BINARY_CHECK = true
Which should force your PDFs to be indexed even though they are binary. I suspect you will not like the results, but you can give it a try.
Are you trying to use splunk to search within your PDF or simply store it?
Either way, splunk doesn't provide a default way to handle this. You could:
While PDF's are partially text-readable they often contain binary content too, like compressed content or images. I suppose if you had a utility that could uncompress the compressed portion of your PDFs it's possible you could end up with something half-useful within splunk, but I'm kind of guessing that splunk may not be the right tool to use here.
If you just flat-out point splunk to some PDFs, you'd have to tell it to ignore it's normal "binary" content check, and it would just index them, but you'd see lots of \x00
kind of things throughout the file where are the unprintable characters are.
If you provide more info about what you want to be able to do with your PDFs and splunk, there may be a better answer.
Based on your followup comments.... What about this as an approach:
http://pdf-web-server.domain/downloadpdf&id=my_unique_id
(This could be as simple as a folder shared by apache, or a simple PHP page that accepts some GET style arguments.) The unique id could be a sequence number, or just a timestamp and a cleaned up version of the original file name, it doesn't matter so much as long as it's unique (and doesn't contain illegal web characters). Make sure that the unique id is in your output log, so that it will be indexed by splunk. Splunk can be setup to extract this value as a field.There are lots of variations you can do with this approach, but hopefully this gives you an idea of how you can get started.
Or, ...
If you want to try simply indexing PDF files straight-up, then simply add something like this to your props.conf file:
[source::....pdf]
NO_BINARY_CHECK = true
Which should force your PDFs to be indexed even though they are binary. I suspect you will not like the results, but you can give it a try.
For the pcap thing, check out this question: http://answers.splunk.com/questions/2922/splunk-monitoring-a-wireshark-file
I basically just splunk to index whats inside the pdf file regardless of what it is. There is metadata and certain other characteristics of the pdf that would be useful if captured and could be searched for. The pdf would also already be deflated using a different program to reveal any javascript or other data in the file, so if there is a way bypass the binary check and just make it index the inside of the file regardless that would be great. then my next problem is pcap files.