Solved: Index PDF files

dugginsg · ‎06-30-2010

Looking to use splunk to index malware analysis data. Out puts from tools like install control 5, capture bat, filemon and regmon are already imported easily.

what i'm looking for is the ability to import/index malicious pdf files or any pdf file for that matter.

Lowell · ‎06-30-2010

Are you trying to use splunk to search within your PDF or simply store it?

Either way, splunk doesn't provide a default way to handle this. You could:

Use a script in combination with some kind of pdf to text utility to load your PDF's textual content into splunk.
To simply store and retrieve PDFs, you could use an input script to load your PDFs into splunk by first encoding them into a base64 encoded format (similarly to how a PDF attachment would appear if you were viewing the raw text of an email). You could then use splunk to store the content, and then write a small export process that would decode your base64 content and give you a download link of some kind (I think this would be possible, but there could be some technical gotchas, and this isn't really what splunk was indented to do). Keep in mind that with this approach, you really can search your content in any significant way. (Searching over base64 content isn't going to be pretty.)

While PDF's are partially text-readable they often contain binary content too, like compressed content or images. I suppose if you had a utility that could uncompress the compressed portion of your PDFs it's possible you could end up with something half-useful within splunk, but I'm kind of guessing that splunk may not be the right tool to use here.

If you just flat-out point splunk to some PDFs, you'd have to tell it to ignore it's normal "binary" content check, and it would just index them, but you'd see lots of \x00 kind of things throughout the file where are the unprintable characters are.

If you provide more info about what you want to be able to do with your PDFs and splunk, there may be a better answer.

Based on your followup comments.... What about this as an approach:

Write a script that handles the new PDFs that you want to index.
Have the script to execute all your PDF investigation utilities against your new PDF file. This should include capturing all your meta data extraction, full text extraction (if you want), and whatever else information you need about the PDF. The script can take all this data, do some minor formatting, and then dump all this information to a log file that splunk will be setup to index. (If you have a high volume situation, or need this to run concurrently, then you can setup a TCP input and push the log content to splunk over a TCP socket... I recommend the simple log file approach if you don't need the extra complexity.)
If you want to be able to retrieve the actual PDF later, then your script should assign a unique id to each of your PDFs and store them somewhere that is accessible via a web page with that unique id. For example, say your PDF is accessible with something like: http://pdf-web-server.domain/downloadpdf&id=my_unique_id (This could be as simple as a folder shared by apache, or a simple PHP page that accepts some GET style arguments.) The unique id could be a sequence number, or just a timestamp and a cleaned up version of the original file name, it doesn't matter so much as long as it's unique (and doesn't contain illegal web characters). Make sure that the unique id is in your output log, so that it will be indexed by splunk. Splunk can be setup to extract this value as a field.
Setup a sourcetype in splunk for this log file. (BTW, I can give you some event-breaking pointers if you decided to use a setup like this.) So from within splunk, you can now search on your metadata or full text (assuming you include that).
If you want to be able to link back to your original pdf, then you can use splunk's workflow actions to do this pretty simply. (This is the drop-down menu on the left of your event from within the splunk interface.) All you have to do is tell splunk how to build a URL based on your unique-pdf-id field and you can link back to your PDF file to open or download the file locally.

There are lots of variations you can do with this approach, but hopefully this gives you an idea of how you can get started.

Or, ...

If you want to try simply indexing PDF files straight-up, then simply add something like this to your props.conf file:

 [source::....pdf]
 NO_BINARY_CHECK = true

Which should force your PDFs to be indexed even though they are binary. I suspect you will not like the results, but you can give it a try.

View solution in original post

Lowell · ‎06-30-2010

Are you trying to use splunk to search within your PDF or simply store it?

Either way, splunk doesn't provide a default way to handle this. You could:

Use a script in combination with some kind of pdf to text utility to load your PDF's textual content into splunk.
To simply store and retrieve PDFs, you could use an input script to load your PDFs into splunk by first encoding them into a base64 encoded format (similarly to how a PDF attachment would appear if you were viewing the raw text of an email). You could then use splunk to store the content, and then write a small export process that would decode your base64 content and give you a download link of some kind (I think this would be possible, but there could be some technical gotchas, and this isn't really what splunk was indented to do). Keep in mind that with this approach, you really can search your content in any significant way. (Searching over base64 content isn't going to be pretty.)

While PDF's are partially text-readable they often contain binary content too, like compressed content or images. I suppose if you had a utility that could uncompress the compressed portion of your PDFs it's possible you could end up with something half-useful within splunk, but I'm kind of guessing that splunk may not be the right tool to use here.

If you just flat-out point splunk to some PDFs, you'd have to tell it to ignore it's normal "binary" content check, and it would just index them, but you'd see lots of \x00 kind of things throughout the file where are the unprintable characters are.

If you provide more info about what you want to be able to do with your PDFs and splunk, there may be a better answer.

Based on your followup comments.... What about this as an approach:

Write a script that handles the new PDFs that you want to index.
Have the script to execute all your PDF investigation utilities against your new PDF file. This should include capturing all your meta data extraction, full text extraction (if you want), and whatever else information you need about the PDF. The script can take all this data, do some minor formatting, and then dump all this information to a log file that splunk will be setup to index. (If you have a high volume situation, or need this to run concurrently, then you can setup a TCP input and push the log content to splunk over a TCP socket... I recommend the simple log file approach if you don't need the extra complexity.)
If you want to be able to retrieve the actual PDF later, then your script should assign a unique id to each of your PDFs and store them somewhere that is accessible via a web page with that unique id. For example, say your PDF is accessible with something like: http://pdf-web-server.domain/downloadpdf&id=my_unique_id (This could be as simple as a folder shared by apache, or a simple PHP page that accepts some GET style arguments.) The unique id could be a sequence number, or just a timestamp and a cleaned up version of the original file name, it doesn't matter so much as long as it's unique (and doesn't contain illegal web characters). Make sure that the unique id is in your output log, so that it will be indexed by splunk. Splunk can be setup to extract this value as a field.
Setup a sourcetype in splunk for this log file. (BTW, I can give you some event-breaking pointers if you decided to use a setup like this.) So from within splunk, you can now search on your metadata or full text (assuming you include that).
If you want to be able to link back to your original pdf, then you can use splunk's workflow actions to do this pretty simply. (This is the drop-down menu on the left of your event from within the splunk interface.) All you have to do is tell splunk how to build a URL based on your unique-pdf-id field and you can link back to your PDF file to open or download the file locally.

There are lots of variations you can do with this approach, but hopefully this gives you an idea of how you can get started.

Or, ...

If you want to try simply indexing PDF files straight-up, then simply add something like this to your props.conf file:

 [source::....pdf]
 NO_BINARY_CHECK = true

Which should force your PDFs to be indexed even though they are binary. I suspect you will not like the results, but you can give it a try.

Lowell · ‎07-01-2010

For the pcap thing, check out this question: http://answers.splunk.com/questions/2922/splunk-monitoring-a-wireshark-file

dugginsg · ‎07-01-2010

I basically just splunk to index whats inside the pdf file regardless of what it is. There is metadata and certain other characteristics of the pdf that would be useful if captured and could be searched for. The pdf would also already be deflated using a different program to reveal any javascript or other data in the file, so if there is a way bypass the binary check and just make it index the inside of the file regardless that would be great. then my next problem is pcap files.

Index PDF files

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!