19

I'm currently looking at ways to prevent malicious PDF files at the network boundary. This will include virus scanning - but there are known limitations to that. I see a common approach is to flatten the PDF file using something like:

gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=flattened.pdf raw.pdf 

While this certainly seems to remove the usual suspects from the output of pdfid, that alone does not mean that the associated threats have been eliminated.

Hence:

  1. Will this approach eliminate most Flash and Javascript exploits?

  2. What threats are likely to persist?

Notes:

As this is intended for bulk scanning, suggestions such as this are not really practical at scale.

Links to authoritative sources would be much appreciated.

Update

The method above removes Flash and Javascript from the PDF. Steffen (see below) highlighted that malware embedded in image files would likely survive. To mitigate this, I am downsampling the images. I've not been able to get a clear answer to whether gs preserves or removes EXIF data, but the downsampling will likely alter the offset of any malware embedded there nullifying its exploitability, and the downsampling should also remove any malware embedded in image data. Hence:

DPI=63

gs -dBATCH -dNOPAUSE -dQUIET -sDEVICE=pdfwrite \
   -dDownsampleColorImages \
   -dColorImageDownsampleType=/Bicubic -dColorImageResolution=${DPI} \
   -dDownsampleGrayImages \
   -dGrayImageDownsampleType=/Bicubic -dGrayImageResolution=${DPI} \
   -dDownsampleMonoImages \
   -dMonoImageDownsampleType=/Bicubic -dMonoImageResolution=${DPI} \
   -sOUTPUTFILE=${TMPPDF} ${SRCFILE}
symcbean
  • 18,418
  • 40
  • 74
  • For the flags you added to downsample images, could you explain what each of them do? And why did you choose 63 for the DPI? – BobbyA Apr 24 '18 at 15:35
  • 1
    The type is based on preserving as much info as possible - bicubic is definitely the best for jpeg images although arguably subsample might be better for lineart. The value if 63 for DPI was chosen as the least likely value for the original resolution (on the basis that if I tell it to resample the file at the same resolution it may not bother changing the data at all). – symcbean Apr 24 '18 at 15:49
  • In more recent months, a number of issues have emerged in relation to ghostscript security. While I'm still confident it is much safer than Adobe's tools - and that using it for gateway conversions then subsequently viewing the file with a different reader still has a Security benefit, I would suggest running this inside a strong, effective sandbox (nd only if you really need to process PDFs). https://www.theregister.co.uk/2019/01/24/pdf_ghostscript_vulnerability/ – symcbean Jan 25 '19 at 13:05

2 Answers2

8

I think using gs should remove all active content (Javascript) and embedded data (Videos, Flash...). But I'm not sure if using pdfwrite directly on the PDF will really remove all active and embedded content. Thus I suggest that you first convert the PDF to Postscript using gs and then convert the Postscript back to PDF using gs with the pdfwrite backend. Since the Postscript format itself does not support active or embedded content such content should not survive the conversion process. I'm not sure if this will also help against image formats, like exploiting vulnerabilities in libjpeg, libpng or similar. In any case the call to gs itself should be done inside some kind of protected environment (i.e. sandbox or similar) so that such vulnerabilities do not affect the security of the security system itself.

Another option would be to convert the PDF to images and maybe create a new PDF with these images then. This way you could protect against exploiting vulnerabilities in image libraries too, but at the cost of loosing the ability to work with the PDF as text (i.e. search, copy...). If you want the additional protection but need the ability to handle the PDF as text you could run some OCR software afterwards to reconstruct the text from the images.

Steffen Ullrich
  • 190,458
  • 29
  • 381
  • 434
  • 1
    I'm curious to know why you suggest NOT using pdfwrite (the only ps to pdf convertor I have is the ghostscript one, which is a shell script wrapper around gs + pdfwrite) – symcbean Oct 21 '15 at 13:14
  • 1
    @symcbean: I did not suggest not using pdfwrite but I tried to suggest not using pdfwrite on the PDF itself but instead to convert the PDF first to PostScript and then convert the PostScript back to PDF (using pdfwrite). The reason is that I'm not sure how pdfwrite works when using a PDF as input and that the explicit step of converting it first to PostScript makes sure that there no active content is left simply because PostScript does not support such content. – Steffen Ullrich Oct 21 '15 at 14:44
  • 2
    When I convert to ps and back to pdf in 2 seperate steps (latter using pdfwrite) I get the same binary file as doing it as a single call to gs, sugesting that internally both methods are equivalent. – symcbean Oct 22 '15 at 15:20
  • @symcbean: if you are sure that it works the same and that it will also do the same on any kind of input and with any future upgrades of gs you can simplify it. I'm simply not sure and that's why I argued for the extra step. – Steffen Ullrich Oct 22 '15 at 15:23
3

This got too long for a comment.

There are various degrees of flattening. Converting everything to bitmaps would not be recommended if (e.g.) you want to typeset the PDF and can make PDFs large and unwieldy (perhaps you don't care if the input files are all bulk scans). To PS and back again would be reasonable but note you will lose (e.g.) ability to fill in forms. Moreover, password protected PDFs won't be transmissible and action-protected PDFs will need you to use a PDF renderer that doesn't have Adobe DRM in.

Going either to PS or bitmap (and back) should eliminate both JS and flash vulnerabilities, along with a lot of others (e.g. buffer overflows) in terms of the person ultimately viewing the PDFs.

However, whether you go to bitmap or to PS or whatever, you are merely moving (and conceivably duplicating) the problem here. Whatever vulnerabilities previously existed in your PDF viewer may be avoided (assuming the vulnerability can't survive conversion to bitmap / PS and back), but whatever renders the PDF may be subject to the same (or other) vulnerabilities.

Spinning up a VM merely means that VM may get compromised, which is probably less bad than an arbitrary desktop getting compromised.

The best practical idea I can come up with is to spin up, for every PDF, a Docker container (or similar) which is in an exact known state each time, and provides its output as a PS or bitmap file. Then spin up another docker container (each time) to render the output as PDF. It would require a targeted and sophisticated attack to get through that (not shot-gun infected PDF approaches). Spinning up a docker container is very fast, and gives you a degree of isolation. For good measure, do the lot in a VM.

abligh
  • 2,036
  • 12
  • 12