77

I was thinking of the following question for a long time and did not find a lot of material* in the web and nothing at all on Security.SE.

I think its a very interesting question as it covers different anonymization measures (or counter measures to possible deanonymization measures of soft- & hardware) and within the modern times seems to be more important than ever to protect the human right of freedom of speech.

How can I publish (scanned) documents anonymously?

To narrow down the question a little bit, lets define some parameters:

  • I have some documents in paper form I want to publish without identifying me as the publisher.
  • These documents have no "fingerprint" or any unique printed information on them to identify me as the owner. (Or I have covered it)
  • I will publish the digital files via a secure network (e.g. Tor) with an open source file hosting website that is guaranteed to not store or even publish any information about the uploader.

Things I thought of that might be a problem:

  • Do scanners add any visual unique fingerprint (or even worse: information about the connected device etc.) to every scanned page?
  • Do scanners add any digital (e.g. binary) fingerprint (or even worse: information about the connected device etc.) to every scanned file?
  • Do scanners have a unique 'technical unavoidable' fingerprint, so every scanner scans differently? And is this fingerprint computable or even stored somewhere?
    Or does the 'institution' that wants to deanonymize me have to have access to my scanner to make an comparison?
  • Do PDFs 'store' any information related to the host computer in them?

And if the answer to one of the question was yes, how can I remove or avoid this information?


*Two notable Sources I have found:

Robert
  • 693
  • 6
  • 9
  • 5
    PDF can have metadata but it's removable http://www.prepressure.com/pdf/basics/metadata – Neil Smithline Apr 26 '15 at 19:13
  • 11
    Without wishing to point out the obvious, is there a reason you don't want to use [OCR software](http://finereader.abbyy.com/) to totally obliterate any sort of digital fingerprint? You could scan the doc, extract the text and upload as plain text to anywhere you want. – Richard Apr 27 '15 at 00:22
  • 3
    If the information can be represented suitably as 1bpp/lineart, save it as such in a format with absolutely no headers beyond the image dimensions, like raw pbm. – R.. GitHub STOP HELPING ICE Apr 27 '15 at 02:22
  • 9
    @Richard : the problem with OCR and publishing it as plain text is that it would be perceived of having less authenticity (maybe he just made it all up and just typed whatever he wanted). Publishing the officially-looking document itself as an image makes it more credible. – vsz Apr 27 '15 at 16:58
  • 1
    How do you find the leak? Each person gets a copy with some unique changes in it. It doesn't matter if you publish it as PDF, image or mere text. That's how I would catch you. As in Catch 22. – ott-- Apr 27 '15 at 19:53
  • 4
    While I understand that this question is about identifying information added by scanners and PCs, you also need to consider the possibility of a "canary trap": http://www.businessinsider.com/nba-canary-trap-media-2014-12 – Free Radical Apr 28 '15 at 02:16
  • 2
    You did not mention whether you personally printed the docs. Be aware that most color laser printers encode hidden identification information on each page (https://www.eff.org/issues/printers). Although I would assume a scanner would not be able to consistently scan/ reproduce those yellow dots, if we are being totally paranoid, its possible that slight noise-like variations could result from the presence of the dots. A statistical analysis of the fixed dot positions from tens or hundreds of scanned pages all printed from the same color laser printer could perhaps identify it. – cybermike Apr 28 '15 at 09:13
  • First off: post **anonymously** on `security.stackexchange.com`. – Vorac Jul 12 '20 at 16:02

3 Answers3

63

Publishing scans without being identified is a tough proposition. There are multiple risks of information leak, and mitigation is technically complex. However, anyone determined to do so can learn the appropriate techniques, and there is free software to accomplish the task.

Disclaimer: Although I consider myself technically knowledgeable about the mentioned issues and I've included references where they exist, some parts of this answer are speculative.


Risks:

Do scanners add any visual unique fingerprint (or even worse: information about the connected device etc.) to every scanned page?

This seems likely, considering that some printers do so. There isn't much information available on scanners, though.

Do scanners add any digital (e.g. binary) fingerprint (or even worse: information about the connected device etc.) to every scanned file?

If you're doing a scan from an attached PC (as your question implies), the answer is no, the scanner can't. Scanners attached to a PC transfer raster image data, not files, so it can't possibly add data to a file it doesn't have access to.

However, you should consider that a digital fingerprint could be added on the scanning software of the PC.

Also, if the scanner is standalone (it saves files to a USB drive, or sends them by email), this is a definite possibility.

Do scanners have a unique 'technical unavoidable' fingerprint, so every scanner scans differently? And is this fingerprint computable or even stored somewhere? Or does the 'institution' that wants to deanonymize me have to have access to my scanner to make an comparison?

Yes. Most modern scanners use CCD sensors, which are uniquely identifiable by their noise pattern, using specialized software.

Other plausible visual fingerprinting targets:

Using these kind of fingerprinting techniques, it seems likely that the scanner model and paper type can be identified from the scans, but identifying the specific scanner and paper page used would be hard (perhaps impossible) without access to them for comparison purposes.

Do PDFs 'store' any information related to the host computer in them?

Yes, there's even a NSA article about it. While dealing with scanned documents, you'll need to be aware of image file metadata, which can also be present on PNG and JPG files, for example.

Another risk that you didn't mention is that the scanner itself may store a copy of your scan. Big printers do

Of course, this isn't a exhaustive list of risks - merely what has come to my mind in the couple of minutes it has taken me to write this answer. I'm pretty sure researchers, intelligence agencies and police paid to do so can come up with better ideas!


Mitigation

The easiest, safest and obvious mitigations are don't use a scanner that can be tied to your identity, and destroy the scanner after the fact. Of course, this is not always attainable, so what else can you do to protect yourself?

Don't use a stand-alone scanner - especially a networked one. If you really must, convert its output to a pure image without metadata.

For (at least partially) mitigating fingerprints added by software, you'll want to use open source software, both for the OS and the scanning program.. Avoid using your personal PC for scanning, or at least, use a secure live OS

For detecting deliberate visual fingerprinting, the best option would be to scan a blank page and look for obvious anomalies. These might be very small, so you may want to use a image editor to crank up the contrast.

For sensor, paper and visual fingerprinting in general, you want to destroy subtle scanning artifacts. Use a image editor to:

  • Add noise
  • Use a noise reduction filter (with aggressive reduction)
  • Rotate
  • Distort the image (by applying multiple camera "lens correction", for example)
  • Convert the image to grayscale
  • increase the contrast (or, preferably, completely convert to black-and-white)
  • Reduce resolution (preferably by a near-to-irrational factor)
  • Compress the image (high JPEG compression, for example)

In general, do everything you can to obfuscate and reduce the amount of information contained in the image while keeping the document reasonably readable.

Finally, after all the other steps, remove the medatadata from your files. You can use specialized software to do this.

loopbackbee
  • 5,338
  • 2
  • 22
  • 22
  • 3
    This is an excellent answer, but I'd like to point out a simple fact: doing all these steps in order to obfuscate the result, can be **a lot** more expensive than buying a cheap scanner and destroying it afterwards. – o0'. Apr 27 '15 at 10:06
  • 3
    Would it be possible to use the scan of the blank page to "remove" any pattern that was added to both pages after scanning? – Alexander Apr 27 '15 at 10:13
  • 2
    "Reduce resolution (preferably by a irrational factor)" That's impossible. Any image has integer dimensions so the scaling factor from one image to another can only be rational. – David Richerby Apr 27 '15 at 12:39
  • 2
    @Lohoris Indeed, but as I mentioned, that's impossible in some circumstances, such has when you can't move all the paper outside a specific area. It's also not a *lot* more expensive if you automate the process. – loopbackbee Apr 27 '15 at 13:09
  • 1
    @Alexander it will depend on the type of pattern used. If the pattern absolutely the same for all scans, you could simply subtract the blank page image (a "difference" layer, in photoshop parlance). If it differs from scan to scan but has always the same position, you can simply blank or blur that area enough. – loopbackbee Apr 27 '15 at 13:18
  • 2
    @DavidRicherby You're strictly correct, of course. What I *meant* is a "rational factor whose reduced fraction has very large integers" - it seemed a bit too technical for this answer, though. Do you have a better idea on what to call it? – loopbackbee Apr 27 '15 at 13:23
  • @goncalopp I guess you want a scaling factor that's in some sense "not a round number" but I'm not sure that phrase conveys what's needed precisely enough. – David Richerby Apr 27 '15 at 13:29
  • 1
    @Lohoris If the disposable scanner you use is one that can be cross-indexed to the location it was purchased, that could give away information one might rather not expose. – Dronz Apr 27 '15 at 22:47
  • @Dronz true, I was under the assumption that the comparison would have been made post-facto, and that there wasn't a central database instead. – o0'. Apr 28 '15 at 08:28
  • 1
    +1 for suggesting open source software for both OS and scanner software. I would go one step further though, and suggest only using scanners that can be used with pre-installed free software drivers (no binary blobs from the scanner vendor) and generic scanning software such as Simple Scan (included on a standard installation of Ubuntu for instance). – JeroenHoek Apr 28 '15 at 08:54
  • @Lohoris, wouldn't opening the scanned image with any image processing software and saving it as another image be enough to loose the scanner information? – YoMismo Apr 28 '15 at 13:12
  • @Lohoris Doing stuff like throwing away a perfectly good new piece of hardware is a telltale sign of someone trying to cover their tracks. It could be what gets you caught. – aaaaaaaaaaaa Apr 28 '15 at 13:47
  • @YoMismo definitely not in many cases. – o0'. Apr 28 '15 at 14:06
  • @goncalopp I'd make a few additions to your editing list: Noise reduction filter, preferably with a high threshold, may do a lot more than other techniques to remove signature artefacts, apply after noise for best effect. Increase contrast to make white areas uniformly white, and black areas ditto black, now we only have to worry about data traces at the edge of characters. On the other hand I don't believe lens correction distortion would do much, it has a mathematical purity that makes it almost pixel perfectly undoable. – aaaaaaaaaaaa Apr 28 '15 at 14:12
  • @eBusiness noise reduction and contrast are great suggestions, I've added them to the list, thanks! It's true that lens correction is easily reversible if you know the correction parameters used *or* you can estimate them from the image, but I think you might have a hard time perfectly estimating them if the original lines are not straight (because they've been deformed by the stepper motor non-uniformity). In the absence of a proper study, I'd rather suggest too much than too few countermeasures. – loopbackbee Apr 29 '15 at 18:54
  • Assuming that we scan text, couldn't we also use OCR to create a plain text or PDF that does not have any images in it, and thus does not have the paper. Of course, OCR has its own problems (mostly with accuracy and dealing with complex formatting). – Kat May 01 '15 at 23:41
  • @Mike In theory, if you can live with the limitations you mentioned, OCR would be a ideal solution, as it keeps no information from the original except the text. In practice, unless you want a plain-text result (without any formatting), you probably can't audit the OCR engine nor the files it produces to see it doesn't add another fingerprint. If you then further edit the files in a text processor (to correct them, for example), you'll add still more metadata. Removing metadata from, say, a `.docx` file is harder to do than from a image file. – loopbackbee May 05 '15 at 09:34
  • "though proposition" is a typo but it won't let me fix a single character – endolith Mar 09 '19 at 02:27
  • "Scanners attached to a PC transfer raster image data, not files, so it can't possibly add data to a file it doesn't have access to." but it can still encode digital fingerprints (not visual fingerprints) in the raster data steganographically. https://en.wikipedia.org/wiki/Steganography#Digital_messages – endolith Mar 09 '19 at 02:28
  • @DavidRicherby "Any image has integer dimensions so the scaling factor from one image to another can only be rational." That's not true. You can resample by any real ratio, including irrational. The pixel centers don't need to line up. – endolith Mar 09 '19 at 02:29
  • @endolith OK but whatever irrational scaling factor you use on a particular image will produce identical results to some rational scaling factor on that image. And, in reality, your image is going to be less than a trillion pixels square, so whatever irrational scaling factor _f_ you pick, I can use _f_ to 13 or so decimal places (which is rational, because it's a finite decimal) and get identical results to you on every image you'll ever see. – David Richerby Mar 09 '19 at 10:29
  • @DavidRicherby no the results will not be identical, even if the pixel number rounds to the same value – endolith Mar 10 '19 at 12:53
11

Buy the scanner in cash, and buy a PC from some PC junker shop in cash. Make sure you never input any information about your name etc into the computer. If everything is bought in cash, and you have a virgin OS with only alias information about yourself, then there should be no correct metadata to encode.

There are certain programs which do encode metadata, Microsoft Word, and other Microsoft products. I think even text files have operating system metadata associated with them. I can't see any software ever encoding an IP address or something of that nature as metadata, that would be a little more invasive than normal.

Programmatically it is possible to scrub metadata from files etc, it just requires a little bit of effort. Images almost always have some form of metadata, such as GPS if it is taken from a mobile device, but I can't see scanners having GPS chips. It would be a little bit of a waste wouldn't it?

PDF's will probably have a lot of meta data associated with them, they would have to get the user's information from somewhere though.

Another thing that would aid in preventing metadata from being transferred would be a lack of connection to the internet. If the programs can't phone home then they can't initialize certain metadata like location etc. I realize this talks a little bit less about the actual metadata than you would like, sorry about. I am an entry level programmer, but I have had some classes in computer forensics as well as computer programming. I hope this helps.

Kjartan
  • 999
  • 11
  • 17
overwraith
  • 135
  • 2
  • 2
    +1 for the "no metadata when there is no data to collect" idea, when this is a requirement buying a dedicated PC may be a good investment. I would recommend Linux as OS on it which will allow less tracking, you can find specialized distribution lists [on the Internet](http://www.greycoder.com/anonymous-linux-distributions/) but ensure that it is not too minimalistic and provides the software to work with the scanner. Be sure to check if the scanner is [supported natively](http://www.sane-project.org/sane-supported-devices.html), free software being less prone to hidden metadata. – WhiteWinterWolf Apr 26 '15 at 22:15
  • 10
    There is a lot of guessing in this answer but no actual sourced knowledge to answer the questions which were asked. – Philipp Apr 26 '15 at 22:31
  • Some points are wrong, but the idea of "no data to collect" is good – Display Name Apr 27 '15 at 08:53
  • 2
    It seems like buying a lot of Raspberry Pi's would be useful for this. $25-35 USD (plus shipping) is fairly inexpensive. Plus if you were doing this in some sort of covert setting, the small form factor is a bonus. – Wayne Werner Apr 27 '15 at 12:21
  • 3
    Why not buy a throwaway phone instead of throwaway computer+scanner? Camera phones are more efficient for scanning documents than most cheap consumer-grade scanners anyway, as long as you setup proper lighting and 'tripod' to do it. – R.. GitHub STOP HELPING ICE Apr 27 '15 at 16:59
  • 4
    @R.. Camera phones usually have GPS chips as part of their cellular modems, which allow you to be geolocated. Sure, you can scrub the metadata, but having metadata in the first place is exactly what the answer aims to avoid. – March Ho Apr 28 '15 at 06:44
  • @MarchHo: They have metadata which is stored in the EXIF headers, but I would probably worry *less* about steganographic metadata hiding in camera phone pics than in scanner output. I would assume you're already planning to publish in a format with no headers like pbm or ppm. BTW, GPS only works when it's on, and rarely even then. Just do your work in a place with no GPS or cellular reception (or open up the device first and remove the antennas). – R.. GitHub STOP HELPING ICE Apr 28 '15 at 14:17
  • Also, be aware that while not normally visible, even a cheap scanner does pick up part of what is on the opposite side of the page and with processing it is possible to see that, so be sure that there are no identifying marks on the other side of your page. – Rod MacPherson Apr 28 '15 at 16:57
2

Don't do it.

Forget about it.

If the documents that you are trying to surreptitiously reveal are sensitive enough to demand that level of anonymity and "security", you will be found out.

Snowden revealed secret documents, but he did not hide his identity, neither did Manning.

ALL of the "security methods" mentioned above will fail, and badly. Why?

They operate on the premise that there is this huge pool of potential leakers, of which you will be an anonymous entrant with nothing to point you out.

However: Most secure documents have a limited distribution/access list, and many are time sensitive, which fix their release to a certain point in time.

Suspicion will fall on you immediately, and there will be many indicators of your involvement right away, least of which is your post, on this site!

You will have to prove you did not, not the other way around, and if you are physically seized, you will confess.

For secure documents and most theft cases, the suspect is picked first and then their circumstantial evidentiary trail is used to lock in their guilt!

You used Tor? Not many do. Do you use Tor all the time? Oh no? You only used it just to upload these docs? Guilty.

How about going to a public wifi spot? Is it near where you live? Did you take your cellphone with you? (cell tower access logs)

Seriously, you are not a spy, and even if you are, you will be caught.

Your only hope is if someone else stole them and you got these documents outside of their knowledge, but the arrow is already pointing to you.

schroeder
  • 125,553
  • 55
  • 289
  • 326
  • sorry about the capitals-only sentences, peterH! I didnt know how to format the text with italics and such as you have done (thanks) and used caps to emphasis certain areas I felt deserved emphasis. The unfortunate thing (for this guy) is that all the other advice given here will definitely not protect him one bit. While some approach it as a purely academic exercise, I saw a young man about to ruin his life..oh well, thanks for all the fish – Stephen Wilkinson May 07 '15 at 23:56
  • @StephenWilkinson you are making some large assumptions about the nature of the documents the OP wishes to disclose. The OP is also constraining the conversation to technological traces. While your points cover the wider discussion, you cannot say that the other answers will "not protect him one bit". There may be gaps, but they are also not wrong. – schroeder May 08 '15 at 00:01
  • Hi Stephen, thanks for your answer and also your attempt to protect me! First of all: I do NOT want to disclose anything. I have no secret material and with that I have no intention in publishing anything anonymously. And I also don't intent to encourage anyone to do that. The whole question is written in the "If someone wanted to $that [..with this restrictions...], how could he or she do It". And of course your points are somehow valid so I won't downvote your answer, but as @schroeder already said the whole idea of this question was to talk about the technical restrictions. – Robert May 08 '15 at 06:04
  • To address the point of sensitive documents having a limited distribution, one method to combat this is to plant any incriminating evidence on someone else. "no officer it wasn't me, but I did see Jeremy going in there one day on his way out of work, I always have been suspicious of him maybe you should check him out" But @StephenWilkinson is right posting on this site is the biggest clue in a the trail, this is one way they built a case on Ross Ulbritch the silk road admin, by comparing his questions with code snippets against the silk road codebase. – user6858980 Mar 10 '19 at 22:14