Methods of proving data integrity

(c) GNU Free Documentation License  2001 Horst Herb, hherb@gnumed.net

In order to prove data integrity we have to check

  1. that the document has not been accidentally corrupted
  2. that the document has not been illegitimately manipulated


Protection against accidental corruption

This is the easy part. The most primitive attempt would be to store multiple copies. The likelihood that all of them get corrupted is slim. However, with only a few copies you might get into a situation that most of them do not match mutually. How would you decide then which copy is the true original?

There are mathematical one way functions, so called "hashes". These functions will calculate a presumably unique large number for any given document. We will call this number a message digest. Of course, it will depend on the choice of hashing algorithm how likely it is that this number is really unique.

Cryptography is an exact science, a branch of mathematics. There is a lot of money in cryptography, so people working professionally in it show similar enthusiasm for their work as we do in medicine. As in any other exact science, good and proven methods are fully disclosed, the rest is likely to be snake oil.

An excellent choice of hashing algorithm to digest the type of documents we are likely to encounter in paperless medicine is RIPEMD-160. You should visit this link to learn more about it. It is reasonably crash proof (no known way of altering the document yet preserving the unique message digest), and the performance is still good enough on low end computers for daily use.

To generate a RIPEMD-160 message digest ("digital fingerprint") of your document, you can download the "GNU Privacy Guard" . After installing it, you enter at your command prompt:

gpg --print-md ripemd160 <your file name>

Wildcards like "*" are allowed, then all files in the current working directory will be digested.  You can capture the screen output in a file:

gpg --print-md ripemd160 * >> allfiles.rmd

would create a text file "allfiles.rmd" (if it does not exist yet) and append a list of all file names followed by their message digests. You can make a backup of this list, and use it to compare it to another list generated in the future. A digest mismatch for any file will prove that that file has been altered in some way.

In order to understand the pwoer of such a function, I would suggest the following experiment:
Get the largest text document you can find on your computer (or download a large text file like the more than 1000 pages long  classic Anomalies and Curiosities of Medicine by George Milbry Gould and Walter Lytle Pyle). Calculate the message digest as outlined above. Now open the document in your favourite text editor, alter one single character anywhere in the file (like changing a period into a comma), save it and calculate the message digest again. Impressed?

Now you know how to track down data corruption in an easy and practical way.



Protection against illegitimate manipulation

In order to prove that a gioven document has not been altered, we need two things:

  1. a message digest of that document as outlined in the previous paragraph
  2. some sort of proof
    1. when this digest has been generated
    2. who generated this digest
Proving who generated the digest can be done by digitally signing the digest. Theoretically, the "standard" algorithm for digital signatures, "DSA", could be used both for creating the message digest and the signature. Part of the DSA algotithm is generating a message digest with the "SHA" algorithm.

The signature can be done again with GPG:

gpg --ba <message digest file name>

You can check the signature with
gpg --verify <message digest file name>.asc

Documents in medical health records can be quite large. The reliability of a hash function declines with the size and complexity of a message.
RIPEMD-60 seems to have a definite advantge. You may read details about the vulnerability of  SHA ``Differential Collisions in SHA-0,'' Advances in Cryptology - Crypto'98, LNCS 1462, H. Krawczyk, Ed., Springer-Verlag, 1998, pp. 56-71.

Although it is extremely unlikely that a SHA-digested document can be altered in a way that it would still produce the same SHA digest, there is no reason not to use RIPEMD-160 to digest large files as it is apparently more reliable,  free and has no other known disadvantages.

A digital signature proves that you have signed a particular document as long as yout private key has not been compromised. Similar to loss of credit cards, you will be liable for loss of your private keys if you don't notify the key certifying authority (which might be yorself or your Division) that your key has been compromised. You would have a hard time in court proving that a document digitally signed by you has not been signed by you.

Now comes the tricky bit: Proving when the document has been signed. A timestamp is embedded in the signature . The only way the signature generating software can tell the time is by querying the time provided by your computer clock - which you can adjust at will, any time.

The solution is that you have to deposit your signature with a trusted 3rd party (trusted by both you and a judge in a potential court case). Almost as good a prove will be if the trusted 3rd party countersigns your signature: as the times tamp is embedded with both signatures,  and you can't forge the signature of the signing 3rd party (if it is a trusted one), you can't manipulate the time stamp provide by the 3rd party.

You can do this even without Internet access: simply print the signature and the countersignature onto a piece of paper, and deposit it with a trustworthy 3rd party. To verify the signature,you would have to type it in again.

The more extreme the demands on proving authenticity are, the more 3rd parties you have to involve. I believe that two independend non-profit organisations (like the Divisions of General Practice) would be trustworthy enough , but you might choose to involve a justice of peace, a notary or similar institution.

The benefit of this method is that you don't have to disclose any confidential data to the signing  3rd party, and that the volume of data to sign is that small that it is practical to distribute th signatures to many different servers through the Internet. Done properly,there is no need for expensive PKI infrastructure or key certification authorities (which are rather worthless anyway, read this article written by one of the world's foremost "crypto Gurus", Bruce Schneier.

There is one utmost important issue: always bear in mind that you might need to proof authenticity of a given document many years, even decades, in the future. Therfore you must not use any software that depends on a particular platform like for example Microsoft Windows.

You cannot expect that particular software will survive and be maintained for decades. The same issue is valid for commercial key authorities. They are unlikely to exist forever. Do not make yourself dependent on them!

Quality and dependable software suitable for our purposes will run on virtually any platform, and source code will always be provided. The source code is written in a portable way that will make it easy to maintain it on future platforms. Even if you are not able to use the source yourself, ther ewill always be someone who cyou can pay for such a service. Without the source, you are lost.

Therefore, at present I cannot recommend any other product than the GNU Privacy Guard for our purposes. It might be a little bit more dificult to use than other products, but it is at least future proof.


[back]

(c) 2000, 2001 GNU Free Documentation License Dr. Horst Herb, hherb@gnumed.net