Is there a hash algorithm that can identify similar files or strings?

Question

Is there a hash algorithm that will help you identify similar files or strings? For example, the hash for ABC and XBC would be similar rather than radically different as is usually the case. I know of one measure of similarity, Edit Distance (http://en.wikipedia.org/wiki/Edit_distance). But this doesn't give you a hash for each input to compare, only a score between any two inputs.

Update

The comment by Andan (locality sensitive hashing, LSH) is what I was looking for. My motive for asking the question is I was wondering how LSH might be used in scanning for malware. Is it used for identifying malware? Why or why not?

Update

In line with Tom Leek's, answer I did some investigation of my own. I wrote a program that would XOR the bytes of a file with a predetermined "random" pattern (the seed didn't change). Then it would sum the total 1 bits. This would produce the Hamming distance from the random pattern to the file. Really, it wasn't a very useful metric as it basically (on average) was just halving the file size to come up with a number.

Some examples:

Two related executables I scanned scored 2684964 and 2738772 for a difference of 53808. They are definitely related (different versions of programs I wrote) but the value of 53k is close to half of the file size difference in bits: ~128k. So it's not a useful metric for determining similarity.

I scanned two similarly sized JPEGs that were definitely different images. They scanned as 3124915 and 3110981 for a difference of 13934. So their difference was "smaller" than the difference between the related executable, even though they aren't related. So it's not a useful metric for determining difference either.

Conclusion:

As Tom Leek, said, it's an open problem for a reason.

The closest thing I have in mind is [locality-sensitive hashing](https://en.wikipedia.org/wiki/Locality-sensitive_hashing). But what are you trying to accomplish? If the tags give any indication, I don't think hashes are what you're looking for. — Adi, Oct 31 '13 at 16:46
Yes that's what I was looking for. It's a curiosity. I was wondering if what you pointed me to exists. And I was wondering: do virus scanners use this? If not, why not? — John, Oct 31 '13 at 16:55
See also: http://stackoverflow.com/questions/12952729/how-to-understand-locality-sensitive-hashing — John, Oct 31 '13 at 21:15
I think [Locality Sensitive Hashing](http://en.wikipedia.org/wiki/Locality_sensitive_hashing) will serve your purpose here. — , May 27 '14 at 09:25
@John [*"Fuzzy Hashing, can match inputs that have homologies..............."*](http://www.forensicswiki.org/wiki/Context_Triggered_Piecewise_Hashing) — Pacerier, Mar 21 '15 at 04:17

Jor-el · Answer 1 · 2013-10-31T21:34:49.830

"Approximate matching algorithms" (still a NIST draft) or "similarity preserving hash functions" could be of your interest. These algorithms are specifically designed to determine similarity between two digital objects. Some of the proposed algorithms till now (and useful) are (chronologically): ssdeep, sdhash, mrsh-v2.

To determine the similarity between objects these algorithms require a minimum chunk of data. Mrsh-v2 performing best in terms of minimum size required.

Mrsh-v2 seems to be really promising in terms of performance and minimum chunk size required, but still under development. I hope it potentially solves your issue of handling similar files.

score 6 · Accepted Answer · answered Oct 31 '13 at 17:24

There are good theoretical reasons why such a hash cannot exist, or cannot be "a hash" in the cryptographic sense of the term. To put it simply, if hash values of two "similar" inputs are themselves "similar" to each other, then you can use that to efficiently recover an input from a given output, which contradicts preimage resistance.

From your tags, I suppose that you are trying to design an antivirus software which knows of N virus "signatures" and what to detect any virus which is "similar" (for some notion of similarity) to any of these N values, but with a computational cost substantially lower than N comparisons (because N can be very high). When the notion of similarity is "exact equality" then you can sort the signatures and do a binary search with cost O(log N) (hash functions are then used to make the process even faster by ensuring that all "signatures" have a fixed constant size). However, for a notion of similarity which is not as sharp, the problem becomes hard.

Database similarity searching is a known problem of bioinformatics where it is used for sequences of nucleotids and similar objects which must be matched for in huge databases despite occasional differences. The bottom-line being that:

There are possible solutions, but they rely on a probabilistic model of the actual differences that can be encountered.
People have been searching for a good solution for decades, and are still searching.

The actual methods used by antivirus software to check for signatures without slowing down the machine to a crawl are at the core of their business, so they are understandably not very talkative about it. We can surmise that whatever solution they come up with is likely to involve a lot of tweaking and hypotheses on actual virus variations as observed in the wild.

Thanks Tom. Please see my update to my question and feel free to add your comments. — John, Oct 31 '13 at 21:08
It is never simple because typical user doesn't have the vast sample of antivirus industry to learn complexity of the mouse and cat game. File packer will be the first obstacle. Obfuscate generator is another challenge. Most hash will fail on the packer. — mootmoot, Jul 18 '16 at 17:01

score 1 · Answer 3 · edited Oct 31 '13 at 17:25

Hashing is specifically intended to make inputs look as dissimilar as possible. What you want is a clustering algorithm intended to sort "similar" items into the same or adjacent bin. Similarity is not a well defined concept, you'll need a domain-specific definition.

Just as a thought experiment, suppose you wanted to detect term paper fraud done by cutting and pasting from other documents. You might do something like:

Hash each 4-word sequence and count the number of occurrences of each hash.
Discard all hashes which occur in a large dictionary of common documents.
Bin the n most common hashes that remain.

To compare two documents for similarity, count how many binned hashes they have in common.

score 0 · Answer 4 · answered Dec 10 '21 at 22:24

You are referring to Locality Sensitive Hashing: https://en.wikipedia.org/wiki/Locality-sensitive_hashing

Side note: Hashing is not intrinsically diffuse as some have suggested. That is an explicit property of cryptographic hashes. Hashing itself is just a mapping from variable to fixed-size values (not actually completely true, see eg. ssdeep). In real world hashing functions, they are non-injective surjective functions, which basically means they can have collisions.

Is there a hash algorithm that can identify similar files or strings?

4 Answers4

Linked