The definitive guide to diff algorithms and patch formats (2024)

Back to all blog posts Back

Ably engineering

Ably engineering

16 min readLast updatedUpdated Aug 30, 2023

A diff algorithm outputs the set of differences between two inputs. These algorithms are the basis of a number of commonly used developer tools. Yet understanding the inner workings of diff algorithms is rarely necessary to use said tools. Git is one example where a developer can read, commit, pull, and merge diffs without ever understanding the underlying diff algorithm. Having said that there is very limited knowledge on the subject across the developer community.

The purpose of this article is not to detail how Ably programmatically implemented a diff algorithm across its distributed pub/sub messaging platform, but rather to share our research and support a better understanding of diff algorithms that could be useful to implementers of diff/delta/patch functionality.

Table of contents

  1. A quick bit of context
  2. Diff algorithms: purpose and usage
  3. Three generations of diff algorithms
  4. Current delta generation algorithms
  5. Delta file formats
  6. Testing open-vcdiff and xdelta

A quick bit of context

For Ably customers like Tennis Australia or HubSpot, Message Delta Compression reduces the bandwidth required to transmit realtime messages by sending only the diff of a message. This means subscribers receive only the changes since the last update instead of the entire stream. Sending fewer bits is more bandwidth-efficient and reduces overall costs and latencies for our customers. To develop this feature we needed to implement a diff algorithm that supported binary encoding and didn’t sacrifice latency when generating deltas.

The definitive guide to diff algorithms and patch formats (3)

Diff algorithms

Purpose and usage

The output of a diff algorithm is called patch or delta. The delta format might be human readable (text) or only machine readable (binary). Human readable format is usually employed for tracking and reconciling changes to human readable text like source code. Binary format is usually space optimized and used in order to save bandwidth. It transfers only the set of changes to an old version of the data already available to a recipient as opposed to transferring all the new data. The formal term for this is delta encoding.

Binary vs. text

There seems to be a common misconception that diff algorithms are specialized based on the type of input. The truth is, diff algorithms are omnivorous and can handle any input, as long as the input can simply be treated as a string of bytes. That string might consist of the English alphabet or opaque binary data. Any diff algorithm will generate a correct delta given two input strings in the same alphabet.

The misconception that a different algorithm is required to handle binary data arises from commonly used diff/merge tools treating text and binary as if they were actually different. These tools generally aim to provide a human-readable delta, and as such focus on human-readable input to the exclusion of binary data. The assumption is that binary data is not human-readable so the delta between two binary data inputs will also not be human readable, and thus rendering it human-readable is deemed to be too much effort. Equality is the only relevant output in the case of binary diffs, and as such, a simple bit-by-bit comparison is considered to be the fastest and most appropriate solution. This categorization of algorithms by the efficiency of solution causes a partitioning of inputs into different types.

Another aspect that adds to the confusion is the line-based, word-based, and character-based classification of textual diff outputs produced by diff/merge tools. A diff algorithm that is described as “line-based” gives the impression that it produces “text-only” output, and that this means that it accepts only text input and never binary data inputs. However, line/word/character-based is not a characteristic of a diff algorithm itself; rather, it is an optimization applied to the input before feeding it to the actual diff algorithm.

Because new lines and spaces have meaning as separators in human-readable text, the diff tool can segment the string based on the hashes of the lines or words in the text. This hash string is much shorter than the original text thus saving time at the cost of reduced granularity of the diff. Additionally, line-based granularity might actually even increase human readability of the diff in some cases.

However, if the input is known to be opaque binary data, there are no meaningful separators nor human-readable diff to display, so this optimization cannot be applied. Algorithms capable of optimizing human-readable data before it becomes an input are thus prone to getting miscast as wholly incapable of processing binary data. The truth remains, however: apart from pre-processing optimization, both binary and human-readable data can be treated as strings-of-bytes inputs and easily processed.

Three generations of diff algorithms

The notion of how a diff should be generated has evolved over time.

The definitive guide to diff algorithms and patch formats (4)

String-to-string Correction or Insert/Delete

The first generation of diff algorithms solved the string-to-string correction problem and emerged in the 60s and 70s. Each of the two inputs is interpreted as a string composed of characters in some alphabet. The output is a sequence of character edits, most commonly insert/delete operations, that could be applied to one of the inputs to transform it to the other input. That makes this class of algorithms particularly suitable for generating human readable diffs on human readable inputs, e.g. different versions of the same text/source code resulting from actual edits being made over time. What helps even further is that in theory, and more often than not in practice, there is more than one minimal length sequence of edit operations that gets the job done. Various heuristics can be used to pick the edit sequences that most closely resembles actual human made edits.

The Wagner-Fischer algorithm set the foundation for this generation of diff algorithms. The Myers Algorithm is the latest improvement and the de facto standard for the generation and is currently used in multiple diff tools including the GNU diff utility.This generation of algorithms usually finds either the longest common subsequence or the minimal edit distance (usually that would be the Levenshtein distance) and uses these to generate the sequence of edits needed to transform one input into the other.

Block move or copy/insert

Pure block move

The next generation of diff algorithms was based on seemingly small optimizations over the previous generation. The character edits were upgraded to block-of-characters edits. That is, instead of expressing the diff as operations on single characters, the diff would be expressed as operations on blocks of characters. The operations are usually copy and insert where blocks of data that appear in both inputs are recorded in the delta as copied from one input to the other. The blocks unique to one of the inputs are recorded as insertions. This approach was first proposed by Walter Tichy.

Compression based block move

The definitive guide to diff algorithms and patch formats (5)

At first glance, the block move approach seems like a minor optimization. However, it has significant impact once you take into account the possibility that some block(s) of characters can repeat in some or both of the inputs. Thinking about diff generation in terms of copying blocks of data and keeping an eye out for the same block repeating more than once opens the door to using compression algorithms to generate a diff and delta file.

Compression algorithms do just that: find the biggest possible repeating blocks of data and replace each consecutive occurrence with a reference to the first occurrence. Blocks of data that never repeat are copied straight to the output. Compression algorithms are in essence block move algorithms.

If the block move analysis done by a compression algorithm is performed on both inputs of a diff algorithm, it will easily identify the common parts of both inputs. It will also point out which blocks of data are unique, that is different in both of the inputs. Once you have this data, it is straightforward to come up with a sequence of block copy/delete operations that will convert one of the inputs to the other one.

The major benefit of using compression algorithms is the greatly reduced size of the delta. A block of data will never appear more than once in the delta. It might be referred to multiple times but the actual data of the block will be contained in the delta only once. That is a major difference from the preceding approaches. It should also be mentioned that the delta size is reduced at the cost of reduced human readability.

xDelta, zDelta, Bentley/McIlroy are widely used de-facto standard implementations of diff algorithms of this generation.

Latest Upgrades

This would be the latest generation of diff algorithms. Most of its members exist only in research papers and don’t have any commercial implementations yet. They are largely based on the block move approach but offer substantial implementation optimizations, which result in claims of double-digit factor improvements in speed over the previous generation.

These optimizations are mostly focused on efficiently finding matching blocks of data in the two inputs. Various incremental hashing or compression-like techniques (such as suffix-trees) are used to achieve this purpose.

edelta, ddelta, bsdiff could be assigned to this generation of diff algorithms

Delta generation algorithms currently in use

This is a short overview of the tools and libraries focused on efficient delta/patch file generation and available at the time of writing (June 2019). Various other existing implementations of general purpose diff algorithms in different languages are omitted here for brevity’s sake.

Myers Algorithm – human readable diffs

The Myers Algorithm belongs to the string correction family and is widely used by tools fine tuned to generate human readable delta/patch files out of human readable inputs. This is used by tools such as Git Diff and GNU Diff.

Original Myers time and space complexity is O(ND) where N is the sum of the lengths of both inputs and D is the size of the minimum edit script that converts one input to the other. When there number of differences is small, as is the case with edits of the same code/text file, the algorithm is fast. Various optimizations can and have been applied to the original Myers algorithm resulting in improvements of up to O(NlgN + D^2) time and O(N) space.

Bentley-McIlroy

The Bentley-McIlroy algorithm belongs to the block move family and is focused on producing delta/patch files of optimal size. It has various implementations on different platforms and languages so it can be considered a somewhat de facto standard for scenarios where delta size matters. Google's Open VCDiff is one of the most prominent usages of Bentley-McIlroy that is able to generate a VCDiff format delta/patch.

The Bentley-McIlroy algorithm has time complexity of O(sqrt(N)*N) although the authors claim linear complexity in the average case. The memory complexity is linear.

XDelta

The XDelta algorithm belongs to the block move family and is focused on speed of delta generation. The algorithm sacrifices delta size for improved speed. The xdelta delta generation tool is the most prominent usage of XDelta and is also able to generate a VCDiff format delta/patch.

The XDelta algorithm has linear time and space complexity.

BSDiff

The BSDiff algorithm belongs to the block move family and is focused on achieving minimal delta/patch size. It is also specifically optimized for executable files. The bsdiff tool is the most prominent usage of the BSDiff algorithm. The bsdiff tool uses its own custom delta/patch file format.

BSDiff time complexity is O((n+m)log(n)) where n and m are the sizes of both inputs. Its memory complexity is max (17n,9n+m)+O(1).

Delta file formats

Standards are a good thing. And the really good thing about standards is that there are usually many to choose from. As far as delta/patch files are concerned, however, the problem is more the lack of standards rather than abundance. The many diff tools and libraries produce delta/patch files in their own custom formats and consequently only the producer of the patch is able to apply it.

That being the case, historically, two major attempts of standardization of the delta/patch format have emerged.

Unix .patch

This is a family of delta/patch formats produced by the GNU diff tool that is aimed at human readability. The GNU diff tool has been around for a long time, and these patch formats are widely accepted/used with or without modifications by various text processing tools and source control systems.

VCDiff

VCDiff is the most prominent attempt at creating a data-agnostic and algorithm-agnostic delta/patch format aimed at compactness and speed of application. VCDiff gained quite the adoption related to Google’s SDCH (Shared Dictionary Compression for HTTP) effort. Nowadays several diff algorithm implementations are able to generate delta/patch files in VCDiff format. VCDiff delta application libraries in various states of maturity exist for most of the popular languages and platforms.

VCDiff term disambiguation – patch format vs algorithm

In RFC3284 the term VCDiff is used to name both a delta/patch file format and a diff algorithm. Moreover, the diff algorithm going by the name VCDiff is proprietary. Numerous research papers also test or refer to the VCDiff algorithm. While a proprietary diff algorithm by that name does actually exist, VCDiff is also the name of an algorithm-agnostic delta/patch file format. Any of the algorithms here could generate delta files in the VCDiff format.

Testing open-vcdiff and xdelta

We chose the Google open-vcdiff and xDelta algorithms for testing since they are mature, use the more advanced block move approach, produce small size delta/patch files, and are not line-based but are straightforwardly applied to opaque binaries.

Even more importantly both of them are able to produce delta/patch files in the relatively universal and open VCDiff format. Adopting an open format means we can fix any bugs and/or implement decoders when necessary. Ably as a company also advocates for open standards so it’s important for us to adopt them in our own stack wherever possible.

Last but not least, both of them are open source and can be built as libraries and incorporated in various applications. Indeed there were multiple choices of implementation of the compression algorithms available in a good set of languages for building decoders.

Our tests are far from complete or statistically significant. They are simply aimed at giving you a feel for how these algorithms behave in the field.

Test setup

The tests were done using the latest official implementations of the algorithms found on GitHub at the time of writing this post (June 2019).

Both algorithms expose a great number of tweaks and settings like memory window size that greatly affect their performance. A deliberate effort has been made to run both under the same settings but mistakes are possible.

Tests used the xDelta CLI.

Test results: Average time over 3 minutes execution in a loop

open-vcdiff (in milliseconds)xdelta (in milliseconds)
2MB JSON9630
MIX #13879619044
MIX #233421131
MIX #3239249499
Big binary112397560
Small binary94201
Inverted463119
Small JSON0.00630.0042
Small text0.00210.0059

The above is where:

2MB JSON2.1MiB old, 2.2MiB new JSON files.
MIX #1229.7MiB old, 231.6MiB new. Fiddler Everywhere product, all files concatenated to a single file.
MIX #2137.7MiB old, 139MiB new. Emacs source code all files concatenated to a single file.
MIX #3365.2MiB old, 365.9MiB new. Emacs for Windows all files concatenated to a single file.
Big binary89MiB old, 92.1MiB new. Fiddler Everywhere AppImage files.
Small binary160KiB old, 1.6MiB new. Images.
Inverted2.1MiB old, 2.1MiB new. JSON and bit-by-bit inverted version of the same JSON.
Small JSON1.1KiB old, 1.3KiB new JSON files.

Delta size comparison

open-vcdiffxdelta
2MB JSON221.0KiB237.6KiB
MIX #186.1KiB82.6KiB
MIX #21.9MiB1.2MiB
MIX #358.5MiB48.8MiB
Big binary52.7MiB52.7MiB
Small binary1.6MiB1.6MiB
Inverted1.2MiB2.0MiB
Small JSON1.0KiB809 bytes
Small text615 bytes585 bytes

Ably’s choice: xDelta

In the end we chose xDelta at Ably primarily because there was a good quality implementation of the algorithm in native code with O(n) complexity. That is, in the worst case scenario Ably discards a delta that is bigger than the original message but we don't waste much time generating this delta. This helps us easily handle the trade-off between bandwidth saved by generating deltas and the CPU costs required to generate said deltas.

xDelta and VCDIFF in action at Ably

This is an American transit source. If you happen to be reading this post at a time when there are no busses running - such as early in the morning in Europe - you won't see any data.

Message Delta Compression in action on realtime transit data from CTransit bus times

We sincerely hope this article saves you time and effort researching all this information, and that it provides the required knowledge in a single place for anyone looking to implement diff/delta/patch functionality.

About Ably

Ably is a realtime messaging platform. We deliver billions of realtime messages everyday to more than 50 million end-users across web, mobile, and IoT platforms.

Developers use Ably to build realtime capabilities in their apps with our multi-protocol pub/sub messaging (including message delta compression), presence, and push notifications, free streaming data sources from across industries like transport and finance, and integrations that extend Ably into third party clouds and systems like AWS Kinesis and RabbitMQ.

Both businesses and developers choose to build on Ably because we provide the only realtime platform architected around Four Pillars of Dependability: performance, high availability, reliability, and integrity of data. This allows our customers to focus on their code and data streams while we provide unrivaled quality of service, fault-tolerance, and scalability.

We’re hiring for roles across our commercial and engineering teams :).

Recommended articles

Ably engineeringCollaborative experiences17 min readCursor Everywhere: An experiment on shared cursors for every websiteAug 18, 2023Ably engineering30 min readGroup chat app with OpenAI's GPTMay 24, 2023Ably engineering14 min readCRDTs are simpler and more common than you thinkFeb 14, 2023

Join the Ably newsletter today

1000s of industry pioneers trust Ably for monthly insights on the realtime data economy.

The definitive guide to diff algorithms and patch formats (2024)

FAQs

What is the difference between diff and patch? ›

Use the diff command in Linux to discover subtle differences between code files. Then, use the patch command to update those code files to match. Linux administration requires you to know a large number of commands.

What are the diff algorithms in Git? ›

In Git, there are four diff algorithms, namely Myers, Minimal, Patience, and Histogram, which are utilized to obtain the differences of the two same files located in two different commits. The Minimal and the Histogram algorithms are the improved versions of the Myers and the Patience respectively.

What is the basic diff algorithm? ›

A diff algorithm outputs the set of differences between two inputs. These algorithms are the basis of a number of commonly used developer tools. Yet understanding the inner workings of diff algorithms is rarely necessary to use said tools.

What is diffing algorithms? ›

Diffing short for Differences Algorithm is used to differentiate the DOM Tree for efficient updates. React utilize diffing algorithm to identify the changes in the newly created virtual dom and previous version of dom after any changes are made.

How to convert diff to patch? ›

It's a simple 2 steps process:
  1. Generate the patch: git diff > some-changes.patch.
  2. Apply the diff: Then copy this patch to your local machine, and apply it to your local working copy with: git apply /path/to/some-changes.patch. And that's it! The changes are now in your working copy and ready to be staged/commit/pushed :)

How many different types of patches are there? ›

There are 9 different types of patches, as woven & embroidered patches, leather & PVC patches, chenille & name patches, bullion & printed or iron-on patches. They can make clothes more fashionable and interesting. So, many clothing, bags, etc. will add patches & labels to get more attention and brand promotion.

What is diff format in git? ›

git diff is a command that takes two inputs, and computes the difference between them. Inputs can be commits, but also files, and even files that have never been introduced to the repository.

Does GitHub have a diff tool? ›

You can also compare two arbitrary commits in your repository or its forks on GitHub in a two-dot diff comparison. To quickly compare two commits or Git Object IDs (OIDs) directly with each other in a two-dot diff comparison on GitHub, edit the URL of your repository's "Comparing changes" page.

What are the three 3 types of algorithms? ›

Types of Algorithms:
  • Sorting algorithms: Bubble Sort, insertion sort, and many more. ...
  • Searching algorithms: Linear search, binary search, etc. ...
  • Graph Algorithms: It is used to find solutions to problems like finding the shortest path between cities, and real-life problems like traveling salesman problems.
Oct 16, 2023

What is diff format? ›

Typically, diff is used to show the changes between two versions of the same file. Modern implementations also support binary files. The output is called a "diff", or a patch, since the output can be applied with the Unix program patch.

What is the visual diff algorithm? ›

As we discussed, Visual diff algorithms compare the pixel values of two images. If there is a difference in the pixel values, the algorithm marks that pixel as changed. The algorithm then visually represents the changes, typically as a heatmap or overlay.

What is diff algorithm and flowchart? ›

Algorithm Vs. Flowchart. Algorithms and flowcharts are different mechanisms used for designing different programs, particularly in computer programming. An algorithm is a step-by-step summary of the procedure, while on the other hand, a flowchart illustrates the steps of a program graphically.

What is the tree diff algorithm? ›

At its core, the Tree Diff Algorithm is a mechanism used to identify and update changes in the structure and content of the Document Object Model (DOM). It is a fundamental part of the Virtual DOM, a concept popularized by React and adopted by various modern JavaScript frameworks.

What does patch the difference mean? ›

: to mend or put together especially hastily or clumsily. c. : to deal with successfully : settle. usually used with up. patched up their differences.

What is the purpose of a patch? ›

Patch: Corrects errors in the software and resolves security vulnerabilities. Update: Often a more extensive update to add software features (e.g., an improved user interface), but also to improve performance and generally fix bugs.

What is patch in differential geometry? ›

A patch is a differentiable mapping →x:U↦Rn, where U is an open subset of R2. Wikipedia gives this definition of a chart: A chart for a topological space M is a homeomorphism ϕ from an open subset U of M to an open subset of a Euclidean space.

Why is it called a patch? ›

Note physical patches used to correct punched holes by covering them. Historically, software suppliers distributed patches on paper tape or on punched cards, expecting the recipient to cut out the indicated part of the original tape (or deck), and patch in (hence the name) the replacement segment.

Top Articles
Latest Posts
Article information

Author: Corie Satterfield

Last Updated:

Views: 6704

Rating: 4.1 / 5 (62 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Corie Satterfield

Birthday: 1992-08-19

Address: 850 Benjamin Bridge, Dickinsonchester, CO 68572-0542

Phone: +26813599986666

Job: Sales Manager

Hobby: Table tennis, Soapmaking, Flower arranging, amateur radio, Rock climbing, scrapbook, Horseback riding

Introduction: My name is Corie Satterfield, I am a fancy, perfect, spotless, quaint, fantastic, funny, lucky person who loves writing and wants to share my knowledge and understanding with you.