Deduplication on Virtual Machine Disk Images

The demand of multiplexing services on single computer is ever increasing. This demand thrives virtual machine applications and researches. The more virtual machines can be properly deployed, the more benefit of multiplexing is assured.

However, the storage requirement is the threat to scalability of mass virtual machine deployment. It is nature to ask, if we deploy same operating systems with merely different softwares, can we only store the difference between each?

The answer is "yes". Deduplication locates data similarity by chunking the monolithic disk images into finer-grain data block, and compare with their ID, which is usually SHA1 hash. There are many factors can affect the rate of space saving, such as operating system version, vendor, linage, or software installation and removing order, package management system. Moreover, chunk-wise compression may or may not significantly further decrease image size. All of them need research and answers.

We can use fixed size chunking or variable size chunking to slice the 0xF300 bytes image files:

fixed size chunking variable size chunking
Fixed size chunking Variable size chunking

In fixed size chunking, if chunk 1 and chunk 121 are identical, we need merely one copy of the data, and link both of them to the real storage, thus save space.