Honey, I Shrunk the Model (Maybe): blk-archive vs AI Data


Because “It Should Work” Isn’t Data

After reading about the billions spent on AI infrastructure, I kept wondering: how much of that storage is just… the same bytes over and over? So I decided to find out. As I’m involved with https://github.com/device-mapper-utils/blk-archive I thought it would be good to understand how much storage blk-archive can realistically save when pointed at AI-style datasets.

Let me be clear upfront: this isn’t a speed test or hardware benchmark. I’m only interested in one question: how many bytes go in, and how many come out?

I performed all operations at the filesystem level with normal files in directories, not at the raw block device layer, though blk-archive can work at the block level, especially with thinly provisioned block devices.

blk-archive uses content-defined chunking (CDC) with variable-sized chunks. Unlike traditional fixed-size blocks, CDC determines chunk boundaries by analyzing the data content itself, typically by finding patterns in a rolling hash. This means if you insert a single byte at the beginning of a file, CDC will still identify and reuse most of the same chunks, whereas fixed-block approaches would misalign everything. For each chunk found, it is compressed before being written to the archive.


Important Disclaimers

The Data Is Not Truly Representative. AI companies keep their training data composition private. I cannot reproduce exact corpus mixtures, internal filtering pipelines, prior deduplication passes, data refresh strategies, or proprietary preprocessing. I used datasets that are structural approximations, they measure storage behavior on AI-shaped data, not actual commercial training datasets. The data may not be large or diverse enough to reliably represent specific data types.

The Tooling Is Early Stage. blk-archive is early-stage software that may change formats, break compatibility, alter chunking behavior, or require re-archiving between versions. There are plans to rename it to blk-stash in the future. Nothing about the on-disk format should be considered stable yet. This post establishes a baseline for current behavior.


What Counts as AI-Style Data?

Since proprietary corpora are opaque, I’m approximating structural categories commonly found in AI workflows.

The test data includes:

These represent a mix of:

Each category has very different redundancy characteristics.


Comparison Baseline: tar + gzip

To make this meaningful, I compare blk-archive against a simple baseline:

$ tar -cvzf dataset.tar.gz <dataset>

This produces a conventional compressed archive without cross-file deduplication.

Conceptually:

The practical question is: how much additional storage does deduplication actually save beyond compression alone for these AI type files?


Methodology

For each dataset:

1. Measure Raw Size

du -sb <dataset_root>

Record:


2. Create tar + gzip Archive

tar -czvf dataset.tar.gz <dataset_root>

Record:


3. Create blk-archive and pack the files

blk-archive create -a test_archive && blk-archive pack -a test_archive <file or files of interest>

Record:

I did not capture performance measurements. Blk-archive hasn’t had many performance-focused enhancements yet.


Metrics

Deduplication Ratio: DedupRatio = RawSize / BlkArchiveSize

blk-archive Savings (%): BlkSavings% = (1 - (BlkArchiveSize / RawSize)) × 100

gzip Savings (%): GzipSavings% = (1 - (TarGzipSize / RawSize)) × 100

Additional Savings Over gzip: AdditionalSavingsBytes = TarGzipSize - BlkArchiveSize

This last metric isolates the value added by deduplication beyond traditional compression.

Example Calculation (Code Dataset):

Applying the formulas:

This means blk-archive saves an additional 5.89% of the original data size compared to gzip alone.


Results

Dataset Raw (GiB) tar+gzip (GiB) blk-archive (GiB) gzip Savings (%) blk Savings (%) Additional Savings Beyond gzip (%)
Synthetic 154.56 49.96 9.73 67.68% 93.71% +26.03%
Scrubbed web 0.96 0.41 0.21 57.46% 77.66% +20.20%
AI models 6.86 5.43 4.42 20.90% 35.49% +14.59%
code 10.00 2.31 1.72 76.88% 82.77% +5.89%
web txt 27.96 9.01 7.92 67.76% 71.69% +3.93%
wiki txt 21.96 7.73 7.72 64.81% 64.86% +0.05%
checkpoint 7.49 6.85 6.93 8.48% 7.39% -1.09%
Train data 16.83 11.27 11.80 33.03% 29.88% -3.15%
Image 0.17 0.16 0.17 9.29% 4.87% -4.42%

Key Patterns:


Expected Behavior by Dataset Type and what was actually encountered

Source Code

For the source code portion of the experiment, I used data from Hugging Face, specifically bigcode/the-stack, and capped the sample at roughly 10 GiB. Given how often developers rely on copy-and-paste, along with common programming idioms, repeated license headers, and shared dependencies, I expected both gzip and deduplication to deliver meaningful space savings. That expectation largely held true. What surprised me, however, was just how compressible source code turned out to be. Considering the sheer volume of source code included in modern training datasets, even a 5%+ improvement in storage efficiency from deduplication translates into a substantial reduction at scale.


Web Text

For this experiment, I pulled a sample of web data from https://data.commoncrawl.org/crawl-data/ and capped it at roughly 28 GiB. Given how repetitive the web tends to be, consistent HTML markup, boilerplate headers and footers, templated layouts, mirrored sites, versioned pages, and plenty of scraped or syndicated content. I expected this dataset to compress extremely well and show clear gains from deduplication. The results told a different story: improvements beyond standard compression were fairly modest. While the data certainly compressed, deduplication didn’t deliver the dramatic savings I had anticipated.


Wiki Text

When I think of wiki data, I immediately think of Wikipedia, so for this portion of the experiment I pulled data from https://dumps.wikimedia.org/enwiki/latest and capped it at roughly 22 GiB. I wasn’t entirely sure what to expect from this dataset. Text generally compresses well, so it was reasonable to assume gzip would perform strongly. What surprised me, however, was that applying deduplication to the raw input provided virtually no additional benefit.

A possible explanation is that while Wikipedia articles share structure and formatting conventions, the actual article content is highly unique. Unlike web crawls or source repositories, there aren’t large volumes of duplicated pages, mirrored content, or copy-and-paste artifacts across documents. Most redundancy likely exists within individual files rather than between them, which standard compression algorithms already handle effectively. As a result, there simply wasn’t much cross-document duplication left for deduplication to exploit.


Checkpoint Data

Building models from scratch, rather than simply using them for inference, is new territory for me. When it comes specifically to training checkpoints, I initially assumed that consecutive saves would be fairly similar, especially when taken close together during training. However, in local testing with a small model, that assumption didn’t hold up. I constructed this data with a Python script that pulled the model from Hugging Face distilbert-base-uncased. See generate_checkpoints.py in the referenced GitHub repo for the specific code.

One detail that stood out: using PyTorch’s default zip-based serialization, saving the same checkpoint twice produces different files on disk. Even when the underlying tensor data is identical, the zip container introduces differences through embedded metadata—timestamps, file ordering, compression details. A simple test confirms this:

model = AutoModelForMaskedLM.from_pretrained(MODEL_NAME)
torch.save(model.state_dict(), "test1.pt")
torch.save(model.state_dict(), "test2.pt")
# SHA256 hashes differ despite identical model weights

For deduplication, this container-level variability obscures underlying tensor similarity and reduces CDC chunking effectiveness.


Completed AI Models

For completed AI models, I grabbed a couple of smaller ones (see build_ai_dataset.sh) to keep things manageable. My initial assumption was that models derived from other models would share a significant amount of underlying data. Intuitively, that feels like it should be true.

My experiments didn’t support that assumption. The overlap wasn’t nearly as substantial as I expected. Because model files are large and my testing environment has limits, I wasn’t able to explore this as deeply as I would have liked, and I ultimately left those results out.

The dataset included here contains only two relatively small models, so the sample size is admittedly limited. Larger models, especially closely related ones, would likely make for a much more interesting analysis. That’s an area worth revisiting with more compute and storage headroom.


Synthetic Dataset

I created the synthetic dataset in build_ai_dataset.sh. It is intentionally simple and serves as a calibration control for the experiments.

The dataset includes:

The duplication here is deliberate. This dataset is designed to be an easy win. If blk-archive does not perform exceptionally well on this data, that would indicate a fundamental issue in the deduplication approach rather than a limitation of the input data.


Image Data

For image data, I used the CIFAR-100 dataset (https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz).

My expectation going in was that, since image data is typically already compressed, neither gzip nor deduplication would provide significant additional savings. The test results largely confirmed this assumption.

What was somewhat surprising, however, was how poorly deduplication performed relative to simple gzip. One possible explanation is that, in the absence of meaningful redundancy, the metadata overhead introduced by deduplication outweighs any tangible storage savings.

This is an area worth examining more closely within blk-archive.


Scrubbed Web

For this category, I used the LLaMA training data from the RedPajama dataset on Hugging Face, limited to ≤ 1 GiB.

Datasets of this type are often preprocessed with some form of deduplication. My understanding is that this is typically done using MinHash + LSH, which would, in theory, reduce the potential gains from a tool like blk-archive.

However, the results show a fairly strong reduction in size. That leaves a couple of possibilities: either this particular subset did not undergo as much MinHash + LSH deduplication as expected, or blk-archive is capturing a different class of redundancy that remains even after that preprocessing step.


Training Data

I came across this while experimenting with nanoGPT. If you follow the standard workflow, there is a “prepare” phase where you run prepare.py. When applied to the OpenWebText dataset, this process produces a train.bin file that is roughly 17 GiB in size. That file, along with a few others, serves as the input to the training process.

After training completes, the resulting model is approximately 128 MiB, with functionality roughly comparable to GPT-2 from OpenAI.

This turned out to be another case where blk-archive performed poorly relative to simple gzip compression. That result suggests there are certain data layouts, particularly large, tokenized binary blobs like train.bin, where traditional compression is simply a better fit than deduplication.

A likely factor here is entropy. Files like train.bin behave more like already-compressed data than like structured text. When the byte patterns are highly random and lack obvious repetition, there is not much for a deduplication engine to grab onto.

Potential Improvement: This suggests adding a lightweight entropy check during processing. If the data appears highly random, blk-archive could:

This would help identify cases like train.bin early, rather than discovering after the fact that gzip would have been the better choice. Ideally, blk-archive should recognize these patterns and handle them more intelligently.


What I Learned

Across these dataset types, a few patterns became clear:

Perhaps more importantly, this exploration provided a better understanding of the kinds of artifacts modern AI workflows produce. Not all AI-related files are good candidates for deduplication. Some datasets, particularly large tokenized binaries or already-compressed assets, behave almost like random data. In those cases, there is little structural redundancy to exploit, and tools like gzip may be the more appropriate choice.

There is certainly more to learn. The sample sizes were limited, and it is entirely possible that some assumptions, tooling choices, or experimental methods could be improved. That is part of the process.

What this work does provide is a baseline. I now have a clearer picture of where blk-archive performs well, where it struggles, and where future improvements should be focused. From here, iteration becomes intentional rather than exploratory.


Future Work

This experiment revealed several concrete improvement opportunities for blk-archive:

1. Intelligent Entropy Detection

High-entropy data (tokenized training binaries, already-compressed images) performs worse with deduplication than gzip alone. Adding a lightweight entropy check during processing could:

2. Metadata Overhead Optimization

For datasets where deduplication provides minimal benefit (like images), the metadata overhead actually hurts compression ratios. Investigating ways to reduce this overhead would help:

3. Expanded Dataset Testing

The current experiments used limited sample sizes and may not represent all use cases. Additional testing should include:


Closing

The results make one thing clear: the value of deduplication depends heavily on the data type.

For synthetic datasets, preprocessed web content, and certain AI model artifacts, blk-archive delivered substantial savings beyond gzip, up to 26% additional reduction in some cases. In contrast, for high-entropy data such as tokenized training binaries and already-compressed image datasets, deduplication performed worse than gzip alone. These cases highlight clear areas where the tool can be improved.

The primary goal of blk-archive is reducing data at rest. In some environments, especially where data is read once and discarded (“read and burn”), long-term archival efficiency may not be a priority. It’s an open question how much demand there is for optimization at this phase of the data lifecycle, but as AI datasets continue to grow, storage efficiency remains a practical concern.

These findings establish a baseline and help define the next phase of blk-archive development. We now have a clearer understanding of where it performs well, where it struggles, and where engineering effort should be focused next.

For more information about the project, see the main repository: https://github.com/device-mapper-utils/blk-archive

How to Reproduce the Data

You can find all scripts and programs used to generate the test data for this post in the GitHub repository:

https://github.com/tasleson/dedupe_blog

The experiments use a custom branch of blk-archive with features that make the tool easier to use:

git clone -b dedupe_blog https://github.com/tasleson/blk-archive

I’ve made an effort to ensure everything is accurate and reproducible. That said, it’s possible I missed something, or that certain environmental assumptions (dependencies, system configuration, available storage, etc.) need to be satisfied for the results to match exactly.

If you run into issues, spot mistakes, or see opportunities to improve the methodology, please open an issue or submit a pull request. Feedback and corrections are welcome.

AI is an intensely competitive space, and much of the interesting work happens behind closed doors. That makes it even more valuable when practitioners share experiments, results, and lessons learned. Even small, practical findings can help move the broader community forward.


Appendix: Detailed Results in Bytes

For those interested in the exact measurements, here are the complete results in bytes:

Dataset Raw Size (bytes) tar+gzip Size (bytes) blk-archive Size (bytes) gzip Savings (%) blk Savings (%) Extra Savings vs gzip
Synthetic 165,958,158,420 53,640,196,898 10,444,186,805 67.68% 93.71% +26.03%
Scrubbed web 1,029,537,053 437,929,200 229,990,214 57.46% 77.66% +20.20%
AI models 7,364,692,743 5,825,389,202 4,750,546,448 20.90% 35.49% +14.59%
code 10,737,450,312 2,482,452,936 1,850,310,398 76.88% 82.77% +5.89%
web txt 30,024,932,043 9,679,676,166 8,500,761,920 67.76% 71.69% +3.93%
wiki txt 23,585,076,069 8,299,099,651 8,287,008,094 64.81% 64.86% +0.05%
checkpoint 8,038,951,640 7,357,216,204 7,444,980,172 8.48% 7.39% -1.09%
Train data 18,071,164,978 12,102,155,168 12,670,437,336 33.03% 29.88% -3.15%
Image 186,301,098 169,001,437 177,223,220 9.29% 4.87% -4.42%

Note: Written by a human, restructuring, grammar and misc. improvements by AI :-)