Rustbucket

Sorting a terabyte of data in the late 1990s meant serious hardware, serious planning, and probably a serious budget approval process. Today you can do it on a workstation before lunch. I wanted to know how fast, so I wrote rustbucket to find out.

It’s a two-phase external sort implemented in Rust, built around io_uring, and named for reasons that should be obvious to anyone who has spent time with either Rust or storage systems.

Read more...

Honey, I Shrunk the Model (Maybe): blk-archive vs AI Data

Because “It Should Work” Isn’t Data

After reading about the billions spent on AI infrastructure, I kept wondering: how much of that storage is just… the same bytes over and over? So I decided to find out. As I’m involved with https://github.com/device-mapper-utils/blk-archive I thought it would be good to understand how much storage blk-archive can realistically save when pointed at AI-style datasets.

Let me be clear upfront: this isn’t a speed test or hardware benchmark. I’m only interested in one question: how many bytes go in, and how many come out?

Read more...