A slow disaster has been unfolding since the end of June in the BTRFS community. This open source file system, also known as the “ButterFS,” includes modern features that do not exist elsewhere on Linux, such as snapshots, pooling and integral multi-device spanning.
According to BTRFS contributor Goffredo Baroncelli on the BTRFS mailing list, the file system computes parity incorrectly when scrubbing a corrupted RAID 5 filesystem. The behavior isn’t even this straightforward, however, and may require a complete rewrite of the portions of the software in question.
According to the BTRFS mailing list, when a Linux kernel running BTRFS encounters a corrupted strip of data on a RAID disk, it properly writes a fix to disk, but then recomputes the parity of the data incorrectly and overwrites the good parity with the bad. This makes for an extremely broken set of data.
Over the past few years, users have reported that replacing the first drive in a RAID has worked, but when a second drive is replaced, the RAID array crashes as if the first device were no longer working. When the bug manifests, it seems to destroy both disks in such a scenario.
Chris Murphy, contributor to BTRFS, wrote to the mailing, “What’s very clear by now is that RAID56 mode as it currently exists is more or less fatally flawed, and a full scrap and rewrite to an entirely different RAID56 mode on-disk format may be necessary to fix it.”
Currently, there is no specific fix planned, though developers are still trying to find out what the cause is. Eventually, it would seem, this entire section of the BTRFS code will need to be rewritten.