2008年10月21日火曜日

BSDCan 2008: ZFS Internals | KernelTrap

BSDCan 2008: ZFS Internals | KernelTrap

Pawel Dawidek first ported ZFS to FreeBSD from OpenSolaris in April of 2007. He continues to actively port new ZFS features from OpenSolaris, and focuses on improving overall ZFS stability. During the introduction to his talk at BSDCan, he explained that his goal was to offer an accessible view of ZFS internals. His discussion was broken into three sections, a review of the layers ZFS is built from and how they work together, a look at unique features found in ZFS and how they work internally, and a report on the current status of ZFS in FreeBSD.

The BSDCan website notes that Pawel is a FreeBSD committer, adding:

"In the FreeBSD project, he works mostly in the storage subsystems area (GEOM, file systems), security (disk encryption, opencrypto framework, IPsec, jails), but his code is also in many other parts of the system. Pawel currently lives in Warsaw, Poland, running his small company."


Derived from notes taken at a one-hour BSDCan talk by Pawel Dawidek, titled, A closer look at the ZFS file system. Simple administration, transactional semantics, end-to-end data integrity.

ZFS Layers

In a series of slides titled "ZFS, the internals", Pawel started with a diagram illustrating the many layers of ZFS, offering a quick overview of how it all fits together, and how it fits into FreeBSD. He then quickly moved from layer to layer.

  • SPA: Storage Pool Allocator
    Pawel explained that The Storage Pool Allocator is responsible for managing all pools, creating pools, attaching disks, replacing disks, and related tasks. The 'zpool' command communicates directly with this layer, and maintains an environment history allowing you to review what's happened with these pools. SPA is also responsible for logging persistent data errors, accessible with the command zpool status -v which shows all errors as well and lists all files affected by these errors. An example use of this Pawel pointed out was that's it's easy to quickly determine exactly which files need to be restored from a backup.
  • VDEV: Virtual Devices
    This layer provides a unified method of arranging and accessing devices. Pawel suggested that it looks like a mini, less flexible GEOM built inside of ZFS. It creates mirrors, RAID-Z volumes, additional caching devices, and real vdevs (disks and files). It is also responsible for handling I/O requests and laying out the blocks. Pawel's slides included a diagram showing the vdev tree.
  • ZIO: ZFS I/O Pipeline
    This layer is responsible for compression, checksum verification, encryption, and decryption. It is also responsible for managing I/O request priority. Pawel went on to explain that regular I/O requests are considered more important than those used to sync both halves of a mirror, so the act of synchronizing doesn't saturate I/O bandwidth and significantly slow down regular activities.
  • ARC: Adjustable Replacement Cache
    ZFS follows a copy on write model which is very fast for writing data, but is not necessarily fast for reading data as it may be scattered all around the disk as data is modified over time. The ARC layer tries to cache as much as possible to make reads fast.

    A feature found in the latest ZFS release, which Pawel is actively porting to FreeBSD, is the ability to use an entire device for caching, which he noted was similar to an L2 cache.

  • DMU: Data Management Unit
    Pawel described this as the heart of ZFS, explaining that all operations are transactional. Any change to the filesystem is done as a transaction, and all transactions are handled by the DMU. Every 5 seconds ZFS syncs the transaction log to the disk, assuring atomicity and removing the needs for utilities like fsck. When data is modified, the change version is written to a new place on the disk rather than overwriting the old copy of the data. Once written, the pointers are updated to point to the new data. If there's a crash in the middle of an operation, the old pointers will still lead to the old data which will remain consistent and unmodified.
  • DSL: Dataset and Snapshot Layer
    This layer is responsible for snapshots and datasets, and can inherent various properties for parent filesystems. It also is where quota reservations are enforced.
  • ZIL: ZFS Intent Log
    ZFS provides consistency on disk through transactions, rather than through a journal. It does provide an internal journaling layer called ZIL which is used so that applications using fsync() and O_FSYNC (such as databases like MySQL and PostgreSQL) behave as expected. Writes go into the ZIL for the 5 seconds between transaction writes, but the log is never used except in the event of a crash, in which case the log will be replayed.
  • ZAP: ZFS Attribute Processor
    File and directory names are stored using ZAP layer functionality. There are two types of ZAP, micro ZAP for small entries, and fat ZAP for large directories and files with long names. The layer utilizes scalable hash algorithms to create arbitrary associations within an object set.

    Traversing the live filesystem is not easy when you have multiple datasets mounted, a feature provided by this layer. This allows you to synchronize mirrors, and is used when verifying all checksums in your pool.

  • ZVOL: ZFS Emulated Volumes
    In FreeBSD, this layer exports GEOM providers, allowing access to storage pool data via /dev/zvol/<dataset>
  • ZPL: ZFS Posix Layer
    Pawel noted that this was the most difficult layer to port from OpenSolaris. This layer provides all filesystem operations, such as open, read, chmod, etc. Pawel noted that the POSIX API is quite limiting. This layer also provides /dev/zfs, the communication gate between userland tools such as zfs(8) and zpool(8) and ZFS, used to configure the kernel and modify ZFS pools.

ZFS Features

  • RAID-5 versus RAID-Z
    Pawel started by talking about RAID-5, noting that RAID is supposed to stand for Redundant Array of Inexpensive Disks, then suggesting that this isn't really true as to make it work properly you use memory, and either have to take a big performance hit synchronizing data to all disk whenever a write happens in case there's a crash, or you have to use very expensive RAID controllers with sufficient RAM and battery backups. He went on to describe the write-hole-problem, in which following a power failure you can have valid data on your disk, but the parity hasn't been updated yet and thus suggest that the data is invalid. This requires either very expensive controllers to avoid, or a lengthy synchronization process after every crash.

    Pawel described RAID-Z as "similar to RAID-5, and yet so much different". RAID-Z gains from the fact that ZFS uses copy on write, and never overwrites data, avoiding the above limitations with RAID-5.

    RAID-Z is also self healing, because a checksum is written when data is written with RAID-Z, and then each time data is read the checksum is always validated. If the checksum doesn't validate, ZFS automatically attempts to reconstruct the data from the parity information, then validates this reconstructed data - if valid, it writes the corrected data back to the disk.

    Another advantage to RAID-Z is that when a disk is replaced, it doesn't blindly copy the entire disk. Instead, it only copies actual data, so if a pool is almost empty synchronization can happen very quickly.

  • End-to-end data integrity
    Pawel noted that other filesystem don't check if the data written to a disk is the same data you get back when you read from the disk, allowing silent data corruption. This data corruption can come from many different places, including a bad cable, a controller driver bug, or even a disk silently corrupting the data. He then provided an analogy to a a mail carrier such as UPS who would be saying something like: "Here's your package: it may be broken, and it may not even be yours, but we don't care."

    He then discussed hardware that does checksumming in the controller. For example, disks might be formatted with 520 byte sectors rather than 512 byte sectors, and the extra 8 bytes is then used to store checksum data. Pawel pointed out that this still does not provide end to end integrity, and can still be corrupted by a bad cable, in memory, or even by a buggy driver. Returning to the mail carrier analogy, he suggested they'd be saying something like: "We can only guarantee that when the package left our office, it was okay."

    Other filesystems offer checksums providing block consistency verification, checking the block itself but not guaranteeing that the block is in the right place. Thus, a controller bug could mistakingly send writes to the wrong place, or phantom writes can happen when you think you wrote data but you didn't. Continuing the the mail carrier analogy, he offered: "Here is a package. It's not broken, but it may not be yours."

    And then finally he looked at how every block is verified against an independent checksum in ZFS. Pointers are stored to the block in another block along with a checksum. When data is read, it can verify the data and that it really is the block being asked for. Stepping back, he noted that as data is stored in a tree, you have checksums going all the way up to the topmost block which offers a single checksum of all blocks in the filesystem. He described this global checksum as a cryptographically strong signature of the entire pool.

  • Snapshots
    ZFS does not impost any limits on the number of snapshots that can be taken, and snapshots creation is a constant time operation that doesn't slow down other filesystem operations. Only removing snapshots takes time, as it has to check to see if it needs to free blocks that were in the snapshot but are no longer in the filesystem.

    To maintain snapshots, ZFS tracks when a block was stored using a counter incremented each time an operation is written to disk, as well as a pointer to the block and a checksum. Every snapshot maintains its own dead block list, which is reviewed when a snapshot is destroyed, freeing blocks that meet the following conditions: they were born after the previous snapshot, born before the destroyed snapshot, they died after this snapshot was created, and they died before the next snapshot was created.

  • Resilvering
    When mirroring drives, it would be possible to simply copy one entire disk to another one. In RAID-Z it's not that simple, as it has to know where the block boundaries are. ZFS first traverses the metadata when synchronizing and when mirroring, giving several benefits: data integrity verification happens before the data is copied, only live data is copied so you don't waste time copying areas of the disk not containing live data, and it's possible to only copy the differences in data between two disks (similar to rsync).

    This synchronization happens from the top of the tree and works its way down, so if it is stopped mid-process by a crash, it is possible to pick up where it left off, or to obtain at least some of the data from the partially synchronized disk.

ZFS Status in FreeBSD
Pawel explained that he has already ported the most recent version of ZFS from OpenSolaris, and that it currently lives in his private Perforce source code repository. He noted that this port is completed code wise and everything works, but that he's working on writing regression tests. He's already written 2,000 tests, but these only cover half of ZFS functionality -- an illustration of just how many features ZFS has. The new code will not be comitted until he completes the writing of his regression tests, so he suggests "be patient".

Cool New Features in the Latest Port

  • Delegated administration
    This makes it possible to delegate administrative rights to a non-privileged user, allowing them to create their filesystems, to create snapshots of their filesystems, and so on.
  • L2ARC
    L2ARC, or Level 2 caching in ARC makes it possible to use a disk in addition to RAM for caching reads. This improves read performance, which can otherwise be slow because of the fragmentation caused by the copy-on-write model used by ZFS.
  • Additional device for ZIL
    Useful for applications such as databases that heavily use fsync operations, allowing a small fast disk to be dedicated for ZIL entries.
  • Can access a corrupted file list
  • Stability improvements
    Pawel noted that he has been unable to duplicate the kmem_map panics reported with the earlier version of ZFS that was committed to FreeBSD, though he's not fully convinced yet the problem is solved.
  • ZFSboot
    In the latest code, it's possible to boot directly from ZFS, no longer requiring a small non-ZFS boot partition. "No you can use only ZFS and just enjoy it," he stated.
  • zpool properties
  • failure modes
    The current version of ZFS in FreeBSD will panic if it's unable to write. The upcoming version that Pawel recently ported from OpenSolaris has two other modes: wait mode is useful if a disk disappears for a short moment, ZFS will wait and then continue to work when the disk returns; continue mode denies all further writes, but allows you to continue reading data from the disk.

When Will ZFS Be Production Ready?
Pawel notes that he's heard this question a lot. "The experimental status is very inconvenient," he commented to lots of laughter from the crowded room. He noted that he's currently the only maintainer, and suggested until someone comes along to co-maintain the code to help debug things when the filesystem gains more users he wouldn't be marking the code as production ready. He also commented that nobody has stepped up yet to co-maintain the code, so he expect is will be a while yet.

He went on to note that he's personally used ZFS on FreeBSD in production for 2 years, and on his laptop for more than 1 year, "it just works, and it doesn't lose data. It doesn't corrupt data, and you don't have to wait for fsck."

Questions and Answers
With this, Pawel opened the floor to questions.

Q: Is the latest port available?
A: Not yet. The regression tests are being written first, then the patch will be published, then it wil go into CVS.

Q: Will the new version of ZFS be able to talk to partitions created with the old version of ZFS?.
A: Yes, but you will need to use a command to update the volume if you want to access the new ZFS features.

Q: How does ZFS handle bad sectors on the disk
A: This can be handled by mirror disks or using RAID-Z. In addition, ZFS always replicates its metadata, and it's possible to configure it to also replicate data on a single disk.

Q: Does it support ACLs?
A: The new version does. In OpenSolaris they use filesystem attributres. In FreeBSD we use extended attributes. In the new version the two can be translated. It's also possible to implement POSIX ACLs, but this isn't likely to happen as it would make ZFS on FreeBSD incompatible with ZFS on OpenSolaris. There's also a Google Summer of Code project related to this.

Q: How does ZFS work with 64-bit architectures?
A: Another nice ZFS feature is that it has no endian dependencies. ZFS always writes in the architecture's endianness, and doesn't slow down writes by translating. When reading, it simply checks the order in which data was stored, then feeds bytes appropriately.

Q: Can you dynamically expand filesystems?
A: Yes.
Pawel then popped up a terminal and offered a live demonstration of how it works.

Q: How much space is allocated for snapshots?
A: No space is allocated for a snapshot until you start modifying it, then it allocates space as the filesystem changes.

0 件のコメント: