zfs

Flying Zones and x86 Virtualization

When we started using our Flying Zone architecture we were really limited to Solaris and some Linux (with a branded Zone).  This is still pretty interesting but didn’t really get us to where I wanted to be.  One day while sitting at Equinix in one of their Ashburn data centers I was building out a cage for Sun Services and I had an idea.  

Sun acquired a company called Innotek back in the beginning of 2008 which produced a product called VirtualBox.  It is an x86 type 2 Hypervisor.  It is a competitor to Parallels Desktop or VMware Workstation/Fusion - one major difference is it is free!  Once we made the acquisition I had access to the developers.  I had several dialogues with them asking for features based on what I was trying to accomplish…I wanted to run VirtualBox inside a Zone.  

I eventually was able to create a Zone on Solaris 10 x86 and I installed VirtualBox (VBOX) inside of it.  Now this was getting pretty cool.  VirtualBox has a mode where you can run it “headless” - with this mode you can configure everything via the CLI.  This is analogous to configuring Xen or KVM from the CLI but the difference was I could utilize the power of ZFS (snapshots, cloning, etc) and Zones.  

VBOX was running successfully within a Zone.  The first test I did was to create a Windows VM and do a ZFS snapshot then send/receive (again cloning wasn’t available) to see if I could quickly clone the VM.  That test was successful and I had a new VM in a matter of seconds.  Awesome!  The thoughts of building a hosting company around this methodology quickly filled my mind because I figured no one else was using ZFS, Zones, and VBOX in this manner.  

I eventually dug a bit deeper and I hit a roadblock.  At the time there were some licensing restrictions on using the pre-compiled Sun version and the open source version was missing some key components.  That effectively killed my idea for a quick way to create and host VMs.  

My idea to host x86 VMs inside a Zone has effectively been on hold for a bit.  Sun did start up an xVM project to run Xen on Solaris but unfortunately Oracle killed off that project so they could focus on their own Oracle VM (Xen on Linux) product.  Something very exciting happened on August 22, 2011…Joyent ported KVM to Illumos (a forked OpenSolaris).  

Joyent is now using KVM within a Zone and utilizing today’s great features ZFS has.  They’ve actually created their own custom distribution of Illumnos called SmartOS.  To see more about what they’re up to check out the SmartOS Wiki.  This has become their foundation for their cloud offering, Joyent Cloud.

In my next post I’ll go through sharing my experiences with KVM on Illumos.

Porthos

My last server, Porthos, was an Intel Pentium D 915 with 4GB of RAM. It had six 3.5" disks - two 200GB drives mirrored for the operating system (OpenSolaris) and four 300GB drives in a RAIDZ1 giving a little less than 1TB of usable primary storage.

I say “was,” but it’s still sitting here beside my hulking desktop machine. They’re both in Lian-Li PC-61 cases (18" tall x 19" deep x 8" wide) with 400w Enermax Liberty PSUs. The reason it’s not my server now is because the RAIDZ filesystem got punk’d by a software bug and is no longer able to mount. All four of the drives were almost five years old at the time that the filesystem went down; it was overdue.

I learned a lesson when it went down, one that I really already knew but hadn’t yet taken to heart: RAID is not backup. This is not the sort of lesson one wants to learn the hard way.

My current server, Law, is a Seagate Dockstar that has been modded to boot Arch Linux from a 2.5" 160GB laptop disk attached via USB. The whole setup uses 7W under load and is completely silent. The old server used 120w at idle and was kept in the garage because it was loud enough to wake the dead.

I’ve got a plan now to build a new server using some lower-power Mini-ITX components. Average wattage for the machine below should be 20w or so, with peaks up to 80w. The machine will be almost entirely silent (barring disk noise, which from 5400RPM disks should be minimal) and will have no active cooling. Primary storage will be on an external USB disk, which is currently a miserly 160GB drive but will eventually be replaced with a 1-2TB drive in a Samsung Story Station enclosure.



Some tests of power usage of devices:

  • Linksys router 6 - 10w
  • Ooma 6 - 10w
  • Panasonic phone 1w
  • Sony Reciever (off) 0w
  • Sony Reciever (on, idle) 46w
  • Powered Speakers (off) 3 - 4w
  • Powered Speakers (on, idle) 6w
  • Desktop PC (off) 4 - 5w
  • Desktop PC (on, idle) 117 - 128w
  • Server PC (off) 4 - 5w
  • Server PC (on, idle) 110w
  • Monitors (on, no signal) 2 - 3w (each)
  • Monitors (on, active) 65w (each)
  • USB external HDD (on, idle) 7w
  • Dockstar + 160GB HDD 7w

Some info about an alternative project: attaching a bare LCD from a laptop to a MINI-ITX board, when is then attached to a balanced-arm lamp stand.

The connector type between the board and the LCD is LVDS. The LCD itself is a donor from a Gateway MT6711 laptop with a dead systemboard/CPU. The display itself is a CHI MEI N154I2-L02:

N154I2-L02 is a 15.4" TFT Liquid Crystal Display module with single CCFL Backlight unit and 30 pins LVDS interface. This module supports 1280 x 800 Wide-XGA mode and can display 262,144 colors. The optimum viewing angle is at 6 o'clock direction. The inverter module for Backlight is not built in.

The connector on the LCD is JAE FI-XB30SL-HF10; one possible LVDS cable to connect from this display is available from ebay for $6.


some links

ZFS Installation on Linux (ubuntu)

ZFS or the Zettabyte File System is an extreme robust, ground up re-thinking of how a filesystem should work.  It’s been designed from the ground up to prevent data loss and corruption; in addition to this it offers complete file system management, oppose to just being another layer on top of physical hardware.  This also includes built-in support for SMB/CIFS and NFS file sharing.

ZFS works much differently than most previous file systems created.  Once this file system was opensourced a few linux groups attempted ports for it to be used in POSIX systems.  Due to licensing issues with ZFS and the linux Kernel, there are two primary ways of using ZFS on Linux:

  1. A user-space implementation (pre-packaged binary)
  2. kernel-space implementation (manual compilation and installation)

This will detail how to install and use the basic ZFS commands on ubuntu lucid (10.04) server with a RAIDz (RAID-5) implementation example.

ZFS Installation

user-space

In ubuntu, the fastest way to get ZFS support is to install the user-space module.  This is a great and easy way to play around with ZFS without getting your hands too dirty.  I’d recommend installing ZFS in kernel-space if you want to do any serious data storage.  Just apt-get install ZFS FUSE:

sudo apt-get install zfs-fuse

… and you are ready to roll on to setting up ZFS

kernel-space

To build ZFS support into the kernel, we need to custom build a version of ZFS from source, to get around Kernel licensing / shipping issues.  We’ll need to add the PPA (Personal Package Archive) for ZFS:

sudo add-apt-repository ppa:dajhorn/zfs

sudo apt-get update

sudo apt-get install ubuntu-zfs

… Just sit back and watch as the ZFS source is downloaded, compiled and installed on your system.  You’re now ready to play with ZFS.

ZFS Drive Configuration

1) Identify disks by path

First, we want to make sure our drives always appear in /dev in the same locations, so we can consistently identify the same drive on linux.  Because of how drives spin up, and are added as they are ready asynchronously, drives will be labeled in a seemingly random order.  This fixes that problem and creates a device node specifically for ZFS disks.

First, we want to find all of our disks using the full PCI bus path:

ls -l /dev/disk/by-path

2) Update /etc/zfs/zdev.conf to map disks by path to zpool

Now we’ll create/edit /etc/zfs/zdev.conf configuration file, to automagically create the zpool disks on boot, in the following format:

<label> <PCI bus location> <# optional note; I use HD serials>

sudo vim /etc/zfs/zdev.conf

disk1 pci-0000:00:1f.2-scsi-1:0:0:0 #HD: 22d3d234-s3a1

disk5 pci-0000:00:1f.2-scsi-5:0:0:0 #HD: 2243d634-s3a7

3) Refresh /dev with new device nodes

Now just trigger a udev update to population the new device nodes in dev:

sudo udevadm trigger

ls -l /dev/disk/zpool

disk1

disk2

disk3

disk4

disk5

4) Create GPT partitions for 2TB+ support

Now we will initialize the Drives with a GPT partition, to allow 2TB+ support per disk.  Since we've programmatically created the zpool, this is easy to do with a quick bash script:

for x in `seq 5` do;

  sudo parted /dev/disk/zpool/disk${x} mklabel gpt quit;

done

… and now we are ready to create our first ZFS RAIDz zpool!

5) Enable ZFS mounts at boot, and un-mounts at shutdown

ZFS by default on ubuntu will not be automatically mounted or unmounted during startup or shutdown.  To enable this filesystem to be automatically mounted and unmounted, edit /etc/default/zfs, and enable these options:

sudo vim /etc/default/zfs

# Automatically run `zfs mount -a` at system startup if set non-empty.

ZFS_MOUNT=‘yes’

#Autmatically run 'zfs unmount -a` at system shutdown if set non-empty.

ZFS_UNMOUNT='yes’

Creating RAIDz zpool

A zpool is a pool of disks for use by the ZFS filesystem.  In classical setups, this would be where you would use hardware or software RAID to make a container for your filesystem.  ZFS does this for you.  In our example, we’re going to make a 5 disk RAIDz pool, with autoexpansion enabled (automatically grow pool on drive upgrades).  The amazing thing with ZFS is that as soon as the pool is created, you can start using the filesystem.  No need for a long allocation process, as required in EXT3/4 filesystems.

1) Create zpool

We create a zpool enabling auto expansion, labeling the pool 'uber’ and using our programmatically created disk ID’s for the disks, automatically mounting the pool to /data when it is created:

sudo zpool create -o autoexpand=on -m /data uber raidz /dev/disk/zpool/disk{1..5} 

2) Check zpools on system (optional)

Once this is done, you can list the current zpools with list command:

sudo zpool list

NAME      SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT

uber       13.6T  10.0T  3.58T    73%  1.00x  ONLINE  -

3) Check zpool status (optional)

And you can check the current status of the zpool with the status command:

sudo zpool status uber

4) Disable built-in SMB/CIFS sharing and NFS sharing

ZFS has built in support for SMB/CIFS and NFS sharing.  I still like to create my own NFS shares however, and therefore, I disable this functionality on my zpool.  To disable these attributes, just run the following commands:

sudo zfs set sharenfs=off uber

sudo zfs set sharesmb=off uber

There are a ton of options that you can set per zpool.  You should check out all the options and set ones that are best for you.  To get a list of all the options, just run:

sudo zfs get all uber

As you can tell, there are a ton of things that this filesystem can do.

Scrubbing zpool

Scrubbing a ZFS file system run through the filesystem and attempts to detect errors on the stored data.  This is done by verifying hashes on the files, and automatically replacing a bad file with a backup copy, if one exists.  For cheap hardware, this should be done weekly, but in practical experience, this should be done monthly.  To kick off a ZFS scrub (which still keeps your filesystem mounted and accessible) just run the following command:

sudo zpool scrub uber

This runs pretty quickly on current hardware.  You can check the status at anytime by running a status check:

sudo zpool status uber

This is really only an introduction to an extremely flexible and powerful filesystem.  It became my primary filesystem almost overnight.

Flying Zones

A great friend of mine (@timkennedy) and I had architected something at Sun Microsystems we called “Flying Zones.”  We had been using ZFS for a quite a while and were very impressed with its ease of use and overall simplicity we wanted to take it to the next level.  Solaris 10 introduced containers also known as Zones.  The great thing about a Zone was that you could virtualize Solaris in two distinct fashions.  First, you could create a “sparse root” Zone which means that the virtualized instance shares libraries with the host OS (Global Zone).  Second, you can create a “full root” Zone.  

Creating a Zone or a virtual instance (VM) of Solaris in itself is not really a big deal.  The secret sauce is what you do with your VM.  At a previous employer an engineer accidentally over wrote /etc/passwd and /etc/shadow on several machines.  We were able to recover in a few hours but what if we had another way?  Enter the Global Zone.  The Global Zone has full access to the underlying filesystem of the sparse root or full root zone.  In order to fix a mistake like this we could simply restore the /etc/passwd and /etc/shadow files directly to each zone from the Global Zone…this is extremely powerful!  

Moving on we realized the power of Zones and began to think, “What else can we do?”  We began to implement our customer’s (Sun Services) deployments all within full root Zones.  Now the next big thing we thought about was how could we speed up deployments.  We began creating “gold disk” images within a Zone and we created ZFS snapshots and back then ZFS did not have the “clone” feature so we would send and receive the snapshot to a new filesystem.  We were able to stand up our customer’s systems within minutes.

Continuing down this path we thought that this was great and all but what happens when you lose a machine?  Of course your system is down and now you have to recover or fix the host.  We came up with another idea.  What if we put the zpool on a shared SAN LUN?  This would give us the ability to import the underlying ZFS zpool onto another host! Well we did and it worked and with a little elbow grease we imported the Zone config file to the other host and were able to boot a Zone on another host.  Now, we had a way in theory to recover quickly a customer Zone if indeed the host failed.  Tim and I were both extremely excited by this.  We weren’t done yet.  

The ability to import a ZFS zpool hosting the full root Zone was pretty amazing but we weren’t totally impressed yet.  Average engineers by nature are somewhat lazy and managers occasionally have to use the stick to motivate them but great engineers are smart enough to just automate everything they do so a manager never has to use that stick.  Tim and I went the automation route, we decided to use another piece of Sun software, Sun Cluster.  With Sun Cluster you can automate the failover of a ZFS zpool and you can bring the Zone config file with it thereby giving you the ability to failover a Zone from host to host just like VMWare or Xen’s high availability works today.  

Tim and I had architected this and deployed this solution over 5 years ago and we still use the same methodology today.  

Why ZFS is not in Linux?

ZFS is popular Open Source file system which was developed by Sun Micro System for Solaris. This wonderful file system is licensed under CDDL a OSI approved Open Source license.

Over the time ZFS has become part of Open Solaris (which is no more now) and BSD’s but it has not become part of Linux. And many people like Alan Cox blame it on Sun for using CDDL which is a GPL incompatible license for ZFS. And there are stories saying that Sun was afraid of Linux so they made ZFS as CDDL [1].

Keep reading

PCBSDからFreeBSDをzfsでインストール

インストールのSSとってない。

ホスト名つけて
新規インストールで、FreeBSDを選択して(PCBSDじゃなくて)

パーティションでは自動設定の
Use Entire Disk (1つのディスクとしてつかう)チェック
FileSystemをZFSに
GPTのパーティションとしてうんぬん。をチェック

気が向いたらSS撮り直す。

rootのパスワードつけて、なんかユーザ作って
オプションなんもいれない。


インストールは結構速い


インストール後、上の方にDHCPのIPアドレス出てるんで覚えといて(でてなきゃあとでifconfigでもすりゃいい)
ログインして、rootになって

キーボードマップの変更(viで記号のマッピングを確認してみる。)
sysconfigいじるのがいいらしいがひとまずコマンドで切り替える

# kbdcontrol -l /usr/share/syscons/keymaps/jp.106.kbd

;; sysconfigは
;; keymap=jp.106を追記してリブート

rc.dをいろいろいじる。
DHCPが動いてる環境ならDHCP設定になってるので、
そこは放置しつつ、

# nameserver [ネームサーバが別途あるならDHCPならたぶんいらん]
sshd_enable=“YES”

を追記
# /etc/rc.d/sshd start
開始

メールのaliases.dbがないなら
# newaliases

時刻確認
# date
Wed Nov 9 01:38:09 UTC 2011

UTCなんでタイムゾーン設定する

# touch /etc/wall_cmos_clock
# ln -s /usr/share/zoneinfo/Asia/Tokyo localtime
;; この2行がイマイチわからん

時刻合わせとNTP起動
# ntpdate ntp.nict.jp

# vi /etc/ntpd.conf
server ntp.nict.jp
:wq

# vi /etc/rc.conf
ntpd_enable=“YES”
:wq

システム更新
# freebsd-update fetch
# freebsd-update install
# reboot
;; でもフェッチでミラーみつからないと失敗した
;; STABLEだと更新できないとか??RELEASEじゃないとだめ?

とりま今日はここまで

# /etc/rc.d/ntpd start