zfs

Flying Zones and x86 Virtualization

When we started using our Flying Zone architecture we were really limited to Solaris and some Linux (with a branded Zone).  This is still pretty interesting but didn’t really get us to where I wanted to be.  One day while sitting at Equinix in one of their Ashburn data centers I was building out a cage for Sun Services and I had an idea.  

Sun acquired a company called Innotek back in the beginning of 2008 which produced a product called VirtualBox.  It is an x86 type 2 Hypervisor.  It is a competitor to Parallels Desktop or VMware Workstation/Fusion - one major difference is it is free!  Once we made the acquisition I had access to the developers.  I had several dialogues with them asking for features based on what I was trying to accomplish…I wanted to run VirtualBox inside a Zone.  

I eventually was able to create a Zone on Solaris 10 x86 and I installed VirtualBox (VBOX) inside of it.  Now this was getting pretty cool.  VirtualBox has a mode where you can run it “headless” - with this mode you can configure everything via the CLI.  This is analogous to configuring Xen or KVM from the CLI but the difference was I could utilize the power of ZFS (snapshots, cloning, etc) and Zones.  

VBOX was running successfully within a Zone.  The first test I did was to create a Windows VM and do a ZFS snapshot then send/receive (again cloning wasn’t available) to see if I could quickly clone the VM.  That test was successful and I had a new VM in a matter of seconds.  Awesome!  The thoughts of building a hosting company around this methodology quickly filled my mind because I figured no one else was using ZFS, Zones, and VBOX in this manner.  

I eventually dug a bit deeper and I hit a roadblock.  At the time there were some licensing restrictions on using the pre-compiled Sun version and the open source version was missing some key components.  That effectively killed my idea for a quick way to create and host VMs.  

My idea to host x86 VMs inside a Zone has effectively been on hold for a bit.  Sun did start up an xVM project to run Xen on Solaris but unfortunately Oracle killed off that project so they could focus on their own Oracle VM (Xen on Linux) product.  Something very exciting happened on August 22, 2011…Joyent ported KVM to Illumos (a forked OpenSolaris).  

Joyent is now using KVM within a Zone and utilizing today’s great features ZFS has.  They’ve actually created their own custom distribution of Illumnos called SmartOS.  To see more about what they’re up to check out the SmartOS Wiki.  This has become their foundation for their cloud offering, Joyent Cloud.

In my next post I’ll go through sharing my experiences with KVM on Illumos.

Porthos

My last server, Porthos, was an Intel Pentium D 915 with 4GB of RAM. It had six 3.5" disks - two 200GB drives mirrored for the operating system (OpenSolaris) and four 300GB drives in a RAIDZ1 giving a little less than 1TB of usable primary storage.

I say “was,” but it’s still sitting here beside my hulking desktop machine. They’re both in Lian-Li PC-61 cases (18" tall x 19" deep x 8" wide) with 400w Enermax Liberty PSUs. The reason it’s not my server now is because the RAIDZ filesystem got punk’d by a software bug and is no longer able to mount. All four of the drives were almost five years old at the time that the filesystem went down; it was overdue.

I learned a lesson when it went down, one that I really already knew but hadn’t yet taken to heart: RAID is not backup. This is not the sort of lesson one wants to learn the hard way.

My current server, Law, is a Seagate Dockstar that has been modded to boot Arch Linux from a 2.5" 160GB laptop disk attached via USB. The whole setup uses 7W under load and is completely silent. The old server used 120w at idle and was kept in the garage because it was loud enough to wake the dead.

I’ve got a plan now to build a new server using some lower-power Mini-ITX components. Average wattage for the machine below should be 20w or so, with peaks up to 80w. The machine will be almost entirely silent (barring disk noise, which from 5400RPM disks should be minimal) and will have no active cooling. Primary storage will be on an external USB disk, which is currently a miserly 160GB drive but will eventually be replaced with a 1-2TB drive in a Samsung Story Station enclosure.



Some tests of power usage of devices:

  • Linksys router 6 - 10w
  • Ooma 6 - 10w
  • Panasonic phone 1w
  • Sony Reciever (off) 0w
  • Sony Reciever (on, idle) 46w
  • Powered Speakers (off) 3 - 4w
  • Powered Speakers (on, idle) 6w
  • Desktop PC (off) 4 - 5w
  • Desktop PC (on, idle) 117 - 128w
  • Server PC (off) 4 - 5w
  • Server PC (on, idle) 110w
  • Monitors (on, no signal) 2 - 3w (each)
  • Monitors (on, active) 65w (each)
  • USB external HDD (on, idle) 7w
  • Dockstar + 160GB HDD 7w

Some info about an alternative project: attaching a bare LCD from a laptop to a MINI-ITX board, when is then attached to a balanced-arm lamp stand.

The connector type between the board and the LCD is LVDS. The LCD itself is a donor from a Gateway MT6711 laptop with a dead systemboard/CPU. The display itself is a CHI MEI N154I2-L02:

N154I2-L02 is a 15.4" TFT Liquid Crystal Display module with single CCFL Backlight unit and 30 pins LVDS interface. This module supports 1280 x 800 Wide-XGA mode and can display 262,144 colors. The optimum viewing angle is at 6 o'clock direction. The inverter module for Backlight is not built in.

The connector on the LCD is JAE FI-XB30SL-HF10; one possible LVDS cable to connect from this display is available from ebay for $6.


some links

How to Replace a Disk in the ZFS Root Pool
# zpool offline rpool c1t0d0s0
# cfgadm -c unconfigure c1::dsk/c1t0d0
Replace the disk in c1t0d0
# cfgadm -c configure c1::dsk/c1t0d0
# zpool replace rpool c1t0d0s0
# zpool online rpool c1t0d0s0
# zpool status rpool

SPARC# installboot -F zfs /usr/platform/`uname -i`/lib/fs/zfs/bootblk /dev/rdsk/c1t0d0s0
x86# installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c1t9d0s0
ZFS Installation on Linux (ubuntu)

ZFS or the Zettabyte File System is an extreme robust, ground up re-thinking of how a filesystem should work.  It’s been designed from the ground up to prevent data loss and corruption; in addition to this it offers complete file system management, oppose to just being another layer on top of physical hardware.  This also includes built-in support for SMB/CIFS and NFS file sharing.

ZFS works much differently than most previous file systems created.  Once this file system was opensourced a few linux groups attempted ports for it to be used in POSIX systems.  Due to licensing issues with ZFS and the linux Kernel, there are two primary ways of using ZFS on Linux:

  1. A user-space implementation (pre-packaged binary)
  2. kernel-space implementation (manual compilation and installation)

This will detail how to install and use the basic ZFS commands on ubuntu lucid (10.04) server with a RAIDz (RAID-5) implementation example.

ZFS Installation

user-space

In ubuntu, the fastest way to get ZFS support is to install the user-space module.  This is a great and easy way to play around with ZFS without getting your hands too dirty.  I’d recommend installing ZFS in kernel-space if you want to do any serious data storage.  Just apt-get install ZFS FUSE:

sudo apt-get install zfs-fuse

… and you are ready to roll on to setting up ZFS

kernel-space

To build ZFS support into the kernel, we need to custom build a version of ZFS from source, to get around Kernel licensing / shipping issues.  We’ll need to add the PPA (Personal Package Archive) for ZFS:

sudo add-apt-repository ppa:dajhorn/zfs

sudo apt-get update

sudo apt-get install ubuntu-zfs

… Just sit back and watch as the ZFS source is downloaded, compiled and installed on your system.  You’re now ready to play with ZFS.

ZFS Drive Configuration

1) Identify disks by path

First, we want to make sure our drives always appear in /dev in the same locations, so we can consistently identify the same drive on linux.  Because of how drives spin up, and are added as they are ready asynchronously, drives will be labeled in a seemingly random order.  This fixes that problem and creates a device node specifically for ZFS disks.

First, we want to find all of our disks using the full PCI bus path:

ls -l /dev/disk/by-path

2) Update /etc/zfs/zdev.conf to map disks by path to zpool

Now we’ll create/edit /etc/zfs/zdev.conf configuration file, to automagically create the zpool disks on boot, in the following format:

<label> <PCI bus location> <# optional note; I use HD serials>

sudo vim /etc/zfs/zdev.conf

disk1 pci-0000:00:1f.2-scsi-1:0:0:0 #HD: 22d3d234-s3a1

disk5 pci-0000:00:1f.2-scsi-5:0:0:0 #HD: 2243d634-s3a7

3) Refresh /dev with new device nodes

Now just trigger a udev update to population the new device nodes in dev:

sudo udevadm trigger

ls -l /dev/disk/zpool

disk1

disk2

disk3

disk4

disk5

4) Create GPT partitions for 2TB+ support

Now we will initialize the Drives with a GPT partition, to allow 2TB+ support per disk.  Since we've programmatically created the zpool, this is easy to do with a quick bash script:

for x in `seq 5` do;

  sudo parted /dev/disk/zpool/disk${x} mklabel gpt quit;

done

… and now we are ready to create our first ZFS RAIDz zpool!

5) Enable ZFS mounts at boot, and un-mounts at shutdown

ZFS by default on ubuntu will not be automatically mounted or unmounted during startup or shutdown.  To enable this filesystem to be automatically mounted and unmounted, edit /etc/default/zfs, and enable these options:

sudo vim /etc/default/zfs

# Automatically run `zfs mount -a` at system startup if set non-empty.

ZFS_MOUNT=‘yes’

#Autmatically run 'zfs unmount -a` at system shutdown if set non-empty.

ZFS_UNMOUNT='yes’

Creating RAIDz zpool

A zpool is a pool of disks for use by the ZFS filesystem.  In classical setups, this would be where you would use hardware or software RAID to make a container for your filesystem.  ZFS does this for you.  In our example, we’re going to make a 5 disk RAIDz pool, with autoexpansion enabled (automatically grow pool on drive upgrades).  The amazing thing with ZFS is that as soon as the pool is created, you can start using the filesystem.  No need for a long allocation process, as required in EXT3/4 filesystems.

1) Create zpool

We create a zpool enabling auto expansion, labeling the pool 'uber’ and using our programmatically created disk ID’s for the disks, automatically mounting the pool to /data when it is created:

sudo zpool create -o autoexpand=on -m /data uber raidz /dev/disk/zpool/disk{1..5} 

2) Check zpools on system (optional)

Once this is done, you can list the current zpools with list command:

sudo zpool list

NAME      SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT

uber       13.6T  10.0T  3.58T    73%  1.00x  ONLINE  -

3) Check zpool status (optional)

And you can check the current status of the zpool with the status command:

sudo zpool status uber

4) Disable built-in SMB/CIFS sharing and NFS sharing

ZFS has built in support for SMB/CIFS and NFS sharing.  I still like to create my own NFS shares however, and therefore, I disable this functionality on my zpool.  To disable these attributes, just run the following commands:

sudo zfs set sharenfs=off uber

sudo zfs set sharesmb=off uber

There are a ton of options that you can set per zpool.  You should check out all the options and set ones that are best for you.  To get a list of all the options, just run:

sudo zfs get all uber

As you can tell, there are a ton of things that this filesystem can do.

Scrubbing zpool

Scrubbing a ZFS file system run through the filesystem and attempts to detect errors on the stored data.  This is done by verifying hashes on the files, and automatically replacing a bad file with a backup copy, if one exists.  For cheap hardware, this should be done weekly, but in practical experience, this should be done monthly.  To kick off a ZFS scrub (which still keeps your filesystem mounted and accessible) just run the following command:

sudo zpool scrub uber

This runs pretty quickly on current hardware.  You can check the status at anytime by running a status check:

sudo zpool status uber

This is really only an introduction to an extremely flexible and powerful filesystem.  It became my primary filesystem almost overnight.

Flying Zones

A great friend of mine (@timkennedy) and I had architected something at Sun Microsystems we called “Flying Zones.”  We had been using ZFS for a quite a while and were very impressed with its ease of use and overall simplicity we wanted to take it to the next level.  Solaris 10 introduced containers also known as Zones.  The great thing about a Zone was that you could virtualize Solaris in two distinct fashions.  First, you could create a “sparse root” Zone which means that the virtualized instance shares libraries with the host OS (Global Zone).  Second, you can create a “full root” Zone.  

Creating a Zone or a virtual instance (VM) of Solaris in itself is not really a big deal.  The secret sauce is what you do with your VM.  At a previous employer an engineer accidentally over wrote /etc/passwd and /etc/shadow on several machines.  We were able to recover in a few hours but what if we had another way?  Enter the Global Zone.  The Global Zone has full access to the underlying filesystem of the sparse root or full root zone.  In order to fix a mistake like this we could simply restore the /etc/passwd and /etc/shadow files directly to each zone from the Global Zone…this is extremely powerful!  

Moving on we realized the power of Zones and began to think, “What else can we do?”  We began to implement our customer’s (Sun Services) deployments all within full root Zones.  Now the next big thing we thought about was how could we speed up deployments.  We began creating “gold disk” images within a Zone and we created ZFS snapshots and back then ZFS did not have the “clone” feature so we would send and receive the snapshot to a new filesystem.  We were able to stand up our customer’s systems within minutes.

Continuing down this path we thought that this was great and all but what happens when you lose a machine?  Of course your system is down and now you have to recover or fix the host.  We came up with another idea.  What if we put the zpool on a shared SAN LUN?  This would give us the ability to import the underlying ZFS zpool onto another host! Well we did and it worked and with a little elbow grease we imported the Zone config file to the other host and were able to boot a Zone on another host.  Now, we had a way in theory to recover quickly a customer Zone if indeed the host failed.  Tim and I were both extremely excited by this.  We weren’t done yet.  

The ability to import a ZFS zpool hosting the full root Zone was pretty amazing but we weren’t totally impressed yet.  Average engineers by nature are somewhat lazy and managers occasionally have to use the stick to motivate them but great engineers are smart enough to just automate everything they do so a manager never has to use that stick.  Tim and I went the automation route, we decided to use another piece of Sun software, Sun Cluster.  With Sun Cluster you can automate the failover of a ZFS zpool and you can bring the Zone config file with it thereby giving you the ability to failover a Zone from host to host just like VMWare or Xen’s high availability works today.  

Tim and I had architected this and deployed this solution over 5 years ago and we still use the same methodology today.  

Why ZFS is not in Linux?

ZFS is popular Open Source file system which was developed by Sun Micro System for Solaris. This wonderful file system is licensed under CDDL a OSI approved Open Source license.

Over the time ZFS has become part of Open Solaris (which is no more now) and BSD’s but it has not become part of Linux. And many people like Alan Cox blame it on Sun for using CDDL which is a GPL incompatible license for ZFS. And there are stories saying that Sun was afraid of Linux so they made ZFS as CDDL [1].

Keep reading

PCBSDからFreeBSDをzfsでインストール

インストールのSSとってない。

ホスト名つけて
新規インストールで、FreeBSDを選択して(PCBSDじゃなくて)

パーティションでは自動設定の
Use Entire Disk (1つのディスクとしてつかう)チェック
FileSystemをZFSに
GPTのパーティションとしてうんぬん。をチェック

気が向いたらSS撮り直す。

rootのパスワードつけて、なんかユーザ作って
オプションなんもいれない。


インストールは結構速い


インストール後、上の方にDHCPのIPアドレス出てるんで覚えといて(でてなきゃあとでifconfigでもすりゃいい)
ログインして、rootになって

キーボードマップの変更(viで記号のマッピングを確認してみる。)
sysconfigいじるのがいいらしいがひとまずコマンドで切り替える

# kbdcontrol -l /usr/share/syscons/keymaps/jp.106.kbd

;; sysconfigは
;; keymap=jp.106を追記してリブート

rc.dをいろいろいじる。
DHCPが動いてる環境ならDHCP設定になってるので、
そこは放置しつつ、

# nameserver [ネームサーバが別途あるならDHCPならたぶんいらん]
sshd_enable=“YES”

を追記
# /etc/rc.d/sshd start
開始

メールのaliases.dbがないなら
# newaliases

時刻確認
# date
Wed Nov 9 01:38:09 UTC 2011

UTCなんでタイムゾーン設定する

# touch /etc/wall_cmos_clock
# ln -s /usr/share/zoneinfo/Asia/Tokyo localtime
;; この2行がイマイチわからん

時刻合わせとNTP起動
# ntpdate ntp.nict.jp

# vi /etc/ntpd.conf
server ntp.nict.jp
:wq

# vi /etc/rc.conf
ntpd_enable=“YES”
:wq

システム更新
# freebsd-update fetch
# freebsd-update install
# reboot
;; でもフェッチでミラーみつからないと失敗した
;; STABLEだと更新できないとか??RELEASEじゃないとだめ?

とりま今日はここまで

# /etc/rc.d/ntpd start

Openindiana with iSCSI to Mac Initiator

Don’t have enough space on your existing Time Machine backup disk?  Time Capsule need to be expandable or redundant, or just work correctly?

Try this recipe!

(next up: Time Machine with OpenIndiana/FCoE) 

Prereqs:
pkg install SUNWiscsit
pkg install SUNWstfm
svcadm enable -r iscsi/target:default

Creation of zvol:
zfs create -V 500G rspool/TimeMachine

Add to SCSI block disk:
sbdadm create-lu /dev/zvol/rdsk/rspool/TimeMachine

Get GUID for SCSI target mode framework adding, etc.:
sbdadm list-lu

Add to SCSI target mode framework:
stmfadm add-view 600144f044bb4a0000004e204e5e0001

Create iSCSI target:
itadm create-target

Get target info (target name, etc) for initiator, deleting target, etc:
itadm list-target -v

Get iSCSI initator for Mac:
http://www.studionetworksolutions.com/products/product_detail.php?pi=11
- Open dmg, install and reboot.
- Use info from itadm list-target to configure adding the target onto the Mac in System Preferences/globalSAN iSCSI target initiator.
- Add IP to iSCSI target initiator.
- Giggity.

Expand the zvol:
zfs set volsize=750G rspool/TimeMachine 

Delete target:
itadm delete-target -f iqn.1986-03.com.sun:02:8f0c7105-0262-e2c5-9c4e-a59edebd42b4

Remove from SCSI target mode framework:
stmfadm remove-view -l 600144f044bb4a0000004e204e5e0001

Remove from SCSI block disk:
sbdadm delete-lu 600144f044bb4a0000004e204e5e0001

Recover space from zvol:
zfs destroy rspool/TimeMachine

Citations:
http://www.tek-blog.com/main/index.php?blog=2&title=comstar_howto&more=1&c=1&tb=1&pb=1
And wherever Mike Schenck’s blog went. 

Nexenta, OpenSolaris and NFS on Softlayer

I’ve been playing around with Nexenta on some new servers at Softlayer. Why Softlayer? Well, by hosting in their new Dal05 data center I can easily get Micron SSD and 10Gbit-E (public and private) for a great price. Since I use ZFS and NFS, both of these options can go a long way for performance.

Installing Nexenta was pretty easy. I downloaded the latest iso from Nexentastor and booted it from IPMI using Softlayer’s lockbox. As for the hardware, I am running it on the 56xx processor, 48GB ram, Adaptec 5805Z controllers, SSDs for the boot, zil, and l2arc and a bunch of 2TB disks for storage.

As I said, installing was the easy part. The real trick is to run through some performance tests to see how it worked. On our current servers at Rackspace we are not using SSDs for the ZIL, which hurts us when it comes to write performance over NFS. The reason is due to how NFS commits writes using fsync(). A more detailed description about the ZIL can be found here and here. A good test, for me at least, is to simply exact a large set of files to an NFS share and time how fast it is. I grabbed a tar of the linux kernel and went to work. Before I started running the tests, I wanted to make sure the settings were correct. When it comes to ZFS, this can be a pain. Not only do you have a lot of options in ZFS, but you also need to figure out how ZFS should interact with your storage. For instance, if you are using a RAID controller with write cache and a BBU, you might need to adjust ZFS properties. Here are some of the settings I used and the results of the tests.

The basic test

My control for the test was to extract the linux kernel from local disk onto an NFS share. Before doing this, I also tested what the performance was like when extracting to local disk (an SSD mirror) and to our existing system that does not have SSDs for ZIL.

  • Extract tar to local disk: 5.5 seconds
  • Extract tar to NFS on old server: 6 min 26 seconds 
  • Extract tar to NFS on new server: Almost useless (read below)

Disabling nocacheflush

When you first setup Nexenta, it gives an option to disable cache flushing for your pools. Usually ZFS is used with JBOD and no RAID controllers. Since we are using the Adaptec controllers, we already have write cache and flushing on the hardware, along with a battery backup (and UPS) in case of power loss. In some cases the controller can ignore the default cache flush requests from ZFS, but in our case it does not. When I tried to extract the tar over NFS it was so slow I had to kill the process. The easy fix is to set zfs_nocacheflush to yes in the appliance preferences. 

  • Extracted tar to NFS: 54 seconds

Turning off the controller cache

A lot of times you will see recommendations to turn off the write cache on the controller if you use ZFS. It was designed to work with cheaper hardware and to manage the data integrity in the file system. I figured it would be worth testing this, so I went into the controller settings and disabled write and read caching for all of the disks and turned on cache flushing in ZFS again. I then went back and ran the tests again.

  • Extracted tar to NFS: 2 min 30 seconds

I was actually surprised to see this. I figured that with ZFS handling everything the speed would be about the same. From what I have read the Adaptec Controller is pretty good, so it seems worth it to use the NVRAM write cache of the hardware.

Disabling ZIL (don’t)

If you have ever read the Evil Tuning Guide you know that one option for great NFS performance is to disable the ZIL. Essentially this will bypass the data protection that ZFS provides in exchange for better performance. It allows the NFS client to write files without acknowledgment that it has been written to stable storage. I wanted to test this to see what the difference was out of curiousity.

  • Extracted tar to NFS: 30 sec

That is by far the fastest I can get it over NFS, but not worth it at all for our system. Data integrity and protection is actually more important that uptime or performance, so I’m not willing to take any risks. Instead, we can actually rely on the Micron SSDs for log devices.

Conclusion

For now I am going to settle with the having very fast network combined with disabling the cacheflush on ZFS. While it is nothing near the write speed local disk, it’s pretty good considering the test is extracting thousands of small files. When it comes to NFS and shared file systems the overhead is mostly in the protocol. For us, problems really happen when we have spikes in activity or heavy load, so these faster disks combined with better network should really help. I’ll have to run some load tests next.

techudemy.com
Udemy- Storage Area Network with Oracle ZFS on Centos Linux : L1 [100% OFF] - Tech Udemy
Storage Area Network

Udemy- Storage Area Network with Oracle ZFS on Centos Linux : L1 [100% OFF]

http://techudemy.com/udemy-storage-area-network-with-oracle-zfs-on-centos-linux-l1-100-off/
This course has 28 lectures and over 12 ratings, 910 students enrolled Instructed by Muhamad Elkenany, Mohamed Nawar
#Udemy- #Storage #Area #Network #with #Oracle #ZFS on #Centos #Linux

flickr

Kamikochi by ubic from tokyo
Via Flickr:
長野県 上高地 Kamikochi, Nagano Prefecture Nikon D700 + Distagon T*2/35 ZF.2