Lazy devops config recoverability with mercurial

At Woome we often suddenly create a new set of key information, such as a regularly changing nginx config. We use mercurial extensively, so the typical next step at this point initialise a repo for the data set and check everything in. As a team we’ve trained ourselves to check in changes to most admin data sets automatically, with a ticket number in the comment, so this is very nearly self documenting change tracking with easy rollback.

Now, that gives us good change accountability and recoverability however it does tend to leave us with mercurial repos littered around the infrastucture slowly becoming more and more important. What we now do is auto-locate these repos and publish them to a central box for hgweb access presenting us with a rapid bare-os recovering in the case of hardware failure. 

This script runs from cron on our admin server every few hours:

# sysopshosts is a flat file hostlist                                          
hostlist=$(cat ~woome/sysops/releases/sysopshosts  |tr -d ':')

cd $base
for host in $hostlist
    # locate any mercurial repos under /etc                                     
    repolist=$(ssh $host find /etc -iname .hg -print \
        |grep -v /etc/.hg|sed -e 's/\/.hg//' )
    # Work through them                                                         
    for repo in  $repolist
        # create a directory of host32-etc-nginx for host32/etc/nginx           
        repodir=$(echo $repo |tr \/ \-)
        # if the target directory already exists then check                     
        # if the repo looks related                                             
        if [ -d $targetdir ]
            HGUNRELATED=`hg -R $targetdir  in ssh://${host}//${repo}/ 2>&1 \    
                grep -cE "(unrelated|no suitable)"`
            if [ "${HGUNRELATED}" -gt "0" ]
                echo "found unrelated repo - $targetdir moving aside"
                mv ${targetdir} ${targetdir}.unrelated.`date +%Y-%m-%d`
        # now attempt to either create or update the repo                       
        if [ -d $targetdir ]
            echo Updating existing repo $targetdir
            pushd  $targetdir
            hg pull
            echo Found new repo $targetdir
            hg clone ssh://${host}/${repo} ${targetdir}

We use hgweb to publish the repo directory .

An example use case (simplified slightly). We had a hardware failure of our main nagios monitoring system during a data centre outage/aircon outage. We were immediately blind to the state of our infrastucture. We obtained sufficient monitoring to finish our recovery with the following steps on a spare server:

sudo apt-get install nagios2  lighttpd
rm -fr /etc/nagios2 /etc/lighttpd
hg clone ssh://host30/data/repos/host65-etc-nagis2 /etc/nagios2
hg clone ssh://host30/data/repos/host65-etc-lighttpd /etc/lighttpd
... minor config changes to lighttpd and dns for new host.....
sudo /etc/init.d/lighttpd restart
sudo /etc/init.d/nagios2 restart
Dev-Ops Collaboration

I watched a really good presentation on InfoQ from QCon San Francisco Called “Cooperation, Collaboration, and Awareness”. The guy presenting was John Allspaw, and he is the VP of operations at Etsy (a popular web startup), and formally Flickr:

Cooperation, Collaboration, and Awareness

The subject of the talk was really about how Etsy handles devops tasks, and the lessons he has learnt along the way for how to have successful collaboration between dev and ops (something I personally think is really critical).

Here are some of the main points from the talk:

  • They practice continuous delivery. This is a subject that I have studied a lot recently, and am very keen on. The idea is that you can have a build/deployment pipeline such that it is possible to commit something and get it running on production in one process (with manual steps too (big red button)). They do many releases per day.
    • This is generally minor changes to existing features. New features and major changes like DB schema changes go out once a week in a specific window, and have a change control process attached to them.
    • When things go wrong the idea is to detect it quickly, and to fix it quickly. That could be either roll-forward or roll-back.
  • New features should be able to be turned on and off with simple config flags. It seems really stupidly simple but is very effective. Deployment of new features then becomes trivial, and A/B testing can be done just as easily (turn it on for only a percentage of users).
  • Deployment is triggered by the developers generally, to give them a sense of responsibility for the change. Having it all done by ops gives it a sense of “it’s someone elses problem now”. This also sends out notifications that it occurred, increasing the sense of responsibility.
  • There are shared responsibilities between ops and devs:
    • Metrics
    • Alerting
    • Plumbing
    • Graphing
    • Logging
  • Get metrics on everthing, and have it automatically coallesce into graphs for everyone to see. They use a tool called Graphite, and for example have graphs for number of new users, number of listings, number of completed transactions etc. They also show on their graphs exactly when deployments happened so it is obvious if it had an adverse or positive effect. Ganglia is another tool they use to aggregate server performance statistics that makes it stupidly easy (if a script exists in a certain place on a server then it uses that for data gathering automatically). That then allows them to have very fine-grained metrics on individual operations very cheaply.
  • Logging is built into their Apache logs; they append certain information about their A/B testing and performance details to each log entry. They also use Splunk for analysis of logs.
  • Optimise for recovery, not reliability. Always assume your system will go down - how long before you can get it back up? Make it as easy as possible, even for the worst case scenario (doing a clean deploy, setting up a server from scratch etc - do it in one script).
  • They have a monthly meeting to discuss how often they were interrupted by alerts. They give feedback to the devs to let them know there is a sporadic problem etc. that needs to be fixed.
    • Ops should only be paged if there is no way to automatically recover and it can’t wait until tomorrow. It should happen very rarely!
  • Configuration of servers these days should be scripted and in source control (I don’t know how easy this is on Windows, certainly on Linux it’s standard practice to use Chef or Puppet).
  • Trust is very important, no finger-pointing allowed. Coordination on launches key (Go or No-Go meeting). Embedding of ops in dev teams allows for closer collaboration, they also have weekly meetings between each other so they know the big picture for ops.
    • No Asshole rule. This one hit home (it sounds very familiar)
    • Don’t allowing snarky, biting and defensive comments between Dev and Ops; it implicitly encourages contention.
    • Condescension and holier-than-thou communication limits your career.
    • Celebrate collaboration.

If you have time and are interested, I recommend this talk.

DevOps for every developer - investigating in log files

In his thorough blog post Mathias Meyer describes how important it is for every developer to know the challenges of running applications in production environment.

I support almost every statement, but have a different opinion on two (out of more than a dozen) advices.

Note: the original blog post and my notes are both about Linux environment.


And yes, I think enabling swap on servers is a terrible idea. Let processes crash when they don’t have any resources left. That at least will allow you to analyze and fix.

Sometimes some long running processes continuously grow in size. Memory is allocated to the process, used for some purpose but will never be accessed again. Once I observed the Apache parent process on Ubuntu 10.04 growing by small size on every graceful restart. While it is a good idea investigate all the memory leaks or to restart processes running amok (preferably automatically with monit after they cross some defined threshold), sometimes you can perfectly live with such problems just swapping off.

I would not make the swap partition on server too large. For a server with 4 GB RAM I would create 2 GB swap; for a server with 8 GB RAM not more than 3GB.

Splitting log files

Separate request logging from application logging. Data on HTTP requests is just as important as application logs, but it’s easier if you can sift through them independently…

I found it is much easier to filter a big log file containing everything with ack/grep and awk or even ruby one-liners than stitching information together from different log files sorting by time.

Some tricks:

  • use ack to filter with regular expressions - extract the stuff, which is interesting for your investigation

  • ensure your log files are rotated regularly - do not let them grow infinitely! Use a custom logrotate configuration if needed. man logrotate and reading example files in /etc/logrotate.d should give you enough inspiration.

  • cut the amount of stuff need to be processed with head and tail or use tail -f if you are analyzing the problem that happens exactly at this moment

  • reduce width with cut. Even the default apache log (without virtual host information) contains very long lines with a full user agent string and referrer at the end. If you are only interested in, say, first 100 character of each of the last 40 lines, use

    tail -40 apache_custom.log | cut -c1-100
  • if you are only interested in the client IP and the url he has accessed - these are the second and the eighth columns of my apache log - you can use awk to extract that columns:

    tail -40 apache_access.log | awk '{print $2 $8}'
  • see also how to display a block of text (which goes over multiple lines) with awk

  • use less to interactively view and search interesting content. You can filter the logs as described above and put the result into a temporary file. vim is less suitable because it is much slower on huge files, especially if the syntax highlighting is activated.

  • prepend with zcat to access (older) compressed log files like

    zcat apache_error.log.2.gz | ack 'my regex'
  • automate - put the piped commands for frequent cases into a bash or ruby script or at least document it in a wiki

  • you can even run a prepared script through ssh without opening a session manualy

    ssh -t myserver ‘/home/me/my-prepared-script’

As a developer you can always let the computer do the stupid work! Give your eyes a rest, let ack do the scan and reduce the amount of information you have to read! That way you have more energy to analyze the problem.

NoSQL is Right... about no SQL

If the NoSQL DBs are right about anything, it’s that SQL is not worth implementing. SQL is great for poking around a database but it has no place in a live website.

SQL’s problem is that it’s lossy. It loses any information the programmer might have about how a query should be executed. We spend all day programming every other part of a website, but when it comes to the database, DB vendors have us convinced, “Hands off, you can’t be of any help here.” In a website, we want predictable performance, and a lossy language like SQL does not lead to predictability.

Instead we’re left to depend on query planners and optimizers to find the right way to run the query and just hope that that isn’t terrible. Which is most of the time.

What if I want to say this to my database? “Database, do this query using this index on these columns. If the index isn’t there, do me a favor and just fail that query rather than taking down the whole database trying to execute it without the index. Thanks Database.” SQL can’t provide this control, but it would help in the web environment.

There’s no reason that most NoSQL databases couldn’t implement a SQL interface to their data, but they’re right not to. It’s fine for the interface to require some database-specific programming, especially if it comes with reduced risk of performance problems. We’re programmers, we can handle it.

Once Again, the Enterprise Starts to Resemble Reality

Having worked in such environments all my career thus far, one of the more interesting trends in Enterprise IT is that it feels like once again, EIT is starting to reflect academia.

I don’t mean a bunch of people messing around in school, I mean that the software that people use in learning is making obvious inroads into EIT. And as the number of IT employees in all areas of the Enterprise who have graduated in the last 10 or so years starts to gain critical mass, this trend is only going to get stronger.

It’s now common to hear about Enterprise Devs bringing in fast key-value stores and ‘NoSQL’ databases, or suggesting a use for Hadoop. We see jquery and other display frameworks being used for internally developed web UIs. REST and JSON are being adopted as standards. But it doesn’t just stop with Enterprise Dev - EIT is moving there too.

Python and Ruby are finally starting to supplant the reigning king on Enterprise Ops, Perl. Puppet and Chef are starting to pop up in meetings. Collectd, Scribe, and even Hadoop are being mentioned in water cooler conversations between Enterprise Engineers. FC connected storage is barely moving while it’s IP-based rival is tearing up on it fast. Questions are being raised about the value of using the same fully resilient hardware for non-prod environments. The external ‘Enterprise Private Cloud’ has proven an emperor with no clothes, the internal infrastructure services are no longer the economic white elephant. ITIL has failed, if you believed in it in the first place. DevOps has become a dirty word for those in the startup space.

And when the last hope for an x86 alternative to Windows and Linux died as The Red Menace consumed Sun, even the old school declared ‘death to non-commodity hardware’ by pronouncing aggressive strategy changes to their hardware roadmaps. And you can imagine how much of their OS roadmaps talk about Windows Server. Even new HPC components are starting to enforce a Linux only driver policy.

In the days when Enterprises could legitimately claim that IT gave them a competitive advantage, the IT they had was the same IT that was used in academia. It was the same IT that the EIT staff learnt on. Nobody ever learnt Windows, Office, Oracle or VMware at University. But for the last 15 or so years those technologies found a place in the Enterprise, as professionals without a technical background or hacker mentality struggled to get on top of ‘this IT stuff’.

But the end of those days finally seems to be in sight. Maybe these kind of technologies are things you still only read about in blogs like mine. Maybe at your shop these are only hallway conversations rather than funded projects. But don’t let that get you down - be the change, take some ownership.

It’s gonna take a few years before the inflexion point, but it’s an inevitability. And when the current crop of startup engineers and ops get tired of the always-on lifestyle, the Enterprise who opens their arms to them will be the one who once again can use IT to gain an edge.

And anyone out there whose sole interest is in maintaining the status quo… start embracing disruption, you don’t want to stand in front of this freight train.

Defining DevOps from the Customer's Perspective

© 2011 Jeff Sussna, Ingineering.IT

There has been much discussion lately on the topic of “what is DevOps?” I sense a bit of a struggle to define the basic value proposition. If we view things from the customer’s perspective, though, the value becomes easy to see: Cloud computing turns software products into services. It used to be that software companies designed and developed software, then distributed it to customers, who took responsibility for deploying and operating it. Software-as-a-Service means that the same company that builds the software also operates it on the customer’s behalf. In the cloud, operational excellence is a fundamental part of what the software vendor is selling. Operational excellence implies more than just scalability, availability, and security. Part of the benefit of having vendor-operated software is immediate access to software updates. Operational excellence thus includes the rapidity, and quality, of the software release process.

Software-as-a-service means that sales and marketing need to care about operational excellence. The agile development process needs to incorporate non-functional requirements. The software release lifecycle needs to be optimized just like the design/development process. In a sense, Agile has always been incomplete, even before SaaS: no matter how great the development process, it has no benefit unless functionality can be delivered to the customer.

It’s important to remember that customer delivery includes non-technical components such as documentation and training. By the same token, integrating operational excellence into the product means that operations needs to think about the customer. Ops is no longer about managing servers and networks (assuming it ever really was). Instead, it’s about managing the customer experience. If development introduces code that makes the application slow, operations cares. Ops also needs to care (and in order to care, they first need to know) if they accidentally turn off previously released functionality. The fact that the infrastructure still works is of no solace to users.

Much is made lately about continuous delivery. DevOps groups are very proud of how many times per day they can release new code. But what is the impact on the customer? If they are continually surprised by change, will they consider it good? I know that if I log into a site, and the navigational mechanisms have changed without warning, and I don’t know where to click, I feel like something is broken. Again, ops needs to think in terms of the customer experience. Perhaps, rather than “continuous delivery”, we should call it “frictionless delivery”.

I don’t think DevOps is really about dissolving boundaries between operations and development. I believe it’s about dissolving boundaries between all the groups involved in delivering software services, from marketing and design, through development and QA, to operations and support. Beyond adopting specific agile development techniques, such as scrums and continuous integration, DevOps needs to adopt Agile’s underlying philosophy about delivering customer value. It may be that DevOps is unfortunately named. The name focuses us on a single part (dissolving boundaries between dev and ops) of the whole need. What we’re ultimately striving for is a unified software service delivery organization that aligns all teams with each other and with the customer. This unified organization needs to integrate design, marketing, development, QA, operations and support. Together they can practice what we might call “User-Centered IT”.

self healing in nagios alert

At Woome we’re quite lazy, I’ve mentioned that before. We write quite a lot of bespoke nagios alerts for our applications and on occasions we’ve found ourselves being woken regularly and needing to perform a simple restart of some buggy third party tool or a piece of code that we’ re waiting on a fix.

These days we simply modify the nagios alert to become alert, watchdog, self healer and logger.

Approximate pseudocode:

do original alert checking logic
if critical
   check time stamp of last self heal
   if not recent
      self heal in back ground
      change alert to warning
      change alert status to state self healing
      touch time stamp
      log heal attempt
deliver up alert status etc to nagios

This isn’t rocket science, but it does automate occasional known failures whilst removing the risks associated with repeated automated restarts

Watch on

This video was a great intro to the power of repeatable creation of standard configurations for all phases IT.  Dev, to Prod, to Ops, to Sales.  Great talk, and I’m anxious to find a use case that will get me working with it in my spare time.

Vagrant, Chef, and Node.js

I recently wrote about Vagrant and Chef. I’ve been using Vagrant as a front-end for VBoxManage for several months now, and I have nothing but awesome things to say about it in that capacity. I’ve also used Chef to provision virtual machines using some of Opscode’s frequently used cookbooks.

In addition to that I had previously set up several node.js environments by ssh’ing into my running virtual machine and manually installing it, but I had never actually provisioned my virtual machine using a node.js recipe for Chef. Today I decided to try doing that.

It turned out to be a bit trickier than I was expecting and so I wrote up some instructions in case you’re also having trouble getting it to work.

As usual, create a project directory, and then use

vagrant init

to create a default vagrant file. We’ll be using chef-solo in this example, so you’ll need to create a local directory to store your cookbooks:

mkdir cookbooks

By default, Vagrant will look there for cookbooks; you can call it something else, but if you do you’ll need to modify your vagrant file. Next you’ll need the opscode build-essential cookbook. You can clone their github repository and copy the build-essential cookbook to your cookbooks directory.

Next, you’ll need to find and download a Chef recipe for node.js. Here is one that was posted in the opscode community. You can download it from the link in the upper-right corner, or you can follow the link to the github page for the repository that holds the recipe and clone it. Either way, you’ll need to get that directory into your cookbooks directory.

Now you’ll need to edit your Vagrantfile to include the recipe for node.js. To do so edit the commented out lines that look like

  # Enable provisioning with chef solo, specifying a cookbooks path (relative
  # to this Vagrantfile), and adding some recipes and/or roles.
  # config.vm.provision :chef_solo do |chef|
  #   chef.cookbooks_path = "cookbooks"
  #   chef.add_recipe "mysql"
  #   chef.add_role "web"
  #   # You may also specify custom JSON attributes:
  #   chef.json = { :mysql_password => "foo" }
  # end

and make them look something like this:

  config.vm.provision :chef_solo do |chef|
    chef.add_recipe "nodejs"
After that,
vagrant up
should create a new virtual machine and install node.js from source. There’s only one problem:
$ vagrant ssh
Welcome to Ubuntu!
Last login: Thu Jul 21 13:07:53 2011 from
vagrant@lucid32:~$ node --version
This isn’t the latest version of node.js. It’s at v0.5.5 at the time of this writing. Fortunately, the authors of this cookbook were kind enough to include an optional parameter for the node.js recipe that allows you to select the version. If, for example, we wanted to install version 0.5.0 instead of version 0.4.8, we would modify the provisioning section of our Vagrant file so it looks like:
config.vm.provision :chef_solo do |chef|
    chef.add_recipe "nodejs"
    chef.json =	{
      "nodejs" => {
	"version" => "0.5.0"
Now running
vagrant provision
will rerun the provisioning scripts giving you a working version of node 0.5.0. You can modify the default.rb file in the attributes directory of the nodejs cookbook if you’d like to install a later version by default. But what happens when we try to install version 0.5.5?
$vagrant provision
[misc deleted]
---- Begin output of "bash"  "/tmp/chef-script20110914-3732-r572nn-0" ----
STDERR: --2011-09-14 14:21:24--
Connecting to||:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2011-09-14 14:21:25 ERROR 404: Not Found.
---- End output of "bash"  "/tmp/chef-script20110914-3732-r572nn-0" ----
Ran "bash"  "/tmp/chef-script20110914-3732-r572nn-0" returned 8
: stdout
The following SSH command responded with a non-zero exit status.
Vagrant assumes that this means the command failed!

chef-solo -c /tmp/vagrant-chef-1/solo.rb -j /tmp/vagrant-chef-1/dna.json

The output of the command prior to failing is outputted below:

[no output]

Sadly, we got a 404 at some stage of the recipe. After a bit of investigating, I discovered that it was due to the fact that the node.js distribution directory structure changed after the 0.5.0 release. Take a look for yourself.

So to solve this problem, I had to hack the recipe a bit. Here’s the original version (from the recipes/default.rb file in the nodejs cookbook directory):
bash "install nodejs from source" do
  cwd "/usr/local/src"
  user "root"
  code <<-EOH
    wget{node[:nodejs][:version]}.tar.gz && \
    tar zxf node-v#{node[:nodejs][:version]}.tar.gz && \
    cd node-v#{node[:nodejs][:version]} && \
    ./configure --prefix=#{node[:nodejs][:dir]} && \
    make && \
    make install
  not_if "#{node[:nodejs][:dir]}/bin/node -v 2>&1 \
            | grep 'v#{node[:nodejs][:version]}'"
I changed it so that the path is dependent on the version of node:
path = node[:nodejs][:version]>"0.5.0"?

bash "install nodejs from source" do
  cwd "/usr/local/src"
  user "root"
  code <<-EOH
    wget #{path}node-v#{node[:nodejs][:version]}.tar.gz && \
    tar zxf node-v#{node[:nodejs][:version]}.tar.gz && \
    cd node-v#{node[:nodejs][:version]} && \
    ./configure --prefix=#{node[:nodejs][:dir]} && \
    make && \
    make install
  not_if "#{node[:nodejs][:dir]}/bin/node -v 2>&1 \
             | grep 'v#{node[:nodejs][:version]}'"
I’m not that familiar with writing chef recipes, so I hope that I haven’t done anything wrong here. In the end, however, this worked perfectly for me.
Vimeo's Tailgate: simple real-time access to your logs


Tailgate on Github:

Tailgate is a nodejs app to pipe tail -F into websockets.

It’s a very simple way to have real-time access to your logs. It uses and coffeescript, and is great for keeping track of scribe logs.

Live demo here:


Tailgate exposes it’s feeds as a simple pub/sub api through connections making it easy to build visualizations or monitoring tools as simple web pages.

cd <install/directory> git clone cd tailgate cp conf/ conf/ 

Edit conf/ to have the correct values for your installation.


If you want to use the script.

cp startup/ startup/ sudo ln -s <fullpath to tailgate/startup/> /etc/init.d/tailgate 

Edit the startup/ script to use the installation directory and tailgate user to run as. Ensure the tailgate user has write permissions to startup/ so that it can write the pidfile.

sudo /etc/init.d/tailgate start sudo /etc/init.d/tailgate stop sudo /etc/init.d/tailgate restart 
  • Make sure the HTTP port specified in conf/ is not already in use
  • Enable logging in by setting LOGGING="1"
Book Review: Test-Driven Infrastructure with Chef by Stephen Nelson-Smith; O'Reilly Media



Test-Driven Infrastructure with Chef describes a rationale and an approach to developing automated tests for system infrastructure. It includes an explanation of behavior-driven development, and detailed instructions for setting up a testing system using cucumber-chef on EC2.


Surprisingly little time was spent talking about cucumber-chef and how to use it. The majority of the book is spent explaining BDD and why you’d want to apply it to infrastructure, and then explaining in minute detail the process to get RVM, EC2 and Chef configured. The last portion of the book covered the process of using cucumber-chef to set up a server with multiple user accounts.

Being already familiar with the supporting tools, I found this disappointing. The teamserver example was too simple and unrealistic. It would have been more useful to see some examples using cucumber-chef for a more realistic use-case, such as setting up a web server.

Unfortunately, I suspect the text isn’t likely to be helpful to a reader who isn’t familiar with the tools either. The instructions were already outdated when I read them, shortly after the book was released. A reader who is new to Cucumber, Chef, Ruby or EC2 will be in danger of getting lost before they even get to the point where they can run a test against an instance.


In the course of working through the examples, I was continually frustrated by the amount of time it took to get feedback. The tests are really slow. It’s hard to imagine actually developing red/green/refactor-style at this pace. I like the concept of being able to test the infrastructure, but it doesn’t seem practical with these tools.

I’m not convinced that it would really be worth the time involved to write tests at this level anyway. Writing integration tests against the full application stack might be a better use of your testing time. The true test of the infrastructure is how well it supports needs of the application and it’s users.

Neither BDD-style infrastructure tests nor full-stack integration tests will help you with the most difficult and interesting infrastructure challenges: scaling and stability. Simulating all the wonderful ways that a server can crash (used up disk space, hung connections, etc.) would be a complex, difficult, slow, and inevitably incomplete endeavor. It’s possible that server ops really is fundamentally different than application development.

Bottom Line
  • Good explanation of BDD and how it could be applied to infrastructure
  • Short on useful examples for readers familiar with the supporting tools
  • Thorough instructions for setting up EC2, Chef and RVM, but likely to have a short shelf-life

Available from O’Reilly: Test-Driven Infrastructure with Chef