this one quarter i redyed my hair half way thru it and everyone in class was like “????????????? why would you dye your hair a darker red if youre already a redhead??” fools

Tumblr's Finagle Redis Implementation

Tumblr uses redis as a cache/store for a number of our internal applications. Some of the things that depend on redis include:

  • dashboard notifications (via staircar)
  • Tiny URL’s (via gob)
  • gearman persistent storage (via george)
  • note counts (also via staircar, coming soon)
  • dashboard data (coming soon via ira)

Over the past six months we’ve started shifting internal services from being C/libevent based to being Scala/Finagle based. Finagle is a wonderful little library for building asynchronous RPC servers and clients. We’ve built a number of services (besides the ones above) with it, some of our usage I talked about here.

Back in October we created a native redis implementation for finagle, and as of yesterday Marius (one of the finagle leads at Twitter) had pulled it into master. Although I’m sure this will take a slightly different shape over the coming months as Twitter engineers have a chance to work with it, I’m proud to have helped shape a lot of that code along with Bennett, Dallas and Wiktor.

If you’re passionate about massively scalable services like this, we’re hiring.

Cloud Foundry Open PaaS Deep Dive

by Ezra Zygmuntowicz (aka @ezmobius)

You are probably wondering about how Cloud Foundry actually works, hopefully these details will clear things up for you about how Cloud Foundry the OSS project works, why it works, and how you can use it.

Cloud Foundry is on github here: The VCAP repo is the meaty part or what we call the “kernel” of Cloud Foundry as it is the distributed system that contains all the functionality of the PaaS.. We have released a VCAP setup script that will help you get an Ubuntu 10.04 system running a instance of Cloud Foundry including all the components of VCAP as well as a few services (mysql, redis, mongodb) up and running so you can play along at home.

We want to build a community around Cloud Foundry, as that is what matters most for now as far as the open source project. We imagine a whole ecosystem of “userland” tools that people can create and plug into our Cloud Foundry kernel to add value and customize for any particular situation. This project is so large in scope that we had to cut a release and get community involvement at some point and we feel that the kernel is in great shape for everyone to dig in and start helping us shape the future of the “linux kernel for the cloud” ;)

So how do you approach building a PaaS (Platform as a Service) that is capable of running multiple languages under a common deployment and scalability idiom, that is also capable of running on any cloud or any hardware that you can run Ubuntu on? VCAP  was architected by my true rock star coworker Derek Collison (this guys is the man, for realz!). The design is very nice and adheres to a main core concept of “the closer to the center of the system the dumber the code should be”. Distributed systems are fundamentally hard problems. So each component that cooperates to form the entire system should be as simple as it possibly can be and still do its job properly.

VCAP is broken down into 5 main components plus a message bus: The Cloud Controller, the Health Manager, the Router’s, the DEAs (Droplet Execution Agents) and a set of Services. Let’s take a closer look at each component in turn and see how they fit together to form a full platform or for the cloud.

NATS is a very simple pub/sub messaging system written by Derek Collison (this dude knows messaging, trust me;) of TIBCO fame. NATS is the system that all the internal components communicate on. While NATS is great for this sort of system communication, you would never use NATS for application level messaging. VMware’s own RabbitMQ is awesome for app level messaging and we plan to make that available to Cloud Foundry users in the near future.

It should be stated here that every component in this system is horizontally scalable and self healing, meaning you can add as many copies of each component as needed in order to support the load of your cloud, and in any order. Since everything is decoupled, it doesn’t even really matter where each component lives, things could be spread across the world for all it cares. I think this is pretty cool ;)

Cloud Controller:
The Cloud Controller is the main ‘brain’ of the system. This is an Async Rails3 app that uses ruby 1.9.2 and fibers plus eventmachine to be fully async, pretty cutting edge stuff. You can find the Cloud Controller here: . This component exposes the main REST interface that the CLI tool “vmc” talks to as well as the STS plugin for eclipse. If you were so inclined you could write your own client that talks to the REST endpoints exposed by the Cloud Controller to talk to VCAP in whatever way you like. But this should not be necessary as the “vmc” CLI has been written with scriptability in mind. It will return proper exit codes as well as JSON if you so desire so you can fully script it with bash or ruby, python, etc.

We made a decision not to tie VCAP to git even though we love git, we need to support any source code control system, yet we want the simplicity of a git push style deployment, hence the vmc push. But we also do want to have the differential deploys, meaning that we want to push diffs when you update your app, we do not want to have to push your entire app tree every time you deploy. Feeling light and fast is important to us. Our goal is to rival local development.

So we designed a system where we can get the best of both worlds. You as a user can use any source control system you want, when you do a vmc push or vmc update we will examine your app’s directory tree and send a “skeleton fingerprint” to the cloud controller. This is basically a fingerprint of each file in your apps tree and the shape of your directory tree. The cloud controller keeps these in a shared pool, accessible via their fingerprint plus the size for every object it ever sees. Then it returns to the client a manifest of what files it already has and what files it needs your client to send to the cloud in order to have all of your app. It is a sort of ‘dedupe’ for app code as well as framework and rubygem code and other dependency code. Then your client only sends the objects that the cloud requires in order to create a full “Droplet” (a droplet is a tarball of all your app code plus its dependencies all wrapped up into a droplet with a start and stop button) of your application.

Once the Cloud Controller has all the ‘bits’ it needs to fully assemble your app, it pushes the app into the “staging pipeline”. Staging is what we call the process that assembles your app into a droplet by getting all the full objects that comprise your applications plus all of its dependencies, rewrites its config files in order to point to the proper services that you have bound to your application and then creates a tarball with some wrapper scripts called start and stop.

The Droplet Execution Agent can be found here: . This is an agent that is run on each node in the grid that actually runs the applications. So in any particular cloud build of Cloud Foundry, you will have more DEA nodes then any other type of node in a typical setup. Each DEA can be configured to advertise a different capacity and different runtime set,  so you do not need all your DEA nodes to be the same size or be able to run the same applications.
So continuing on from our Cloud Controller story, the CC has asked for help running a droplet by broadcasting on the bus that it has a droplet that needs to be run. This droplet has some meta data attached to it like what runtime it needs as well as how much RAM it needs. Runtimes are the base component needed to run your app, like ruby-1.8.7 or ruby-1.9.2, or java6, or node. When a DEA gets one of these messages he checks to make sure he can fulfill the request and if he can he responds back to the Cloud Controller that he is willing to help.

The DEA does not necessarily care what language an app is written in. All it sees are droplets. A droplet is a simple wrapper around an application that takes one input, the port number to serve HTTP requests on. And it also has 2 buttons, start and stop. So the DEA treats droplets as black boxes, when it receives a new droplet to run, all it does it tells it what port to bind to and runs the start script. A droplet again is just a tarball of your application, wrapped up in a start/stop script and with all your config files like database.yml rewritten in order to bind to the proper database. Now we can’t rewrite configs for every type of service so for some services like Redis or Mongodb you will need to grab your configuration info from the environment variable ENV[‘VCAP_SERVICES’].

In fact there is a bunch of juicy info in the ENV of your application container. If you create a directory on your laptop and make a file in it called env.rb like this:

That will make a simple app that will show you what is available in your ENVIRONMENT so that you can see what to use to configure your application. If you visit this new app you will see something like this:
ENV output

So the DEA’s job is almost done, once it tells the droplet what port to listen on for HTTP requests and runs it’s start script, and the app properly binds the correct port, it will broadcast on the bus the location of the new application so the Routers can know about it. If the app did not start successfully it will return log messages to the vmc client that tried to push this app telling the user why their app didn’t start (hopefully). This leads us right into what the Router has to do in the system so we will hand it over to the Router (applause).

The Router is another eventmachine daemon that does what you would think it does, it routes requests. In a larger production setup there is a pool of Routers load balanced behind Nginx (or some other load balancer or http cache/balancer). These routers listen on the bus for notifications from the DEA’s about new apps coming online and apps going offline etc. When they get a realtime update they will update their in-memory routing table that they consult in order to properly route requests. So a request coming into the system goes through Nginx, or some other HTTP termination endpoint, which then load balances across a pool of identical Routers. One of the routers will pick up the phone to answer the request, it will start inspecting the headers of the request just enough to find the Host: header so it can pick out the name of the application this request is headed for. It will then do a basic hash lookup in the routing table to find a list of potential backends that represent this particular application. These look like:
{‘’ => ['’, '’, etc]}
So once it has looked up the list of currently running instances of the app represented by ‘’ it will pick a random backend to send the request to. So the router chooses one backend instance and forwards the request to that app instance. Responses are also inspected at the header level for injection if need be for functionality such as sticky sessions.

The Router can retry another backend if the one it chose fails and there are many ways to customize this behavior if you have your own instance of Cloud Foundry setup somewhere. Routers are fairly straightforward in what they do and how they do it. They are eventmachine based and run on ruby-1.9.2 so they are fast and can handle quite a bit of traffic per instance, but like every other component in the system, they are horizontally scalable and you can add more as needed in order to scale up bigger and bigger. The system is architected in such a way that this can even be done on a running system.

Health Manager:
The Health Manager is a standalone daemon that has a copy of the same models the Cloud Controller has and can currently see into the same database as the Cloud Controller.

This daemon has an interval where it wakes up and scans the database of the Cloud Controller to see what the state of the world “should be”, then it actually goes out into the world and inspects the real state to make sure it matches. If there are things that do not match then it will initiate messages back to the Cloud Controller to correct this. This is how we handle the loss of an application or even a DEA node per say. If an application goes down, the Health Manager will notice and will quickly remedy the situation by signaling the Cloud Controller to star a new instance. If a DEA node completely fails, the app instances running there will be redistributed back out across the grid of remaining DEA nodes.

These are the services your application can choose to use and bind to in order to get data, messaging, caching and other services. This is currently where redis and mysql run and will eventually become a huge ecosystem of services offered by VMware and anyone else who wants to offer their service into our cloud. One of the cool things I will highlight is that you can share service instances between your different apps deployed onto a VCAP system. Meaning you could have a core java Spring app with a bunch of satellite sinatra apps communicating via redis or a RabbitMQ message bus.

Ok my fingers are tired and my shoulder hurts so I am going to call this first post done. I plan on blogging a lot more often as well as trying to help organize the community around Cloud Foundry the open source project. I hope you are as excited as I am by this project, it is basically like “rack for frameworks, clouds and services” rather then just ruby frameworks. Pluggable across the 3 different axis and well tested and well coded. This thing is very cool and I am very proud just to be a member of the team working on this thing.

This has been a huge team effort to get out the door and we hope it will become a huge community effort to keep driving it forward to truly make it “The Linux Kernel of Cloud Operating Systems”. Will you please join the community with me and help build this thing out to meet its true potential?

Scaling Redis

When a database is limited to running on a single computer, only certain load can be served with acceptable latency. Adding hardware to this single computer will help only so much. Doubling or tripling the load may require significantly more than twice or thrice cost of hardware to scale up. Such approach is expensive to scale and it eventually hits its limit. Ideally, we would start with a single inexpensive computer and as load increases we would keep on adding same inexpensive computers resulting in a near-linear function between load and cost. Such horizontal scaling out is common place in today’s web applications because it provides a more predictable cost model. 

Clusters of inexpensive commodity hardware lead to a disruption in the database ecosystem. What further amplifies this disruption in falling prices of RAM and solid state storage. A number NoSQL of databases are truly leveraging this disruption. At Meshin, one of our key requirements is the lightning-fast delivery of query results. Think of Google Instant search. Showing results with “as you type” latency enables a whole new class of use cases. This lead us to considering various in-memory database engines. Redis, with its simple and elegant data model and very transparent performance characteristics came out on top. For in-memory databases, single-threaded design becomes an obvious choice, essentially removing significant overhead common to traditional database architectures. Another assumption in Meshin design is that if the index is partially or completely lost it can be recovered by re-indexing original data. Sure, this introduces short-term inconsistencies, but on the other hand it allows to relax durability and further simplify the design. Such a trade-off is a perfect fit to the Redis persistence strategy.

Single-threaded design of Redis brings up another interesting issue. Scaling of a single instance of Redis is not limited just by the computer it runs on, which would amount to RAM size, but is further limited by the single core or hardware thread it will utilize. So if your load exceeds the single hardware thread or size of available RAM, you need a second instance of Redis. This requires some approach to clustering. Effective clustering is all about figuring out the right partitioning scheme. Ideally, you need to split load into identically-sized partitions. If load becomes biased towards one or more partitions you will have a bottleneck. Such a bottleneck will limit your system’s scalability or in other words will make your load-to-cost function non-linear. It is important that partitions are as independent as possible. If an operation spans a group of partitions its scalability will be limited by the number of such groups in the cluster. In extreme cases of spanning all partitions in a cluster, the scalability will be as good as running on a single partition. 

Another aspect is differentiating the load by reads and writes. If the Redis hardware thread capacity is exceeded by reads, it is easy to scale out by putting additional read-only replicas of the same data. It is much harder to scale out writes by replication where trading off consistency is often required. Redis again takes simple a approach with its master-slave replication. A writable master replica asynchronously updates one or more read-only slave replicas. Right after replying to a write operation the master notifies all replicas. No acknowledgment is required by the master. This means that there is short period of time when slave replicas may return old data. This provides with write-write consistency guarantees that are as strong as without replication. However, the write-read consistency guarantees are now less strong or “eventual”. It is important to note that replication not only helps scalability of reads, but also improves reliability when replicas are placed on different machines. For the Meshin application eventual write-read consistency is acceptable tradeoff for higher reliability.

With the Meshin application, we have a large number of user indexes, each of roughly equal size. Meshin keeps a sliding window index of email, Twitter, Facebook etc. messages to maintain predictable maximum size of index. This provides a great opportunity for partitioning. If a partition handles roughly an equal number of users, we have balanced storage and load requirements and at the same time made the most frequent operations directed at only one partition. One approach is choosing between N partitions is hashing a unique user identifier, for example an email address, such that the hash space is in the [0..N) range. If the quality of the hash function is sufficient, we will get a well-balanced distribution.

Keep reading

TripleD: Distributed GFS-like File System Using Redis and zeromq

On GitHub :

This is a very simple distributed file system borrowing many ideas from GFS. The idea is to focus on simplicity and performance. Not availability or fault tolerance (at all).

Triple D:

  1. Dead simple…
  2. Done right…
  3. Distributed file system

I’ll leave expert Jeff Darcy to comment on TripleD.

Update: Jeff was again kind enough to comment:

Not a filesystem, doesn’t even look serious. Explicit copy in/out, no real directories, single MDS, no security/consistency/etc. In short, it looks like dimissing complexity it doesn’t comprehend. Needs to grow up a lot.

Original title and link: TripleD: Distributed GFS-like File System Using Redis and zeromq (NoSQL databases © myNoSQL)

Redis based triple database

Meshin application relies on back-end triple store for holding person semantic index and for processing front-end queries. This back-end store is built on top of open source in-memory key-value database Redis. Before getting into details of how triples are represented and queried I will briefly introduce essential Redis features. Fill free to skip following paragraph if you are familiar with Redis.

Redis is key-value store where keys are binary strings and values can be either simple binary strings or higher order data structures. These data structures include ordered lists, unordered unique sets, secondary level hash tables (hsets) and weight sorted sets (zsets). Redis exposes its functionality through simple text based protocol. Protocol defines number of commands and corresponding replies. Commands are either general for all kinds of key-values or specialized for value type. For example SET A X associates binary string value X with key A.

Keep reading

Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Membase vs Neo4j comparison

While SQL databases are insanely useful tools, their monopoly of ~15 years is coming to an end. And it was just time: I can’t even count the things that were forced into relational databases, but never really fitted them.

But the differences between “NoSQL” databases are much bigger than it ever was between one SQL database and another. This means that it is a bigger responsibility on software architects to choose the appropriate one for a project right at the beginning.

In this light, here is a comparison of CassandraMongodbCouchDBRedisRiakMembase,Neo4j and HBase.


Rails Caching Benchmarked: MongoDB, Redis, Memcached

A couple of Rails caching solutions—file, memcached, MongoDB, and Redis—benchmarked firstly here by Steph Skardal and then here by Thomas W. Devol. Thomas W. Devol concludes:

Though it looks like mongo-store demonstrates the best overall performance, it should be noted that a mongo server is unlikely to be used solely for caching (the same applies to redis), it is likely that non-caching related queries will be running concurrently on a mongo/redis server which could affect the suitability of these benchkmarks.

I’m not a Rails user, so please take these with a grain of salt:

  • without knowing the size of the cached objects, at 20000 iterations most probably neither MongoDB, nor Redis have had to persist to disk.

    This means that all three of memcached, MongoDB, Redis stored data in memory only[1]

  • if no custom object serialization is used by any of the memcached, MongoDB, Redis caches, then the performance difference is mostly caused by the performance of the driver

  • it should not be a surprise to anyone that the size of the cached objects can and will influence the results of such benchmarks

  • there doesn’t seem to be any concurrent access to caches. Concurrent access and concurrent updates of caches are real-life scenarios and not including them in a benchmark greatly reduces the value of the results

  • none of these benchmarks doesn’t seem to contain code that measure the performance of cache eviction

  1. Except the case where any of these forces a disk write  

Original title and link: Rails Caching Benchmarked: MongoDB, Redis, Memcached (NoSQL database©myNoSQL)