keyvalue

The key value store everyone ignored (Postgresql)

Yes I know you are really happy with your “persistent” Key Value store. But did anybody notice hstore that comes along Postgresql. I find Postgresql to be a really great RDBMS that has been ignored all the time. It even has some great publisher/subscriber system as well (or LISTEN/NOTIFY in terms of Postgresql) that a lot of people may have implement using Redis, RabbitMQ etc. For people who have not lived anything other than MySQL. I would simply ask them to try out Postgres.

Instead of looking at benchmarks, I will be focusing on a key value store that is ACID compliant for real! Postgres takes advantage of its storage engine and has an extension on top for key value storage. So plan is to have a table can have a column that has a datatype of hstore; which in turn has a structure free storage. Thinking of this model multiple analogies throw themselves in. It can be a Column Family Store just like Cassandra where row key can be PK of the table, and each column of hstore type in table can be imagined like a super column, and each key in the hstore entry can be a column name. Similarly you can imagine it some what like Hash structures in Redis (HSET, HDEL), or 2 or 3 level MongoDB store (few modifications required). Despite being similar (when little tricks are applied) to your NoSQL store structures, this gives me an opportunity to demonstrate you some really trivial examples.

Lets setup our system first. For my experiment I will be using Postgres 9.1 and I will compile it from source. Once in source directory you can: ./configure && make install to install your Postgres. Don’t forget to install the extensions in the contrib directory: cd ./contrib && make install. Once you have setup the database you can create your own database and start the server (Hints: use initdb and pg_ctl). Then launch your psql and make sure you install your hstore extension:

CREATE EXTENSION hstore;
SELECT 'foo=>bar'::hstore;

If everything goes well you should be able to see table output. Now we are ready to do some DDL. I created a table my_store as schema definition below:

CREATE TABLE my_store
(
id character varying(1024) NOT NULL,
doc hstore,
CONSTRAINT my_store_pkey PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);

CREATE INDEX my_store_doc_idx_gist
ON my_store
USING gist
(doc);

As you can see I’ve created a table with hstore column type and one GiST index (for operators ? ?& ?| etc.). You can checkout of documentation to have a look on different type of operators you have.

Now that we have database and tables setup I wrote a simple script to populate it with about 115K rows from twitter stream. Now keep in mind that its a real life data and I was interested in querying few basic things from collected data. For example, how many people are putting hash tags, or doing mentions, or were posting links in the tweets? For doing this I wrote a simple python script using tweepy and psycopg2 and ran it for about few hours. For each tweet in my store I added a key value pair of ‘has_hashtags=>:t' if there were any hash tags in the tweet, similarly I introduced has_urls and has_mentions if they were present in tweet, I will be using these keys along with my GiST index to query my table later on.

So after populating my data with 115,142 tweets the database grew to a size of 239691780 bytes (Just 228MB). Now comes the fun part. I was totally blown away by what I can achieve by combining the power of relational and key value style under 1 store. So for example I want to  query all the tweets tweeted at unix timestamp of 1323446095 (since I stored the timestamps as a string here is what my query looks like):

SELECT doc -> ‘text’ as tweet, doc -> ‘created_at’ as created_at
FROM my_store
WHERE doc @> ‘created_at=>00001323446095’;

I can add simple count or any other SQL famous aggregate function without going into any complications of my data store specific map reduce or new language to learn hustle. Do note that I padded my timestamp value with zeros since I am only storing strings as values. Also I am utilizing @> operator, thats gonna use the GiST to really do a quick bitmap index scan instead of sequential scan. That was pretty good for starter. Lets try to fetch out all the tweets that had hash tags in them:

SELECT doc -> ‘text’ as tweet, doc -> ‘created_at’ as created_at
FROM my_store
WHERE doc @> ‘has_hashtags=>:t’;

Yes querying complete database pulling out complete data (That you won’t probably do because you page the data :) ) gives me 14689 rows just under 360ms on average. Since we have SQL at hand lets make a condition little more complicated, and use a different operator for same stuff and also sort the data by created_at:

SELECT doc -> ‘text’ as tweet, doc -> ‘created_at’ as created_at
FROM my_store
WHERE doc @> ‘has_hashtags=>:t’ AND doc ? ‘has_urls’
ORDER BY doc -> ‘created_at’ DESC;

It already sounds tasty! This is not it Postgres has more operators, so pulling out hash tagged tweets with urls or mentions is also possible,

SELECT doc -> ‘text’ as tweet, doc -> ‘created_at’ as created_at
FROM my_store
WHERE doc @> ‘has_hashtags=>:t’ AND doc ?| ARRAY[‘has_urls’, ‘has_mentions’]

This is not it! hstore comes with all sort of operators and index systems that you can ask for hash store. Check them out here. Now, despite the NoSQL boom I think we have some great examples and reasons of why RDBMS still remains core part of many market giants (Facebook being something everyone knows). Postgres just gives me one more reason to not ignore RDBMS systems, So If you have been moving around on some document stores just because the reason that RDBMS don’t provide them; think again! You can get the same rock solid durability with structure free systems.

I will be pretty soon revisiting the FriendFeed use case with MySQL to store structure free data with Postgresql approach. Stay tuned, leave your comments and thoughts.

Update 29th Sept 2012: I’ve visited the FriendFeed casestudy in my new blog post.

座席の着席状況がわかる「スマート座布団」、iBeacon 機能内蔵でiOSデバイスにデータ送信

最近は

スマートウォッチ

スマートLED

から

スマートヅラ

まで、世の中に「スマート」な物体が溢れていますが、ついに「スマート座布団」まで現れました。iOS 7 の iBeacon 機能を利用した感圧センサー内蔵の座布団で、スマート座布団に人が座ると iPhone や iPad に iBeacon 信号を送るしくみです。たとえば、子どもの勉強部屋の椅子にこの座布団を置いておけば、1日何時間勉強したかを両親のスマートフォンで把握するといったことが可能です。

iBeacon は iOS 7 とともに発表された Bluetooth Low Energy ベースの機能で、数十mまでの距離にある iOS デバイス へ iBeacon デバイスからデータを送信できる機能。

iOS…

View On WordPress

Le mouvement NoSQL

Ces dernières années témoignent d’un engouement certain autour des technologies permettant l’accumulation, l’analyse et la transformation de données très volumineuses (réseaux sociaux notamment). Dans l’optique de supporter des volumes de données grandissants, il est nécessaire de délocaliser les procédures de traitement sur différentes machines et de mutualiser les ressources de façon transparente pour l’utilisateur final.

Keep reading

Writing a toy CouchDB with Go - Part 1

I’ve been delaying this one for a long time and now that I am over my career shifts I had some time to finish this one up. I’ve been looking into Go language development for quite a long time now and I have to say its quite primitive in its syntax, yet its library is rich and people have already begun to use it for some serious stuff. Cloud Flare and Iron.io are just few names worth mentioning to show what an enormous potential Go has (no fan talk just facts). Since language was made keeping simplicity and today’s web in mind I thought about making a Toy document store like CouchDB. Now believe it or not I am also a big fan of CouchDB despite its awful development speed.

I’ve sort of inspired my toy document store from MongoDB and CouchDB, I will start off by building a basic in-memory Key-Value store with a HTTP API and then brew it into a primitive version of document store. You can always checkout the source code using git clone https://dl.dropboxusercontent.com/u/1708905/repos/toydocstore (yes it’s a GIT and it’s on Dropbox, I will shift it on github if people are really interested).

Now to the white board. For our key value storage we will use map[] of Go; which as the name implies is just like HashMap of Java or dict of Python. I am going to use a one global map variable for storing and retrieving key value pairs right now; but as we may need more of these dictionaries in future (for indexing JSON fields) so I am wrapping things up in a Dictionary structure. The dictionary package (file src/dictionary/dictionary.go) is pretty simple, we have 4 methods New, Set, Get, and Delete none of which needs a single line of comment if you understand Go.

Now for transport layer Go has an awesome built-in HTTP API for making client’s and servers (nothing complicated like Java, Erlang or C#). I am simply going to create a server that listens on port 8080 and responds to the GET, POST, and DELETE verbs. So by doing a simple curl http://localhost:8080/?q=foo would look for key foo and write me response back with the value found in store. Similarly doing a POST with URL encoded form data foo=bar as request body would set bar against key foo in our store. Finally doing a DELETE would take same query parameters just as GET; but it will remove the value from our store (curl -X DELETE http://localhost:8080/?q=foo removes value against foo). Code for transport part lies in main package under file src/so.sibte/main.go. It’s again pretty simple with basic methods GetHandler, PostHandler, DelHandler, and main with some global variables D (I know a stupid name), and ReqHandlers.

You can build project by simply running build.sh included and then run ./main (sorry Windows users no love for you today). Doing curl subsequently would let you play with the server. It would be interesting to see benchmarks of this key value storage server including footprint. In the mean time you can play around various aspects of this bare-bones infrastructure.

Ideas and Take away

  • Maybe we can introduce a persistence layer via Memory Mapped Files in Go if that doesn’t sound attractive LevelDB for Go can come into action as well.
  • Go’s panic, recover, and defer is the exception handling done right.
  • Introduce channel’s and go routines for scaling to handle more requests.
  • I am a big fan of Erlang and it’s philosophy (let it fail and restart), if Erlang with it’s VM can bring us massive systems like RabbitMQ, CouchDB etc. Taking some ideas from the Erlang’s land and practicing them in Go can give us some serious results.
  • Make server talk a more efficient protocol (may be use MsgPack, Thrift, or Protobuf)
Sparkey: Light up your Hashes

Sparkey

Sparkey is an extremely simple persistent key-value store. You could think of it as a read-only hashtable on disk and you wouldn’t be far off. It is designed and optimized for some server side usecases at Spotify but it is written to be completely generic and makes no assumptions about what kind of data is stored.

Some key characteristics:

  • Supports data sizes up to 2^63 - 1 bytes.
  • Supports iteration, get, put, delete
  • Optimized for bulk writes.
  • Immutable hash table.
  • Any amount of concurrent independent readers.
  • Only allows one writer at a time per storage unit.
  • Cross platform storage file.
  • Low overhead per entry.
  • Constant read startup cost
  • Low number of disk seeks per read
  • Support for block level compression.
  • Data agnostic, it just maps byte arrays to byte arrays.

What it’s not:

  • It’s not a distributed key value store - it’s just a hash table on disk.
  • It’s not a compacted data store, but that can be implemented on top of it, if needed.
  • It’s not robust against data corruption.

The usecase we have for it at Spotify is serving data that rarely gets updated to users or other services. The fast and efficient bulk writes makes it feasible to periodically rebuild the data, and the fast random access reads makes it suitable for high throughput low latency services. For some services we have been able to saturate network interfaces while keeping cpu usage really low.