mongodb

Crittercism: Scaling To Billions Of Requests Per Day On MongoDB

This is a guest post by Mike Chesnut, Director of Operations Engineering at Crittercism. This June, Mike will present at MongoDB World on how Crittercism scaled to 30,000 requests/second (and beyond) on MongoDB.

MongoDB is capable of scaling to meet your business needs — that is why its name is based on the word humongous. This doesn’t mean there aren’t some growing pains you’ll encounter along the way, of course. At Crittercism, we’ve had a huge amount of growth over the past 2 years and have hit some bumps in the road along the way, but we’ve also learned some important lessons that can hopefully be of use to others.

Background

Crittercism provides the world’s first and leading Mobile Application Performance Management (mAPM) solution. Our SDK is embedded in tens of thousands of applications, and used by nearly a billion users worldwide. We collect performance data such as error reporting, crash diagnostics details, network breadcrumbs, device/carrier/OS statistics, and user behavior. This data is largely unstructured and varies greatly among different applications, versions, devices, and usage patterns.

Storing all of this data in MongoDB allows us to collect raw information that may be of use in a variety of ways to our customers, while also providing the analytics necessary to summarize the data down to digestible, actionable metrics.

As our request volume has grown, so too has our data footprint; over the course of 18 months our daily request volume increased over 40x:

Our primary MongoDB cluster now houses over 20TB of data, and getting to this point has helped us learn a few things along the way.

Routing

The MongoDB documentation suggests that the most common topology is to include a router — a mongos process — on each client system. We started doing this and it worked well for quite a while.

As the number of front-end application servers in production grew from the order of 10s to several 100s, we found that we were creating heavy load via hundreds (or sometimes thousands) of connections between our mongos routers and our mongod shard servers. This meant that whenever chunk balancing occurred — something that is an integral part of maintaining a well-balanced, sharded MongoDB cluster — the chunk location information that is stored in the config database took a long time to propagate. This is because every mongos router needs to have a full picture of where in the cluster all of the chunks reside.

So what did we learn? We found that we could alleviate this issue by consolidating mongos routers onto a few hosts. Our production infrastructure is in AWS, so we deployed 2 mongos servers per availability zone. This gave us redundancy per AZ, as well as offered the shortest network path possible from the clients to the mongos routers. We were concerned about adding an additional hop to the request path, but using Chef to configure all of our clients to only talk to the mongos routers in their own AZ helped minimize this issue.

Making this topology change greatly reduced the number of open connections to our mongod shards, which we were able to measure using MMS, without a noticeable reduction in application performance. At the same time, there were several improvements to MongoDB that made both the mongos updates and the internal consistency checks more efficient in general. Combined with the new infrastructure this meant that we could now balance chunks throughout our cluster without causing performance problems while doing so.

Shard Replacement

Another scenario we’ve encountered is the need to dynamically replace mongod servers in order to migrate to larger shards. Again following the recommended best deployment practice, we deploy MongoDB onto server instances utilizing large RAID10 arrays running xfs. We use m2.4xlarge instances in AWS with 16 disks. We’ve used basic Linux mdadm for performance, but at the expense of flexibility in disk configuration. As a result when we are ready to allocate more capacity to our shards, we need to perform a migration procedure that can sometimes take several days. This not only means that we need to plan ahead appropriately, but also that we need to be aware of the full process in order to monitor it and react when things go wrong.

We start with a replica set where all replicas are at approximately the same disk utilization. We first create a new server instance with more disk allocated to it, and add it to this replica set with rs.add().

The new replica will enter the STARTUP2 state and remain there for a long time (in our case, usually 2-3 days) while it first clones data, then catches up via oplog replication and builds indexes. During this time, index builds will often stop the replication process (note that this behavior is set to change in MongoDB 2.6), and so the replication lag time does not strictly decrease the whole time — it will steadily decrease for a while, then an index build will cause it to pause and start to fall further behind. Once the index build completes the replication will resume. It’s worth noting that while index builds occur, mongostat and anything else that requires a read lock will be blocked as well.

Eventually the replica will enter the SECONDARY state and will be fully functional. At this point we can rs.stepDown() one of the old replicas, shut down the mongod process running on it, and then remove it from the replica set via rs.remove(), making the server ready for retirement.

We then repeat the process for each member of the replica set, until all have been replaced with the new instances with larger disks.

While this process can be time-consuming and somewhat tedious, it allows for a graceful way to grow our database footprint without any customer-facing impact.

Conclusion

Operating MongoDB at scale — as with any other technology — requires some knowledge that you can gain from documentation, and some that you have to learn from experience. By being willing to experiment with different strategies such as those shown above, you can discover flexibility that may not have been previously obvious. Consolidating the mongos router tier was a big win for Crittercism’s Ops team in terms of both performance and manageability, and developing the above described migration procedure has enabled us to continue to grow to meet our needs without affecting our service or our customers.

See how Crittercism, Stripe, Carfax, Intuit, Parse and Sailthru are building the next generation of applications at MongoDB World. Register now and join the MongoDB Community in New York City this June.
MongoDB を使ってみる

MongoDB は NoSQL と呼ばれるデータベースソフトウェアの一つ。 どうやら、概ね RDB (Relational Database) と KVS (Key Value Store) の中間にあたる機能とパフォーマンスを提供しているようだ。 つまり、RDB からリレーションやトランザクションなど幾つかの機能を省くことで KVS に迫る高いパフォーマンスを実現している。 今回は、その MongoDB を同梱のクライアントを通して使ってみることにする。

まずは MongoDB をインストールする。 OS X であれば Homebrew で入れるのが楽で良い感じ。

$ brew install mongodb

インストールしたら、まずは MongoDB のサーバを起動する。
$ mongod --config /usr/local/etc/mongod.conf

次に、クライアントとなるコマンド ‘mongo’ で接続する。
$ mongo

接続できたら 'use’ コマンドで使用するデータベースを宣言する。
> use testdb
switched to db testdb

この時点では、まだ中身が何も無いので実際には何もできていないようだ。
> show dbs
admin  (empty)
local  0.078GB

次に、RDB のテーブルに相当する「コレクション」の中に、同じくレコードに相当する「ドキュメント」を作成する。 以下はデータベースの中に 'users’ という名前のコレクションを作り、そこに 'name’ と 'age’ という要素を持ったドキュメントを追加している。 見ての通り、MongoDB で扱う内容は JSON (正しくは BSON) になっている。
> db.users.insert({"name": "Foo", "age": 10});
WriteResult({ "nInserted" : 1 })

この時点で中身ができたのでデータベースに内容が反映されている。
> show dbs
admin   (empty)
local   0.078GB
testdb  0.078GB

確認すると、新たに 'users’ という名前のコレクションができている。 'system.indexes’ は、検索を高速化するためのインデックスに関する情報だろう。
> db.getCollectionNames()
[ "system.indexes", "users" ]

追加したドキュメントは find() で確認できる。 ドキュメントの識別子 (_id) が自動的に追加されている。 RDB で言うところのプライマリキーで、かつサロゲートキーかな。
> db.users.find()
{ "_id" : ObjectId("544cec8cc5c20be818a66321"), "name" : "Foo", "age" : 10 }

動作確認用として、同様にドキュメントを追加していく。
> db.users.insert({"name": "Bar", "age": 20});
WriteResult({ "nInserted" : 1 })
> db.users.insert({"name": "Baz", "age": 30});
WriteResult({ "nInserted" : 1 })
> db.users.find()
{ "_id" : ObjectId("544cec8cc5c20be818a66321"), "name" : "Foo", "age" : 10 }
{ "_id" : ObjectId("544ceda3c5c20be818a66322"), "name" : "Bar", "age" : 20 }
{ "_id" : ObjectId("544cedccc5c20be818a66323"), "name" : "Baz", "age" : 30 }

find() は SQL の 'where’ 節のように結果を絞り込むことができる。 例えば名前に 'Bar’ を持つものだけを選ぶにはこうする。
> db.users.find({"name": "Bar"});
{ "_id" : ObjectId("544ceda3c5c20be818a66322"), "name" : "Bar", "age" : 20 }

値は決め打ちではなく、各種の修飾子を使って柔軟に指定できる。 例えば年齢が 15 歳を越えるものを得るには $gt 修飾子を使ってこうすれば良い。
> db.users.find({age: {$gt: 15}});
{ "_id" : ObjectId("544ceda3c5c20be818a66322"), "name" : "Bar", "age" : 20 }
{ "_id" : ObjectId("544cedccc5c20be818a66323"), "name" : "Baz", "age" : 30 }

正規表現を使って名前が 'B’ から始まるものを得るといったこともできる。
> db.users.find({ "name": { $regex: /^B/}})
{ "_id" : ObjectId("544ceda3c5c20be818a66322"), "name" : "Bar", "age" : 20 }
{ "_id" : ObjectId("544cedccc5c20be818a66323"), "name" : "Baz", "age" : 30 }

複数の条件を組み合わせるときは $or や $and 修飾子が使える。 年齢が 25 歳未満、または名前が 'F’ から始まるユーザを選んでみる。
> db.users.find({ $or: [{ "age": { $lt: 25} }, { "name": { $regex: /^F/ }}]});
{ "_id" : ObjectId("544cec8cc5c20be818a66321"), "name" : "Foo", "age" : 10 }
{ "_id" : ObjectId("544ceda3c5c20be818a66322"), "name" : "Bar", "age" : 20 }

find() でドキュメントの特定の要素だけを得たい場合には第二引数を指定する。 名前だけ取り出してみる。
> db.users.find({}, {"name": true, "_id": false})
{ "name" : "Foo" }
{ "name" : "Bar" }
{ "name" : "Baz" }

find() で選んだ内容を特定のルールにもとづいて並び替えてみる。 年齢を昇順で表示させてみよう。
> db.users.find().sort({"age": 1})
{ "_id" : ObjectId("544cec8cc5c20be818a66321"), "name" : "Foo", "age" : 10 }
{ "_id" : ObjectId("544ceda3c5c20be818a66322"), "name" : "Bar", "age" : 20 }
{ "_id" : ObjectId("544cedccc5c20be818a66323"), "name" : "Baz", "age" : 30 }

降順の場合は -1 を指定する。
> db.users.find().sort({"age": -1})
{ "_id" : ObjectId("544cedccc5c20be818a66323"), "name" : "Baz", "age" : 30 }
{ "_id" : ObjectId("544ceda3c5c20be818a66322"), "name" : "Bar", "age" : 20 }
{ "_id" : ObjectId("544cec8cc5c20be818a66321"), "name" : "Foo", "age" : 10 }

sort() と limit() を組み合わせることでソート後の上位を取得できる。 例えば年齢が最も低いドキュメントを選び出すには、こうすれば良い。
> db.users.find().sort({"age": 1}).limit(1)
{ "_id" : ObjectId("544cec8cc5c20be818a66321"), "name" : "Foo", "age" : 10 }

コンソールでは JavaScript 的な操作もできるので、以下のように計算を絡めて上位 n% の結果を得る、といったことも楽にできそう。
> var cnt = db.users.count()
> Math.floor(cnt / 2)
1 

次はドキュメントの内容を更新してみる。 例えば 'Foo’ ユーザの年齢を 5 歳に変更する。 値を更新するには update() を使う。
> db.users.update({"name": "Foo"}, {$set: {"age": 5}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
> db.users.find({"name": "Foo"})
{ "_id" : ObjectId("544cec8cc5c20be818a66321"), "name" : "Foo", "age" : 5 }

$set 修飾子は、既存のドキュメントのその要素だけを変更するのに使う。 もし、これを使わないと、選択したドキュメントがそれ自体に変わってしまう。
> db.users.update({"name": "Foo"}, {"age": 10})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
> db.users.find({"name": "Foo"})
> db.users.find({"age": 10})
{ "_id" : ObjectId("544cec8cc5c20be818a66321"), "age" : 10 }

ちなみに update() はデフォルトで最初に該当する一つのドキュメントだけを更新する。 もし、すべてのドキュメントを更新したい場合には第三引数の 'multi’ に true を指定する。 例えば、5 歳を越えるドキュメントの年齢を全て 100 歳に変更してみる。
> db.users.update({"age": {$gt: 5}}, {$set: {"age": 100}}, {"multi": true})
WriteResult({ "nMatched" : 2, "nUpserted" : 0, "nModified" : 2 })
> db.users.find()
{ "_id" : ObjectId("544cec8cc5c20be818a66321"), "age" : 5 }
{ "_id" : ObjectId("544ceda3c5c20be818a66322"), "name" : "Bar", "age" : 100 }
{ "_id" : ObjectId("544cedccc5c20be818a66323"), "name" : "Baz", "age" : 100 }

update() には $set 以外にも便利な修飾子が用意されている。 例えば、$inc 修飾子を使うとドキュメントの内容をインクリメントできる。
> db.users.update({"name": "Bar"}, {$inc: {"age": 1}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
> db.users.find({"name": "Bar"})
{ "_id" : ObjectId("544ceda3c5c20be818a66322"), "name" : "Bar", "age" : 101 }

不要になったドキュメントは remove() で削除できる。 例えば全てのドキュメントを削除するにはこうすれば良い。
> db.users.remove({})
WriteResult({ "nRemoved" : 3 })
> db.users.find()

もし、操作内容で分からないことがあれば help を見れば良い。
> help
> db.help()
基本的な操作に関してはこんなところかな。

正直、MongoDB の使いどころは意外と限られるような気がしている。 これは、アプリケーション側でリレーションを担保することが現実的でないという考えをぼく自身が持っていることに由来するけど、MongoDB はドキュメント間のリレーションや複数のドキュメントに渡るトランザクションをデータベース自体が提供していないという点が大きい。 RDB のようにあらかじめスキーマを決める必要のないスキーマレスという特徴も、実際には簡単に不整合が起こりうる温床となるわけで、そのゆるっとした印象とは裏腹に、システムが担保してくれない部分をユーザが厳格に管理することが求められるんじゃないだろうか。 とはいえ、リレーションやトランザクションを必要としない分野、ぱっと思いつく限りでは情報を累積的にストアしていくような用途にはマッチしていそう。 例えば、アプリケーションで発生したイベントなんかを保存していく先としては良さそうな気がしている。 次は各種言語バインディングから操作したいな。
NoSQL databases, Hadoop, Big Data: Pinned tabs Oct. 30th

00: This post format is still experimental. For each entry, the main link is under its prefix number (e.g [01]). The ★ is a link to the item.


01: osquery is a new open source project from Facebook which exposes the operating system metrics as a resource that can be queried using SQL.


02: Polyglot persistence delivers. But it’s over complicated. The Weather Company is using Riak for storing data, Cassandra for serving API data, and MongoDB as a caching layer.


03: If you don’t test for the possible failures, you might be in for a surprise. Stripe has tried a more organized chaos monkey attack and discovered a scenario in which their Redis cluster is losing all the data. They’ll move to Amazon RDS PostgreSQL. From an in-memory smart key-value engine to a relational database.


04: How a distributed database should really behave in front of massive failures. Netflix recounts their recent experience of having 218 Cassandra nodes rebooted without losing availability. At all.


05: Today I read twice about multi-model databases. Once about ArangoDB and second about FoundationDB (link). The storage is the same so except the data access layer (and potentially some in memory transformations) I don’t really understand the idea of multi-model databases.


06: I really hope this announced presentation about database performance benchmarking from Tokutek’s VP of Engineering will be published. I have seen tons of benchmarks, most done wrong, but never run into an extended best practices guide.


07: Emil Eifrem1 says that the super cool iPad Paper app uses Neo4j for their sharing platform.


08: Grupo Globo2, broadcasters of the FIFA World Cup, migrates from Redis to Cassandra hoping to rewind the 7 goals from Germany. One of these goals was achieved.


09: Java-based relational JSON database on top of PostgreSQL. Not very clear what it adds on top of the PostgreSQL JSON support.


10: Continuent is taking the VMware route. The pretty well known MySQL clustering and replication tool Tungsten is now part of VMware portfolio.


11: Learn how to use Scala to access MapR’s distributed file system. The post refers to it a “file system similar to HDFS”, the first time I see a reference to a Hadoop component that is referred to correctly.


12: What’s the difference between a data lake and a data mart? A data lake doesn’t wash the data and doesn’t try to structure it according to a predefined use case. You could think of it as the difference between a relational database (structure data) and data stored in a NoSQL database or Hadoop.


13: Using regular expressions to shard Redis, a new sharding option added by Redis Cloud. That’s a first.


14: Full text indexing in Neo4j: manual vs automatic vs schema indexes. (nb: make sure you check the first link in the blog post which explains the different types of indexes available in Neo4j).


15: Forrester’s Wave covering document databases: Cloudant, Couchbase, MarkLogic, MongoDB. In case you expected to see some other NoSQL databases names, Forrester a referenceable install base for each product.


  1. Emil Eifrem is founder and CEO of Neo4j. 

  2. Grupo Glove is the largest media business in Latin America. 

Original title and link: NoSQL databases, Hadoop, Big Data: Pinned tabs Oct. 30th (NoSQL database©myNoSQL)

Announcing MongoDB 3.0

Today we announced MongoDB 3.0. The upcoming release will be generally available in early March. We’re making major improvements that have a big impact. Write performance has improved by 7x - 10x with WiredTiger and document-level concurrency control, compression reduces storage needs by up to 80%, and new management software called Ops Manager can reduce operational overhead by up to 95% for most tasks. To learn more, please check out the following resources:

MongoDB vs. Azure DocumentDB

Azure DocumentDB is a NoSQL document database service designed from the ground up to natively support JSON and JavaScript directly inside the database engine. It’s the right solution for applications that run in the cloud when predictable throughput, low latency, and flexible query are key. Microsoft consumer applications like MSN use DocumentDB in production to support millions of users. 

Whereas both eschew the traditional relational design for data, the greatest distinction between the two database platforms is that DocumentDB is PaaS by design whereas MongoDB is not.

Great post from David Green at justazure.com.

For more information, click here.

Robert Haas takes a comparative look at PostgreSQL and MongoDB’s features emphasized by its MongoDB CEO in an interview:

Schireson also mentions another advantage of document stores: schema flexibility. Of course, he again ignores the possible advantages, for some users, of a fixed schema, such as better validity checking. But more importantly, he ignores the fact that relational databases such as PostgreSQL have had similar capabilities since before MongoDB existed. PostgreSQL’s hstore, which provides the ability to store and index collections of key-value pairs in a fashion similar to what MongoDB provides, was first released in December of 2006, the year before MongoDB development began. True JSON capabilities were added to the PostgreSQL core as part of the 9.2 release, which went GA in September of 2012. The 9.4 release, expected later this year, will greatly expand those capabilities. In today’s era of rapid innovation, any database product whose market advantage is based on the format in which it is able to store data will not retain that advantage for very long.

It’s difficult impossible to debate or contradict the majority of facts and arguments the author is making. But in order to understand the history and future of developer tools, it’s worth emphasizing one aspect that has been almost completely ignored for way too long. — and the author mentions it just briefly.

Developers want to get things done. Fast and Easy.

For too long vendors thought that a tool that had a feature covered was enough. Even if the user had to read a book or two, hire an army of consultants, postpone the deadlines, and finally make three incantations to get it working. This strategy worked well for decades. It worked especially well in the space of databases where buying decisions where made at the top level due to the humongous costs.

MySQL became one of the most popular database because it was free and perceived to be easier than any of the alternatives. Not because it was first. Not because it was feature complete. And definitely not because it was technically superior — PostgreSQL was always technically superior, but never got the install base MySQL got.

MongoDB replays this story by the book. It’s free. It promises features that were missing or are considered complicated in the other products. And it’s perceived as the easiest to use database — a look at MongoDB’s history will reveal immediately its primary focus on ease of use: great documentation, friendly setup, fast getting started experience. For a lot of people, it really doesn’t matter anymore that there are alternative solutions that offer technically superior solutions. They’ve got their things done. Fast and Easy. Tomorrow is another day.

Original title and link: Why the clock is ticking for MongoDB (NoSQL database©myNoSQL)

Getting Started with MongoDB and Java: Part I

By Trisha Gee, Java Engineer and Advocate at MongoDB

Java is one of the most popular programming languages in the MongoDB Community. For new users, it’s important to provide an overview of how to work with the MongoDB Java driver and how to use MongoDB as a Java developer.

In this post, which is aimed at Java/JVM developers who are new to MongoDB, we’re going to give you a guide on how to get started, including:

  • Installation
  • Setting up your dependencies
  • Connecting
  • What are Collections and Documents?
  • The basics of writing to and reading from the database
  • An overview of some of the JVM libraries

Installation

The installation instructions for MongoDB are extensively documented, so I’m not going to repeat any of that here. If you want to follow along with this “getting started” guide, you’ll want to download the appropriate version of MongoDB and unzip/install it. At the time of writing, the latest version of MongoDB is 2.6.3, which is the version I’ll be using.

A note about security

In a real production environment, of course you’re going to want to consider authentication. This is something that MongoDB takes seriously and there’s a whole section of documentation on security. But for the purpose of this demonstration, I’m going to assume you’ve either got that working or you’re running in “trusted mode” (i.e. that you’re in a development environment that isn’t open to the public).

Take a look around

Once you’ve got MongoDB installed and started (a process that should only take a few minutes), you can connect to the MongoDB shell. Most of the MongoDB technical documentation is written for the shell, so it’s always useful to know how to access it, and how use it to troubleshoot problems or prototype solutions.

When you’ve connected, you should see something like

MongoDB shell version: 2.6.3                           
connecting to: test
> _  

Since you’re in the console, let’s take it for a spin. Firstly we’ll have a look at all the databases that are there right now:

> show dbs

Assuming this is a clean installation, there shouldn’t be much to see:

> show dbs
admin  (empty)
local  0.078GB
>

That’s great, but as you can see there’s loads of documentation on how to play with MongoDB from the shell. The shell is a really great environment for trying out queries and looking at things from the point-of-view of the server. However, I promised you Java, so we’re going to step away from the shell and get on with connecting via Java.

Getting started with Java

First, you’re going to want to set up your project/IDE to use the MongoDB Java Driver. These days IDEs tend to pick up the correct dependencies through your Gradle or Maven configuration, so I’m just going to cover configuring these.

At the time of writing, the latest version of the Java driver is 2.12.3 - this is designed to work with the MongoDB 2.6 series.

Gradle

You’ll need to add the following to your dependencies in build.gradle:

compile 'org.mongodb:mongo-java-driver:2.12.3'

Maven

For maven, you’ll want:

<dependencies>
    <dependency>
        <groupId>org.mongodb</groupId>
        <artifactId>mongo-java-driver</artifactId>
        <version>2.12.3</version>
    </dependency>
</dependencies>

Alternatively, if you’re really old-school and like maintaining your dependencies the hard way, you can always download the JAR file.

If you don’t already have a project that you want to try with MongoDB, I’ve created a series of unit tests on github which you can use to get a feel for working with MongoDB and Java.

Connecting via Java

Assuming you’ve resolved your dependencies and you’ve set up your project, you’re ready to connect to MongoDB from your Java application.

Since MongoDB is a document database, you might not be surprised to learn that you don’t connect to it via traditional SQL/relational DB methods like JDBC. But it’s simple all the same:

Where I’ve put mongodb://localhost:27017, you’ll want to put the address of where you’ve installed MongoDB. There’s more detailed information on how to create the correct URI, including how to connect to a Replica Set, in the MongoClientURI documentation.

If you’re connecting to a local instance on the default port, you can simply use:

Note that this does throw a checked Exception, UnknownHostException. You’ll either have to catch this or declare it, depending upon what your policy is for exception handling.

The MongoClient is your route in to MongoDB, from this you’ll get your database and collections to work with (more on this later). Your instance of MongoClient (e.g. mongoClient above) will ordinarily be a singleton in your application. However, if you need to connect via different credentials (different user names and passwords) you’ll want a MongoClient per set of credentials.

It is important to limit the number of MongoClient instances in your application, hence why we suggest a singleton - the MongoClient is effectively the connection pool, so for every new MongoClient, you are opening a new pool. Using a single MongoClient (and optionally configuring its settings) will allow the driver to correctly manage your connections to the server. This MongoClient singleton is safe to be used by multiple threads.

One final thing you need to be aware of: you want your application to shut down the connections to MongoDB when it finishes running. Always make sure your application or web server calls MongoClient.close() when it shuts down.

Try out connecting to MongoDB by getting the test in Exercise1ConnectingTest to pass.

Where are my tables?

MongoDB doesn’t have tables, rows, columns, joins etc. There are some new concepts to learn when you’re using it, but nothing too challenging.

While you still have the concept of a database, the documents (which we’ll cover in more detail later) are stored in collections, rather than your database being made up of tables of data. But it can be helpful to think of documents like rows and collections like tables in a traditional database. And collections can have indexes like you’d expect.

Selecting Databases and Collections

You’re going to want to define which databases and collections you’re using in your Java application. If you remember, a few sections ago we used the MongoDB shell to show the databases in your MongoDB instance, and you had an admin and a local.

Creating and getting a database or collection is extremely easy in MongoDB:

You can replace "TheDatabaseName" with whatever the name of your database is. If the database doesn’t already exist, it will be created automatically the first time you insert anything into it, so there’s no need for null checks or exception handling on the off-chance the database doesn’t exist.

Getting the collection you want from the database is simple too:

Again, replacing "TheCollectionName" with whatever your collection is called.

If you’re playing along with the test code, you now know enough to get the tests
in Exercise2MongoClientTest to pass.

An introduction to documents

Something that is, hopefully, becoming clear to you as you work through the examples in this blog, is that MongoDB is different from the traditional relational databases you’ve used. As I’ve mentioned, there are collections, rather than tables, and documents, rather than rows and columns.

Documents are much more flexible than a traditional row, as you have a dynamic schema rather than an enforced one. You can evolve the document over time without incurring the cost of schema migrations and tedious update scripts. But I’m getting ahead of myself.

Although documents don’t look like the tables, columns and rows you’re used to, they should look familiar if you’ve done anything even remotely JSON-like. Here’s an example:

person = {
  _id: "jo",
  name: "Jo Bloggs",
  age: 34,
  address: {
    street: "123 Fake St",
    city: "Faketon",
    state: "MA",
    zip: “12345”
  }
  books: [ 27464, 747854, ...]
}  

There are a few interesting things to note:

  1. Like JSON, documents are structures of name/value pairs, and the values can be one of a number of primitive types, including Strings and various number types.
  2. It also supports nested documents - in the example above, address is a subdocument inside the person document. Unlike a relational database, where you might store this in a separate table and provide a reference to it, in MongoDB if that data benefits from always being associated with its parent, you can embed it in its parent.
  3. You can even store an array of values. The books field in the example above is an array of integers that might represent, for example, IDs of books the person has bought or borrowed.

You can find out more detailed information about Documents in the documentation.

Creating a document and saving it to the database

In Java, if you wanted to create a document like the one above, you’d do something like:

At this point, it’s really easy to save it into your database:

Note that the first three lines are set-up, and you don’t need to re-initialize those every time.

Now if we look inside MongoDB, we can see that the database has been created:

> show dbs
Examples  0.078GB
admin     (empty)
local     0.078GB
> _

…and we can see the collection has been created as well:

> use Examples
switched to db Examples
> show collections
people
system.indexes
> _ 

…finally, we can see the our person, “Jo”, was inserted:

> db.people.findOne()
{
    "_id" : "jo",
    "name" : "Jo Bloggs",
        "age": 34,
    "address" : {
        "street" : "123 Fake St",
        "city" : "Faketon",
        "state" : "MA",
        "zip" : "12345"
    },
    "books" : [
        27464,
        747854
    ]
}
> _

As a Java developer, you can see the similarities between the Document that’s stored in MongoDB, and your domain object. In your code, that person would probably be a Person class, with simple primitive fields, an array field, and an Address field.

So rather than building your DBObject manually like the above example, you’re more likely to be converting your domain object into a DBObject. It’s best not to have the MongoDB-specific DBObject class in your domain objects, so you might want to create a PersonAdaptor that converts your Person domain object to a DBObject:

As before, once you have the DBObject, you can save this into MongoDB:

Now you’ve got all the basics to get the tests in Exercise3InsertTest to pass.

Getting documents back out again

Now you’ve saved a Person to the database, and we’ve seen it in the database using the shell, you’re going to want to get it back out into your Java application. In this post, we’re going to cover the very basics of retrieving a document - in a later post we’ll cover more complex querying.

You’ll have guessed by the fact that MongoDB is a document database that we’re not going to be using SQL to query. Instead, we query by example, building up a document that looks like the document we’re looking for. So if we wanted to look for the person we saved into the database, “Jo Bloggs”, we remember that the _id field had the value of “jo”, and we create a document that matches this shape:

As you can see, the find method returns a cursor for the results. Since _id needs to be unique, we know that if we look for a document with this ID, we will find only one document, and it will be the one we want:

Earlier we saw that documents are simply made up of name/value pairs, where the value can be anything from a simple String or primitive, to more complex types like arrays or subdocuments. Therefore in Java, we can more or less treat DBObject as a Map<String, Object>. So if we wanted to look at the fields of the document we got back from the database, we can get them with:

Note that you’ll need to cast the value to a String, as the compiler only knows that it’s an Object.

If you’re still playing along with the example code, you’re now ready to take on all the tests in Exercise4RetrieveTest

Overview of JVM Libraries

So far I’ve shown you the basics of the official Java Driver, but you’ll notice that it’s quite low-level - you have to do a lot of taking things out of your domain objects and poking them into MongoDB-shaped DBObjects, and vice-versa. If this is the level of control you want, then the Java driver makes this easy for you. But if it seems like this is extra work that you shouldn’t have to do, there are plenty of other options for you.

The tools I’m about to describe all use the MongoDB Java Driver at their core to interact with MongoDB. They provide a high-level abstraction for converting your domain objects into MongoDB documents, whilst also giving you a way to get to the underlying driver as well in case you need to use it at a lower level.

Morphia

Morphia is a really lightweight ODM (Object Document Mapper), so it’s similar to ORMs like Hibernate. Documents can be in a fairly similar shape to your Java domain objects, so this mapping can be automatic, but Morphia allows you point the mapper in the right direction.

Morphia is open source, and has contributors from MongoDB. Sample code and documentation can be found here.

Spring Data

Another frequently used ODM is Spring Data. This supports traditional relational and non-relational databases, including MongoDB. If you’re already using Spring in your application, this should be a familiar way to work.

As always with Spring projects, there’s a lot of really great documentation, including a Getting Started guide with example code.

MongoJack

If you’re working with web services or something else that supports JSON, and you’re using Jackson to work with this data, it probably seems like a waste to be turning it from this form into a Java object and then into a MongoDB DBObject. But MongoJack might make your job easier, as it’s designed to map JSON objects directly into MongoDB. Take a look at the example code and documentation.

Jongo

This is another Jackson-based ODM, but provides an interesting extra in the form of supporting queries the way you’d write them in the shell. Documentation and example code is available on the website.

Grails MongoDB GORM

The Grails web application framework also supports its own Object-Relational Mapping (GORM), including support for MongoDB. More documentation for this plugin can be found here.

Casbah

This isn’t an ODM like the other tools mentioned, but the officially supported Scala driver for MongoDB. Like the previous libraries, it uses the MongoDB Java Driver under the covers, but it provides a Scala API for application developers to work with. If you like working with Scala but are searching for an async solution, consider ReactiveMongo, a community-supported driver that provides async and non-blocking operations.

Other libraries and tools

This is far from an extensive list, and I apologise if I’ve left a favourite out. But we’ve compiled a list of many more libraries for the JVM, which includes community projects and officially supported drivers.

Conclusion

We’ve covered the basics of using MongoDB from Java - we’ve touched on what MongoDB is, and you can find out a lot more detailed information about it from the manual; we’ve installed it somewhere that lets us play with it; we’ve talked a bit about collections and documents, and what these look like in Java; and we’ve started inserting things into MongoDB and getting them back out again.

If you haven’t already started playing with the test code, you can find it in this github repository. And if you get desperate and look hard enough, you’ll even find the answers there too.

Finally, there are more examples of using the Java Driver in the Quick Tour, and there is example code in github, including examples for authentication.

If you want to learn more, try our 7-week online course, “Intro to MongoDB and Java”.

Try it out, and hopefully you’ll see how easy it is to use MongoDB from Java.

Read Part II

NoSQL databases, Hadoop, Big Data: Pinned tabs Nov.5th

01: A brief overview or rather a cheatsheet of MongoDB’s index types and commands.


02: I didn’t know the replicating data form Couchbase Lite to Couchbase requires an extra tool, the Sync Gateway.


03: A very nice read about how to transform some of the most populat sequential clustering algorithms, k-means, single-linkage, correlation, scale for large amounts of data using a map-reduce massively parallel computation model.


04: An intro to using Spark Streaming with some HBase and data visualization.


05: Benchmarking Amazon EBS options, spinning vs SSD vs Provisioned IOPS SSD, using Redis. No surprises here.


06: Researchers from MIT and the Israel Institute of Technology have proved that for a large-class of non-blocking parallel algorithms, lock-free vs wait-free perform are equal.

Lock-free algorithms guarantee that some concurrent operation will make progress. Wait-free algorithms guarantee that all threads make progress.


07: Facebook organized a summit to discuss their storage engines and then look at the challenges they are facing across small & big data, but also hardware.

Facebook’s storage is based on: Tao and Memcached. Tao operates at a rate of billions of queries per second. The Memcached caching layer has a critical impact on the service availability.

The problems Facebook would like to address at both small data and big data layers are quite challenging. A couple of examples:

  1. how to deal with geographically distributed caches
  2. how to deal with huge amounts of logging which is quite difficult to store in their entirety for analysis
  3. Facebook’s data warehouse must be partitioned globally and this has important implications on the type of queries that can be executed

Original title and link: NoSQL databases, Hadoop, Big Data: Pinned tabs Nov.5th (NoSQL database©myNoSQL)