data-modeling

nytimes.com
Data-Crunching Program Guides Santa Cruz Police Before a Crime - NYTimes.com

In July, Santa Cruz began testing the prediction method for property crimes like car and home burglaries and car thefts. So far, said Zach Friend, the police department’s crime analyst, the program has helped officers pre-empt several crimes and has led to five arrests.

Based on models for predicting aftershocks from earthquakes, it generates projections about which areas and windows of time are at highest risk for future crimes by analyzing and detecting patterns in years of past crime data. The projections are recalibrated daily, as new crimes occur and updated data is fed into the program.

The notion of predictive policing is attracting increasing attention from law enforcement agencies around the country as departments struggle to fight crime at a time when budgets are being slashed.

Campfires May Have Triggered Emergence of Tuberculosis

Fire brought warmth and comfort to early humans but may also have triggered the emergence of deadly tuberculosis, Australian researchers suggest.

Smoke-damaged lungs, as well as the closeness of humans around a campfire, could have created the ideal conditions for tuberculosis to mutate from a harmless soil bacterium into our number one bacterial killer, according to the researchers’ data model.

The model, published in the Proceedings of the National Academy of Sciences, showed controlled use of fire would have increased the likelihood of tuberculosis emerging by several orders of magnitude.

Mathematical biologist Mark Tanaka of the University of New South Wales has had a long-standing interest in the evolution of disease-causing microorganisms such as tuberculosis, but a sudden insight led him to think about the role of fire in catapulting tuberculosis into the medical limelight. Read more.

Modeling a Simple Social App Using SQL and Redis

Felix Lin sent me a link to the slides he presented at NoSQL Taiwan meetup. There are 105 of them!

The deck covers:

  • how to build a simple social site using SQL
  • what are the performance issues with SQL
  • how to use the data structures in Redis for getting the same features
  • how to solve the performance issues in SQL by using Redis

Check them up after the break:

Keep reading

Walkthrough: MongoDB Data Modeling

Last week’s post about MongoDB Map/Reduce was pretty well received, so it seems like there is a need for some more discussion of the details involved in real-world MongoDB deployments. I thought we’d try and do a couple more posts and walk through some more details about how we’re using MongoDB at Fiesta.

Flexibility

One of the most touted features of MongoDB is its flexibility. I personally have emphasized flexibility in countless talks introducing MongoDB to technical audiences. Flexibility, however, is a double-edged sword; more flexibility means more choices to face when deciding how to model data (this reminds me of the Zen of Python: “There should be one - and preferably only one - obvious way to do it”). Nevertheless, I like the flexibility that MongoDB provides, it’s just important to review some best practices before settling on a data model.

The Problem

In this post we’ll take a look at how we’ve modeled mailing lists and the people that belong to them. Here are the requirements:

  • Each person can have one or more email addresses.
  • Each person can belong to any number of mailing lists.
  • Every person who belongs to a mailing list can choose what name they want to use for the list.

These requirements have obviously been simplified somewhat, but they are enough to express the core mechanics that power Fiesta.

0-Embed

Let’s examine how our data model looks if we never embed anything - we’ll call this a 0-embed strategy.

We have People, who have a name and password:

{
  _id: PERSON_ID,
  name: "Mike Dirolf"
  pw: "Some Hashed Password"
}

We have a separate collection of Addresses, where each address maintains a reference to a single Person:

{
  _id: ADDRESS_ID,
  person: PERSON_ID,
  address: "mike@corp.fiesta.cc"
}

We have Groups, each of which is basically just an ID (IRL there is some more group-specific metadata that would be in here as well, but we’re going to ignore it to focus on the relationships):

{
  _id: GROUP_ID
}

Lastly, we have Memberships, which associate a Person with a Group. Each Membership includes the list name that the Person is using for the Group, and a reference to the Address that they want to receive mail at for that Group:

{
  _id: MEMBERSHIP_ID,
  person: PERSON_ID,
  group: GROUP_ID,
  address: ADDRESS_ID,
  group_name: "family"
}

This data model is easy to design, simple to reason about, and easy to maintain. We are basically modeling the data as we would in an RDBMS, though; we aren’t leveraging MongoDB’s document-oriented approach. For example, let’s walk through how we would get the other member addresses of a group, given a single incoming address and group name (this is a very common query for Fiesta):

  1. Query the Addresses collection to get the ID of the relevant Person.
  2. Query the Memberships collection with the Person ID from step 1 and the group name to get the Group ID.
  3. Query the Memberships collection again to get all of the Memberships with the Group ID from step 2.
  4. Query the Addresses collection to get the Address to use for each of the Memberships from step 3.

Things get a bit complicated :).

Embed Everything

The strategy that a lot of newcomers use when modeling their data is what we’ll call the embed everything strategy. To use this strategy for Fiesta, we’d take all of a Group’s Memberships and embed them directly within the Group document. We’d also embed Addresses and Person metadata directly within each Membership:

{
  _id: GROUP_ID,
  memberships: [{
    address: "mike@corp.fiesta.cc",
    name: "Mike Dirolf",
    pw: "Some Hashed Password",
    person_addresses = ["mike@corp.fiesta.cc", "mike@dirolf.com", ...],
    group_name: "family"
  }, ...]
}

The theory behind the embed everything strategy is that by keeping all of the related data in one place we can make common queries a lot simpler. With this strategy, the query we performed above is trivial (remember, the query is “given an address and group name, what are the other member addresses of the group”):

  1. Query the Groups collection for a group containing a membership where the address is in person_addresses and the group_name matches.
  2. Iterate over the resulting document to get the other membership addresses.

That’s about as easy as it gets. But what if we wanted to change a Person’s name or password? We’d have to change it in every single embedded membership. Same goes for adding a new person_address or removing an existing one. This highlights the characteristics of the embed everything model: it can be great for doing a single specific query (because we’re basically pre-joining), but can be a nightmare for long-term maintainability. I’d highly recommend against this approach in general.

Embed Trivial Cases

The approach we’ve taken at Fiesta, and the approach I most often recommend, is to start by thinking about the 0-embed model. Once you’ve got that model figured out, you can pick off easy cases where embedding just makes sense. A lot of the time these cases tend to be one-to-many relationships.

For example, our Addresses each belong to a single user (and are also referenced by Memberships). Addresses are also not likely to change very often. Let’s embed them as an array to save some queries and keep our data model in sync with our mental model of a Person.

Memberships are each associated with a single Person and a single Group, so we could imagine embedding them in either the Person model or the Group model. In cases like this, it’s important to think about both data access patterns and the magnitude of relationships. We expect People to have at most 1000s of group Memberships, and Groups to have at most 1000s of Memberships as well, so the magnitude doesn’t tell us much. Our access pattern, however, does - when we display the Fiesta dashboard we need to have access to all of a Person’s Memberships. To make that query easy, let’s embed Memberships within the Person model. This also has the advantage of keeping a Person’s addresses all within the Person model (since they are referenced both at the top-level and within Memberships). If an address needs to be removed or changed, we can do it all in one place.

Here’s how things look now (this is the Person model - the only other model is Group, which is identical to the 0-embed case):

{
  _id: PERSON_ID,
  name: "Mike Dirolf",
  pw: "Some Hashed Password",
  addresses: ["mike@corp.fiesta.cc", "mike@dirolf.com", ...],
  memberships: [{
    address: "mike@corp.fiesta.cc",
    group_name: "family",
    group: GROUP_ID
  }, ...]
}

The query we’ve been discussing now looks like this:

  1. Query for a Person with the matching address and an embedded Membership with the right group_name.
  2. Use the Group ID in the embedded Membership from step 1 to query for other People with Memberships in that Group - get the addresses directly from their embedded Memberships.

It’s still almost as simple as in the embed everything case, but our data model is a lot cleaner and easier to maintain. Hopefully this walkthrough has been helpful - if you have any questions let us know!

Mike

juhonkoti.net
Example how to model your data into nosql with cassandra

This maybe a nifty idea if you want to create a community within your organizations realm.  Call it a PRIVATIZED Version of FB.

———————————————————————-

“We have built a facebook style “messenger” into our web site which uses cassandra as storage backend. I’m describing the data schema to server as a simple example how cassandra (and nosql in general) can be used in practice….”

——————————————————————-

Read the rest here:  http://www.juhonkoti.net/2010/09/25/example-how-to-model-your-data-into-nosql-with-cassandra

sciencealert.com
Scientists just figured out continental plates can move up to 20 times faster than we thought
Just before they split.
By David Nield

Geophysicists have discovered something startling about tectonic plates: when under extreme stress, they hit the gas and can accelerate in speed by up to 20 times.

When they’re about to split, the plates can move about as fast as the human fingernail grows, and that’s very fast indeed as far as continental drift is concerned.

Scientists from the University of Sydney, Australia, and the University of Potsdam, Germany, used a combination of seismic data and computer modelling to map the varying speed of a plate breakups, and you can view the results online.

The University of Sydney’s Dietmar Müller compared the process to pulling a piece of dough apart. “At first, separating it requires a lot of effort because the dough resists your pulling and stretches slowly between your hands,” he said.

“If you’re persistent, you’ll eventually reach a point where the dough becomes thin enough to separate quite easily and quickly. The same principle applies to rifting continents once the connection between them has been thinned sufficiently.”

Continue Reading.

MongoDB, Data Modeling, and Adoption

Micheal Shallop describes in this post how he “built and re-buit” a geospatial table, replacing several tables in MySQL with MongoDB:

The mongo geospatial repository will be replacing several tables in the legacy mySQL system – as you may know, mongodb comes with full geospatial support so executing queries against a collection (table) built in this manner is shocking in terms of it’s response speeds — especially when you compare those speeds to the traditional mySQL algorithms for extracting geo-points based on distance ranges for lat/lon coordinates.  The tl;dr for this paragraph is: no more hideous trigonometric mySQL queries!

But what actually picked my attention was this paragraph:

What I learned in this exercise was that the key to architecting a mongo collection requires you to re-think how data is stored.  Mongo stores data as a collection of documents.  The key to successful thinking, at least in terms of mongo storage, is denormalization of your data objects.

This made me realize that MongoDB adoption is benefiting hugely from the fact that its data model and querying are the closest to the relational databases, neither requiring a radical mindshift from developers that have at least once touched a database. It is like knowing a programming language and learning a 2nd one that follows almost the same paradigms.

The same cannot be said about key-value stores, multi-dimensional maps, MapReduce algorithms, or graph databases. Any of these would require one to dismiss pretty much everything learned in the relational model and completely remodel the world. It’s a tougher job, but when used right the reward pays off.

Original title and link: MongoDB, Data Modeling, and Adoption (NoSQL database©myNoSQL)

blog.8thlight.com
NO DB - the Center of Your Application Is Not the Database

Uncle Bob:

The center of your application is not the database. Nor is it one or more of the frameworks you may be using. The center of your application are the use cases of your application. […] If you get the database involved early, then it will warp your design. It’ll fight to gain control of the center, and once there it will hold onto the center like a scruffy terrier. You have to work hard to keep the database out of the center of your systems. You have to continuously say “No” to the temptation to get the database working early.

Original title and link: NO DB - the Center of Your Application Is Not the Database (NoSQL database©myNoSQL)

From Dice to Predictive Analytics

Gambling – the wagering of money or something of material value on an event with an uncertain outcome with the primary intent of winning additional money and/or material goods – has been with us since ancient times; during the Trojan War, Homer wrote about the mythological hero Palamedes – who is credited with inventing the dice – creating games of chance to entertain his troops; Greek mythology tells the story of Poseidon, Zeus and Hades dividing the world between themselves in a dice game; Poseidon won the sea, Zeus the heavens and Hades the underworld. The land, I presume, was left to the rest of us.

From these intriguing beginnings have arisen the opulent gambling temples of today; Las Vegas’s Bellagio, Palazzo and Wynn, Atlantic City’s Taj Mahal, Tropicana and Caesars, and Macau’s Venetian, Grand Lisboa, and, to, ironically, come full circle, the clumsily named casino ‘Greek Mythology’ in Taipa.

Just like executives in every other industry, today’s casino executives are faced with an inherent problem – how do I differentiate myself from my competitors when the tools I have at my disposal are basically the same industry-wide? The answer lies in business intelligence, customer intelligence, data modeling and predictive analytics. Companies now have to be smarter than their competitors in attracting, understanding and marketing to their customers. Although marketing is best when it’s subtle and it taps into a customer’s subconscious wants, desires and needs, the goal of marketing is anything but subtle, its sole objective is to generate revenue. Nothing else matters. ROI trumps all.

There is an old adage in marketing circles that says it is far more cost effective to keep a customer than to acquire a new one. On average, it takes five times as much time and money to find a new customer than it does to retain a current customer. Business intelligence, customer intelligence, data modeling and predictive analytics isn’t just about analyzing data, it is about creating a customer relationship that can be analyzed, scrutinized, and, most importantly, predictive as ROI is all about future return. These tools help predict future patterns of behavior so that a business can create a 360 degree view of its customer. This analysis can include not just a customer’s basic demographics, but also such psychological traits as his or her wants, desires and needs, thereby making any marketing offer much more enticing and, therefore, much more difficult to resist. Today, BI, CI, data modeling and customer analytics can be used to:

  • Anticipate a customer’s needs. 
  • Enhance a customer’s experience, thereby maximizing profitability. 
  • Anticipate a customer’s ultimate profitability to a business.
  • Retain existing customers while also providing knowledge to acquire new ones.
  • Provide the right offer to the right customer at the right time and at the right price.
  • Take market share away from competitors.

In the casino industry today, analytics and data modeling can be used to project the likelihood of a customer’s response to an offer based upon his or her past history, which can help identify the optimum offer to send out to that specific customer. For example, if a customer has shown a history of dining in one of the casino’s finer restaurants, he or she may not respond well to – or actually be offended by – an offer for the buffet. Logistics must also be considered. Too often marketing departments send out '2-for-1’ buffet coupons to patrons who either have to fly or drive several hours to get to that property. In these cases, the buffet offer has to be accompanied by a free hotel night or the offer’s response rates will be exceedingly low.

One key thing that must be factored into these types of analytical data models is not simply to look at whether the customer redeems the offer, but also to look at how much additional revenue the offer generates. For example, if two different patrons receive the same offer and one customer redeems it and also hits the tables and gambles extensively while the other customer just takes the free food and doesn’t generate much additional revenue, the casino will obviously value the former customer over the latter simply because he or she is bringing more revenue into the property. Models that include gambling and overall property spend can help casinos target only their most profitable customers and not waste offers – and valuable hotel rooms – on customers who, comparably, generate less than their counterparts.

Models such as these aren’t unique to the gaming industry. The financial services, telecommunications, retail, insurance and healthcare industries can all benefit from using such data modeling and analytics techniques. Based upon company customer segments and offers selected for a marketing campaign, a company can even project the anticipated redemptions and potential revenue that will be generated from a campaign before it even begins. This gives a company’s marketing department enormous leverage as they will have the opportunity to review an anticipated campaign and make adjustments to meet the desired results of the campaign before it has even started. Following the close of the campaign, an accounting report of the campaign that compares projected responses to actual responses can be generated to discover the true ROI value of the campaign. This comparison feeds into the creation of the next campaign, and so on, and so on.

There is a touch of irony in the fact that casino operators all over the world now use analytics and probability theory to increase their ROI. It was, after all, Blaise Pascal, one of the founding fathers of probability theory, who initially presented a solution to the problem of the division of stakes – how to divide the take of a card game when it is interrupted because one of the players (usually a nobleman) has to leave the game to tend to some other pressing duties. The casino industry’s need for BI, CI, data modeling and customer analytics are substantial; it is a daunting task to fill thousands of hotel rooms and hundreds of gaming tables every week, on both a customer and employee level, but the lessons learned here are not unique to the gaming industry. Customer relationship management is important to almost every industry and the lessons learned in the gaming industry should be heeded by all.

Fiesta at the NY MongoDB User Group

Last night we had the chance to speak at the NY MongoDB User Group (great event - check it out!) about how we’re using MongoDB at Fiesta. A lot of the talk was focused on giving real-world examples of the concepts I used to discuss when giving “Intro to MongoDB” talks. The bulk of those examples were about how we approach data modeling. Here are the slides from the talk:

Big thanks to 10gen and Buddy Media for having us. We really enjoyed speaking and getting to listen to some other talks about how people are putting MongoDB to use. Also, we got to do some white-boarding after the talk:

(image via Francesca Krihely)

qz.com
Why coding is not the new literacy

Coding requires us to break our systems down into actions that the computer understands, which represents a fundamental disconnect in intent. Most programs are not trying to specify how things are distributed across cores or how objects should be laid out in memory. We are not trying to model how a computer does something.³ Instead, we are modeling human interaction, the weather, or spacecraft. From that angle, it’s like trying to paint using a welder’s torch. We are employing a set of tools designed to model how computers work, but we’re representing systems that are nothing like them.4

Even in the case where we are talking specifically about how machines should behave, our tools aren’t really designed with the notion of modeling in mind. Our editors and debuggers, for example, make it difficult to pick out pieces at different depths of abstraction. Instead, we have to look at the system laid out in its entirety and try to make sense of where all the screws came from. Most mainstream languages also make exploratory creation difficult. Exploring a system as we’re building it gives us a greater intuition for both what we have and what we need. This is why languages that were designed with exploration in mind (LISP, Smalltalk, etc.) seem magical and have cult followings. But even these suffer from forcing us to model every material with a single tool. Despite having different tools for various physical materials, in programming we try to build nearly everything with just one: the general purpose programming language.

On the surface, it seems desirable to have “one tool to rule them all,” but the reality is that we end up trying to hammer metal with a chef’s knife.5 Excel, by contrast, constrains us to the single material that it was intentionally designed to work with. Through that constraint we gain a tool with a very intuitive and powerful interface for working with grids. The problem of course is that Excel is terrible for doing anything else, but that doesn’t mean we should try to generalize a chef’s knife into a hammer. Instead, we should use the right tools for the job and look for a glue that allows us to bring different materials together.

bradley-holt.com
CouchDB and DDD

Bradley Holt:

I’ve found CouchDB to be a great fit for domain-driven design (DDD). Specifically, CouchDB fits very well with the building block patterns and practices found within DDD. Two of these building blocks include Entities and Value Objects. Entities are objects defined by a thread of continuity and identity. A Value Object is an object that describes some characteristic or attribute but carries no concept of identity. Value objects should be treated as immutable.

Aggregates are groupings of associated Entities and Value Objects. Within an Aggregate, one member is designated as the Aggregate Root. External references are limited to only the Aggregate Root. Aggregates should follow transaction, distribution, and concurrency boundaries. Guess what else is defined by transaction, distribution, and concurrency boundaries? That’s right, JSON documents in CouchDB.

The way I read this is the impedance mismatch between the object model and the document-based model is lower than what we’ve seen in object-relational world.

Original title and link: CouchDB and DDD (NoSQL database©myNoSQL)

Data Modeling for Document Databases: An Auction and Bids System

Staying with data modeling, but moving to the world of document databases, Ayende has two great posts about modeling an auction system: part 1 and part 2. They are great not only because it’s not the Human-has-Bird-and-Cat-and-Dogs example, but also because he looks at different sets of requirements and offers different solutions.

That is one model for an Auction site, but another one would be a much stronger scenario, where you can’t just accept any Bid. It might be a system where you are charged per bid, so accepting a known invalid bid is not allowed (if you were outbid in the meantime). How would we build such a system? We can still use the previous design, and just defer the actual billing for a later stage, but let us assume that this is a strong constraint on the system.

Original title and link: Data Modeling for Document Databases: An Auction and Bids System (NoSQL database©myNoSQL)

daniellang.net
6 Ways to Handle Relations in RavenDB and Document Databases

Daniel Lang presents 6 solutions for dealing with relations in RavenDB:

If you’re coming from the sql world, chances are you will be confused by the lack of relations in document databases. However, if you’re running RavenDB you’ve got plenty of options to address this trade-off. I personally cannot think of any situation where I’d wish back SQLServer because of this (there could be other reasons).

Two not recommended:

  • go to the database twice
  • include one document inside the other

Two RavenDB specific solutions:

  • implement a read trigger to do server-side joins
  • implement a custom responder

Two recommended solutions:

  • use the .Include<T>() method
  • denormalize your references

Couple of comments:

  • the difference between “include one document inside the other” and “denormalize your references” is very subtle—the latter suggests including only the information needed for the presentation layer.
  • I think one should consider both “include one document inside the other” and “denormalize your references” and choose one of them depending on the chances of the embedded documents being updated often vs the chances of having the presentation layer changing often
  • except RavenDB, all other document databases seem to offer only two options: “go to the database twice” and “denormalize your references”
  • when Redis will release its version embedding server-side Lua, that could be used as a form of stored procedure

Original title and link: 6 Ways to Handle Relations in RavenDB and Document Databases (NoSQL database©myNoSQL)