Matthew Hayes introduces a very interesting new framework from LinkedIn

Hourglass is designed to make computations over sliding windows more efficient. For these types of computations, the input data is partitioned in some way, usually according to time, and the range of input data to process is adjusted as new data arrives. Hourglass works with input data that is partitioned by day, as this is a common scheme for partitioning temporal data.

Hourglass is available on GitHub.

We have found that two types of sliding window computations are extremely common in practice:

  • Fixed-length: the length of the window is set to some constant number of days and the entire window moves forward as new data becomes available. Example: a daily report summarizing the the number of visitors to a site from the past 30 days.
  • Fixed-start: the beginning of the window stays constant, but the end slides forward as new input data becomes available. Example: a daily report summarizing all visitors to a site since the site launched.

Original title and link: LinkedIn’s Hourglass: Incremental data processing in Hadoop (NoSQL database©myNoSQL)

Fortune 500 companies are using Hadoop and big data technologies to transform their use of data for financial analysis, retail customer intelligence, IT operational insight, environmental and biomedical research, energy management, and even national security. But what results are they seeing? What have the early adopters of big data systems learned, and what business benefits have been realized from these investments?

Now that companies are starting to capture all of this

data, translating it from the raw data source into information that makes sense to business users is no small task. This webinar will examine this issue and the techniques companies are using to generate meaningful business insights from big data.

We will discuss these topics and more:

  • Customer examples of big data analytics
  • What business benefits are customers seeing from big data implementations today?
  • What were the challenges and how were they overcome?
  • What are the next steps for the early adopters?
Infinite Data

Since people liked my last opinion piece on #big data, here’s another one.

Imagine there was a technology that allowed me to record the position of every atom in a small room, thereby generating some ridiculous amount of data (Avogadro’s number is 𝒪(10²³)ŽŽ so some prefix around that order of magnitude — eg yoctobytes). And also imagine that there was a way for other scientists to decode and view all of that. (Maybe the latency and bandwidth can still be restricted even though neither capacity nor resolution nor fidelity nor coverage of the measurement are restricted — although that won’t be relevant to my thought experiment, it would seem “like today” where MapReduce is required.)

Let’s say I am running some behavioural economics experiment, because I like those. What fraction of the data am I going to make use of in building my model? I submit that the psychometric model might be exactly the same size as it is today. If I’m interested in decision theory then I’m going to be looking to verify/falsify some high-level hypothesis like “Expected utility" or "Hebbian learning". The evidence for/against that idea is going to be so far above the atomic level, so far above the neuron level, I will basically still be looking at what I look at now:

  • Did the decisions they ended up making (measured by maybe 𝒪(100), maybe even 𝒪(1) numbers in a table) correspond to the theory?
  • For example if I draw out their assessment of the probability and some utility ranking then did I get them to violate that?

If I’ve recorded every atom in the room, then with some work I can get up to a coarser resolution and make myself an MRI. (Imagine working with tick-level stock data when you really are only interested in monthly price movements—but in 3-D.) (I guess I wrote myself into even more of a corner here, if we have atomic level data then it’s quantum, meaning you really have to do some work to get it to the fMRI scale!) But say I’ve gotten to fMRI level data, then what am I going to do with them? I don’t know how brains work. I could look up some theories of what lighting-up in different areas of the brain means (and what about 16-way dynamical correlations of messages passing between brain areas? I don’t think anatomy books have gotten there yet). So I would have all this fMRI data and basically not know what to do with it. I could start my next research project to look at numerically / mathematically obvious properties of this dataset, but that doesn’t seem like it would yield up a Master Answer of the Experiment because there’s no interplay beween theories of the brain and trying different experiments to test it out — I’m just looking at “one single cross section” which is my one behavioural econ experiment. Might squeeze some juice but who knows.

Then let’s talk about people critiquing my research paper. I would post all the atomic-level data online of course, because that’s what Jesus would do. But would the people arguing against my paper be able to use that granular data effectively?

I don’t really think so. I think they would look at the very high level of 𝒪(100) or 𝒪(1) data that I mentioned before, where I would be looking.

  • They might argue about my interpretation of the numbers or statistical methods.
  • They might say that what I count as evidence doesn’t really count as evidence because my reasoning was bad.
  • They couldn’t argue that the experiment isn’t replicable because I imagined a perfect-fidelity machine here.
  • They could go one or two levels deeper and find that my experimental setup was imperfect—the administrator of the questions didn’t speak the questions in exactly the same tone of voice each time; her face was at a slightly different angle; she wore a different coloured shirt on the other day. But in my imaginary world with perfect instruments, those kinds of errors would be so easy to see everywhere that nobody would take such a criticism seriously. (And of course because I am the author of this fantasy, there actually aren’t significant implementation errors in the experiment.)

Now think about either the scientists 100 years after that or if we had such perfect-fidelity recordings of some famous historical experiment. Let’s say it’s Michelson & Morley. Then it would be interesting to just watch the video from all angles (full resolution still not necessary) and learn a bit about the characters we’ve talked so much about.

But even here I don’t think what you would do is run an exploratory algorithm on the atomic level and see what it finds — even if you had a bajillion processing power so it didn’t take so long. There’s just way too much to throw away. If you had a perfect-fidelity-10²⁵-zoom-full-capacity replica of something worth observing, that resolution and fidelity would be useful to make sure you have the one key thing worth observing, not because you want to look at everything and “do an algo” to find what’s going on. Imagine you have a videotape of a murder scene, the benefit is that you’ve recorded every angle and every second, and then you zoom in on the murder weapon or the grisly act being committed or the face of the person or the tiny piece of hair they left and that one little sliver of the data space is what counts.

What would you do with infinite data? I submit that, for analysis, you’d throw most of the 10²⁵ bytes away.

There is a growing body of evidence, at least in text processing, that of … data, features, [and] algorithms … data probably matters the most. Superficial word-level features coupled with simple models in most cases trump sophisticated models over deeper features and less data. But why can’t we have our cake and eat it too? Why not both sophisticated models and deep features applied to lots of data? Because inference over sophisticated models and extraction of deep features are often computationally intensive, they don’t scale well.

Training data is fairly easy to come by—we can just gather a large corpus of texts and assume that most writers make correct choices (the training data may be noisy, since people make mistakes, but no matter). In 2001, Banko and Brill [14] published what has become a classic paper in natural language processing exploring the effects of training data size on classification accuracy, using this task as the specific example. They explored several classification algorithms (the exact ones aren’t important, as we shall see), and not surprisingly, found that more data led to better accuracy.

Across many different algorithms, the increase in accuracy was approximately linear in the log of the size of the training data. Furthermore, with increasing amounts of training data, the accuracy of different algorithms converged, such that pronounced differences in effectiveness observed on smaller datasets basically disappeared at scale. This led to a somewhat controversial conclusion (at least at the time): machine learning algorithms really don’t matter, all that matters is the amount of data you have. This resulted in an even more controversial recommendation, delivered somewhat tongue-in-cheek: we should just give up working on algorithms and simply spend our time gathering data (while waiting for computers to become faster so we can process the data).

In 2007, Brants et al. [25] described language models trained on up to two trillion words. Their experiments compared a state-of-the-art approach known as Kneser-Ney smoothing [35] with another technique the authors affectionately referred to as “stupid backoff”. Not surprisingly, stupid backoff didn’t work as well as Kneser-Ney smoothing on smaller corpora. However, it was simpler and could be trained on more data, which ultimately yielded better language models. That is, a simpler technique on more data beat a more sophisticated technique on less data.

Recently, three Google researchers summarized this data-driven philosophy in an essay titled The Unreasonable Effectiveness of Data [65]. Why is this so? It boils down to the fact that language in the wild, just like human behavior in general, is messy.

[14] Michele Banko and Eric Brill. Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL 2001), pages 26–33, Toulouse, France, 2001.

[25] Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 858–867, Prague, Czech Republic, 2007.

[65] Alon Halevy, Peter Norvig, and Fernando Pereira. The unreasonable effectiveness of data. Communications of the ACM, 24(2):8–12, 2009.


Jimmy Lin & Chris Dyer, Data-Intensive Text Processing with MapReduce (2010)

This is the best summary of the “pro big-data” argument I’ve found. I still haven’t made up my mind on this topic. But my prior is closer to @johndcook’s comment #4 on

The Three Pillars to improve Apple Pay: Data Science, Cloud Computing and Security

The Three Pillars to improve Apple Pay: Data Science, Cloud Computing and Security

On Sept 9th, Apple announced the new models of its iPhone with two sizes: iPhone 6 (4.7″) and iPhone 6 Plus (5.5″); the Apple Watch, the new smart watch with a lot of features with two sizes and three versions (Apple Watch, Apple Watch Sport and Apple Watch Edition); and the most critical announcement (at least from my perspective) was Apple Pay, a totally new mobile payments platform using…

View On WordPress

Python + Hadoop: Real Python in Pig trunk

Jeremy Karn


For a long time, data scientists and engineers had to choose between leveraging the power of Hadoop and using Python’s amazing data science libraries (like NLTK, NumPy, and SciPy). It’s a painful decision, and one we thought should be eliminated.

So about a year ago, we solved this problem by extending Pig to work with CPython, allowing our users to take advantage of Hadoop with real Python (see our presentation here). To say Mortar users have loved that combination would be an understatement.

However, only Mortar users could use Pig and real Python together…until now.

Read More

A tempest is brewing with Apache Storm - whats going on there?

Lost faith in Apache Foundation Governance - a tempest is brewing with Apache Storm

I’ve lost a bit of faith in the Apache Foundation of late. I understand that a project that is in incubation is still getting its house in order but how long should that take, how much confidence should one put in interim releases and when should you just throw your hands up in the air and say enough is enough, time to find another tool?

My Apache Storm cluster blew up for no apparent reason. I…

View On WordPress

Types of data that land in Hadoop

Jim Walker in The Business Value of Hadoop as seen through the Big Data:

While every organization is different their big data is often very similar. For the most part, Hadoop is collecting massive amounts of data across six basic types of data: social media activity, clickstream data, server logs, unstructured (videos, docs, etc) and machine/sensor data from equipment in the field.

From these categories, I think only machine/sensor data can be considered critical data. Actually if you think of it, server logs, clickstreams, and even social media activity are also sensor data; originated in servers and respectively humans.

The future of data processing is platforms that would be able to bring together all critical data disregarding their main storage location. Some call this federated databases. Some call this logical data warehouses. The specific term doesn’t matter though. It’s the core principles that will make the difference:integration and integrated processing in close to real time.

Original title and link: Types of data that land in Hadoop (NoSQL database©myNoSQL)

Naive Bayes classification for large data sets

I put together a Naive Bayes classifier in scalding. It’s modeled after the scikit-learn approach, and I’ve tried to make the API look similar. The advantage of using scalding is that it is designed to run over enormous data sets in Hadoop. I’ve run some binary classification jobs on 100GB+ of data and it works quite well.

Here’s an example usage on the famous iris data set.

The three classes are the species of iris, and the four features/attributes are the length and width of the flower’s sepal and petal. The train method (line 21) returns a Pipe containing all of the information required for classification.

The scikit-learn documentation contains a good explanation of Bayes’ theorem, which boils down to the following equation:

\[ \hat{y} = \underset{y} {\mathrm{argmax}} ~P(y|x_i) = \underset{y} {\mathrm{argmax}} ~P(y) \prod_{i=1}^{n}P(x_i | y) \]

Where \( \hat{y} \) is the predicted class, \( y \) is the training class, and \( x_i \) are the features (sepalWidth, sepalLength, petalWidth, petalLength). Most Naive Bayes examples that I’ve seen are dealing with word counts, therefore \( P(x_i | y) = \frac{\text{number of times word} x_i \text{appears in class} y}{\text{total number of words in class} y} \). That’s how the MultinomialNB class works, but in the iris data set, we’re dealing with continuous, normally distributed measurements and not counts of objects in a multinomial distribution.

Therefore, we want to use GaussianNB which uses the following equation in the classification:

\[ P(x_i | y) = \frac{1}{\sqrt{2\pi {\sigma_y}^2}} \text{exp}(-\frac{(x_i-u_y)^2}{2\pi\sigma_y^2}) \]

That means that our model pipe must contain the following information in order to calculate \(P(y|x_i)\)

  • classId, \(y\) - The type of flower.
  • feature, \(i\ - The name of the feature (sepalWidth, sepalLength, petalWidth, or petalLength).
  • classPrior, \(P(y)\)- Prior probability of an iris blonging to the class.
  • mu, \(\mu_y\) - The mean of the given feature within the class.
  • sigma, \(\sigma_y\) - The standard deviation of the given feature within the class.

The model Pipe is then crossed with the test set and we calculate the likelihood that the point belongs in each class, \(P(y|x_i)\). Once we have the probability of a data point belonging to each class, we simply group by the point’s ID field and keep only the class with the maximum likelihood.

The results are shown below (plotting only two of the four features). The x's are the training points used, and o's are successfully classified points.

For now, your best bet for using this is to just copy the code off of github. If I get some time, I’d love to port this over to scalding’s typed API, combine it with some other machine learning functions (such as the K nearest neighbor classifier I wrote) to provide a nice little library of tools for scaling machine learning algorithms. If you’d like to be involved, get in touch.

Hadoop Streaming Support for MongoDB

MongoDB has some native data processing tools, such as the built-in Javascript-oriented MapReduce framework, and a new Aggregation Framework in MongoDB v2.2. That said, there will always be a need to decouple persistance and computational layers when working with Big Data.

Enter MongoDB+Hadoop: an adapter that allows Apache’s Hadoop platform to integrate with MongoDB.


Using this adapter, it is possible to use MongoDB as a real-time datastore for your application while shifting large aggregation, batch processing, and ETL workloads to a platform better suited for the task.




Well, the engineers at 10gen have taken it one step further with the introduction of the streaming assembly for Mongo-Hadoop.

What does all that mean?

The streaming assembly lets you write MapReduce jobs in languages like Python, Ruby, and JavaScript instead of Java, making it easy for developers that are familiar with MongoDB and popular dynamic programing languages to leverage the power of Hadoop.






It works like this:

Once a developer has Java installed and Hadoop ready to rock they download and build the adapter. With the adapter built, you compile the streaming assembly, load some data into Mongo, and get down to writing some MapReduce jobs.

The assembly streams data from MongoDB into Hadoop and back out again, running it through the mappers and reducers defined in a language you feel at home with. Cool right?

Ruby support was recently added and is particularly easy to get started with. Lets take a look at an example where we analyze twitter data.

Import some data into MongoDB from twitter:

This script curls the twitter status stream and and pipes the json into mongodb using mongoimport. The mongoimport binary has a couple of flags: “-d” which specifies the database “twitter” and -c which specifies the collection “in”.

Next, write a Mapper and save it in a file called mapper.rb:

The mapper needs to call the function and passes it a block. This block takes an argument “docuement” and emits a hash containing the user’s timezone and a count of 1.

Now, write a Reducer and save it in a file called reducer.rb:

The reducer calls the MongoHadoop.reduce function and passes it a block. This block takes two parameters, a key and an array of values for that key, reduces the values into a single aggregate and emits a hash with the same key and the newly reduced value.

To run it all, create a shell script that executes hadoop with the streaming assembly jar and tells it how to find the mapper and reducer files as well as where to retrieve and store the data:

Make them all executable by running chmod +x on the all the scripts and run to have hadoop process the job.

7 books to supercharge your data education

Scott Haylon

Working with data is HARD.  Let’s face it, you’re brave to even attempt it, let alone make it your everyday job.

Fortunately, some incredibly talented people have taken the time to compile and share their deep knowledge for you.

Here are 7 books we recommend for picking up some new skills in 2013:

Read More