Tumblr is where tens of millions of creative people around the world share and follow the things they love.Sign up to find more cool stuff to follow
Hadoop Now, Next and Beyond - Keynote by Eric Baldeschwieler
Eric Baldeschwieler’s keynote from HadoopSummit has been published on YouTube. It’s mainly about the goals and effort behind Hadoop 2.0 and the new tools in the Hadoop’s ecosystem meant to simplify different aspects of a Hadoop deployment (HCatalog, Ambary, Tez, Stinger Initiative).
✚ Datanami has published a summary of the keynote here
Original title and link: Hadoop Now, Next and Beyond - Keynote by Eric Baldeschwieler (NoSQL database©myNoSQL)
Yahoo создает софтверную компанию Hortonworks
Интернет-компания Yahoo анонсировала планы по выведению из своей структуры подразделения, ответственного за разработку открытого программного обеспечения Apache Hadoop в … Читать далее…
Exciting month in the Hadoop world
January has so far been an exciting month in the Hadoop world. Three heavyweights with different offerings of the MapReduce platform commonly used in data analytics.
First up was Netflix, introducing us to Genie. Genie provides additional abstraction on top of Amazon’s EMR through a RESTful API. It also allows the management of job flows on different clusters. What is not clear is whether there are any plans to make Genie available as open source or otherwise. Will we see it on GitHub soon?
In typical Netflix fashion, the blog post also provides technical insights into the Netflix kitchen this time about their data warehouse architecture. The biggest surprise to me was the reliance on Amazon S3 which I know from experience can stifle Hadoop performance. Their main reason seems to be that with S3 it is possible to spin multiple clusters simultaneously without needing to manually replicate data stores or worry about inadvertent data loss.
Next is Hortonworks Sandbox, a virtual machine packed with a couple of things that a Hadoop novice would find handy: Hortonworks’ own flavor of Hadoop preinstalled, as well as a step-by-step tutorial to get started with Hadoop. I wonder if the VM is useful beyond its standalone environment, e.g. to easily manage an external Hadoop cluster. There is also an associated webinar, and a promise to provide more tutorials and exercises in the coming months.
Last but not least is Joyent, who announced an enterprise Hadoop-aaS provision that is supposed to be of much higher performance than others in the market. They claim “3x faster or at 1/3 the cost of other cloud offerings”. Something’s gotta give, not sure what yet.
Hadoop might indeed be old hat to some, but no one can deny that it still is an important programming framework that is widely used for crunching data.
Hadoop World 2011 Recap
Hadoop World 2011 Recap
I’m going to give a brief description about hadoop world, what I learned from
Disclaimer- the opinions expressed are my own and my understanding of Hadoop. Since most of this is from memory and notes, there may be some areas where the numbers are inaccurate. I will gladly update them
Tuesday Nov 8th
Hugh Williams, VP experience and Search, EBay
Hugh gave a pretty insightful talk about how Ebay is using hadoop/hbase to rewrite their current search engine. Ebay developed their search engine back 2001 and haven’t updated it since then.
Ebay current search engine codenamed Voyager is so antiquated that it does searches only on titles. Also it can’t do proper recommendations based on the current structure.
Williams said that eBay currently has more than 97 million active buyers and sellers and over 200 million items across 50,000 categories for sale. The auction site handles close to 2 billion page views, 250 million search queries and tens of billions of database calls each day, he said.
Hadoop and Hbase allow EBay to build a far more sophisticated search engine than Voyager codename Cassini. Cassini will deliver more accurate and more context-based results to user search queries, he said.
With more than 100 engineers assigned to Project Cassini full time, the development effort is one of the largest ever at EBay.
The company has 9 petabytes of data stored on Hadoop and Teradata clusters, and the amount of data is growing quickly, he said.
Cassini is expected to go live next year.
Larry Feinsmith, Managing Director, Office of the CIO, JPMC
To me this was one of the more surprising and interesting talks. My thoughts on financial companies was that they were always the last to embrace technology but I was surprised at Chase was diving headlong into hadoop. Chase was Cloudera’s first enterprise customer almost 3 years ago which makes them one of the pioneers of using hadoop.
He told a keynote audience that the financial services firm has been using the open source storage and data analysis framework for close to three years now and is currently leveraging the technology for fraud detection, IT risk management, self-service and other applications.
Chase still relies heavily on core relational database technologies for transaction processing, but uses Hadoop-based products for a growing number of tasks, Feinsmith said. Five out of seven Chase business units use Hadoop in some way, he added.
Hadoop’s ability to store vast volumes of unstructured data has allowed Chase to collect and store weblogs, transaction data and social media data, Feinsmith said. The company is aggregating the data into a common platform, and runs a range of customer-focused data mining and data analytics applications to utilize it, he said. With over 150 petabytes of online storage, 30,000 databases and 3.5 billion logins to Chase user accounts, data is the lifeblood of the company, Feinsmith said.
For the moment at least, relational database technologies appear to be more suited for running transaction applications, he said. The big debate among technologists at the bank right now is whether incumbent relational database technologies will evolve to meet the bank’s emerging big data needs, or Hadoop-based technology can become adept at transaction processing, Feinsmith said.
BTW both Ebay and JPMC were looking to hire hadoop engineers with JPMC saying they would beat Ebay salaries by 10%. J
There were several tracks and I couldn’t attend all of them though I would have loved to so I focused mainly on developer and HBase tracks.
Building real-time services at Facebook with Hadoop and HBase by Jonathan Gray, Software Engineer, Facebook
Jonathan joined Facebook in early 2010 from his startup steamy.com. Facebook was built on PHP/MySQL. As Facebook grew they were pushing the limits of MySQL. In fact they rewrote MySQL to push the limits further. They call their implementation “Sharded MySQL”. Even then they were having scaling issues. With their new Facebook messaging system, they realized they couldn’t continue to rely on MySQL. So that is when they did a comparison between Nosql options out there. The 3 databases they were using to test were Cassandra(developed at Facebook), Sharded MySQL(Facebook’s implementation of MySQL) and hbase. The reason they chose Hbase was because of hadoop. Facebook had already spent considerable amount of effort into building their hadoop clusters. Also from the maintenance perspective, due to hadoop’s self-healing(ability to handle server failures automatically), hbase became a viable candidate. But still hbase wasn’t proven to be stable and the initial versions of hbase were prone to be broken quite often. After a lot of effort by Facebook engineers. They were able to come up with a stable version which they have also pushed back into the community edition of Hbase.
Some numbers from Facebook.
6B+ messages per day which translates to 75B+ Hbase calls per day (this is up from 20B+ Hbase calls when they launched)
15M ops/sec at peak
Facebook Scribe (Facebook’s implementation of Flume) + Hadoop.
Using just this implementation, their queries took between 8-24 hrs to run. This is true because we have seen the same performance on our test bed hadoop cluster.
Now they added Hbase to their implementation and brought their query run time to 2-32 seconds with an average of 5 seconds. This is amazing. Think about it, the amount unstructured data that Facebook has and running queries on them just taking an average of 5 seconds.
Hadoop Troubleshooting 101 by Kate Ting, Customer Operations Engineer, Cloudera
This was an interesting presentation and one I wanted to attend because of Cloudera’s extensive customer base and the issues that they have. The default configuration specified by CDH (Cloudera’s Distribution for hadoop) needs to be configured. Most of the defaults are based on a very basic architecture and need to be changed based on the system architecture.
Most of the errors that Cloudera sees are mainly due to misconfiguration.
Mapreduce Heap Oversubscription: - This occurs when you don’t have enough heap memory to allocate to all the processes. If this happens, Hbase will start swapping and will fail eventually. Typically in hadoop world, swapping is a bad thing and you want to avoid this as much as possible.
Increase heap size
Memory Error: - Simple rules to follow to prevent memory errors, for every 4 mappers, configure 3 reducers. For every 1 GB of map reduce data, allocate 512 mb to heap. Total mapreduce memory must be less than Total RAM -3GB
Job tracker out of memory:- run the following command to see the current Jobtracker memory size, sudo –u mapred jmap –J –db64 –histo:libe <PID of jobtracker>.
Increase JT heap space.
Split out the Jobtracker and Namenode onto different boxes.
Decrease the no. of Job trackers.
Too many fetch failures: - cause could be DNS issues, not enough threads, JVM issues.
Set mapreduce = SQRT(Node count) with a floor of 10
No such file of directory:-
Change permissions to 755
Change owner to mapred
Too may open files:-
Increase no of open files to 32K +
HDFS Name Node High Availability by Aaron Myers, Cloudera and Suresh Srinivas, Hortonworks
This was a very interesting topic. HA has been a big issue for Hadoop and many people consider it to be the downside for hadoop. A lot of enterprises have been asking for stability in this. This presentation was focused on the next gen of HA. Currently there is no concept of backup in Namenode. If the namenode were to go down, it would take time to recover. Cloudera therefore recommends that the namenode be a beefy box.
Some brief no’s from Hortonworks
In 2009, Yahoo lost 19 blocks/329 million. This is 7 9’s reliability which is pretty good.
They had 22 failures on 25 clusters which translates to 0.56 failures per year per cluster. Now out of these 22 failures only 8 would have benefitted from going to HA which is 0.23 failures per year per cluster so even though this seems to be an issue, it really isn’t
But they still want to address the issue.
Firstly changes that will happen to Namenode
They are adding concepts of Namenode called Active NN and Standby NN. As the name suggests the active namenode would be the one currently and standby would be ready to takeover in case of failure. Standby namenode could be in cold, warm or Hot Standby. The theory is to get failover to happen within a minute. Different configurations possible.
Single namenode – no failover
Active & Standby(Cold/warm/hot) – Manual
Active&Standby(Hot) – Automated
The HA daemon that would be monitoring the daemon would run outside the namenode.
Current estimate of HA Availability is 0.23.1
The jira tickets that cover this issue is primarily 1623. Associated tickets are 1971,1972,1973,1974,1975,2005,2064 and 1073. You can go to the apache hadoop developer site to get the URl
Also another feature being added is Bookkeeper for transactional Logging.
Integrating Hadoop with Enterprise RDBMS using SQOOP by Guy Harrison, Quest Software and Arvind Prabhakar, Cloudera
Another interesting topic was on the state of the Sqoop. Sqoop is a database connector very much like the JDBC/ODBC connectors around except that it works between hadoop and the DB.
Quest Software were the guys who created the Ora-oop Connector that allowed hadoop to connect with Oracle. This was because oracle was initially reluctant to join the hadoop bandwagon, though now things have changed and oracle has their own connector, though it will be released only next year.( From my discussion with one of the oracle reps, it would only be made available to people who purchase the oracle hadoop hardware. I could be wrong but that was the impression I got from the rep).
Anyways back to Sqoop. Currently most database connectors are optimized for the particular databases because of nuances every db has. One of the things that was happening before Oraoop came out was the way parallelism worked in oracle. When a hadoop multi-threaded connection came in, oracle own parallelism would kick in and try to respond to the connection but the way this worked was that it would setup a multithreaded call only for the first hadoop connection and then have single threads assigned to the other hadoop connectors. This was very inefficient. Oracle fixed this issue by now making all secondary connections wait until the first thread was completed with its request. Again not acceptable. Quest software got rid of this issue by making oracle agnostic to the incoming calls and serving the hadoop requests as if there were one. Sorry I don’t have any images to go with this but Guy explained this pretty well.
Currently a newer version of Sqoop is being developed called Sqoop 2.0. The ticket for this Sqoop-365. You can get the info by going to the apache sqoop website
Also Quest Software has this amazing tool called Toad for Cloud Computing which is basically a UI for hadoop. It also has a plugin for eclipse which I thought was awesome. Also it’s completely free.
Next generation Mapreduce by Mahadev Konar, Hortonworks
This topic is particularly very important because it addresses changes that are coming that will the hadoop core. In fact all of these changes are part of 0.23.0 which is in beta right now. Cloudera plans to release CDH4 based on this which will be ready early next year.
Current drawbacks for HDFS:-
You need to have Hacks for MPI/Graph software like Giraph
Max Cluster Size:- 4000 nodes
Max Concurrent tasks:- 40000
Because of Coarse Synchronization, JobTracker was SPOF(Single point of failure)
Requirements for next version of Mapreduce
100k concurrent tasks
10k concurrent jobs
Couple of new concepts added:-
Application – a job submitted to the framework
Container – Basic unit of allocation (This replaces the fixed map/reduce slots)
Major changes that are happening
Jobtracker is now split into 2 different parts viz. Resource Manager and Application Master
Tasktracker now becomes node manager.
Resource Manager is a global resource scheduler that does the job of Application Manager + Scheduler + Resource Tracker
Application Master handles application scheduling and task execution
Node manager manages the life cycle of a container.
By splitting the Jobtracker into 2, hadoop is able scale well beyond the current limitations
Another important feature that this version brings is the ability for clients to upgrade their hadoop irrespective of cluster version
Speculative Execution (The ability for hadoop to start a similar job if one of the tasktrackers is slow in processing) still exists but is now handled by the application master.
The ticket for this Mapreduce-279
You can also follow blog posts for this at
This was end of Day 1.
Wednesday, Nov 9th
James Markarian, Executive Vice President and CTO, Informatica
This presentation was pretty much about the future of hadoop from a company’s perspective. James presentation was about what hadoop will look like in 30 years? Would hadoop be the Beatles or Justin Beiber of big data? Would it live up to the testament it has created or just be another hyped technology that would die out in a couple of years. James truly believes that hadoop is the one. He said hadoop is the most important technology to come out in the last 30 years. That hadoop will survive and become bigger than it has but he said the time has come to stop giving hadoop a free pass. Security is paramount in business and hadoop has to do a better job of doing it. He mentioned a very interesting story on one of his clients. This is a chain of big hotels. One of the hotels that they recently added a new wing had a lot of customers complaining of noise. This information was gleaned from websites like tripadvisor etc. Based on this information, it was decided to gut the old wing and build a new one spending millions of dollars in the process. Well at the same they also decided to analyze all the feedback they received through hadoop. What they learnt was surprising. First they able to pinpoint the actual rooms where the customer was staying when they complained about the noise; they learnt the noise was coming from the new wing. They also learnt that they just needed to put a rubber pad under the door and the noise problem was solved. Also they just had to renovate the old wing, in the process saving millions.
Doug Cutting, Architect, Cloudera and Founder of Hadoop
Doug basically gave a recap of the current state of hadoop and where the technology was moving to. Couple of new changes added to hadoop was the addition of Bigtop which is a test framework built for hadoop. This framework is primarily used to test the components of hadoop and make sure they gel together well. Another personal project that Doug is currently working on is Avro. Avro is a serialization tool for hadoop. It has been integrated very well into hadoop. Doug is advocating everyone to use Avro for their serialization purposes.
Hadoop and Performance, Todd Lipcon, Cloudera and Lingpen Wei, University of California, Berkeley
Main takeaways from this presentation were that hadoop doesn’t have a very good performance testing tool. Most of the test cases they came up with were working with Facebook and some of cloudera’s customers to come up with some performance metrics. Some of the metrics that are currently measured are overhead (time spent doing tasks that are not important) CPU overhead, Lock contention. Scheduler/framework overhead and disk utilization.
Some of the myths about hadoop
Java is slow
Java allows to interact with JNI(Java native interface) so you can interact with the system level to write faster code
Most of the current slowness is IO based not CPU
Java does not allow you to interact with low level system process
You can interact using JNI
Hadoop I/O has too many layers making it wieldy
Hadoop HDFS is built upon current filesystems like ext4, XFS so it is only as wieldy is the base filesystem.
Some tricks to get Mapreduce performance are gained using IO caching
Linux provides 3 useful syscalls(POSIX_FADV_WILLNEED<POSIX_FADV_DONTNEED,SYNC_FILE_WRITE_RANGE)
Explicit read ahead
Dropping unnecessary data
Hadoop performance is a community effort and primarily driven by the direction the community is taking.
HBase Roadmap by Jonathan Gray, Facebook
Jonathan gave a brief history of HBase which I would like to share. Hbase was written by the guys from Powerset who were acquired by Microsoft who then stopped using it. Initially they developed HBase to help them with their offline logs. In 2009 3 companies started working towards to supporting developing for Hbase (Steamy, StumbleUpon and Trend Micro). Initial versions of HBase started with 0.10 and then moving to 0.18 to 0.20( to keep it on the same cycle as hadoop) and then jumping to 0.90( to break away from hadoop cyclical releases). 0.90 was also the first release where they went to zero data loss. One of the thing is forgot to mention yesterday was the concept of ZooKeeper. Zookeeper is very important not only in HBase but also in hadoop( though it’s not being used as much but in the newer version of mapreduce, it will come to the forefront). Zookeeper is a Quorum based technology wherein decisions are made by a majority of machines( which is why you always need an odd no. of machines for zookeeper).
0.92 is set to release any week now. This release will be more stable than the current release. One of the big features offered here is the concept of Coprocessors( think triggers & stored procedures) developed by trend micro. It will also support Multi Master Cluster replication, pre/post hooks to all client calls and server operations, dynamically add new RPC calls, ACL security will now be supported atop co-processors. Also in this release HFile V2 will be released which will be able to support even larger files to process. The current version of HFile could not support as large files as V2 because it used to load Bloom Filters into memory. If you know how Bloom Filters work, they filter the results in memory thereby decreasing the amount of data that could be processed. They fixed this by doing inline Bloom Filters. HFile v2 is kind of like a multi-level B-tree. Other things that will be noticeable in 0.92 are performance improvements(more seeking and early out hints, distributed log processing, cache on write, evict on close), Compaction improvements(multi-threaded compactions, vastly improved file selections), operational improvements(slow query log for tracing, online schema modification( rolling schema, you can now make changes to the schema online)), Usability & API improvements.
0.94 is the next stable version that is supposed to bring even more changes like Fast backups w/ point in time recovery, multi slave replication, thrift 2.0, TTL+Minversions. 0.94 is currently in development.
Advanced HBase Design by Lars George, Cloudera and Author of HBase: The definitive guide
This presentation was about designing HBase for real world problems based on real use cases
Hbase Configuration is done at many levels on a server.
Hbase automatically shards for you. The way you can visualize an HBase table is to think of a DB Table( logical layer) broken into several regions(Physical Layer). On top of these regions, you have the region server which maintains the regions for reading & writing.
Unit of scalability in HBase is the region. In HBase, all rows are stored in a sorted, contiguous range. The data is spread randomly across regions servers. Capacity is function of cluster size. Hbase is a columnar DB and hence supports columns families. HBase uses Bloom Filters. These are generated when an HFile is persisted. They are stored at the end of each file. They are loaded into memory. It allows you to filter on row or row + column currently. Drawbacks are heavy on memory and results sometimes in 1% false positives.
The key cardinality for HBase is Row +Column Family + Column Qualifier + Timestamp . The best cardinality is obtained using row keys. Also remember the logical layout will not match the physical layout i.e. the way you design is not the way it will be stored on disk.
One of the considerations to take in HBase is creating Tall-Narrow tables or Flat Wide Tables.
They both have their drawbacks, with Tall-Narrow, you lose atomicity. With Flat-wide wider tables cannot be split logically since HBase splits data across region servers by splitting columns.
Another important thing to remember is to avoid hotspotting(i.e. choosing keys that will always result in you hitting one region server more often than others). That’s why it’s better to salt the keys. Sequential keys are good for reading but are bad for writing(hotspotting). Random keys are good for writing but bad for reading. Key design is based on the data you have. Typically it’s good to do a combination of sequential and random key
Practical HBase by Ravi Veeramchaneni, Informatica
Ravi was at Navteq prior to join Informatica as Lead Architect. This session was about using HBase in production at Navteq.
Navteq Use Case
100’s of millions of content records.
100’s of content supplier’s + community support
On average one record has 120 attributes
Certain type of content attributes have close to 400 attributes
Content classified across 270+ categories
Being in such a situation made Navteq realize that they needed a solution that will work well with these requirements. Navteq has one of the original users of HBase and as such seen all the various problems that occurred in the older less stable versions of HBase. One of the biggest limitations in HBase the availability of a single key only. Secondary indexes are currently being worked with coprocessors but it is still being tested and won’t be released as part of 0.92.
Navteq has done some work on Hive/HBase integration . One of their biggest reason for using HBase was the ability to scale as they grow( just add nodes onto the hadoop cluster). Also Navteq saw very limited use for running reducers in their use case.
Things to look out for in HBase:-
Too many column families are not good. Navteq initially had close to 30 column families. This was causing issues with the region servers especially while reading. They reduced their column families to between 8-10. Ideal is 3-4
Snappy is recommended as the compression format since it is 20% better compression than LZO
THBASE/TBASE(experimental libraries) doesn’t work well for secondary indexes.
Do not run Zookeeper on Datanode
Do not run HBase master on Namenode
Up the no. of open files to 32K +
Use Puppet/Chef/SCM Manager for configuration
Set vm.swapiness to 0 or very low to prevent swapping
Increase xceivers to 2047
Set socket timeout to 0(dfs.datanode.socket.write.timeout)
Turn off block cache
Increase max file size to 2GB(hbase.hregion.max.filesize), the default is 256GB. Do this based on the H/W specs
Practical H/W characteristics of an HBase node
General node specs:-
24GB Ram, 8 Cores, 4x1TB drive running 6 Mappers and 6 reducers
Navteq node specs:-
64GB ram,24 Cores,8x2TB drive running 16 Mappers and 4 reducers
The way to split the configuration for a general datanode is( In brackets is Navteq’s implementation):-
DataNode – 1GB(2GB)
Tasktracker – 1GB(2GB)
Mappers – 6x1GB(16x1.5GB)
Reducers – 6x1GB(4x1.5GB)
Region Servers – 8GB(24GB)
Total – 24GB(64GB)
Extending Enterprise Datawarehouse with Hadoop by Jonathan Seidman & Rob Lancaster, Orbitz
Orbitz currently uses hadoop to drive hotel recommendations, airfare prediction, and segmentation analysis. One of the common usage patterns at Orbitz is using Hadoop to complement ETL. Their process for this is
Raw Logs -> Hadoop -> ETL DB. Previously, they used to just directly move data into ETL. They mentioned a use case where a few of their analysts were not able to run queries against their DB for a year to do geo targeted ads. With the arrival of hadoop, they were able to do that. Orbitz has also said they are moving their click data processing to hadoop.
Currently they transform 500GB Uncompressed data per day in hadoop to 5GB worth of data. They again reprocess this data in hadoop to get 5MB worth of summarized data
This was the end of Day 2 and Hadoop World Conference
Thanks for your patience.
Search Hadoop with Search-Hadoop.com
10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS…
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
2011 - The Year of Big Data
At the beginning of 2011, there was only one major vendor offering a distribution of Hadoop – the core technology for Big Data. That company was Cloudera. Over the past year, we have seen a mad rush among the Big Iron vendors to have a presence in Big Data:
- IBM Launches InfoSphere BigInsights
- Oracle Announces a partnership with Cloudera
- Yahoo spins off its Hadoop development into a new company called Hortonworks
- Microsoft announces Partnership with Hortonworks
- MapR Technologies awarded a $20M round of financing
With an abundance of data, and the availability of cost effective technology solutions, organizations lack only two key ingredients to succeed. First, a pool of skilled resources - both in the technology and data science fields. Second, the political will to recognize and exploit the findings revealed by the data. It is entirely possible that product lines may be sidelined, or that legacy work teams are supplanted by online dashboards and visualization tools.
Let’s be honest, if your organization is unwilling or unable to change based on the data, then you should reassess the purpose of your initiative. Readiness is not as much about the technology, but about the business objectives you set forth.
The key take-aways are:
- Big Data is still in its infancy at the enterprise level, attracting lots of pilots and exploration. There are very few trained technicians to implement the technology and even fewer experienced analysts and marketers that use Big Data in a “new way”.
- Data Movement (bandwidth) is still a big issue within established companies. Once we implement a Big Data Platform, we have to extract and import Terabytes of raw data on a daily basis, just to keep the platform relevant.
- Don’t assume that a big solution will come to you after you deploy the technology. Start with a well defined business objective.
Check out Dion Hinchcliffe’s Blog: Enterprise Web 2.0. In this refreshing post, Mr. Hinchcliffe explores what he calls “the clue gap”, where companies lack the internal capabilities to implement and execute on the platform. Good news for aspiring Data Scientists.
Hadoop.Next benchmark performance blog post up!
My second post on hortonworks blog is up. We’ve the finally cracked the 0.23 benchmarking nut!
In our previous blogs and webinars we have discussed the significant improvements and architectural changes coming to Apache Hadoop .Next (0.23). To recap, the major ones are:
- Federation for Scaling HDFS – HDFS has undergone a transformation to separate Namespace management from the Block (storage) management to allow for significant scaling of the filesystem. In previous architectures, they were intertwined in the NameNode.
- NextGen MapReduce (aka YARN) – MapReduce has undergone a complete overhaul in hadoop-0.23, including a fundamental change to split up the major functionalities of the JobTracker, resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs. Thus, Hadoop becomes a general purpose data-processing platform that can support MapReduce as well as other application execution frameworks such as MPI, Graph processing, Iterative processing etc.
As we have discussed previously, delivering a major Apache Hadoop release takes a significant amount of effort to meet very strict reliability, scalability and performance requirements. Since Apache Hadoop (HDFS & MapReduce) are the core parts of the ecosystem, compatibility and integration of components in the upper layers of the stack (HBase, Pig, Hive, Oozie etc.) are critical for success of the new release.
In the tradition that we’ve followed for every single major (stable) release of Apache Hadoop, Hortonworks partnered with Yahoo! to benchmark and certify hadoop-0.23.1 on a performance cluster of 350 machines. Although performance improvements have been a continuous process since the beginning, it became the principle focus after the alpha release of Hadoop .Next (0.23.0).
We are pleased to report that almost all of the benchmarks perform significantly better on Hadoop .Next (0.23.1) compared to the current stable hadoop-1.0 release. Even those that don’t perform significantly better are on par with hadoop-1.0.
The performance benchmarks are the same ones that we’ve been using to harden & stabilize major Hadoop releases throughout the lifetime of the project.
The aim of this process is to verify every single aspect of core Hadoop – to validate that there are no regressions at scale. These include the core HDFS and MapReduce (i.e. NextGen MapReduce, or YARN) and the applications that run on top of this framework.
Here are some details on the benchmark tests:
- The dfsio benchmark for measuring HDFS I/O (read/write) performance.
- The slive benchmark for measuring NameNode operations.
- The scan benchmark to measure HDFS I/O performance for MapReduce jobs.
- The shuffle benchmark to calibrate how fast the map-outputs are shuffled
- The famous sort benchmark which measures time for sorting data with MapReduce.
- The compression benchmark to validate how fast we compress intermediate and the final outputs of MapReduce jobs.
- The gridmix-V3 to measure the throughput of the cluster using a production trace of thousands of actual user jobs.
We also started using a couple of new benchmarks to cater to the architectural changes due to YARN:
- The ApplicationMaster Scalability benchmark to figure out how fast task/container scheduling happens at the MapReduce ApplicationMaster. Compared to hadoop-1.0, this benchmark ran twice as fast with hadoop-0.23.1.
- The ApplicationMaster Recoverability benchmark for measuring how fast jobs recover on restart.
- The ResourceManager Scalability to evaluate the central master’s scalability by simulating lots of nodes in a cluster.
- The Small Jobs benchmark to measure performance for very small jobs also runs more than twice as fast due to improvements made where the tasks execute within the ApplicationMaster itself (as opposed to launching small number of tasks for the job).
Many of the performance improvements can be attributed to the new architecture itself. Stay tuned for additional blogs on this topic.
Leaving YARN aside, i.e. the resource-management layer, the MapReduce runtime (map task, sort, shuffle, merge etc.) itself has many improvements when compared to hadoop-1.0. Some examples are: MAPREDUCE-64, MAPREDUCE-318, MAPREDUCE-240.
More information is available on MAPREDUCE-3561, which is the umbrella Apache Hadoop JIRA where we were tracking all our benchmarking efforts.
Benchmarking distributed systems is a very challenging task. It involves debugging, constant focus on one problem at a time, knowing which threads of investigation to follow and which to ignore and last, but not the least, patience and persistence. We had so much fun doing it and learnt some valuable lessons along the way. The process itself merits its own post.Summary & Acknowledgements
We thank the Yahoo! Performance team for the cluster resources, development & performance teams for all the help along the way!
We are very excited to be delivering on the promise of Hadoop .Next and hope you can derive even better value from your Hadoop clusters.
- Vinod Kumar Vavilapalli a.k.a @tshooter
I step into the Hortonworks office and the first thing that they do is a conversion of my religion. They put into my lap a Mac Book pro and say nothing else can be done.
I am sure they did this only because they know way too well how big a fan(boy) I am of linux/ubuntu ;)
Here I go off, whining, hissing and trying to convert this machine to my linux setup as much as possible.
Hortonworks and Microsoft bring open-source Hadoop to Windows
#SuryaRay #Surya There’s probably no better way to open up big data to the masses than making it accessible and manipulatable — if that’s a word — via Microsoft Excel. And that ability gets closer to reality Monday with the beta release of Hortonworks Data Platform for Windows. The product of a year-old collaboration between Hortonworks and Microsoft is now downloadable. General availability will come later in the second quarter, said Shawn Connolly, Hortonworks’ VP of corporate strategy, in an interview.
The combination should make it easier to integrate data from SQL Server and Hadoop and to funnel all that into Excel for charting and pivoting and all the tasks Excel is good at, Connolly added.
He stressed that this means the very same Apache Hadoop distribution will run on Linux and Windows.
Microsoft opted to work with Hortonworks rather than to continue its own “Dryad” project, as GigaOM’s Derrick Harris reported a year ago. Those with long memories will recall this isn’t the first time that Microsoft relied on outside expertise for database work. The guts of early SQL Server came to the company via Sybase.
The intersection of structured SQL and Hadoop universes is indeed a hotspot, with companies including Hadoop rivals Cloudera and EMC Greenplum all working that fertile terrain so Hortonworks and Microsoft face stiff competition. This topic, along with real-time data tracking, will be discussed at GigaOM’s Structure Data conference in New York on March 20-21.
Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.
Auspiciously, this appropriate shot was the Bing image from yesterday. After 7 years in numerous guises at Microsoft - although always developer, developer, developer - I’m delighted to let you know that I’m heading to pastures new.
I couldn’t be more excited to be joining a team with a great vision and clarity of purpose, at a time of fantastic potential in the ‘Big Data’ market* and with such an exciting project and product. Even Microsoft is in on it. And Xbox.
Microsoft has been a blast - the people and the eccentricity of an organization so huge have made it an adventure from day to day. There’s too much to talk about there but you can always trawl the archives. But it’s time for more and new.
I can’t thank enough the people who’ve been part of that journey, and right now, I can’t thank enough my amazingly brave and supportive family as we embark on yet another adventure - not quite so far, but still a big shift.
*OK, OK, that sounds too mature for me so you figured that the real reason I’m interested in Big Data is because I think we’re all living in a simulation and really we’re just headed into ever increasingly fine grained quantification of life that ultimately will enable us to build a simulation so real that we’ll be able to have our own little universes, and then spend time wondering how far into the onion skin of similar universes we exist and then wonder what the point is and suffer a simultaneously crushing feeling of pointlessness and overwhelming release and freedom… OK, maybe it is :)
A few stats, rumors and stories on on Hadoop’s rapid growth
#SuryaRay #Surya There’s little hard data on the size of the largely private Hadoop market yet, but you can get a clue from looking at what’s going on inside Silicon Valley. The money changing hands and the sizes of the largest players in the space alone are enough to paint a telling picture of a market that’s growing fast in uncharted territory. I’ve collected some of the insights I’ve gleaned over the past few months to try and add some perspective.
Everything, of course, is relative and we might never see a Hadoop vendor reach the size of a database company such as Oracle with more than 100,000 employees and tens of billions in annual revenue. After all, Hadoop is a new technology for most companies, so it’s not really moving in on an already lucrative market and stealing budgetary dollars from incumbents. Further — and possibly more importantly — the core Hadoop technology is free and open source, meaning there are lots of unpaid downloads so money comes from services, support and large enterprises willing to buy software licenses for value-added products.
Here’s a chart showing how much money Hadoop-based companies have raised thus far (although the grand total will likely rise by at least $10 million next week). Keep in mind, Cloudera only launched in 2009 and Hortonworks launched in June 2011. And these aren’t companies that merely bury Hadoop under an application or can connect their technologies to it — these are companies either selling Hadoop or applications designed specifically for it.
In terms of revenue, one might look to a May 2012 report by research from IDC estimating the size of the Hadoop ecosystem to be around $77 million, growing to $813 million by 2016. Those are both impressive numbers, but they might actually be short-changing reality. For one, as I noted at the time, the authors attributed next to almost no revenue to Amazon Web Services’ Elastic MapReduce service, which is almost certainly generating at least a few million in revenue each year.
Speaking to me in June, Cloudera CEO Mike Olson also took issue with the number, claiming it didn’t even take Cloudera’s revenue into account — which seems entirely possible considering the business Cloudera is doing. I’ve heard from reliable sources that Cloudera is doing very well and is on track to do about $100 million in revenue this year, very possibly more. And as early as April 2011, Cloudera executives were touting that software license revenue had already surpassed services revenue (although it’s arguable whether that will, or even has to, remain the case).
More anecdotally, I’ve heard from several sources that Hortonworks has already declined at least one potentially appealing acquisition offer. That it wouldn’t sell isn’t surprising: sources say the company is valued at $225 million after its last round of funding and is looking to raise more money. And although it just released its first product in June, the company has impressive and potentially lucrative partnerships in place with Microsoft, Teradata, Rackspace, VMware and other large vendors.
MapR, the proprietary thorn in the sides of both Cloudera and Hortonworks, appears to be doing quite well, too. Vice President of Marketing Jack Norris told me in June that his company had higher license revenue than many would expect and predicted that deals with Amazon Web Services and Google Compute Engine would help the company become “the license revenue leader within the next quarter.”
Former Cloudera VP of Technology Solutions Omer Trajan, who just left to join HBase-centric startup WibiData, shared some insights with me from his days at Cloudera that seem to back up vendor confidence. He said most mature production clusters (excluding monster users such as Facebook) consist of about 200 nodes, and many double in size after the first year. That’s part of the reason Cloudera grew in size about 10x during the three years he was there.
“It has definitely been a rocket ship,” he said. “… You just strap in and hope you make it up.”
Interest is only picking up, too: “There are more people that have started big data projects in the past six months than have big big data projects running [in production],” Trajman said.
It’s probably not accurate to call companies such as Cloudera, Hortonworks and MapR startups anymore, and we might start to see signs of this shift in personnel moves. Here’s how big they are and expect to become:
* Cloudera: More than 300 employees globally and growing, especially in the sales department.
* Hortonworks: 145 employees as of late October and hiring a person per day, on average, through the end of 2012.
* MapR: More than 125 employees, mostly in technical and engineering positions; starting to build sales team and looks to more than double headcount in 2013.
While Cloudera and Hortonworks, for example, are still young, nimble and agile enough to lure a fair amount of talent from now-officially large enterprises such as VMware, their employees who joined on early and really love the startup life might not stick around.
Trajman’s new home, WibiData, is a fine example of this. It was launched last year by former Cloudera employees Christophe Bisciglia (who actually co-founded Cloudera) and Aaron Kimball to help companies build behavioral-analysis applications on top of Hadoop.
(Maybe there’s a Cloudera mafia shaping up: WibiData’s officemates — MemCachier and Thanx — both count former Cloudera employees as key members or founders of their teams, as does HBase-centric startup Drawn to Scale.)
Trajman, who was one of the first couple dozen employees at Cloudera (and who previously joined Vertica at around the same stage in its growth) told me he likes the rush of getting in the the ground level of new technologies and helping companies do something really new. While he enjoyed establishing and implementing some the the core foundational use cases for Hadoop (e.g., ETL and data exploration) with Cloudera’s early customers, that’s still much of what Cloudera provides to customers because it’s so difficult to build higher-level and higher-value applications at the infrastructural level where Cloudera operates.
“For me, it was very personal in terms of the impact I wanted to have,” Trajman said. At WibiData, he can help users who have the infrastructure part resolved and now want to develop applications that make data analysis a core part of their businesses. Where there’s a focus on innovation, he said, that’s where the innovators go.
This isn’t a bad thing, it’s just a side effect of growth — and when employees stay and innovate in the Hadoop space, it just creates a bigger pie for everyone to share.
Feature image courtesy of Shutterstock user GuskovaNatalia.