LexisNexis HPCC Takes On Hadoop as Battle for Big Data Supremacy Heats Up
Over the last ten years, LexisNexis Risk Solutions has developed what it calls a High Performance Computing Cluster, a proprietary method for processing and analyzing large volumes of data for its clients in finance, utilities and government. The company this week made HPCC open source and spun-off HPCC Systems to develop and market the technology.
LexisNexis is positioning HPCC as a competitor to Apache Hadoop, the open source software framework for Big Data processing and analytics. The entry of LexisNexis and HPCC into the Big Data ecosystem is yet another validation of the Big Data space and should spur innovation from all parties – HPCC, Hadoop and others.
Whether HPCC is a viable competitor to Hadoop for Big Data dominance is another question. LexisNexis, which has vast experience in collecting and processing large volumes of media and industry data, certainly thinks it is. The answer, of course, depends on a number of factors, most of which are not yet clear. Here is my initial analysis:
Maturity – Ten years in the making, HPCC has a three-year head-start over Hadoop, which was developed, more or less, in 2004. Since then, however, Hadoop has benefited from the contributions of thousands of developers via the Apache Software Foundation. HPCC was developed behind closed doors by an undetermined number of LexisNexis developers. If you subscribe to the notion that two heads are better than one, Hadoop is likely the more mature technology thanks to its open source heritage despite being a few years younger than HPCC.
Programming language – While Hadoop is, overall, the more mature of the two Big Data technologies, HPCC may have the edge in some specific functional areas. LexisNexis claims HPCC’s programming language, Enterprise Control Language, enables “data analysts and developers to define ‘what’ they want to do with their data instead of giving the system step-by-step instructions.” Though it doesn’t call out Pig Latin, the most popular Hadoop programming language, specifically, LexisNexis is inferring ECL is the easier, faster of the two languages in which to create Big Data processing jobs. If true, this is could be an advantage for HPCC over Hadoop, whose main drawback is that it requires significant expertise to use.
Real-time data analytics – HPCC’s Rapid Data Delivery Engine, a.k.a. Roxie, allows users to run real-time queries against HPCC. This appears to be an advantage over Hadoop, which is generally batch-oriented and used for “rear-view mirror” analysis. Upon closer inspection, however, though Roxie is able to return query results in under a second, the data it hits has already passed through Thor, HPCC’s data processing cluster. So the queries are in near real-time, but the data is not updated in real-time. In other words HPCC is not capable of real-time data analytics, as far as I can tell, so on this point it’s a wash.Facebook is working on some interesting real-time analytics jobs using Hadoop, however, which, if applicable to other use cases, could be a significant improvement for Hadoop and a differentiator over HPCC.
Open source acceptance – Gaining acceptance by the open source community is not as easy as just joining the Linux Foundation, as HPCC has done. You have to put in your time, not to mention your code. HPCC Systems has made HPCC open source, but LexisNexis “will not release its data sources, data products, the unique data linking technology, or any of the linking applications that are built into its products. These assets will remain proprietary and will not be released as open source.” This will not endear HPCC to the open source community. In contrast, Hadoop has been open source for virtually its entire existence and has a hardcore following of dedicated contributors. In order for HPCC to benefit from the open source model, it needs to attract talented developers to contribute new, innovative features. It remains to be seen if the open source community will embrace HPCC.