We live in the age of the algorithm. Increasingly, the decisions that affect our lives—where we go to school, whether we get a car loan, how much we pay for health insurance—are being made not by humans, but by mathematical models. In theory, this should lead to greater fairness: Everyone is judged according to the same rules, and bias is eliminated. But as Cathy O’Neil reveals in this shocking book, the opposite is true. The models being used today are opaque, unregulated, and uncontestable, even when they’re wrong. Most troubling, they reinforce discrimination: If a poor student can’t get a loan because a lending model deems him too risky (by virtue of his race or neighborhood), he’s then cut off from the kind of education that could pull him out of poverty, and a vicious spiral ensues. Models are propping up the lucky and punishing the downtrodden, creating a “toxic cocktail for democracy.” Welcome to the dark side of Big Data.
7 steps: Using digital for your 'entrepreneurship and brand' for impact

7 steps: Using digital for your ‘entrepreneurship and brand’ for impact

External image

7 steps: Using digital for your ‘entrepreneurship and brand’ for impact

A 7 step guide to assist you in building a brand, empowering your entrepreneurship and your business with maximum impact coupled with valuable and tangible results in today’s Digital world:

  1. Goals: Be very clear on your goals and at the same time be very specific. For eg. Instead of saying I want to achieve “sales”, say I…

View On WordPress

It is easier to charge those Big Data waves when you have afternoon tea courtesy @cherrylaurelstudio . Thank you, Elizabeth! 👍🏻😄 Keyword research, I’m ready! 📖📚📝🔍❤️ #SEO #keyword #research #bigdata #tea #biscuits #cookies #teapot #ladyboss #workfromhome #impactzoneco [Photo ©. 2015 Jewel H. Ward.] (at Chapel Hill, North Carolina)

“Like Bringing a Club to a Gunfight”

Many readers will be familiar with the old adage ‘like bringing a knife to a gunfight’. I recently encountered a small multi-store retailer, having Walmart as their primary foe, who recently decided to drop their loyalty program. This retailer will be entering into a retail gunfight with nothing more than a club when competing with the world’s largest retailer.

Sadly this is not the first retailer that has decided to drop loyalty. Retailers such as this simply do not understand that loyalty programs are not about cards or key tags, they are not about two-tier pricing, or points programs. These initiatives are really shopper-intelligence programs. The goal is gaining shopper-identified purchase data and using the insights gained to improve your business through more efficient marketing, improved understanding of merchandising, pricing, operations, and so on. Today’s retail battle is being waged for the shopper and share-of-wallet. Without the ability to measure shoppers and their purchasing a retailer cannot even get on the playing field.

I’ve seen this story before. The buyers or merchandisers or department heads - guaranteed to be older and in the business for decades - argue that tying sale items to the card harms sales. The reason for their weak sales performance isn’t the fact they are relying on the same old items and prices promoted through the same old weekly ad. No, they say the problem is the ‘card’ or loyalty program.

How on earth does a small retailer hope to compete with the largest retailer on price? They may say they compete on customer service - code for ‘we’ve been around for a lot of years’ - but that just doesn’t cut it in today’s over-stored world and anything-you-want at the touch of a button online. This retailer, and retailers like them, are blinded to the reality that they are going to have to change how they do business. And, they can’t do it. Sometimes its an older owner who just wants to keep things going for a couple more years to retirement, other times its a culture of ‘we always did it this way and it worked’. No matter. As a friend of mine was fond of saying, ‘someone has to be cannon fodder’.

What is frustrating is that today, for the first time ever, technology is riding to the rescue of smaller retailers in the form of incredibly sophisticated solutions and capabilities being made available through the cloud very cost effectively. Retailers of any size can deploy tools that enable them to gather, understand, and use shopper intelligence throughout their business.

Smaller and even regional retailers must decide to up their game or risk being relegated to being large, empty, convenience stores as larger retailers leverage shopper-centric initiatives to grow shopper share-of-wallet.

Data Preprocessing -  Things to consider.

sData in real world is inconsistent, noisy  and have missing values. In order to improve the quality of data, preprocessing of data is required.  There are several ways to preproces the data : data cleaning, data integration, data reduction or data transformation.
Raw Data :
Noisy :  Data can have outliers which can be termed as values which do not follow the regular data pattern and are different from the rest. Data Can have anomaly and can have errors.
Missing Data: data can have missing values.
Inconsistent : Containing discrepancies in names or values.
Aggregated data: Dataset contains only aggregated data.

Sample Dataset : 

Sample dataset is a US census -Income data and is taken from UCI ML Repsoitory.

Lets discuss various stages of data preprocessing :

Data Cleaning :
1. Filling in missing values 
2. Smooth out noisy data.
3. Manually clean the data.
4. Convert one form of data to other (Nominal to Numberic, Binary to Numeric)

There are various ways of handling missing values
a > Ignore it
b> Replace with fixed constant
c> Replace with mean of values if your values are numeric or by mode if the values are categorical.
d> Predict the values using a learning algorithm.

For Handling outliers :
a> Use Binning : Partition the values into bins, first sort them then smooth out values.
b> Clustering: Cluster the values, removing the outliers.
c> Employ regression functions.

Data Transformation:
1. Normalization.
2. Smoothing.
3. Aggregation.
4. Generalization.
5. New feature addition.

Normalization is The attribute values are scaled to fit in a specific range.
Can be done by:
-> MinMax : 
One linearly transformed data value such that min and max falls wit in a specific range (-1 , 1) or (0,1)
-> ZScore: 
Mean value of transformed data equals 0 and Standard deviation of 1.
-> Decimal Scaling.
Data value between -1 to 1.

Smoothing involves removing noise from data.

To be continued.

We must learn how to ask hard questions of technology and of those making decisions based data-driven tech. And opening the black box isn’t enough. Transparency of data, algorithms, and technology isn’t enough. We need to build assessment into any system that we roll-out. You can’t just put millions of dollars of surveillance equipment into the hands of the police in the hope of creating police accountability, yet, with police body-worn cameras, that’s exactly what we’re doing. And we’re not even trying to assess the implications. This is probably the fastest roll-out of a technology out of hope, and it won’t be the last. How do we get people to look beyond their hopes and fears and actively interrogate the trade-offs?

#BigData Landscape 2016 showing important #Startups and players in the area. Interesting area to follow. I think it’s the end of Large Tech Companies providing all the solutions in the topic and we are back to best of breed approach.

Visualization Capabilities, QuickApps in 1010data

Visualization Capabilities, QuickApps in 1010data

External image

Visualization Capabilities, QuickApps in 1010data : 1010data is a leading provider of big data discovery and data sharing solutions. It is used by hundreds of the world’s largest retail, m

New Post has been published on http://www.predictiveanalyticstoday.com/visualization-capabilities-quickapps-1010data/?%SNAP%

Shifting from “big data,” because it’s become code for “big brother,” tech deployed the language of “artificial intelligence” to mean all things tech, knowing full well that decades of Hollywood hype would prompt critics to ask about killer robots. So, weirdly enough, it was usually the tech actors who brought up killer robots, if only to encourage attendees not to think about them. Don’t think of an elephant. Even as the demo robots at the venue revealed the limitations of humanoid robots, the conversation became frothy with concern, enabling many in tech to avoid talking about the complex and messy social dynamics that are underway, except to say that “ethics is important.” What about equality and fairness?

Learn more about the #bigdata that’s #revolutionizing our #world 🌎 http://datatalkshow.com/page/2/
#superbigdata #datarevolution #dataforbreakfast #bigdatabigideas #bigdatanews #datanews #learnbigdata #bigdataadventure #bignewscoming #bignews #datatalkshow #datatalk

Apache Hadoop at 20

A week or two ago, Doug Cutting wrote up a ten-year retrospective on Apache Hadoop for the project’s birthday. I enjoyed it. As co-creator of the project, Doug’s had a privileged seat from which to watch the decade unfold. I really liked the fact that he called out the contributions of the global Apache developer community so strongly. I believe that’s the key reason that Hadoop has been so successful.

Plus, I had never seen Hadoop the elephant sitting on the Stone of Destiny before.

Doug’s post was, though, backward-looking. It talked a lot about Hadoop’s history, and not so much about its future. I’d like to point the lens in the other direction, and consider what’s likely to happen over the next ten years.

What the heck is Hadoop?

The mass of software that we mean when we talk about Hadoop today has little in common with the code that Yahoo! rolled into production in 2006. The original project was based on two components developed at Google, and described in the research literature:

External image

A new data processing layer — distributed, easy to program compared with grid systems of the day — was built on top of a large-scale, inexpensive storage layer. That combination was new, and allowed web companies to do crazy things with way more data than they’d ever been able to handle before.

Those two components weren’t enough to solve the wide variety of problems that businesses have, though. New projects have emerged in the ten years to broaden its utility. (In fact, those two components weren’t enough to solve the problems that Google had — it’s running a substantially evolved and much more diverse collection of systems today, too).

A partial snapshot of our offering today looks like:

External image

All that blue stuff is new. It doesn’t just add new capabilities to the original Hadoop components. It’s shifted the center of gravity in the community dramatically. The platform layer in particular is dominated now by real-time support and rich interactive processing models:

External image

In fact, the story is even more powerful for new components than the picture suggests. I didn’t have room to put in pieces like YARN, the resource scheduling framework, or Apache Sentry for security — those cross-component pieces that touch all parts of the platform are just hard to fit into the space available. They, and others, are all rolled up in the “dot dot dots” at the right end of the picture.

The original components — HDFS and MapReduce — are down to just ten percent of the code contributions in the total ecosystem, while the newer components are at 90% and growing fast. More importantly, when we look at the share of work being done by the various components, new workloads are rolling out mostly on top of the new pieces. Legacy MapReduce will always matter, but that’s just not where the action is anymore.

And yet when people say “Hadoop,” informally, they generally mean the two original Apache Hadoop components, plus all the new projects that have grown up around them. Hadoop today still shows the outlines of its original incarnation, but is a dramatically larger, more powerful and more interesting collection of technologies than it was.

Where the software is going

The last-ten-years trend is certain to continue. We’ll see the Hadoop ecosystem grow over the coming years as new projects are created to handle new data and to offer new analytic approaches.

Three years ago, Apache Spark was thought to be a deadly threat to Hadoop. Today it’s an essential part of the broader ecosystem. We’re madly in love with it, but we’re also scanning the horizon for its successor as the Cool New Project. I’ve no doubt that such a successor will come — again, and again, and again. That’s not to say that Spark is doomed, but that there’ll be new frameworks, especially in such a fast-moving open source ecosystem. Smart people are going to keep having new ideas. The flexibility that has allowed Hadoop to incorporate so many diverse projects over time bodes well for its continued growth.

There is, of course, evolutionary work to be done on parts of the platform that exist today. Spark needs to get better in a variety of ways for secure enterprise use. YARN has opened up the door to multi-tenant workloads, but it’s too prescriptive, and not responsive enough. We need that to change if we’re to deliver real multi-user, multi-workload support for big data. We’re really excited about Apache Kudu (incubating), but it’s young and has much to prove in production.

A few young open source projects already show real promise of joining the Hadoop party. Google developed its Dataflow software for internal use, but has collaborated with the Apache community to create a new project, Apache Beam (incubating), for managing data processing flows, including ingest and data integration across components. Developer tools like Apache HTrace (incubating) are aimed at testing and improving performance of distributed systems.

We’ve already seen cloud-native big data solutions built on Hadoop in the market. Amazon EMR, Microsoft’s Azure HDInsight and Google Cloud Dataproc are on-demand offerings that make it fast and easy to spin up a cluster in the public cloud. The coming years will, without question, see the platform embrace both datacenter and cloud deployments natively, including elasticity and consumption-based pricing. Users absolutely want that flexibility, and it’ll be baked into the platform over the near term.

There may well come a point when we all decide that the birds have left the dinosaurs behind. As the evolution continues, we may one day stop calling this platform Hadoop. Even if we do, I am confident that it will be defined by the key strengths that Doug and Mike Cafarella created when they started the project a decade ago:

  • A vibrant open source developer community, collaborating around the world to innovate faster than any single company could alone.
  • Extensibility as a fundamental design property. Hadoop’s ability to embrace SQL, and real-time processing with Spark, and new storage substrates like Apache HBase and Kudu, allow it to evolve.
  • A deep systems focus. Knitting together large-scale distributed infrastructure — managing memory, disks and CPUs as a fabric — was fantastically hard when Hadoop was created. By doing that hard work for developers, the Hadoop community has allowed application programmers to concentrate on business services and interfaces, instead of on knotty infrastructure issues. Those problems get no easier in the future. Whatever Hadoop becomes, it will be driven by hard-core systems people.

The action is in hardware

Most of the action for the next ten years, though, will be in hardware.

Google — the inspiration for Hadoop — built its infrastructure to run on the cheap pizza-box systems available in the late 1990s. The fundamental design decisions in the current-generation software were made then. Everywhere you look, you see them: Disk is cheap; memory is expensive. A random disk bit is about a million times further away than a random RAM bit. You need lots of copies of data because so many things can go wrong. Processors inside the chassis are really close; processors in the same rack are pretty close; processors in other racks are far away; processors in other data centers do not exist. And so on.

Those were laws of physics in 1999. We violate them regularly today. Ten years hence, they’ll all be wrong.

Because of the relationship that we have with Intel, Cloudera gets to look at the future of hardware early. There are dramatic changes coming in pretty much every part of the hardware ecosystem. Those changes will mean much better, faster and more powerful systems. To deliver them, though, we’ll have to make fundamental changes to the Hadoop software platform.

One example, now publicly announced, is 3D XPointTM (pronounced “crosspoint”) technology. Intel solid-state drives based on the technology will begin to ship in 2016. They’re non-volatile, so they survive power outages without losing data. 3D XPointTM technology isn’t as fast as RAM, but it’s vastly faster than disk, and up to 1,000 times faster than Flash-based SDRAM technology. More importantly, it offers up to 10x the storage density of traditional RAM. You can pack a lot of bits in a very little space.

Google’s architecture was disk-heavy. It needed three times more disk than the data it stored, for redundancy (disks have lots of moving parts that fail) and for performance (latency is so bad you want to spread workloads out to reduce head contention). Disk is ravenous for power — it takes lots of electrons to spin those motors and move those arms.

3D XPointTM technology has exactly zero of those problems. Granted it costs more than disk, but that curve is pointed in the right direction, and over ten years that story will get way better.

More interestingly: beginning in the 1960s, we came up with complicated ways to organize data because we had to move it from high-latency, cheap systems like tape and disk to fast, expensive systems like RAM. We created logs and B-trees and all kinds of other page-based systems to accommodate that separation.

If 3D XPointTM technology is disky in persistence, density and price, and RAMmy in its latency and throughput, then maybe we can get rid of the split. Perhaps we can organize our data for the convenience of our algorithms and our users, instead of for our disk heads. There is enormous potential to wipe fifty years’ worth of complexity from the slate, and to build for a new generation of storage.

This is one example of many underway. The coming years will see further innovation in storage, dramatic improvements in networking and new ways to push special-purpose processing into the overall fabric of our computer systems. Ten-year-old Hadoop was built on the bedrock of Google’s pizza-box assumptions. As that foundation shifts, so must the software.

Data, data everywhere

The hardware and software are locked in a feedback loop. Each will change as a result of the other, and each will drive more changes from there.

One thing, though, won’t change. There’s going to be a ton of data.

That’ll be driven by the hardware and software, of course. We’ll have sensors and actuators everywhere, and we’ll run code that wants to talk about its status.

Most of all, though, the continued growth in data will be driven by the value of creating and collecting it. The more detailed and specific information we have about the world, the better we can understand it. The more powerful the software tools that we have for analyzing it — SQL and machine learning and all the new techniques the next decade will create — the more we’ll know, and the better we will be able to anticipate and act.

The post Apache Hadoop at 20 appeared first on Cloudera VISION.

via Cloudera VISION
Surgical outcomes prediction at Dartmouth-Hitchcock Medical Center using SAP HANA

Surgical outcomes prediction at Dartmouth-Hitchcock Medical Center using SAP HANA

External image

Surgical outcomes prediction at Dartmouth-Hitchcock Medical Center using SAP HANA : Dartmouth-Hitchcock health system (D-H) is working with analytics solutions from SAP and the SAP HANA platform to

New Post has been published on http://www.predictiveanalyticstoday.com/predict-surgical-outcomes-dartmouth-hitchcock-medical-center-using-sap-hana/?%SNAP%

Last #weekend, the @hackworksinc team and I ran the #Scotiabank Hack IT: Debt Challenge. A pretty rad #FinTech #hackathon and the banks’ first ever! $25.000 found a new home that day with the three incredibly talented finalist teams. The best part for me: the astonishing #openness of the #bank for #digital #disruption, the passion of the participants and knowing that I’ve contributed a teeny, tiny little bit in changing #banking for the #better.
#whatididlastweekend #innovation #banking #canada #toronto #entrepreneurship #entrepreneurs #tech #techstartup #startuplife #coding #codingjam #bigdata #debt #debtmanagement #socialgood #Finance #scotiabankcentre #bigbanks

El Mainframe me lleva de vacaciones

Sí sí, porque además de originar y almacenar el 80% de los datos corporativos del mundo, el mainframe procesa casi todas las reservas de aerolíneas que se hacen, además de potenciar el trabajo de cadenas hoteleras, compañías de ferrocarril, etc. Innovación y transformación industrial para mejorar nuestra vida, en la era del IoT y el Big Data.

No somos conscientes de las aplicaciones del mainframe hasta que nos paramos a pensarlo: transacciones bancarias, uso de tarjetas de crédito, control del tráfico en las ciudades, la previsión meteorológica…son solo alguno de los ejemplos cotidianos de su uso. Pero aún queda una pregunta, ¿qué es un mainframe para ti?

Imagen | IBM 


Sometimes when people hear ANOVA, they think of a “nova” or a “super nova”. Super novas are star explosions while novas are a sudden temporary brightness change. A big difference indeed.

In layman’s terms, ANOVA analysis tells you whether or not there is a big enough difference to matter.

Chi Yau (2013) writes “Analysis of Variance… enables a researcher to differentiate treatment results based on easily computed statistical quantities from the treatment outcome.”

Why is that important? In statistics, a factor is a nominal variable with several levels (i.e. options). Think of a factor as a question to be pondered. Consider a level to the factor as a possible solution.

In a restaurant, a question could be: which sandwich is best? The levels could be the different sandwiches.

Which solutions are best beyond a reasonable doubt of chance?  There is always an amount of chance that could account for the differences. ANOVA is a formal test to determine differences that are most likely true.  Reliance on formal tests reduces the occurrence of “research bias”.  

“Research bias, also called experimenter bias, is a process where the scientists performing the research influence the results, in order to portray a certain outcome.” (Martyn Shuttleworth 2009)

Data Quantity and Degrees of Freedom

Of note is that ANOVA requires enough degrees of freedom to do its calculations.  If the model becomes saturated then ANOVA cannot be done.

“A saturated model has as many parameters as data values. When a model has as many parameters as there is information in the data, the model is saturated, and the model is said to have zero degrees of freedom.” (Carolyn J. Anderson)

If one is able to find enough rows, then one will be able to perform the complicated analysis completely.  In many occasions knowing when you need more data is also a useful result.


ANOVA is a powerful tool and is able to inform the researcher when he/she needs more data and whether different test results are truly different enough to matter.

Now you know your ANOVA from a nova.


Carolyn J. Anderson (n.d.) “Saturated Model” https://srmo.sagepub.com/view/the-sage-encyclopedia-of-social-science-research-methods/n889.xml

Chi Yau (2013) “R Tutorial with Bayesian Statistics Using OpenBUGS” http://www.r-tutor.com/category/statistical-concept/anova

Martyn Shuttleworth (Feb 5, 2009). Research Bias. Retrieved Oct 30, 2015 from Explorable.com: https://explorable.com/research-bias

NASA/ESA, The Hubble Key Project Team and The High-Z Supernova Search Team (1999) Hubble Space Telescope-Image of Supernova 1994D