A handy and rather complete cheat sheet on probability and inferential statistics. It comes from William Chen, Data Scientist at Quora, website which, by the way, is one of the best references on Data Science on the Internet (start for example here).
A tendency to search for or interpret information in a way that
confirms one’s preconceptions, leading to statistical errors. A phenomenon wherein decision makers have been shown to actively seek out and assign more weight to evidence that confirms their hypothesis,
and ignore or under-weigh evidence that could disconfirm their
Here come the robot reporters. This week the AP announced it will use software to automatically generate news stories about
college sports that it didn’t previously cover. Specifically, it’s
turning to a content generation tool called Wordsmith, created by a
Durham, North Carolina-based company called Automated Insights.
It’s latest case of big news organizations turning to algorithms to
create content. The AP — which is an investor in Automated Insights —
already uses Wordsmith to generate stories on corporate quarterly
earnings reports. Meanwhile, automated content competitor Narrative Science provides similar services to publications such as Fortune and Big Ten Network. And a Los Angeles Times journalist used custom software to auto-generate a story minutes after an earthquake hit Los Angeles last year.
But is anyone actually reading any of this machine generated content?
Automated Insights CEO Robbie Allen says that’s the wrong question to
ask. Although the company generated over one billion pieces of content
in 2014 alone, most of this verbiage isn’t meant for a mass audience.
Rather, Wordsmith is acting as a sort of personal data scientist,
sifting through reams of data that might otherwise go un-analyzed and
creating custom reports that often have an audience of one.
For example, the company generates Fantasy Football game summaries
for millions of Yahoo users each day during the Fantasy Football season,
and it helps companies turn confusing spreadsheets into short, human
readable reports. One day you might even have your own personal robot
journalist, filing daily stories just for you on your fitness tracking
data and your personal finances.
“We sort of flip the traditional content creation model on its head,”
he says. “Instead of one story with a million page view, we’ll have a
million stories with one page view each.”
Wordsmith essentially does two things. First, it ingests a bunch of
structured data and analyzes it to find the interesting points, such as
which players didn’t do as well as expected in a particular game. Then
it weaves those insights into a human readable chunk of text. You can
think of it as a highly complex form of Mad Libs — one that takes an understanding of both data and writing to create.
Allen came up with the idea eight years ago, back when he was working
as an engineer for Cisco. Allen, who has written ten books, wanted to
create something new, so he decided to combine his passion for computer
science, writing, and sports analysis into a company called StatSheet.
“The traditional approach of hiring a lot of writers wasn’t
attractive to me,” he says. “What’s exciting about sports recaps is that
90 percent of what you do is write about the numbers.”
Soon, however, Allen realized that the idea could be applied to any
quantitative data — not just sports. So the company changed its name to
Automated Insights to bring its technology to a wide range of
industries, including finance, health care and, of course, journalism.
Today Wordsmith can only work with structured, quantitative data —
the sort of things you find in well formatted spreadsheets and
databases. Allen says there’s certainly potential for other companies to
create software that can go further in automating research or writing
by summarizing lengthy texts, rewriting press releases, or sifting
through unstructured documents for insights. But he doubts that
Automated Insights will stray from its roots in quantitative in the
Last month the company was acquired by private equity firm Vista
Equity Partners, which also owns the sports data company STATS and
business intelligence company TIBCO. By partnering with Vista’s other
companies, Allen says Automated Insights will have more than enough work
to keep them busy. “It’s kind of a no brainer for us,” he says. “We
have so much opportunity ahead of us in structured data, why take on a
space that people have struggled with for years?”
In the meantime, expect to see more stories written for a very particular audience: you, and you alone.
Correction 3/6/2015 2:10 PM: An earlier version of this piece
stated that Automated Insights is based in Raleigh, North Carolina. It’s
actually based in Durham, North Carolina.
This week, in Istanbul, for the second training on data science, we’ve been discussing classification and regression models, but also visualisation. Including maps. And we did have a brief introduction to the leaflet package,
Co-worker. One of my co-workers asked about the project I’m testing in 2015. I spent 5 minutes explaining the project to her. After listening, she turned around to everyone nearby and got their attention and said this. I may not get a MacArthur for this work, but I’m planning on writing a paper about it at least, once the secret is out.
Praise is cool, but money and resources are cooler. The CEO promised to show me how to raise money and build my vision. I keep poking him to start the training.
“Now the key here that makes data science special and distinct from statistics is that this data product then gets incorporated back into the real world, and users interact with that product, and that generates more data, which creates a feedback loop.”
Rachel Schutt and Cathy O’neil, Doing Data Science, Édition : 1., O’Reilly, 2013, 406 p, Amazon.
Cause And Effect: The Revolutionary New Statistical Test That Can Tease Them Apart
One of the most commonly repeated maxims in science is that correlation is not causation. And there is no shortage of examples demonstrating why. One of the most famous is the case of hormone replacement therapy, which was studied by numerous epidemiologists at the end of the last century.
How to Get Top N Results for Each Attribute Combo: SQL
Let’s say you need the top three image results for a car make, model, brand grouping. Not SELCT TOP 3 images overall, but 3 image results for each unique car in your inventory table (where all the columns will be coming from for this example).
I was asked to do a similar task in Hive, a querying language used with Hadoop. Most of my code writing days have been spent with C-style code so intuitively I thought to make a solution that involved hashmaps where the key is the make-model and the value is a linked list of size 3 containing the image URLs. Although SQL is Turing complete (ergo there must be a way to write this solution), it’s not a solution that effectively uses the strengths of querying languages.
When learning new languages, computer or otherwise, I find it easier to put them in context of what they’re conveying. Just as French is not akin to Americans using different words to convey the exact same thing, SQL is not just a way to write C style code to solve different problems. SQL has strengths and when I was newer to writing queries, I found it was easier to start with a mindset and then get a solution rather than translating a solution to a mindset.
This solution takes advantage of the rank() function and collect_set:
CREATE TABLE #carStaging AS
makeId, modelId, brand, rank, image
makeID, modelId, brand, image, rank(concat(brand,'-',model))
makeID, modelID, brand, thumbnail
WHERE (makeID = 'Cadillac')
DISTRIBUTE BY brand
SORT BY BRAND,
WHERE rank <= 3;
CREATE TABLE #carStaging2 AS
SELECT brand, makeID, model,
collect_set(image) AS image
GROUP BY brand, makeID, modelID;
A.makeID, A.modelID, A.brand, B.image
FROM carStaging A
JOIN carStaging2 B
ON A.makeID = B.makeID
AND A.modelID = B.modelID
AND A.brand = B.brand;
Until I ran the query, I had not found any documentation about collect_set with multiple column keys and non-integer values. It works! The GROUP BY statement in carStaging2 can be thought of as where you make the keys that correspond to the value aka the array returned by collect_set, if you’re still stuck on C-style thinking.
Good luck with your data driven coding, fellow nerds!
I just started delving into SQL books and online tutorials, with a view to implementing some queries on a small database of information I’m currently working with. The database is in its initial phase, of simply taking measurements and producing text files of those measurements and associated variables such as date, time, position, angle, etc. I figured now is the perfect time to learn about database management and SQL as I build a catalog of results.
I’m finding SQL very accessible and easy to implement. I installed MySQL on my MacBook and have been running through examples such as in this great tutorial: http://joshualande.com/data-science-sql/ I have managed to output a set of csv files and load them into tables with MySQL in an overarching database. Simple operations like combining tables and querying specifics has been very straightforward. I’m excited to see how powerful it will become as my database grows and I anticipate building more complicated queries on the growing datasets.
Ecommerce Product Discovery & Recommendation: How you can be (like) @LazadaID & @bliblidotcom too!
Ecommerce industry is going absolutely crazy in Indonesia. I wouldn’t be surprised if someone came up to me and say there’s 1 created each day. With increased proliferation also comes an increase in competition. What should a new local ecommerce startup to do to generate revenues in the face of heavy competition? They may not have the money to burn in marketing like the big boys do but they definitely have the full control on their websites and they can try absolutely everything to improve customer experience. So why not learn from the big boys and craft a better experience than they currently have? And that’s what this post is about.
Statistically speaking, conversion rates for ecommerce is not that high in Indonesia. The industry average conversion for ecommerce in SEA is around 0.5%-1%. One of the reasons why the figure is abysmally low is due to the fact Indonesians have our own shopping sessions prior to the actually shopping sessions. It’s their hobby called “window shopping”; Folks in more developed countries call it “research”; And us, ecommerce business owners call it “opportunity lost”.
All ecommerce face the same challenge
But what separates great ecommerce websites from the rest is the fact that they have lots of tricks up their sleeves to mitigate the lost and more often than not, turn “opportunity lost” into actual opportunities.
When visitors are browsing/window shopping, they don’t easily convert although they may already know what kind of products they want. So what’s an ecommerce business manager to do? Well, one way to go about it is to give them a strong reason to hang around under the assumption that increase in browsing duration breeds trust and affinity. If we could make customers think that we are a reliable resource to find out more about goods they currently care about and that we can satisfy their window shopping desire, they should stick around in our website and either convert (fingers-crossed!) or they’ll put it in their wishlist and comeback another day to seal the deal.
Then the question becomes: “How exactly can we make them stick around?”. One of the tricks of the trade used by veteran ecommerce is showcasing other products that are similar in one way or another to the one a customer is currently seeing.
At the end of the day, these showcases are there to help ecommerce persuade customers, at scale, to stay and hopefully pay (more). Think of it like a personal assistant that helps give you more vantage points prior to purchasing. It is a great feature to have on your ecommerce shop, but it can be tricky to setup.
The way they do it: Lazada Indonesia & Blibli
Note: I don’t work for either company so the following is my best interpretation as to how their engine actually works
The way Lazada ID do it is by storing information about which products each visitor is currently seeing as well as all the previous products. For each product that a visitor is currently seeing, check all previous paths for the current product, and display the top X products that occur most often in the path. For instance, suppose we have 3 visitors currently seeing Xiaomi Redmi and prior to that, they each had their own view history:
Visitor A saw 3 products: iPhone 6, Samsung Galaxy A, LG G3
Visitor B saw 2 products: Asus Zenfone, iPhone 6
Visitor C saw 5 products: Samsung Galaxy A, Asus Zenfone, iPhone 6, Samsung Note 4, Sony Experia Z3
What each of these visitors see in their recommendation engine would be: iPhone 6 (3 occurences), Samsung Galaxy A (2 occurences), Asus Zenfone (2 occurences), and Samsung Note 4 (1 occurence, alphabetical precedence over Z3)
In contrast, blibli took a different approach to show related products. Take a look at this example.
Their recommendation is based on products that are of the same kind to the one currently being viewed. Typically this is as simple as filtering products based on category (smartphone), then sorting by some criteria (import date, prices, custom tags) and display the top 5 items.
Each approach has its own pros and cons:
It has much better relevancy because the engine generates results based on actual visitor behavior. It also has a huge benefit for not being biased towards a particular attribute of a product (price, color, size, etc) because it crowdsources the result.
Depending on infrastructure, it can be hard on production to serve quickly.
There needs to be another system to keep track of visitor behavior at scale. And for other startups that are not as well-funded, this extra load may not be worth the potential conversion lift provided.
Easy on production: can set up from backend using existing or some tagging system
The tagging system may need to be updated manually
It has bad apple to apply relevancy (as we’ve seen above) because based on human input, not on actual behavior
What if there’s another way to create recommendation system that is: easy, relevant, only require minimal production overhead and can be set up from day 1?
It’s not magic; I’ll show you how.
Note: this is another side project that I recently conducted in my journey to study Data Science.
First things first, these are a few results from the engine that I built:
Recommendation #1 - Nexian Journey
Comparison #1 - Recommendation by Lazada for Nexian Journey
Recommendation #2 - Microsoft Lumia 535
Comparison #2 - Recommendation by Lazada for Microsoft Lumia 535
Recommendation #3 - Samsung Galaxy Alpha. Love Samsung specs but want to try a new brand? Check these out!
Comparison #3 - Recommendation by Lazada for Samsung Galaxy Alpha
Time to lift the curtain
I used a machine learning algorithm called k-Nearest Neighbors (kNN) to achieve the above results. The algorithm determines a class for an unobserved instance based on majority votes among its k nearest neighbors. All instances are neighbors, but some are close, some other are far away. So if majority of 5 instances closest to you belong to class A, then you are most likely of the same class.
kNN is typically used as a classifying mechanism. I stopped the algorithm short of classifying to reveal who are my closest neighbors instead.
What I did:
I first scrape the smartphone data from Lazada.co.id. At the time of this project, there were 5196 unique smartphones based on URLs.
I used kNN algorithm on smartphone specification.
kNN suffers from a famous problem in machine learning called the “curse of dimensionality”. I personally experienced it when doing this project; my computer was almost out of memory! To pare down the processing load without sacrificing accuracy, I only kept the key critical factors that help persuade a potential smartphone buyer to convert (Remember: we do this in attempt to increase conversions). See appendix for the top 10 list. Specifically, I only used the following features: price, operating system, screen size, weight, camera MP and colors
Some data cleaning to keep variable values as sensible as possible. The big ones were
Canonicalizing colors and operating system
Removed all instances with NA’s
Some data exploration to find all the rough edges. I ended up deciding to just leave them be
Since products are finite entities (there is no “1/2 iPhone 1/2 Samsung Galaxy A” smartphone), I computed similarity measure using Manhattan distance.
Find out the 5 to 10 nearest neighbors for all examples above
Can be set up from day 1: do you have multiple products? Do you have detailed descriptions entered in your management system for each of your product? If you answer yes to both on the first day of operation, then slap on this baby and you’re up for some fancy
Accurate: unlike blibli, this system’s accuracy is piggybacking on the fact that employees who upload the SKUs must get the product specs right. If the product is not accurately described, customers won’t be pleased anyway.
Minimal overhead: the nearest neighbors can be stored in an index which can then be used to quickly serve the results to visitors. The index only needs to be updated every once in a while or when there are significant amount of new SKUs coming in.
Easy to implement: assuming you have the data well-formatted and ready to go, the actual production code is less than 10 lines of R code.
As a bonus, this approach, unlike ones like Lazada’s, is memoryless which means that the engine will automatically adapt to new SKUs imported. So there could be new products appearing in recommendation every now and then if they’re similar enough to the current product.
This technique only works with ecommerce whose products are well-documented. But hey, that should not be an option in the first place.
This system is only meant to kickstart your recommendation automation as it does not account for user behaviour. Use this as an entry point for a more sophisticated recommender system based on supervised machine learning (”people who bought this also see…” or “people who like this also like…”).
Because kNN uses a pre-defined cutoff value (the ‘k’), there could be instances where their neighbors is so distant, i.e. no one wants to be its neighbor (poor guy..), in which there would be no similar products to be recommended. Such case could be handled through another mechanism. A quick fix would be to increase the k while sacrificing the relevancy
By implementing this, ecommerce websites can leverage their existing SKUs from the get go. But by all means, this is not the cure to a fundamental business problem. If people won’t buy from you now, this system cannot guarantee anything better. What this system is really good at is improving your user experience which leads to higher engagement which should lead to higher chance to convert visitors.
Another exciting aspects of this engine is in remarketing. If a person came to see a product but did not proceed to convert, maybe we can nudge them using retargeting banners with similar products. This approach can be executed through emails as well where the seen product takes the spotlight and other similar products are served as secondary options.
As usual, the script has been pushed to my github. But there is serious cleaning up needed (brutally messy). Please don’t download it yet or I’ll be severely embarassed :)
Feel free to use the script. I sincerely hope it can help a lot of emerging local startups to increase their conversion rate. Feel established already? I don’t discriminate; Go right ahead (Zalora, I’m looking at you…)
I worked with lazada website (specifically, smartphone section) because they have a great product spec section that is quite ready to be processed instead of blibli or tokopedia. (Maybe they’re the next challenge!)
I only used observation without any missing values. No imputing was done.
There are a few bogus values that I did not fix that may make a few recommendation results go wonky (e.g. weight should be in kg but some people use grams)
Looking at Today's Child Poverty Rates: Are We Using the Right Measure?
The official poverty measure was developed in 1963 by a former U.S. Social Security Administration employee who calculated a family’s needs by taking the costs of groceries, which consumed a higher share of a family’s budget back then, and multiplying it by three. At the time, food represented one-third of the expenses for many families.
Fortunately, we have a tool that does a much better job of measuring poverty based on today’s household economics. The Supplemental Poverty Measure or SPM, is based on a modern family budget and is adjusted to consider variations in costs across the country. It more accurately accounts for changes in costs over time for expenses such as food, health care, housing, transportation and child care.
And more importantly, it enables everyone who is concerned about reducing poverty to understand the effectiveness of programs such as the Earned Income Tax Credit and the Supplemental Nutrition Assistance Program, which help provide household resources and food for millions of children. By using the Supplemental Poverty Measure, we see that without these federal and state supports for families, the child poverty rate would have been almost double, rising as high as 33 percent, rather than the still-unacceptable 18 percent, according to the most recent data.
Despite the important information provided by the Supplemental Poverty Measure, the official poverty measure continues to be useful because it is a yardstick that is embedded in numerous government programs and enables us to track trends consistently from year to year, decade to decade. While this measure is fine for those purposes, it fails to give local, state and federal policymakers the best possible data to drive decisions, and it doesn’t give us a true sense of how government programs can make a difference in addressing poverty. The Supplemental Poverty Measure is a much more effective, accurate measure for these purposes.