Data-Mining

Wearable AI system can detect a conversation's tone

It’s a fact of nature that a single conversation can be interpreted in very different ways. For people with anxiety or conditions such as Asperger’s, this can make social situations extremely stressful. But what if there was a more objective way to measure and understand our interactions?

Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Institute of Medical Engineering and Science (IMES) say that they’ve gotten closer to a potential solution: an artificially intelligent, wearable system that can predict if a conversation is happy, sad, or neutral based on a person’s speech patterns and vitals.

Keep reading

Nothing is more hypocritical than the “Popplio Protection Squad” members being assholes to people about Litten’s final evolution. They are saying stuff like, “this is what you get for hating Popplio” and stuff like that. It pretty much confirms and reaffirms that protection squads aren’t really about stopping hate about something (though it’s a fictional creature, its feelings can’t get hurt), it’s about being assholes to people for having a different opinion. Though jokes on you, at least Litten doesn’t have a lazy as hell shiny. (Note 1: I do not hate Popplio. I think it’s cute, but you have to admit that its shiny is really lazy. It pretty much has the objectively worst shiny) (Note 2: I’m actually a Rowlet fan, so I’m a bit of a neutral party. I just don’t like assholes and hypocrites. Like or dislike any Pokémon you want. Just don’t be a jerk about it) 

Today on Twitter, the guardian asked ‘can you design a better Google logo?’. Well, here is my attempt. See what I did there? 😊 For the record I like Google, I use all the things. I don’t want to make them angry, they probably know more about me than I do.

2

Just an FYI: Twitter is LYING and telling users they’re accounts “appear” to exhibit “automated behavior.” They are locking accounts, forcing you to enter a telephone number, so they can send you a code to enter, in order to unlock your account. This is but one way social media platforms and email providers try to attach phone numbers, and thus, identifying information, to your account. I gave them a fake number, got the code, and entered it. Data mining won’t work here, assholes. Twitter will hear about this from me.

10

As an engineer in security data mining, the data mining cartoon hits home.  I promised myself I will only work for company’s who collect data to help protect us and improve our lives.  I don’t believe Facebook, Google are on the same trajectory.  Theirs is to understand your behavior so they can monetize us.  This is not a new concept, if you use a credit card, then you have been getting data mined from your purchasing habits.  Google, and FB have added several new dimensions to this. 

As a Palestinian who witnessed U.S. intervention in the region those cartoons hit home and they frustrate me.  Saddam was put into power by the CIA, when they changed their minds about him, they created a power vacuum that destabilized the region.  Israel is another M.E. destabilizer.   

Anyway, you take away what you want from this.

Introducing SAMOA, an open source platform for mining big data streams.

https://github.com/yahoo/samoa

Machine learning and data mining are well established techniques in the world of IT and especially among web companies and startups. Spam detection, personalization and recommendations are just a few of the applications made possible by mining the huge quantity of data available nowadays. However, “big data” is not only about Volume, but also about Velocity (and Variety, 3V of big data).

The usual pipeline for modeling data (what “data scientists” do) involves taking a sample from production data, cleaning and preprocessing it to make it usable, training a model for the task at hand and finally deploying it to production. The final output of this process is a pipeline that needs to run periodically (and be maintained) in order to keep the model up to date. Hadoop and its ecosystem (e.g., Mahout) have proven to be an extremely successful platform to support this process at web scale.

However, no solution is perfect and big data is “data whose characteristics forces us to look beyond the traditional methods that are prevalent at the time”. The current challenge is to move towards analyzing data as soon as it arrives into the system, nearly in real-time.

For example, models for mail spam detection get outdated with time and need to be retrained with new data. New data (i.e., spam reports) comes in continuously and the model starts being outdated the moment it is deployed: all the new data is sitting without creating any value until the next model update. On the contrary, incorporating new data as soon as it arrives is what the “Velocity” in big data is about. In this case, Hadoop is not the ideal tool to cope with streams of fast changing data.

Distributed stream processing engines are emerging as the platform of choice to handle this use case. Examples of these platforms are Storm, S4, and recently Samza. These platforms join the scalability of distributed processing with the fast response of stream processing. Yahoo has already adopted Storm as a key technology for low-latency big data processing.

Alas, currently there is no common solution for mining big data streams, that is, for doing machine learning on streams on a distributed environment.

Enter SAMOA

SAMOA (Scalable Advanced Massive Online Analysis) is a framework for mining big data streams. As most of the big data ecosystem, it is written in Java. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm and S4. SAMOA includes distributed algorithms for the most common machine learning tasks such as classification and clustering. For a simple analogy, you can think of SAMOA as Mahout for streaming.


External image

SAMOA is both a platform and a library. As a platform, it allows the algorithm developer to abstract from the underlying execution engine, and therefore reuse their code to run on different engines. It also allows to easily write plug-in modules to port SAMOA to different execution engines.

As a library, SAMOA contains state-of-the-art implementations of algorithms for distributed machine learning on streams. The first alpha release allows classification and clustering.

For classification, we implemented a Vertical Hoeffding Tree (VHT), a distributed streaming version of decision trees tailored for sparse data (e.g., text). For clustering, we included a distributed algorithm based on CluStream. The library also includes meta-algorithms such as bagging.


External image



HOW DOES IT WORK?

An algorithm in SAMOA is represented by a series of nodes communicating via messages along streams that connect pairs of nodes (a graph). Borrowing the terminology from Storm, this is called a Topology. Each node in the Topology is a Processor that sends messages to a Stream. The user code that implements the algorithm resides inside a Processor. Figure 3 shows an example of a Processor joining two stream from two source Processors. Here is a code snippet to build such a topology in SAMOA.

TopologyBuilder builder;
Processor sourceOne = new SourceProcessor();
builder.addProcessor(sourceOne);
Stream streamOne = builder.createStream(sourceOne);

Processor sourceTwo = new SourceProcessor();
builder.addProcessor(sourceTwo);
Stream streamTwo = builder.createStream(sourceTwo);

Processor join = new JoinProcessor();
builder.addProcessor(join).connectInputShuffle(streamOne).connectInputKey(streamTwo);

External image


SWEET! HOW DO I GET STARTED?

1. Download SAMOA

git clone git@github.com:yahoo/samoa.git
cd samoa
mvn -Pstorm package

2. Download the Forest CoverType dataset.

wget "http://downloads.sourceforge.net/project/moa-datastream/Datasets/Classification/covtypeNorm.arff.zip"
unzip covtypeNorm.arff.zip

Forest CoverType contains the forest cover type for 30 x 30 meter cells obtained from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. It contains 581,012 instances and 54 attributes, and it has been used in several papers on data stream classification.

3. Download a simple logging library.

wget "http://repo1.maven.org/maven2/org/slf4j/slf4j-simple/1.7.2/slf4j-simple-1.7.2.jar"

4. Run an Example. Classifying the CoverType dataset with the VerticalHoeffdingTree in local mode.

java -cp slf4j-simple-1.7.2.jar:target/SAMOA-Storm-0.0.1.jar com.yahoo.labs.samoa.DoTask "PrequentialEvaluation -l classifiers.trees.VerticalHoeffdingTree -s (ArffFileStream -f covtypeNorm.arff) -f 100000"

The output will be a sequence of the evaluation metrics for accuracy, taken every 100,000 instances.

To run the example on Storm, please refer to the instructions on the wiki.


I WANT TO KNOW MORE!

For more information about SAMOA, see the README and the wiki on github, or post a question on the mailing list.

SAMOA is licensed under an Apache Software License v2.0. You are welcome to contribute to the project! SAMOA accepts contributions under an Apache style contributor license agreement.

Good luck! We hope you find SAMOA useful. We will continue developing the framework by adding new algorithms and platforms.

Gianmarco De Francisci Morales (gdfm@yahoo-inc.com) and
Albert Bifet (abifet@yahoo.com) @ Yahoo Labs Barcelona