clojure

2

My friends made regra.ph, a cool app that shows you the path a tumblr post takes as it gets reblogged. It’s really cool to actually SEE how the community interacts and intersects. As of right now it only works for posts that haven’t had a deleted reblog in the chain—a known issue that’s being fixed, but won’t be updated until after voting for Clojure Cup is finished.

Like I said, this was built for Clojure Cup, so if you like it, toss a vote our way. And of course, pass it on to your friends.
:) Thanks!

{:Clojure_Fn_of_the_day juxt}

Takes a set of functions and returns a fn that is the juxtaposition of those fns. The returned fn takes a variable number of args, and returns a vector containing the result of applying each fn to the args (left-to-right).

Binary relation:
((juxt a b c) x) => [(a x) (b x) (c x)]

Mind Blown at how useful that could be!

It’ll surely make an appearance or two in my dissertation.

Programming Languages: In layman's terms, what are the major programming languages, and what are they used for?

(reposted from this Quora answer because it’s just great)

Programmers have a surprisingly intimate relationship with the programming languages they use. Your programming language will frustrate you, and enlighten you. Over time you will learn your programming language’s inner workings and little quirks. It will get inside your head, too, and change the way your mind works. 

Choose the right programming language and together you will create something new and beautiful. Choose wrongly and things can get very messy indeed.

In other words, choosing a programming language is much like choosing a romantic partner…

(Note: I’m a straight guy. If you’re not, feel free to do a mental find/replace with whatever you’re into).

PHP is your teenage sweetheart, the girl you first awkwardly fumbled around with that one summer. Just don’t try and start a more serious relationship - this girl has serious issues.

Perl is PHP’s older sister. She might be a bit old for you, but she was pretty popular back in the 90s. In a long-term relationship with Larry Wall, so her standards have dropped, and she’s looking seriously fugly now. “I don’t care what y’all say, I still love her!”, he says. No-one else does.

Ruby is the cool kid of the scripting family. When you first saw her, she took your breath away with her beauty. She was fun, too. At the time she seemed a bit slow and ditzy - though she’s matured a lot in the last few years.

Python is Ruby’s more sensible sister. She’s elegant, classy, and sophisticated. She’s perhaps too perfect. Most guys are like “dude, how can you not like Python!?”. Sure, you like Python. You just consider her the boring version of the edgy and romantic Ruby.

Java is a successful career woman. Some people who’ve worked with her feel she owes her position less to ability and more to her knack for impressing the middle-management types. You might feel that she’s the sensible type you should settle down with. Just prepare for years of “NO THAT DOESNT GO THERE GOD YOU ALWAYS USE THE WRONG TYPE INTERFACE AND YOU MISSED A SEMICOLON” nagging.

C++ is Java’s cousin. Similar to Java in many ways, the main difference being she grew up in a more innocent time and doesn’t believe in using protection. By “protection”, I mean automatic memory management, of course. What did you think I meant?

is C++’s mom. Mention her name to some old grey beard hackers and they’re sure to reminisce with a twinkle in their eye.

Objective C is another member of the C family. She joined that weird church a while back, and won’t date anyone outside of it.

Haskell, Clojure, Scheme and their friends are those hipster, artsy, intellectual girls you probably spent a blissful college summer with a few years ago. The first girls who really challenged you. Of course, it could never have become something more serious (you tell yourself). Though you’ll always be left asking “what if?”

You might be put off C# due to her family’s reputation. But they’ve gone legit, the last few years, they tell you. Once you’re one of us, you’re one of us, you hear? You need a database? Her brother MSSQL will hook you up. Need a place to stay? Heck, her daddy will even buy you your own mansion on Azure avenue. What’s that, you’re having second thoughts about all these overly friendly relatives? No, you can never leave. You’re part of the family, now, ya hear?

Javascript - hey, wasn’t that the girl you first kissed, way before even PHP came on the scene? I wonder what she’s doing now. I hear her career’s really taken off in the last few years. Would be cool to catch up some time, if only for old time’s sake… (You catch sight of her, dressed head to toe in designer jQuery)… wow, somebody grew into a beautiful swan…

Hickey on Values, Identity and State

Rich Hickey delivers a must-watch presentation.

I have a hard time watching presentations on my computer, but this one is very worth it.

In an approachable and high-level manner, Hickey effectively attacks variables as insufficient abstractions, making a compelling argument and offering effective replacements.

Though I’m a Lisp fan, I haven’t been too interested in Clojure since I viewed it as just a Lisp-on-the-JVM. I now understand Hickey and friends are doing tremendous work.

(link via Dave Dribin)

OpenTSDB proxy

We use OpenTSDB to store the majority of our time series server and application statistics here at Tumblr. We recently began a project to migrate OpenTSDB from an existing HBase cluster running an older version of HBase to a new cluster with newer hardware and running the latest stable version of Hbase.

We wanted a way to have some historical data in the new cluster before we switched to it. Within Tumblr we have a variety of applications generating these metrics and it was not very practical for us to change all of them to double write this data. Instead, we chose to replace the standard OpenTSDB listeners with a proxy that would do this double writing for us. While we could have used HBase copy table or written our own tool to backfill historical data from the old cluster, double writing for an initial period allowed us to avoid adding additional load on our existing cluster. This strategy also allowed us to move queries for recent data to new cluster earlier than the full cutover.

The tsd_proxy is written in Clojure and relies heavily on the Lamina and Aleph which in turn build on top of Netty. We have been using this in our production infrastructure for over two months now while sustaining writes at or above 175k/s (across the cluster) and it has been working well for us. We are open sourcing this proxy in the hope that others might find a use for this as well.

The tsd proxy listens on a configurable port and can forward the incoming data stream to multiple end points. It also has the ability to filter the incoming stream and reject data points that don’t match a (configurable) set of regular expressions. It also has the ability to queue the incoming stream and re-attempt delivery if one of the end points is down. It is also possible to limit the queue size so you don’t blow through your heap. The README has some more information on how to set this up.

getting started with clojure

I’m about to try to teach a bunch of people (primarily Python devs running OS X) how to use Clojure, and I’m not satisfied with any of the currently existing documentation on how to get up and running from scratch. When I was going through all this myself a few months back, there was a weird period of a good few weeks when I had basically no mental map of the Clojure ecosystem and had no idea how to assemble one.

My goal for this post is to create the resource I wish I had six months ago. I’ll assume that you’re running on OS X and have a non-zero amount of programming experience.

The Clojure Book

Your first step should be to buy and begin reading Clojure Programming. There’s another book called (confusingly enough) “Programming Clojure”, and I can’t vouch for whether it’s better or worse, but I used “Clojure Programming” and liked it very much, so it’s what I recommend. It’s written by people whose names you’re going to get used to seeing everywhere as you explore the Clojure ecosystem; all the main figures in the Clojure community seem to be inhumanly prolific.

Let’s Get Started

Now, let’s start getting your environment assembled. Get Homebrew - “the missing package manager for OS X” - if you don’t have it already, and then run

brew install leiningen

Congratulations, now you have Leiningen! (Make sure you ended up with version 2.0 or greater - you can check that by running lein --version.)

So what the hell is Leiningen?

Leiningen’s the main tool you’ll be using for:

  • starting up a REPL
  • downloading+installing libraries
  • running your programs
  • starting a server to run the webapps you’ve written

Go ahead and run `lein repl`. You’ve now got a working Clojure REPL! In addition, if you run that command from the top-level directory of one of your Clojure projects, it’ll deal with wiring up classpaths and whatnot so that you’ll be able to import and play around with your project’s code and the libraries that it depends on. We’ll get to that later. Right now, let’s create a skeleton project for us to play around with by running

lein new foo

When that’s done, cd into foo and you’ll see that it’s already got some files and directories:

[jrheard@jrheard-air:~/dev/foo] $ ll
total 16
-rw-r--r--  1 jrheard  staff   193B Jan  5 15:17 README.md
-rw-r--r--  1 jrheard  staff   263B Jan  5 15:17 project.clj
drwxr-xr-x  3 jrheard  staff   102B Jan  5 15:17 src
drwxr-xr-x  3 jrheard  staff   102B Jan  5 15:17 test

Whenever you write a Clojure library/program/anything, your source code will live in the “src” directory and your tests will live in the “test” directory. Straightforward enough. Let’s take a look around in src:

[jrheard@jrheard-air:~/dev/foo] $ cat src/foo/core.clj
(ns foo.core)

(defn foo
  "I don't do a whole lot."
  [x]
  (println x "Hello, World!"))

Looks like Leiningen’s already created a file called “src/foo/core.clj”. It’s a Clojure program that defines a namespace called “foo.core” and then declares that that namespace contains a function called “foo”. Let’s check it out. Start up a repl with `lein repl` and poke around. Remember when I mentioned earlier that leiningen takes care of setting up your classpath and associated goop such that you’re able to access your project’s code from the REPL? Check this out:

user=> (use 'foo.core)
nil
user=> foo
#<core$foo foo.core$foo@6ad591a6>
user=> (foo "jrheard")
jrheard Hello, World!
nil

Awesome - we were able to import our code and run it. The `use` function basically serves the same purpose as `from foo.core import *` would in Python, and its use in source code is similarly discouraged for the same reasons that import * is discouraged. Like import *, though, It’s pretty useful to have when you’re poking around in the REPL.

So that’s cool - we’ve created a project, it’s got code in it, we’ve found out how to start up a working REPL that can play around with that code. Bullet point 1: accomplished. Let’s take a look at the second bullet point:

Downloading and installing libraries

You’re probably used to getting your libraries by running something from the command-line,
e.g. `pip install this_great_library_i_found`, which would download the specified library and install it either globally or within your current virtualenv. Things work a little bit differently in Clojure.

First, you’ve got to find a library that looks useful. The Clojure Toolbox is a fantastic tool for this, and is the best such resource I’ve found. Let’s choose a library to play around with: making HTTP requests is fun - let’s go down to the “HTTP Clients” section and see what our options are. Looks like we’ve got to pick between clj-http and http.async.client - but how do we choose?

Currently, my favorite way of deciding between competing libraries is: pull up their respective github repos, compare the number of stars+forks, and give bonus points to any libraries that have commits from within the past month or two. Not exactly scientific, but it’s served me well so far as a good proxy for the strength of the library’s community/influence/adoption. As of this writing, clj-http has 242 stars to http.async.client’s 127, so let’s pick clj-http.

So… how do we get it?

Let’s go to clj-http’s github repo. Check out how the README’s installation section has this block of code:

[clj-http "0.6.3"]

That’s the information we need - it’s a Clojure vector with two items, the first of which is the name of the library, and the second of which identifies the most up-to-date stable version available. We’re going to add this to our project.clj, which you saw earlier when we looked at the contents of the ‘foo’ directory. Open up project.clj, it’ll look like this:

(defproject foo "0.1.0-SNAPSHOT"
  :description "FIXME: write description"
  :url "http://example.com/FIXME"
  :license {:name "Eclipse Public License"
            :url "http://www.eclipse.org/legal/epl-v10.html"}
  :dependencies [[org.clojure/clojure "1.4.0"]]) 

Note the :dependencies section - it’s a Clojure vector containing one item, and that item is itself a Clojure vector containing two items. This vector indicates to Leiningen that we want our project to run on version 1.4.0 of Clojure. Fair enough - now let’s add the clj-http vector we saw earlier. Our project.clj should now look like this:

(defproject foo "0.1.0-SNAPSHOT"
  :description "FIXME: write description"
  :url "http://example.com/FIXME"
  :license {:name "Eclipse Public License"
            :url "http://www.eclipse.org/legal/epl-v10.html"}
  :dependencies [[org.clojure/clojure "1.4.0"]
                 [clj-http "0.6.3"]])

And that’s it! We’ve now specified to Leiningen that we want the clj-http library, and which version we need. Let’s try it out - start a REPL with `lein repl`, and let’s play around with our fancy new library. Notice that Leiningen will first download clj-http before starting up the REPL - that’s because it first runs `lein deps` behind the scenes any time you ask it to do basically anything, and that causes it to scan your project.clj and make sure that it’s already fetched all the dependencies you’ve asked it to.

Okay, back to our REPL session. Looks like clj-http’s github repo’s README suggests that you require it in the REPL by running

(require '[clj-http.client :as client])

So let’s do that - it’s the same thing as `from clj.http import client` in Python (as opposed to `from clj.http.client import *`, which is again what the `use` function does.)

user=> (require '[clj-http.client :as client])
nil
user=> (client/get "http://www.yelp.com")
;; a big huge blob of data pops out!

Okay, wow, looks like that worked! That’s sort of hard to read - you’ll notice that the big huge blob of data ends with a “}”, which is a hint that it might be a Clojure map. Let’s try poking at it:

user=> (def resp (client/get "http://www.yelp.com"))
#'user/resp
user=> (type resp)
clojure.lang.PersistentArrayMap
user=> (keys resp)
(:cookies :trace-redirects :request-time :status :headers :body)
user=> (:status resp)
200
user=> (:headers resp)
{"server" "Apache", "content-encoding" "gzip", "x-proxied" "lb2",
"content-type" "text/html; charset=UTF-8", "date" "Sun, 06 Jan 2013 00:02:58 GMT",
"cache-control" "private", "vary" "Accept-Encoding,User-Agent",
"transfer-encoding" "chunked", "x-node" "wsgi, web40, www_all",
"x-mode" "ro", "connection" "close"}

And there you have it - we’ve found an HTTP client library, downloaded it, and figured out how to use it interactively in the REPL!

It took me a while to figure this all out - after beating my head against a wall for a day, I eventually had to jump into the #clojure IRC channel and plead for help. Now you don’t have to! For further reading, check out the official Leiningen tutorial.

Putting it all together

Let’s finish up by figuring out how to actually run a Clojure program. Let’s try a good old `lein run`:

[jrheard@jrheard-air:~/dev/foo] $ lein run
No :main namespace specified in project.clj.

Okay, that didn’t work. Referring back to the Leiningen tutorial mentioned earlier and doing a search for :main, we see that you can define a :main key in your project.clj definition that specifies the namespace that `lein run` will run, and that said namespace has to contain a `-main` function, which serves as the entry point into your program. Let’s add this line to our project.clj spec:

:main foo.core

And finally let’s modify src/foo/core.clj so that it looks like this:

(ns foo.core
  (:require [clj-http.client :as client]))

(defn -main
  "Prints the first 50 characters of the HTML source of yelp.com."
  [& args]
  (println (apply str
                  (take 50
                        (:body (client/get "http://www.yelp.com"))))))

Here we go - let’s try it out with `lein run`!

[jrheard@jrheard-air:~/dev/foo] $ lein run
Compiling foo.core
<!DOCTYPE HTML>

<!--[if lt IE 7 ]> <html xmlns:fb

It works!

That’s it for now - you now have a working REPL to play around with, the ability to install and use libraries, the knowledge to give your programs access to those libraries and run them, and a really good book that’ll take you through everything else you need to know about the Clojure language.

The reason I had to write this post

Clojure’s still a pretty young language. The community is extremely small relative to e.g. Python’s, and although the core language’s API is (I’m told) remarkably stable, a lot of the tools around it are new and in a state of rapid change. Add on top of that the fact that most of the up-to-date documentation you’ll find has poor SEO - to the degree that a lot of your Google searches will turn up documentation on richhickey.github.com that’s years out of date and deprecated - and you’ll find that getting started from scratch can be a little tricky.

I hope that this post has helped save you the few weeks of bewilderment that I went through when I was getting started - I promise that the joy of actually programming in Clojure is well worth putting up with these growing pains.

Assorted resources

  • Watch these two lectures by Rich Hickey, the creator of Clojure: “Are We There Yet?” and “Simple Made Easy”. In particular, watch the first one three or four times over the course of several months.
  • I’ve spent the past few weekends watching a whole lot of Clojure lectures. The official Clojure youtube channel is a great resource, and the amount of great content on InfoQ is really astounding, I’ve probably watched at least 25 lectures there.
  • The Clojure Toolbox, mentioned above.
  • Where Did Clojure.Contrib Go - you’re going to see references to libraries like “clojure.contrib.monads” as you explore. clojure.contrib no longer exists, and this page will tell you where the libraries it used to contain have gone.
  • This example project.clj shows you how to take advantage of the thousand different hooks Leiningen provides for customizing how your project is built and run.
  • Assorted core members of the Clojure community worth following on Twitter: @cgrand, @cemerick, @marick, @weavejester, @stuartsierra, @seancorfield, @Baranonsky, @richhickey
  • Read "The Joy of Clojure" once you’re done with “Clojure Programming”.
  • Heads up - Noir is deprecated. Use Compojure instead.

Luke Amdor

Developer, T8 Webware

Who are you, and what do you do?

I’m Luke Amdor and I code a lot.


My work is as a developer at T8 Webware. I’m mainly involved with the backend platform that powers the magic behind our newest product Grip and much more to come. We’re very cross-functional developers at T8, involved with everything from architecture, to coding, to deployment and more. T8 Webware is based out of Cedar Falls, but a Des Moines office is in the works.

What hardware are you using?

My main workhorse is a System76 Gazelle Professional beast of a laptop. It’s pretty loaded with a quad core i7, 16 gigs of RAM, and a hybrid SSD. It’s definitely the fastest machine I’ve developed on, but isn’t one of the most mobile laptops I’ve used. It’s usually relegated to be docked on my desk. The battery life and weight aren’t as good as I’d like. I use a single 24” HP monitor as I’m not a big fan of the multi-monitor setups. I get distracted easily so I feel having only one thing in front of me allows me to focus on just that. Also in use is an external Apple keyboard with an old school Logitech mouse. I love the feel and action of Apple’s chiclet keyboards.


All of this sits on top of a poor man’s standing desk. I’ve basically put an IKEA coffee table on my regular desk to get everything to a height of 4 feet, which is perfect for me. The standing experiment started back in May of this year and I have been loving it. I no longer feel like I have any energy lows. Very recommended. I’m currently thinking of ways to add a treadmill to the mix, but have to figure out the logistics first.


For my mobile coding needs on the couch or the patio, I use a trusty 2009ish Macbook Pro. It has served me quite well over the years.


I use an iPhone 4 on the go and an iPad 2 to read news, RSS feeds, books, comics and pretty much consume all types of media.

And what software?

I’ve been a Mac user for about the last 7 or so years up until earlier this year when I finally switched over to Ubuntu. I have been becoming more disenfranchised with OS X’s developer package support and general bloat (Yes, I know about homebrew. I’ve actually contributed some formula to it. It’s just not quite comparable to other package managers. Don’t even get me started on macports.) I’ve flirted with Linux distros in the past. I was even a hardcore gentoo user at one time. So in a quest for greater minimalism and power, I’ve fallen into Ubuntu. I use it both on my system76 laptop and Macbook.


Emacs is pretty much my one and only workbench. I’d estimate I spend about 80% of my coding time in it. I believe it’s the programmer’s text editor as almost everything is written in and extendable using Emacs Lisp. There’s so much that one can do in Emacs. The tar pit of Emacs configuration that I use is up at Github.


The other time is split between Chrome, the terminal using zsh, and Spotify. I use xmonad for window management. My Swiss army knife for scripting is Ruby.


Being the GTD nerd that I am, I need a reliable system of trust for keeping things out of my mind. I’ve used OmniFocus on the Mac in the past, but moved over to using the awesome org-mode in Emacs and Evernote for keeping track of things. That’s worked pretty well so far. However, lately, I’ve been feeling that as my number of inboxes grows and my dependence on mobile devices grows, my current system isn’t growing with me. So being the hacker I am, I’ve been slowly developing my own opinionated system at my own pace. Sorta like a ifttt for task collection into a single inbox. We’ll see how that goes. Hopefully I’ll be able to open it up more later.


At T8 we use a ton of open source: Scala, Akka, Unfiltered, Lift, Hadoop, HBase, Solr, MongoDB, PostgreSQL, and more. Chef and Vagrant make our deployments sane. For team/project communication we’re currently using Campfire, Trello, Github, Skype, and the usual Google apps. We’re hoping to get a engineering blog set up sometime in the near future so we can tell more of our technical story. There’s some really cool stuff that we have going on.


Other stuff I’m currently hacking with are Haskell, Clojure, and the very promising ClojureScript. I also have a neglected hobby with music generation and music as code. So playing around with extempore and overtone are interesting right now.


The essential backup apps I use are Dropbox and Crashplan.

What would be your dream setup?

Tools and machines that grow with you and don’t limit you.

Arcadia 0.1a Launched

We are very excited to announce the first public release of Arcadia, our integration of the Clojure programming language and the Unity 3D game development platform!

Clojure brings the live editing, dynamic typing, and persistent data structures of a modern Lisp to the world of video games. Unity brings the cutting edge graphics, real time physics, and multi-platform export offered by an industry standard engine to the world of functional programming.

Arcadia’s goal is to make these two powerful tools interoperate seamlessly to provide a powerful and fluid game development experience. To achieve this, we forked the Clojure CLR compiler and introduced our own optimizations, most of which have been merged upstream. We also provide a powerful library to convert between Clojure’s persistent data structures and Unity’s GameObjects and Components.

Getting Started

Make sure to read our wiki for more information!

Early Adopters

People are already making things in Arcadia! Brooklyn based game developer Joseph Parker built Parade Route for the 2014 Clojure Cup and released its source code, and is now working on a whale diving game.

The Team

Arcadia is developed by Clojure developer Tims Gardner and Unity developer Ramsey Nasser. The project would not be possible without the invaluable support of David Nolen, Kovas Boguta, Brandon Bloom, David Miller and others.

What started as a proof of concept hack in April 2014 is turning into the workflow we’ve always wanted. We hope you find it as fun to use as we do!

Try Clojure メモメモ

http://try-clojure.org/ 

Clojure> (- 3 2)
1
Clojure> (+ 3 3)
6
Clojure> (/ 10 3)
10/3
Clojure> (type (/ 10 3))
clojure.lang.Ratio
Clojure> (/ 10 3.0)
3.3333333333333335
Clojure> (+ 1 2 3 4 5 6)
21
Clojure> (defn squar­e [x] (* x x))
#’sandbox260217/square
Clojure> (square 10)
100
Clojure> back
Clojure> (square 10)
100
Clojure> (fn [x] (* x x))
#
Clojure> (defn squar­e (fn [x] (* x x)))
Parameter declaration fn should be a vector
Clojure> ((fn [x] (* x x)) 10)
100
Clojure> (def squar­e (fn [x] (* x x)))
#’sandbox260525/square
Clojure> (map inc ‘(1 2 3 4))
(2 3 4 5)
Clojure> (map inc [1 2 3 4])
(2 3 4 5)

Cluster (with Clojure)

I was watching a video of Berkeley professor Michael Jordan lecturing on the Chinese Restaurant process and for a moment he showed a slide of a tree of documents that were matched up by word frequencies.  It seemed cool so I coded up my own version of it, mostly to learn about the topic and to get some practice with Clojure.  It turned out there’s a name for this:  hierarchical clustering.  I went with the ‘agglomerative’ version of it, which is repeatedly pairing up things and pairs of those things until you have a single pairing that, beneath it, contains everything you started with.  Usually you choose pairings based on similarity - you pair up the two available things that are most similar. 

To make documents into something you can easily compare, I’m converting each into a hashtable of the relative frequencies of the words it contains, like this:  {"purple" 0.0015, "it" 0.0023, "this" 0.0083  ...}  etc.  The code finds the two most similar docs based on those frequencies, and matches them up, making a new hashtable like that by averaging the frequencies of those two docs [1]. This “pairing” is now on the same footing with all the other documents.  So we again find the two most similar docs (allowing this new pairing to be treated as a doc), repeating that process until we’re left with only a single pairing.  It contains every pairing we made, and ultimately every document we started with.  This is our finished product, a hierarchical cluster.

You can see the code on my GitHub.

I turned it loose on some text files got the following (plotted with the Protoviz Javascript toolkit) [3]

The blue nodes are the documents and the green nodes are pairings.  From top to bottom, the docs are: 

-the German Wikipedia article on the lambda calculus,
-the first several hundred words of a German novel, Wilhelm Meister’s Apprenticeship by Johann Wolfgang von Goethe,
-an exerpt from Shakespeare’s King Lear,
-the scifi short story, “They’re Made out of Meat”,
-the English Wikipedia article on the lambda calculus,
-the English Wikipedia article on “The Buzzer” or UVB-76,
-the Sherlock Holmes story, “The Red Headed League”,
-the Sherlock Holmes story, “A Scandal in Bohemia”,
-the scifi short story, “The Long Watch” by Robert Heinlein,
-Shakespeare’s first and second sonnets,
-Shakespeare’s third and fourth sonnets,
-the Spanish poems, “Candor” and “Reto” from Julio Flórez,
-the Spanish Wikipedia article on the lambda calculus,
-the Spanish Wikipedia article on functional programming,
-the Dutch Wikipedia article on the lambda calculus,

Notice that each pairing shows three words.  Those are the most ‘interesting words’: those whose frequencies do the most to make that pairing stand out from the average.  For example, if you’ve got a document that uses the word “mooloolaba” a few times, that’s probably going to be one of its interesting words because it’s so rare elsewhere.  But a word could also be interesting for not showing up, ex. if the word “the” never shows up in a document or only a few times in a long text.  In that case the word is in parens.

It seems to have done an OK job here.  It strongly leans toward matching up docs (and pairings) in the same language when possible.  I was hoping that it would be able to pull out the two science fiction stories, but that isn’t happening.  It’s not smart enough for that.  I’m pleased that it grouped the Spanish functional programming and lambda calculus articles before including the Spanish poetry.  It put Shakespeare’s sonnets together, but failed to associate them very closely with the excerpt from King Lear.  [4] 

I was also hoping the Dutch and German articles would cluster together before being joined with the Spanish or English docs, but this didn’t happen.  It might be that “de” is a common word in Dutch and Spanish, whereas “is” and “in” are common in Dutch and English.  So the classifier might see Dutch documents as partway between English and Spanish ones (even though the opposite is closer to the truth).  

Interesting.  I think this could be improved by using a statistical distance to measure the distance between vectors, instead of unscaled distance of relative frequencies as I’m doing now.

In playing with this code I stumbled on a interesting way to refactor it.  I’ll talk about that in the next post.






[1] This is called the vector space model, declaring each word to be a dimension, and each document a point, or vector, in that word-space.  Identifying docs by the words they contain and not worrying about word order is in general called using the “bag of words”.  I’m comparing those frequencies using Euclidean distance.  It’s often said to be better to use the cosine of the two vectors but that doesn’t matter here since the dimensions of any document vector sum to 1 (is there a name for such a vector?), and I’m only looking for a ranking of distance.

[2] I omitted the thirty lines of code to translate the s-expression result to the JSON needed for Protovis.

[3] I found in trial runs over all of Shakespeare’s sonnets that it did a pretty good job of sorting out the earlier sonnets from the later ones.

[4] The regular expression I used for extracting words is poor for non-English languages, but the algorithm can probably handle it anyway, as the fragments it creates will be unique to the words they came from.

In the Bay Area and need help with your machine learning project? Contact me at info@reatlas.com or twitter.com/herdrick  

New Years Coding Resolutions / Goals

I had reached my year end goals of 2011 by graduating with my Master’s degree and finished the year strong with 4.00 GPA for the past two semesters. Now its time to set some new goals for the year. At the beginning of the year, I read a few blog posts on New Years resolutions for coding and decided I would share my own. 

First off, I want to spend some time learning new program languages along with strengthening my current arsenal. Last year, I spent the majority of my time programming Ruby with a little bit of Python and Objective-C. For this year, I have decided that I will spend time learning 3 different categories of programming languages (compiled, scripted, and functional).

The languages I have chosen:

  • JavaScript in combination with Ruby or Python
  • Java / Objective-C
  • Clojure

Lately, my interest has been more towards JavaScript because initially I learned how to use jQuery, but never took the additional time to learn JavaScript. As of late, I’ve been spending more time in the JavaScript world so I can program and experiment with tools such as Node.js and Coffeescript. I initially had an interest in Node.js, but did not spend enough outside time to see what sort of projects I could create using it. Looking forward to writing an application using Node.js and put it up on my Github account. I’m still looking to improve my knowledge of the Ruby programming language, which I love programming in since I started a year and a half ago.

I never really got a chance to program in Java that much since my university focused heavily on C++ as the primary language. Most people who had Java backgrounds in my school were mostly transfer students. I had only taken 1 semester’s worth of Java making very basic games. I probably won’t be spending a tremendous amount of time to learn Java this year, but just want to get my feet wet again.

My first exposure to a functional programming language was Lisp. Granted I did not program in it, but some of my college buddies were taking a class on artificial intelligence and I could not understand a damn thing about all those curly braces or parentheses (I forget which one it uses). So what got me interested in learning a functional programming language? I have been recently going on interviews for a new job and have been talking with other developers on what sort of projects they have been working on and what languages they use. Two developers mentioned they used Clojure, which I guess is similar to Lisp. I don’t know too much about it except that the data structures that are used in Clojure are immutable, which therefore provide great performance improvements over mutable data structures. Other options that I was going to consider were Scala and Haskell, but I’ll stick with learning Clojure for now. I’ll start off watching the peepcode screencast and then read a book.

Another goal I have is to write at least one or two iPhone applications at put them up on the AppStore. I think the hardest part is to come up with a great idea. I had written one application last semester and I might revisit it to make it more polished and submit the app.

I noticed that a lot of “big” companies have been using document-based database (NoSQL) such as CouchDB, MongoDB, or Redis. I’ll probably play around with one of these in-conjunction with other projects just to see what the benefits are.