So I decided to take a bit of a break and mess around with a service called Exversion http://exversion.com that was started by some friends I met at YCombinator some time back. It’s kind of like a GitHub for public datasets and it offers an API for you to run queries against the data set. What this means is ONE bloody unified way to access your data. Of course there still exists the bootstrapping problem. Public datasets don’t conform to any specific format. They are spread across multiple files, the files aren’t always easily accessible, and to top it off, the formats aren’t easily convertible to a CSV (which is what Exversion’s alpha can import at the moment). Sure, Exversion could possibly (and they are probably working on *some* of this already) provide a set of import algorithms that would try their best to decipher the format in the datasets and attempt to flatten it out, but that is something that will never execute with 100% certainty. It would have to be a very heuristic set of algorithms and possibly some that learn on their own. Look how complicated this is already getting and we’re just talking about text in a spreadsheet! Whatever the case may be, they might provide a fixed set of algorithms that know of some of the formats that data.gov might use in exporting the public datasets. If they work, cool. If not, oh wellzzzz. I don’t think it makes sense for them to waste capital on that development since it’s not core to their product and it would only encourage the data publishers to keep publishing crap formats.
I think the beauty of this approach is how they are attempting to gather the masses of data-loving people in a call-to-arms style to “seed” exversion with datasets. If you’ve ever worked with a public dataset, you have had to transform it, smack it, spank it, and maybe even say a prayer to get it into a workable format that served your needs. Then what? Now it’s sitting in your hard drive or Dropbox collecting dust. Why not convert it (you have to do this anyway) and upload it to Exversion so that now the rest of the world doesn’t have to go through that pain for the dataset you just converted. If this practice scales, and enough volunteers start converting, it’s going to be a great resource for anyone that wants to study the public datasets. It would eliminate hours of pointless conversion work (this means save $$$ and do some actual work). I’m now going to contribute my first dataset to the community…hopefully.
When we met, I expressed my disgust when trying to work with public data sets from data.gov or other federal agencies way back in 2008 or 2009. I figured something must have improved since then so I decided to upload my data set provided by the FBI @ http://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2012/preliminary-annual-uniform-crime-report-january-december-2012
After unzipping the archive, I had to clap because this one didn’t include a useless Access database file (thank you FBI) like some of the datasets I pulled from the Census site. Instead it had XLS files so I figured I could open it up in my Numbers app and export to CSV and upload. WRONG! Exversion told me that the names of some of my columns were wrong so I went into the CSV that was exported and I find that there were empty column names. Why in the f*@!? … whatever … I’ll just fill in the names for them by a an algorithm known as O-Grep (Optical Grep) and using my vast knowledge of context clues that I learned about in 2nd grade. Cool so now we have some column names. Wooooooooo…
Now what in the @#@% is wrong? Exversion says that the columns and values don’t match because line 2 has 16 items, but it should have 14. Well @#@# Me then! Let’s whip out emacs again and compare that with the Numbers app view of the data. Ugh… this particular data set has 3 columns of hierarchical data involving the creative use of the “Merge Cells” function. This is going to be a real F#*@#! hoot! So now I have to unmerge all the merged cells. The first column is states so that’s going to be at least 100 clicks. The 2nd column can have any number of cities in the state so I’ll have to unmerge those as well. That means it will be at least 50 clicks, but that’s if there was only 1 city per state in the data. This is going to be more like 1200 clicks. I have to unmerge the city cells because the next column contains 2011,2012 for every city (more hierarchical nonsense). Uh oh. The 4th column I just looked at has a blank cell for what should correspond to 2011… not sure what that is so I’m going to put a 0 there. Others can easily fork the dataset and fix it if they want and I might soon be able to take a merge request for that fix… It’s now 12:55AM PST…let me get to clicking……..
clicking and fixing and clicking and now it’s 2:17AM and I finally got it to work. Here it is: http://exversion.com/data/view/MOG9O2VZAH4GT7K I don’t know why it won’t preview, but you can click download and get the csv representation of it. When you are signed in, you can make queries against it. It seems like once I log out, something happens to the dataset and it puts it into a weird state that causes the data preview error. That’s fine though. This is in alpha and under active development so it should be sorted out soon.
So, as you can see, I spent toooooo long trying to do anything with these freely available public datasets simply because their formats make the entire thing inaccessible. I don’t even understand why I had to go through the unmerging, copying, pasting, fixing, etc. Clearly this data is not stored on government computers in this format. It just seems like someone ran a “report” that took the data from an actual database and exported it to CSV. Then some actual person decided to spend the day getting creative with Excel and making the data pretty with merged cells … but ummmm guess what … they rendered it useless to those of us that want to consume it. So, hey! data.gov and other agencies want to release public datasets, just please export it FLAT as a csv. Better yet, why not just throw them directly onto Exversion that way no one has to spend the time converting and importing for you.
Sleep Time ZZZZzzzzzZZZZzzzZZZZZzzz