A few really bad tools have risen to ubiquity in data science, and they’re an immense drag on the productivity of almost everyone in the field.
Someday someone is going to create, and then successfully promote, a serious competitor to these tools, and I will be so happy. It won’t actually be that hard, because the tools are so bad.
The tools I’m thinking of are
- Jupyter Notebook (which is such an inherently bad idea it feels like a mean joke)
- Pandas (which is much less actively harmful than Jupyter Notebook, but is a very cumbersome and confusing way of doing some very basic and foundational tasks)
- “Jupyter + Pandas,” the synergetic combination of these two tools (pandas clearly expects you to use Jupyter so you can see its HTML output) that has data science in a tighter grip than either bad tool could manage on its own
What is Jupyter Notebook? It’s basically an interactive interpreter that looks like an IDE. You can write long blocks of code at once easily, and you can go back and edit/delete/rewrite your code … and all the while you are in the same interpreter session, with the same global state, which was produced by code you ran earlier and then rewrote or deleted.
The state of the session is the context in which your code executes, yet it quickly diverges from anything your code could ever have produced! Indeed, any Jupyter Notebook quickly develops a mysterious state which is impossible for anyone to reproduce perfectly. A huge fraction of all code written by data scientists is first executed inside one of these phantom, inexplicable states.
Yet we develop our code in this nightmare joke IDE anyway, because nothing else has the same (fairly simple, but essential) visualization tools. And because we like doing computations that take a while, and doing all of them in a single, convoluted, stateful process running alongside development is a simple (albeit horrible) way to avoid doing them more than once.
Some people embrace this tool to an extent I do not understand, seeing some untapped potential in it. For example, Google made Colaboratory/Colab, and Netflix built some vast complex system around it so they could … so they could do … honestly, I watched that whole video and I’m still not sure.
Pandas is … okay, I guess, it’s just very un-Pythonic. Python is great! That’s why these ubiquitous add-ons to python are so frustrating.
Python likes having one conceptually simple way to do each things. Pandas has a huge, inconsistent API with 5 different ways to do everything.
Quick, do you want `pd.read_sql` or `pd.read_sql_query` or `pd.read_sql_table`? Do you want `is_na` or `is_null`? `join` and `merge` do the overlapping things with different argument syntax. There is no concept of a field/column with nullable type, so the moment you add a null value to a typed field, its type degrades to “object.” Everything is fuzzy and squishy and changes from version to version.
But it prints the outputs of SQL queries in a pretty way that everyone loves. … except only if you’re in a Jupyter Notebook. You’re in a Jupyter Notebook, right? You’re using pandas, right? Right???