R versus Python: Programming Languages for Data Science

I love R, but almost everything my team has deployed to production is written in Python. Python can be deployed almost anywhere with just a few changes, from clusters to cloud functions. It’s flexible and fast (those who mumble about how it’s slower than C++ or whatever, well, a few milliseconds of CPU time is very cheap). Maybe it’s not a good fit for deploying massive code projects, but most of our scripts are a hundred or so lines, a data processing or machine microservice.

Still, exploring data in Python sucks, in my opinion, and a lot of project time goes into exploring data and basic analysis. I have never particularly liked Jupyter notebooks – I find RStudio’s R Markdown to be quite superior. R has the best data exploration, manipulation, and graphing language – the Tidyverse. Seriously, slays pandas. Tidyverse is a collection of packages generally under the wise oversight of RStudio and Hadley Wickham, unifying many tasks are a common workflow. Dplyr and friends beat pandas any day for simplicity and clarity of operations.

I think R could theoretically become the machine learning language of choice for most production use cases, but for a few problems:

R and the Tidyverse does not have a dominant machine learning package like sklearn in Python. There are a number of package, particularly caret, but no one-stop shop.
R seems to need to use more external packages to get things done than Python – which can do most projects with just numpy, pandas, and sklearn. Tidyverse is cleaning this up a bit, but it’s not there, especially for ML where a different package is needed for every model…
R seems to have more breaking changes than Python. This might not be so true if you stuck with Tidyverse, but if you use more niche packages – and one tends to use more niche packages with R, breaking changes across versions is something I’ve observed routinely.

I think the way forward for R is to have a well-controlled Tidyverse, including a sklearn equivalent. You could just import one package (‘Tidyverse’) for many projects. Tidyverse would then have fewer breaking changes, and if changes did occur you would only have to specify the version of one package – that of Tidyverse. That would beat Python on many fronts.

Notes:

Another advantage to R is that it tends to be where academics play, creating new packages here before anywhere else. This may be changing?
I haven’t used the R package MLR. I think MLR is a little more stable than caret based off user feedback. I suspect in a few years (2021?) that R will eventually have a nice, production ready API like sklearn.
This looks like a good article for people who want to bring more of a Tidyverse style to Python: https://stmorse.github.io/journal/tidyverse-style-pandas.html
Another language sometimes mentioned is Julia. Mostly it stands out for being great for high-speed math operations. Yet most of my work is dataframe operations not math. Therefore I think Julia is largely going to stay in the realm of mathematicians and other pretentious people.
Scala, JavaScript, C++ I see no reason why I would want to switch to these over Python. Simplicity is the goal for most projects, and Python wins there, and I see no clear threat to Python supremacy, at least not for a few years yet.

Notes:

Leave a Comment Cancel Reply