In September 2015, as I started working in a lab that requires bioinformatics skills, I made a new friend whose name is R. Before then, the last time I programmed was in 2008, in C, and I didn’t do well in it. Then R has become my de facto mother tongue in programming. Three years later, I’m writing a package for single cell RNA-seq to be submitted to Bioconductor, and I have fixed bugs in other packages. I still have a long way to go to improve my R skills and deepen my understanding of R, but I can’t wait to reflect on my friendship with R so far.

The Best of R

Here are my favorite aspects of R:

The Comprehensive R Archive Network (CRAN) and Bioconductor I can find an R package for pretty much everything I need in data analysis, and many of the cutting edge bioinformatics tools are written in R. CRAN task views and Bioconductor biocViews are convenient ways to discover existing packages to solve particular kinds of problems. In addition, there’re some great packages currently only on GitHub and not yet on CRAN or Bioconductor, such as VISION and dynverse. Recently, Johns Hopkins started a Bioconductor-inspired platform called Neuroconductor for R packages for medical image analysis.

The R community If there’s something painful to use, then most likely someone somewhere will write a package to make that less painful. There’re helpful people who answer our burning questions on Stack Overflow. Moreover, those cool packages are all contributed free of charge by the community, and in return, members of the community get to use lots of cool packages for free. There’re also many great online books and tutorials that are free, such as Hadley Wickham’s R for Data Science. I like this kind of open and collaborative culture. Moreover, the R community is interdisciplinary, and consists of not only statisticians and data scientists, but also sociologists, geologists, psychologists, physicians, biologists, and so on who need to analyze data. Bioinformatics has absorbed a lot from other fields, from the good old principal component analysis (PCA) to how the famous Seurat package in single cell RNA-seq used a technique from time series to align multiple datasets.

The R community is not so hostile to women I know, CS majors are predominantly male, and it was said that for all its veneer of liberalism, Silicon Valley has a problem with misogynist culture. However, as a woman, I do not experience such misogyny in the R community and have never received R-related misogynist comments. The R community supports the R-Ladies organization from the very top, such as in the R Project’s Forwards task force, and the amount of attention R-Ladies get from UseR! conferences and rstudio::conf. It was also said that while R contributors are still predominantly male, at least there are many more women than other programming communities, including Python and Java.

The RStudio integrated development environment (IDE) I like how I can see what’s in the environment, the files I have, and help pages in one place. RStudio also works seamlessly with other packages developed by the RStudio company, such as rmarkdown, shiny, testthat, and devtools, by providing shortcuts and easy to use buttons and panes for those packages. Plus those packages are great. I also really like the git panel, which saves a lot of thinking needed to use git from the command line. RStudio server is also easy to use - you don’t need to ssh to use it and the bash terminal inside it. ssh can be scary to those who are unfamiliar with the command line. What’s even cooler, we can make slides, write books, and build websites within RStudio, and it has just in time compilation for LATEX math; those are very helpful to communicate results of our data analysis. Now I use RStudio instead of Microsoft Word as the text editor, and this blog itself is made with blogdown, with posts written in RStudio. While I can use Jupyter notebook for R (and I have done so before), I prefer R notebook, and one of the reasons is the other stuff in RStudio. With R Markdown, RStudio gives us the best of both worlds of notebook and IDE.

Tidyverse It was said that R has a steep learning curve for beginners, and this was the case for me back in 2016. In 2017, I finally got comfortable with wrangling data with base R, but then I switched to the Tidyverse. It’s so easy to learn and so intuitive that I picked it up within a month, so no more steep learning curve.

The RStudio company I just mentioned packages like Tidyverse, rmarkdown, shiny, blogdown, bookdown, and devtools. There’re other packages I like, such as roxygen2 and keras (I meaan the R package, not the Python package), also developed by RStudio. Using roxygen2 to write documentations feels just like when I write casual comments to explain what my functions do; almost no LATEX is required. Furthermore, RStudio is developing RStudio cloud, in which we can set up a project with all the packages installed and share it with others so they can play with our project without having to worry about installing packages and all the complexities, especially when it comes to packages that require compilation. This is a great way to teach data science. The downside is that we only have 1GB of memory in the free plan so we can only use small datasets. RStudio’s products are amazing, but they are either free or very expensive.

Metaprogramming This is what made the nice tidyeval, in which we can directly write column names without quoting, possible. This is also what made shiny nice; shiny allows us to write html in R style, a style that is more intuitive to R programmers.

R’s flexibility This allows users to implement nice features for R that is absent in base R. For instance, say why do Python and MATLAB allow something like x, y = vec, where vec has 2 elements and this automatically unpacks the elements of vec into x and y, but not R? No problem; a package called zeallot implemented this feature. R’s flexibility is also what made formulas and ggplot2 possible. R doesn’t inherently support interactive outputs? No problem, we’ve got shiny, which looks interactive while doing everything in R’s way. It’s also convenient when different data types can be converted easily.

Interoperability What if there’s a cool Python package that does not have equivalences in R? No problem, we’ve got the package reticulate, which calls Python from R using R-like syntax. What if a colleague is using closed source software such as SAS? No problem, the package plumber allows us to write web APIs of our R code that can be used from any platform, the the haven package nicely deals with file formats of SAS. There’re also R packages that allow us to call MATLAB and Julia from R. And there’s the famous Rcpp that calls C++ functions from R and makes cool C++ libraries available to R users. As R is built on top of C, R comes with a C API. It’s also possible to call Fortran from R; I don’t really know how to do it, but I know that it’s possible since I have used R packages that have Fortran code, such as deSolve, which is used to numerically solve differential equations.

Functional programming R is a functional programming language. Functional programming is the part of Hadley Wickham’s Advanced R that I enjoyed the most. It’s just cool to pass functions as arguments, write functions that return functions, and write functions that modify behaviors of functions. I also find R’s function-centric style of object oriented programming more intuitive to non-computer science majors in many cases than C++’s style. In R, methods belong to functions rather than objects (for S3 and S4, which are the most commonly used, but this is not the case for R6). It’s nice that I can implement new methods for classes from packages I often use, such as Seurat. This is kind of like a function is a service, and whether a function has a method for a class is like whether that service is provided for different kinds of customers. For instance, say we have a function restaurant, and there’re classes vegetarians, Muslims, and Jews, and those without dietary restrictions are “default”. When a function has a method for a class, it is providing service to the class, just as a Halal restaurant is a method of function restaurant for class Muslim. C++’s object oriented programming is more like the class is something like a phone, which has methods like texting and calling. At least in the case of single cell RNA-seq, it’s more intuitive to think about services provided to those datasets (like in R) than those datasets having some special powers (like in C++). That said, I may be biased, since I’m not experienced enough with C++ and Python, in which methods belong to classes.

Culture for open and reproducible research Honestly I really think I should work on this. We have the cool organization rOpenSci, and packages like rmarkdown, knitr, and workflowr that help to make research reproducible.

Documentations Functions are usually very well-explained and have examples. Furthermore, Bioconductor requires examples in function documentations and vignettes in packages. The C++ linear algebra library Eigen has such terrible documentation that I just gave up when trying to learn to use it; that’s why I use Armadillo.

The built-in statistics functions R is built by statisticians, for statisticians, so of course it has good built-in statistics functionality. We don’t need to install extra packages to do linear regression, generalized linear models, ANOVA, Student’s t-test, PCA, MDS, logistic regression, and etc.

Vectorization, all the amazing ways to avoid reading and writing loops, and the expressive power in general to make R more human readable. Yes, someone has to write loops somewhere, but it doesn’t have to be you, to paraphrase Jenny Bryan in the Thinking Inside the Box RStudio webinar. For example, which one is easier to read, write, and reason with:

sum(x)

Or

out <- 0
for (i in seq_along(x)) {
  out <- out + 1
}

R can help us to avoid writing loops in more complicated cases as well. For instance, for a homework assignment in which we have to find the number of transcript from the Ensembl human transcriptome, in R, it’s simply

library(Biostrings)
cdna <- readDNAStringSet("./hs_cdna.fa.gz")
length(cdna)

How many base pairs are there in this transcriptome? Easy:

sum(width(cdna))

But my colleagues who use biopython have to write hard to read loops to accomplish these. Another example of avoiding writing loops is how rsample makes it unnecessary to write loops for cross validation. The apply family of functions and purrr allow us to think in terms of data frames, which is more intuitive from a data analysis point of view than loops. However, nesting a long anonymous function inside an apply or a purrr function makes the code hard to read.

The worst of R

Here are my least favorite aspects of R:

So slow! R, as a high level interpreted language, is pretty slow. Equivalent C++ code can often speed things up dozens if not hundreds of times. However, the R community has worked on improving R’s performance. For instance, starting from R 3.4, functions are by default byte compiled, which makes them several times faster than if they’re not byte compiled. Starting from R 3.5, packages are also by default byte compiled. Moreover, we’ve got Rcpp, which makes C++ look more like R, so easier to use by R users. We’ve also got powerful C++ libraries, such as the great Boost library, accessible from the R package BH, and the linear algebra libraries Eigen (RcppEigen) and Armadillo (RcppArmadillo). I really like Armadillo thanks to its documentation. The foreach and parallel packages also make parallel computing easy for R users; this can significantly speed up code when there’re multiple cores. So yeah, with good coding practice (e.g. don’t grow vectors, do vectorize, and etc.) and those tools, we can bypass the speed limit of R.

However, more work should still be done to make R more efficient. For instance, while mclapply and foreach can speed up intensive and parallizable jobs, they also have large memory overhead since a copy of the data is made for each child R process and each child process does not know if memory is running out, so I often run out of memory when using mclapply. While makeForkCluster can allow the child processes to share objects in memory, this sometimes results into object not found errors in child processes which I don’t have enough computer knowledge to resolve. Furthermore, when we really can’t avoid writing loops, R loops are so slow and the memory overhead of mclapply and foreach is so high that we often have to resort to Rcpp, having to know one more programming language, though fortunately, we don’t need to know that much and Rcpp and RStudio made C++ easier for R users.

Everything is loaded into memory This is becoming less and less practical as single cell RNA-seq datasets are getting larger and larger. Anyway, we can simply use a server that has a lot of memory. However, there’re already many ways to bypass this obstacle, such as hdf5r and rhdf5, R interfaces to the C library hdf5, which allow us to manipulate data on disk; only part of the data is loaded into memory and operations are done piecemeal. There’s also sparklyr, which allows us to use the nice dplyr syntax to manipulate big data on Spark. There’re also packages like bigmemory, DelayedMemory, and etc. that make life easier when dealing with data that don’t fit into memory. That said, there’s a bit of learning curve, and before getting to know those package, I have tried to foolishly load large datasets into memory, causing R to crash.

Related to the last point: There isn’t an easy way to stream files line by line, which can be done easily in Python and C++. When we do have to process a file line by line, we have to load it entirely, process it entirely, and write it entirely, rather than process line by line and write line by line. That said, someone in the R community is working a way around this. For instance, see the chunked package.

R is not backward compatible This is a complaint from some of my colleagues and the reason why R does not get updated in one of our lab’s servers. However, I haven’t experienced this problem myself since I update packages in my computer and personal directory very often so normally I use the newest version. Perhaps this is because R users tend not to be computer science majors and don’t know the best practices of software life cycle, such as that a function or a function argument that will be disabled soon should first be deprecated and kept functional for a while so people using old code won’t immediately get into trouble.

Too many options for the same task This is the flip side of the vibrancy of the R ecosystem. Often we have several different packages that do the same thing, such as irlba, rsvd, base R’s prcomp, and several other packages can all do PCA. Sometimes this can be overwhelming – which one shall I use? How shall I decide? This is happening in the field of single cell RNA-seq. According to this database of scRNA-seq tools, as of the writing of this post, there are already 108 packages for clustering, 96 for dimension reduction, 66 for data normalization, and 54 for differential expression. Those are parts of the standard, basic scRNA-seq data analysis pipeline, and we already have too many choices. While not all of them are written in R, R is by far the most commonly used language here. But who has the time to read all those papers behind those packages to decide which one to use? Even worse, this kind of papers often present the packages as working better than they actually do. So what to do? Just in R, even for comprehensive packages that cover multiple steps of scRNA-seq data analysis, we already have several options – Seurat, scater, pagoda2 – and more are coming up. Which one to use? Guess what? Too lazy to decide, so I’ll stick to good old Seurat since this is where I got started and good old log1p normalization, PCA, tSNE, and Louvian clustering, since I use them for exploratory data analysis, which is pretty qualitative anyway. I’ll only consider something fancier when they become dependencies when writing packages or when they inform experiment design.

Some long running commands can’t be interrupted without force killing the R session, and while long running commands are executing, RStudio becomes really slow for tasks unrelated to that running job such as looking up documentation and when an interruption is necessary, RStudio’s “stop” button above the console often doesn’t work, at least not until waiting for quite a while. When I have to use that R session and don’t want to wait, I have to go to the terminal to kill the R session.

R is too flexible Sometimes it’s hard to predict the output especially when matrices and data frames with only one columns are automatically converted to vectors, when data types are so casually converted, when something somehow returns a vector while you expect a scalar, and when there’re objects of length 0, unless you know R inside out and use some not so well-known techniques that prevent those from happening, such as drop = FALSE and vapply. C++’s rigidity regarding data types can prevent those problems from the beginning, but R’s flexibility leads to more convenience in interactive use.

Syntax is often inconsistent across packages, so I have to waste some time trying to learn new syntaxes. This isn’t a problem within the Tidyverse, but look at the myriads of machine learning packages, and see how their syntaxes are different. That why Max Kuhn wrote the caret package, to make syntax consistent for over 200 machine learning packages, and that’s also an objective of the newer and more Tidyverse-like tidymoodels metapackage. This is also why Bioconductor urges developers to use existing Bioconductor data structures.

We need more computer science professionals! When I read source code of packages to better understand how they work, I’m often so appalled by how inefficient the implementation is. Jenny Bryan said in her 2018 UseR! presentation that computer scientists and engineers tend less to use R, while those in natural sciences, social sciences, and (of course) statistics tend more to use R. We really need more computer science professionals, at least to make R more efficient. What bothers me more is that R is often considered a second class citizen by computer scientists. Probably as a result, cool new deep learning frameworks (deep down, they’re still based on C++) don’t get to R first.

Some people with a computer science background may find R weird, though this is not a problem to me. I remember back in a rotation, an older graduate student said he immediately quit R as soon as he saw that stupid arrow. Anyway, I’m perfectly fine with using <-. Do you know that there’s also -> in R? Well, English speakers may find Arabic weird in that Arabic has declension and is written from right to left, but this doesn’t mean that Arabic is not an expressive language that has great literature. BTW, I just found out that there’s a programming language entirely in Arabic script called Qalb (قلب), and just like R, it’s a functional programming language!

About the cover image: A long time ago, when I was new to Rcpp, I read a blog post about using R to make beautiful plots of the Clifford attractor, and tried it out with Rcpp. This is one of the images I produced, using the parameters in that blog post. I plotted 10 million points by ggplot2; there were no problems at all.