\figh{Figs/i_dont_get_your_code.jpg}{0.8}
\footnotesize \lolit [geekandpoke.typepad.com](http://geekandpoke.typepad.com/geekandpoke/2008/02/the-art-of-prog.html)
It’s funny because it’s true.
Clear code is more likely to be correct.
Clear code is easier to use.
Clear code is easier to maintain.
Clear code is easier to extend.
But we also want to write code that we (and others) will be able to use again. Efficiency is way down on the list of needs.
Write clearly, and re-write for greater clarity.
Break things down into small, re-usable functions, written in a general way, but not {\nhilit too general. You want to actually solve a particular problem.
Step 1: stop and think.
Step 2: write re-usable functions rather than a script.
By reproducible/re-runnable, I mean: don’t have a bunch of code from which you copy-and-paste in a special way. }
\begin{frame}[c]{}
\vspace*{30mm}
\centerline{\large Write programs for people, not computers}
{\footnotesize Wilson et al. (2014) PLoS Biol 12:e1001745}
The computer doesn’t care about comments, or style, or the naming of things. But users and collaborators (and yourself six months from now) will.
get_grid_index <-
function(vec, step)
{
grid <- seq(min(vec), max(vec), by=step)
index <- match(grid, vec)
if(any(is.na(index)))
index <- sapply(grid, function(a,b) {
d <- abs(a-b)
wh <- which(d==min(d))
if(length(wh)>1) wh <- sample(wh, 1)
wh
}, vec)
index
}
This is a bit of code that was originally part of a larger function.
Separated as a function, the code is more likely to be reused. And the smaller and more focused the function, the more likely you’ll use it again.
Avoid having scripts that are long streams of code. Write functions. Give each function a single, focused task. Break long functions into pieces.
Comment on the input and output; ideally, these are obvious from the names.
What might be improved about this function?
sampleone <-
function(vec)
ifelse(length(vec)==1, vec, sample(vec, 1))
get_grid_index <-
function(vec, step)
{
grid <- seq(min(vec), max(vec), by=step)
index <- match(grid, vec)
if(any(is.na(index)))
index <- sapply(grid, function(a,b) {
d <- abs(a-b)
sampleone(which(d == min(d)))
}, vec)
index
}
Ideally, a function does only one thing.
Pulling out the {\ttsm sampleone code as a separate function makes the body of {\ttsm get_grid_index} a bit simpler.
Is the {\ttsm ifelse} line clear? Would it be better to just write that function with {\ttsm if} and {\ttsm else}?
Could {\ttsm get_grid_index} be improved further? }
sampleone <-
function(vec)
ifelse(length(vec)==1, vec, sample(vec, 1))
get_grid_index <-
function(vec, step)
{
grid <- seq(min(vec), max(vec), by=step)
index <- match(grid, vec)
if(any(is.na(index))) {
for(i in seq(along=grid)) {
d <- abs(grid[i] - vec)
index[i] <- sampleone(which(d==min(d)))
}
}
index
}
The use of {\ttsm sapply is a bit confusing and hard to follow. I think it’s a bit more clear to use a {\ttsm for} loop.
Whether {\ttsm apply}-type functions are more readable or less readable depends on the complexity of the function that is applied but also on the experience of the reader. }
sampleone <-
function(vec)
ifelse(length(vec)==1, vec, sample(vec, 1))
get_grid_index <-
function(vec, step)
{
grid <- seq(min(vec), max(vec), by=step)
index <- match(grid, vec)
missing <- is.na(index)
if(any(missing)) {
for(i in which(missing)) {
d <- abs(grid[i] - vec)
index[i] <- sampleone(which(d==min(d)))
}
}
index
}
One last change: The loop really only needs to be over the parts of {\ttsm index that are missing.
Of course, I should also add a comment for each function, to explain the input and output. }
# rmvn: simulate from multivariate normal distribution
rmvn <-
function(n, mu=0, V=diag(rep(1, length(mu))))
{
p <- length(mu)
if(any(dim(V) != p))
stop("Dimension problem!")
D <- chol(V)
matrix(rnorm(n*p),ncol=p) %*% D + rep(mu,each=n)
}
I’m often wanting to simulate from a multivariate normal
distribution. It’s not hard, but you have to remember: do you do
Z \%*\% D
or D \%*\% Z
? And in any case, it’s a lot
easier to just write rmvn(n, mu, V)
.
Note the use of default values.
# colors from blue to red
revrainbow <-
function(n=256, ...)
rev(rainbow(start=0, end=2/3, n=n, ...))
# move values above/below quantiles to those quantiles
winsorize <-
function(vec, q=0.006)
{
lohi <- quantile(vec, c(q, 1-q), na.rm=TRUE)
if(diff(lohi) < 0)
lohi <- rev(lohi)
vec[!is.na(vec) & vec < lohi[1]] <- lohi[1]
vec[!is.na(vec) & vec > lohi[2]] <- lohi[2]
vec
}
If there’s a bit of code that you write a lot, turn it into a function.
I can’t tell you how many times I typed
rev(rainbow(start=0, end=2/3, n=256))
before I thought to make
it a function. (And I can’t tell you how many times I’ve typed it
since, having forgotten that I wrote the function.)
(And really, I should forget about {\ttsm revrainbow; such rainbow color scales are generally considered a bad idea.)
I learned about Winsorization relatively recently; it’s a really
useful technique to deal with possible outliers. It’s not hard. But
how much easier to just write winsorize()
!
}
No long streams of code; rather, a short set of function calls.
I’m particularly bad at making use of others’ code.
You don’t want to have to edit the code for different data.
We {\nhilit will write scripts. But make them a bit more general by allowing command-line arguments to specify input and output file names. You can always include defaults, for your particular files. Even better is to have these specified in your {\tt Makefile.} }
Global variables make code difficult to reuse.
It turns out that, when you pass data into a function, R doesn’t actually make a copy. If your function makes any change to the data object, {\nhilit then it will make a copy. But you shouldn’t be afraid to pass really big data sets into and between functions. }
max_iter <- 1000
tol_convergence <- 0.0001
12
?”When you include such things as function arguments, include reasonable default values.
# move values above/below quantiles to those quantiles
winsorize <-
function(vec, q=0.006)
{
lohi <- quantile(vec, c(q, 1-q), na.rm=TRUE)
if(diff(lohi) < 0)
lohi <- rev(lohi)
vec[!is.na(vec) & vec < lohi[1]] <- lohi[1]
vec[!is.na(vec) & vec > lohi[2]] <- lohi[2]
vec
}
I taught part of a course on statistical programming at Johns Hopkins. It included a lecture like this one. Still, many students submitted their programming assignment flush left.
Most people recommend indenting by 4 spaces. Still, I prefer 2.
(4 spaces are easier to track.)
Don’t use tabs, as they get messed up when moving between computers.
# move values above/below quantiles to those quantiles
winsorize<-function(vec,q=0.006)
{lohi<-quantile(vec,c(q,1-q),na.rm=TRUE)
if(diff(lohi)<0)lohi<-rev(lohi)
vec[!is.na(vec)&vec<lohi[1]]<-lohi[1]
vec[!is.na(vec)&vec>lohi[2]]<-lohi[2]
vec}
The computer doesn’t care about white space, but people do.
get_grid_index <-
function(vec, step)
{
grid <- seq(min(vec), max(vec), by=step)
index <- match(grid, vec)
if(any(is.na(index)))
index <- sapply(grid, function(a,b) { d <- abs(a-b); sampleone(which(d == min(d))) }, vec)
index
}
$<$ 72 characters per line
if( (ndraws1==1) && (ndraws2>1) ) {
...
}
leftval <- which( (map - start) <=0 )
Even if they aren’t {\nhilit necessary, they can be helpful to you (or others) later. }
tmp1
, tmp2
, …i
, j
, x
, y
in the simplest
situationsfv
, what might it do?nms
, what could it be?Descriptive names make code self-documenting.
If functions are named by what they do, you may not need to explain further.
If the object names are meaningful, the code will be readable as
it stands. If the objects are named g
, g2
, and {\tt
gg, it’ll need a lot of comments.
I admit, I’m a terrible offender in this regard, especially when I’m coding in a hurry. Do as I say, not as I do. }
markers
vs mnames
camelCase
vs. pothole_case
nind
vs n.var
If a function/object has one of these, there shouldn’t be a function/object with the other.
Consistency makes the names easier to remember.
If the names are easier to remember, you’re less likely to use the wrong name and so introduce a bug.
And, of course, other users will appreciate the consistency.
total
and totals
n.cluster
and n.clusters
result
and results
Mat
and mat
g
and gg
Don’t use similar but different names, in the same function or
across functions.It’s confusing, and it’s prone to bugs from mistyping.
\figw{Figs/richierocks_tweet.png}{0.8}
\footnotesize \lolit [@richierocks](https://twitter.com/richierocks/status/388609208293556224)
drop_the_bom
might actually be quite memorable, which
would make it a good choice.
And think of knitr: knit
, purl
, spin
, {\tt
stitch. Those work.
But generally I’d say: be direct rather than cute. Perhaps
remove_byteorder_marks
}
But in many cases, code can be rewritten to be clear and self-documenting.
Commenting takes time. And no one ever goes back and adds comments later. It’s another one of those “invest for the future” type of things.
{\small
* Explain what’s wrong (and where)
* {\ttsm error(“nrow(X) != nrow(Y)”)}
* Suggest corrective action
* {\ttsm “You need to first run calc.genoprob().”}
* Give details
* **"nrow(X) ("**, nrX, **") != nrow(Y) ("**, nrY, **")"**
* Don’t give error/warning messages that users won’t understand.
* {\ttsm X’X is singular.}
* Don’t let users do something stupid without warning
Meaningful error messages are hard.
Writing them to give detailed information takes time.
And of course {\nhilit you wouldn't be making these errors, right?
But remember that you, six months from now, will be like a whole new
person.
}
Check that the input is as expected, or give warnings/errors.
Write these in the first pass (though they’re dull).
Hadley Wickham’s assertthat
package is great for this. We’ll
talk about it in a couple of weeks.
Break code into separate files (say 300 lines?)
Each file includes related functions
Files should be named meaningfully
Include a brief comment at the top.
My R/qtl package has some {\nhilit really long files, like {\tt
util.R}, which seemed like a good idea at the time. I have to use
a lot of grep
to find things. And there are plenty of
functions in there that I have totally forgotten about.
I used to make really long boiler-plate comments at the top of each file. That means that you {\nhilit always} have to page down to get to the code. Now I try to write really minimal comments. }
Mine is R/broman, \ttsm github.com/kbroman/broman
# qqline corresponding to qqplot
qqline2 <- function(x, y, probs = c(0.25, 0.75), qtype = 7, ...)
{
stopifnot(length(probs) == 2)
x <- quantile(x, probs, names=FALSE, type=qtype, na.rm = TRUE)
y <- quantile(y, probs, names=FALSE, type=qtype, na.rm = TRUE)
slope <- diff(y)/diff(x)
int <- y[1L] - slope*x[1L]
abline(int, slope, ...)
invisible(c(intercept=int, slope=slope))
}
R packages are great for encapsulating and distributing R code with documentation.
Every R programmer should have a personal package containing the various functions that they use day-to-day.
We’ll talk about writing R packages next week. It’s really rather easy.
I give an example here of one function in my R/broman package. Some of the earlier examples are in this package.
R has two systems for object-oriented programming: S3 and S4. They can make your code easier to use and understand. But they can also make things more opaque and harder to understand.
We all write differently. That’s okay. But be consistent. If every function is named in a different way, it will be that much harder to remember.
Plan for the future: write code that is readable and easy to maintain. And write things in a modular and reasonably general way, so that you (or others) will be more likely to be able to re-use your work.
Be organized. Well-organized software is easier to understand and maintain.
It’s when I’m hurried that I write crap code. It means more work later.
There are a lot of great programmers out there; read their code and learn from it.