Data Analysts Captivated by Power of R

scott_s · on Jan 7, 2009

I am baffled that a programming language is getting coverage in the NYT. I'd like to figure out what, exactly, it is about this one that merits mainstream coverage, but I'm afraid I've been at the office too long and I think my perspective on the matter is permanently skewed.

As an aside, I used R briefly for a econometrics/philosophy course I took. I recognized immediately it was a powerful, functional language. What I wonder, though, is if the scientific programming libraries in Python might eventually be a better environment. Surely there must be some R users here who can comment.

Anon84 · on Jan 7, 2009

      if the scientific programming libraries in Python might eventually be a better environment.

Well... there is R.py http://rpy.sourceforge.net/ to give you most (if not all) the functionality of R in conjunction with all the scipy goodness.

brent · on Jan 7, 2009

There are a number of packages for relatively obscure statistical techniques (say, nonmetric multidimensional scaling) that are available for R, but not for python (that I have seen).

I also believe R is easier for statisticians who cannot and do not want to know how to program. There are many advanced techniques that are one-liners.

gaius · on Jan 7, 2009

You've hit the nail on the head there. A programmer looks at R (or MATLAB) and says "this sucks as a programming language". An engineer or a scientist or a statistician says "so what, I'm not a programmer, nor do I want to be one, I've got actual work to do".

aaronsw · on Jan 7, 2009

Must be a slow news day.

In other news: Programmers Captivated by Power of C

staunch · on Jan 7, 2009

To some people C is just the 3rd letter of the alphabet. To others, it’s the grade they got in journalism, a measure of bust size or what pirates in movies sail across.

yters · on Jan 7, 2009

I find these kinds of comments hilarious. I'm scanning the comments, there's a 3, there's a 4, oh, there's a 7. Ho hum, 4, 3, 37, 2...37, wha!?

huhtenberg · on Jan 7, 2009

Then "Men Captivated by Power of C" would clearly appeal to a larger audience.

showerst · on Jan 7, 2009

Having only very minimal experience with R, I get the impression that while python is fantastic for programmers, and the tools are probably close to or better than R's (no idea about library support), it's still a programming language, and requires one to think and structure things like a programmer.

R, at least in the beginning, comes across more like a powerful set of Excel formulas, so I think a non-programmer might pick it up faster without having good programming form.

(Not to say one is more powerful than the other, or that R programmers aren't as good as Python programmers, etc, I just think it's a difference in community viewpoints)

aposteriori · on Jan 7, 2009

R is not at all like Excel, in form or function. R is actually a complete programming language, much more similar to Python in that respect. It supports two types of OOP (S3 and S4), has hundreds (if not more) of contributed packages that do things like survival analysis, 2D/3D plotting, bioinformatics, machine learning, Bayesian statistics, econometrics, numerical integration, spatial statistics, and more (http://cran.r-project.org/web/packages/)

One major advantage over Python is that it's vectorised, so you can say things like A + 1 when A is a vector (or matrix). A bit like Matlab in that regard, only that it doesn't suck as much as the Matlab ``language''.

wesm · on Jan 7, 2009

"One major advantage over Python is that it's vectorised, so you can say things like A + 1 when A is a vector (or matrix). A bit like Matlab in that regard, only that it doesn't suck as much as the Matlab ``language''."

This is not very accurate. NumPy / SciPy provide vectorized matrix libraries, significantly faster than both R and Matlab for matrix operations. No argument though that Matlab as a language truly sucks =)

aposteriori · on Jan 7, 2009

Well, I don't how NumPy can be significantly faster than R, as R basically passes most linear algebra down to BLAS and LAPACK. NumPy does the same, no?

pingswept · on Jan 7, 2009

Yes, NumPy does the same.

shaunxcode · on Jan 7, 2009

it strikes me that it would be pretty trivial to implement such syntax in smalltalk seeing as whitespace is not an issue and you can implement operators how you please per class. Or even better yet an "array based language" like Nial would be a good fit as well (think apl/j but with out the "thats just noise" readability issue)

hs · on Jan 7, 2009

back in uni, i used matlab everyday couple years ago, it was that function is file, so if i have 100 functions, i must have 100 files in a directory

i find that too limiting, found python+numarray+matplotlib and never looked at matlab ever since ... never regret

ironically i got a phone interview request from matlab at the end that year (i was on vacation, never got to that)

rodrigo · on Jan 7, 2009

Way offtopic, but can you tell more about that conometrics/philosophy course? sounds interesting.

scott_s · on Jan 7, 2009

The full course title was - deep breath - The Philosophical Foundations of Statistical Modeling and Causal Inference. It was a economics professor and a visiting philosophy professor teaching their research.

And it turns out I still have the syllabus lying around, so I don't need to try to explain it myself: http://www.cs.vt.edu/~scschnei/syllabus.pdf

rodrigo · on Jan 7, 2009

Thats great stuff! thanks for sharing. Any hindsigths? something rigth out of your head about the subject???

scott_s · on Jan 7, 2009

Sad to say, my only real takeaway lessons were:

- All statistical models have assumptions. Even if a model looks like it fits the data, make sure the data doesn't violate those assumptions. If it does, the model doesn't fit.

- Causation can be inferred, with confidence, just by analyzing data.

Honestly, the econometrics stuff was presented poorly. What looked like pages from a book were put up on the projector (and in some cases, I think they were book pages), and the professor would just talk through the page. Picking up anything worthwhile from his lectures was hard - he knew the class had a varied background (some CS, some philosophy, some economics, even one person from marketing), but he still went faster than my prob/stat background could keep up.

The causal inference stuff was presented better, but I think the subject matter is more intuitive in general. His (the philosophy professor) math was graph theory, which I have a firmer grasp of.

rodrigo · on Jan 7, 2009

More along the lines of philosophy of science then; great stuff!!!

jderick · on Jan 7, 2009

Economics is hot right now?

daniel-cussen · on Jan 7, 2009

In my opinion it is. I have anticipated this downturn for the rich historical perspective on depressions it will offer.

wesm · on Jan 7, 2009

having programmed R on the job for some heavy statistics, I will say this: good for quick analyses but burdened by legacy functionality from the S+ days. I switched to Python/NumPy and rewrote all the R code I had, could not be happier with the results. Of course, you have to create your own data structures if you want something like R's data frame, but at least you have a rich language to do that with.

however, if you need to do anything systematic, do NOT use R, bugs are elusive and extremely tedious to debug

yters · on Jan 7, 2009

I have used R for a number of minor projects. I really like its functional programming design, but its syntax can be obtuse. It's preferable to MatLab, IMHO. What do you consider to be the legacy aspects that weigh it down? What do you like better about NumPy? While I'm asking questions, have you tried Sage? If so, what do you think of it as a meta package of mathematical software?

Also, here are interesting links I've found comparing R and similar statistical programs. Sorry, don't know much about Numpy's capabilities.

R equivalent in speed to Matlab:

http://www.sciviews.org/benchmark/index.html

Someone's research for a good data analysis language:

http://www.cs.ubc.ca/~murphyk/Software/which_language.html

wesm · on Jan 7, 2009

One of R's big benefits is the huge amount of statistical functionality, for example numerous different quantilization algorithms.

If you are working with a lot of heterogeneous data in R it becomes a real headache. Merging data frames seems like it should work like you think it should but if one of your sets of keys (strings) are 'factors' (what I am calling 'legacy S+ functionality', I'm sure they're useful for many algorithms), you'll end up with garbage. There's a hack you can put in your code ('options(stringsAsFactors=FALSE)') which alleviates some of this but in general aligning data I found to be a huge pain. If you're running regressions this is pretty important

haven't tried sage but have heard good things. NumPy is a good alternative because it's extremely well implemented and has consistent behavior across the board. Extensibility (with Fortran, Cython/Pyrex, C/C++) is clean and easy. Never thought I'd write Fortran 77 code being born quite a few years after '77 but it's an easy way to speed up simple procedural algorithms 50x or more.

earl · on Jan 7, 2009

factors are nothing but enums and are used to shrink data and speed processing. Plus that matches what you typically want to happen in regressions: strings turn into (n-1) indicator variables. Otherwise, what is the meaning of using a string as an explanatory variable in a regression?

If you want to merge data frames that were created w/ different factors, perhaps the easiest thing to do is turn your factors into strings?

If d is your data frame, then:

d$factorVar <- as.character( d$factorVar )

merge your two data frames, then

merged$factorVar <- as.factor( merged$factorVar )

should set you right...

earl

zandorg · on Jan 7, 2009

"We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet." - Ugh, SAS isn't about science, it's about administration.

tocomment · on Jan 7, 2009

I had a job interview where the guy told me my first task would be to code a neural network. I said that honestly he would be better off using one of the hundreds of open source versions already out there. He said "We don't use freeware here". That's when I knew the job wasn't for me..

nailer · on Jan 7, 2009

My day job manages $USD 68B in funds.

Our largest fund and most profitable fund (around $US 25B), which includes some our of brightest people, which uses the Linux server infrastructure I design, also uses R and Python.

Trades are made based on the models created in these languages by our research teams. Our traders simply execute what our researchers models tell them to, when the model tells them to.

So both R and Python are fundamental parts of our business without which our best products could not function.

We don't use freeware either, as neither R, Python, or Linux are freeware. They are licensed software, with OSD compatible licenses.

kirubakaran · on Jan 7, 2009

> Our traders simply execute what our researchers models tell them to

Can the traders be eliminated? Why pay them if all they do is carry out orders? (I am asking sincerely)

nailer · on Jan 7, 2009

It's a fair question. Everyone is aware the traders, which are normally a big deal, are slaves to the research gents and their algorithms.

My guess is largely, they can be. The actual trades could be automated (this would become increasingly necessary with rapid-fire [millisecond] trading, which we don't do now but could in future). The meatware oversight could be consolidated to a smaller group of individuals.

michaelneale · on Jan 7, 2009

What an idiot. There would be lots of freeware involved if not on the aircraft. And that is preferable to the alternative in many cases.

aneesh · on Jan 7, 2009

Well, for starters, the video screens on the seatbacks of 777s (and other aircraft) are running some flavor of Linux. But that's not a counterexample to her argument, per se, since the video screens aren't critical to the plane's operation.

dag · on Jan 7, 2009

The counterargument is that, if I am to fly, I would like the ability to check for myself that the physics equations used to design the plane are correct.

michaelneale · on Jan 7, 2009

And I am sure there is no SAS software running on the plane, AT ALL - so the whole thing doesn't make sense.

Actually for critical control software, I want it to be boring and simple.

neilc · on Jan 8, 2009

The argument wasn't that SAS software is used to run systems on the plane, it is that SAS was used to design the airplane.

However, the argument is still obviously laughable.

bkj123 · on Jan 7, 2009

What do you mean by "SAS isn't about science, it's about administration"

zandorg · on Jan 7, 2009

I meant SAP, but that or SAS is about 'business intelligence', which means running an organisation - staffing, payroll, etc, but certainly not jet engines. My Aunt programs in SAP for a University. So SAP determines how businesses are run, and are paid for the privelege.

neilc · on Jan 8, 2009

SAP and SAS are completely different. SAS is for science (well, for statistical programming in general); SAS has nothing to do with SAP, which is for back-office stuff. The article and the quote are about statistics packages like SAS.

geebee · on Jan 8, 2009

A more extended quote:

“I think it addresses a niche market for high-end data analysts that want free, readily available code," said Anne H. Milley, director of technology product marketing at SAS. She adds, “We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet.”

Wow, is that a FUD statement if I ever heard one. Pretty cyncial stuff from Anne H. Milley.

socratees · on Jan 7, 2009

I don't get your comment. You mean mission critical systems should not use open source software?. Google runs on free software, what's wrong with using free software?

nailer · on Jan 7, 2009

He's quoting the article.

But the person he's quoting, like you, confuses Open Source software / Free Software with freeware, which is generally considered to be unlicensed or public domain software.

rbanffy · on Jan 7, 2009

No. The person quoted has to sell very expensive software.

These people are not dumb.

nailer · on Jan 7, 2009

I didn't say he wasn't deliberately confusing OSS / Free software with freeware.

rbanffy · on Jan 8, 2009

I bet she is. It's kind of sad when people have to resort to disinformation to sell their stuff.

gruseom · on Jan 7, 2009

Meanwhile the creator of R wants to return to a Lisp-based statistics environment:

http://books.google.com/books?id=8Cf16JkKz30C&pg=PA21...

jderick · on Jan 7, 2009

Another interesting paper hidden behind the academic firewall.

Anon84 · on Jan 7, 2009

At least in the physics world people just post everything to arXiv and their personal websites. This way you only need to pay attention to a couple of places to keep up to date.

scott_s · on Jan 7, 2009

As I said in another thread, most computer science papers are on authors' webpages. Then this one comes along - but it looks like the authors are stat people. I don't know what their culture is - and what kind of copyright agreements they sign.

I can access the paper because my school subscribes. If anyone wants to read it, figure out how to email me and I'll send a copy.

While searching, I did find a brief email from Ihaka talking about the jump to Lisp: https://stat.ethz.ch/pipermail/r-devel/2008-May/049501.html

gaius · on Jan 7, 2009

Thanks :-)

brent · on Jan 7, 2009

Even in certain areas of computer science (specifically, machine learning) nearly all recent articles are open. I find in most cases that both statisticians and computer scientists tend to post links to pdfs or ps on their websites as well.

mwexler · on Jan 7, 2009

Statistics and Lisp have a long, long history together. Before R took off, many academic statisticians were using Lisp-Stat, Luke Tierney's tool from UMN. Luke has now been spending time helping R, and it shows in how it is starting to adopt more of a Lisp mentality. The referenced paper also shows the Lisp background from R (and S) history. R syntax can be confusing enough; Lisp really has a learning effort such that most analysts just don't have the time to invest in it unless they were trained in it early on.

twopoint718 · on Jan 7, 2009

Maxima (http://maxima.sourceforge.net/) is a nice CAS, sort of similar to Maple, IIRC. It is written in Lisp but hides that from the interface layer.

steveblgh · on Jan 7, 2009

Lush is a LISP-style environment for scientific computing:

http://lush.sourceforge.net/

bbgm · on Jan 7, 2009

In the life science space, R dominates research informatics. A large chunk of molecular profiling methods and techniques use R, or quite often the Bioconductor package, http://www.bioconductor.org/ (from Gentleman's group). Most commercial bioinformatics apps also implement a number of methods using R and provide ability to implement R-based classifiers, etc.

In the clinical space, it's all SAS. Pretty much the de facto standard.

stcredzero · on Jan 7, 2009

My girlfriend was using Stata. She's an epidemiologist.

manny · on Jan 7, 2009

I can't believe nobody here has mentioned PDL, the Perl Data Language: http://pdl.perl.org

Admittedly, i think R and PDL do different things... (I have never played with R).

draegtun · on Jan 7, 2009

Looks like PDL == NumPy (http://news.ycombinator.com/item?id=363159)

There are CPAN modules to directly use R from Perl (for eg... R::* & Statistics::R).

I have a stats friend who's been singing the praises of R & PDL for donkey years.

/I3az/

asnyder · on Jan 7, 2009

Is it just me, or was this article not very well written? I felt it was all over the place. It brings up S, then mentions that S isn't open source. It mentions open source and brings up things like apache, the web, and Microsoft, I don't see how it relates much to R. Though, I'm probably spoiled due to the usual quality of news I get here.

jessep · on Jan 7, 2009

Interesting. My girlfriend is a statistician for the WHO and they definitely still use SAS, at least in her area (calculating global burden of disease). I'm going to ask her if anyone there uses R.

jessep · on Jan 8, 2009

she says apparently there's a big debate going on at the WHO about whether everything should be done in R. currently they use R, Stata, and SAS.

tokenadult · on Jan 7, 2009

"The co-creators of R express satisfaction that such companies profit from the fruits of their labor and that of hundreds of volunteers."

That's an interesting reaction from the first designers of the program.

rdixit · on Jan 8, 2009

My 2 cents: Numpy+Scipy+Matplotlib and other packages, which u can download together in a convenient package at Enthought. That enthought distribution also comes with Ipython, which, is, REALLY nice. I checked out R, Sage and am still sometimes forced into Matlab, but u just can't beat a programming language (Python) which can be used OUTSIDE of whtever problem space you happen to be working on.

waldrews · on Jan 7, 2009

At least it's a functional language, and you can do things like manipulating code symbolically, showing its lisp heritage.

The tooling, library integration, and debuggers aren't as good as, say, Python, though.

rdixit · on Jan 8, 2009

My 2 cents: Numpy+Scipy+Matplotlib and other packages, which u can download together in a convenient package at Enthought. That enthought distribution also comes with Ipython, which, is, REALLY nice. I checked out R, Sage and am still sometimes forced into Matlab, but u just can't beat a programming language (Python) which can be used OUTSIDE of whtever problem space you happen to be working on.

Prrometheus · on Jan 7, 2009

Could someone explain how this differs from Matlab, which is the most popular language for statistics and machine learning at my university?

aposteriori · on Jan 7, 2009

As earl said, it's free! But that's not the most important aspect (although nice for a student).

It's an actual programming language. Like Python, you can use both non-OOP and OOP styles. You can define your own packages and namespaces (unlike MATLAB where there's just one big namespace). There are hundreds of contributed packages, you don't need to buy separate toolboxes. Not to forget, it's GPL, you can look at the source and learn a few things from people who know what they're doing (to name a few: Brian Ripley, Terry Therneau, Douglas Bates, Bill Venables).

On the other hand, it can be a bit of a steep learning curve at the beginning, but I feel it's definitely worth it. It's not perfect, I stub my toe on obscure language features from time to time, but to paraphrase Winston Churchill, R is the worst form of statistical languages except all those other forms that have been tried from time to time...

I think that Andrew Robinson's introduction is pretty decent (http://cran.r-project.org/doc/contrib/Robinson-icebreaker.pd...), but there are many others at http://cran.r-project.org/other-docs.html

mojonixon · on Jan 11, 2009

"But R has also quickly found a following because statisticians, engineers and scientists without computer programming skills find it easy to use."

whaaaa?

earl · on Jan 7, 2009

The power of R (speaking as a very heavy user who has deployed it in multiple production environments and been using it for 5 years) is that it makes it very fast, easy, and natural to do statistics. It also has the nicest data structure I've ever seen for manipulating table data, called a data frame -- I'll elaborate, if anybody cares. In addition, it encourages people to create packages to extend the functionality. There are extant packages to do almost every analysis you can think of -- time series, kmeans, other clustering techniques, cox-box style analyses, regular maximum likelihood style GLM, hierarchical regression, HB, etc. Further, the amount of knowledge and the open source nature of the language, base, and packages encourage additional development and widespread adoption. See: http://cran.r-project.org/ and http://cran.r-project.org/web/views/ ^ is task views. Explore it -- it's well worth your time.

The downsides are, well, it's slow for large data sets and debugging can be difficult. But as a desktop / rapid development platform for statistics it is without peer, IMO.

ps -- unlike Matlab, which often costs thousands of dollars, and the Statistics Toolbox, more thousands, R is free. This is pretty important on its own -- instead of $5k per server and workstation and home pc, install it on any linux, Mac, or windows box you have and get to work for $0.00.