I am baffled that a programming language is getting coverage in the NYT. I'd like to figure out what, exactly, it is about this one that merits mainstream coverage, but I'm afraid I've been at the office too long and I think my perspective on the matter is permanently skewed.
As an aside, I used R briefly for a econometrics/philosophy course I took. I recognized immediately it was a powerful, functional language. What I wonder, though, is if the scientific programming libraries in Python might eventually be a better environment. Surely there must be some R users here who can comment.
There are a number of packages for relatively obscure statistical techniques (say, nonmetric multidimensional scaling) that are available for R, but not for python (that I have seen).
I also believe R is easier for statisticians who cannot and do not want to know how to program. There are many advanced techniques that are one-liners.
You've hit the nail on the head there. A programmer looks at R (or MATLAB) and says "this sucks as a programming language". An engineer or a scientist or a statistician says "so what, I'm not a programmer, nor do I want to be one, I've got actual work to do".
To some people C is just the 3rd letter of the alphabet. To others, it’s the grade they got in journalism, a measure of bust size or what pirates in movies sail across.
Having only very minimal experience with R, I get the impression that while python is fantastic for programmers, and the tools are probably close to or better than R's (no idea about library support), it's still a programming language, and requires one to think and structure things like a programmer.
R, at least in the beginning, comes across more like a powerful set of Excel formulas, so I think a non-programmer might pick it up faster without having good programming form.
(Not to say one is more powerful than the other, or that R programmers aren't as good as Python programmers, etc, I just think it's a difference in community viewpoints)
R is not at all like Excel, in form or function. R is actually a complete programming language, much more similar to Python in that respect. It supports two types of OOP (S3 and S4), has hundreds (if not more) of contributed packages that do things like survival analysis, 2D/3D plotting, bioinformatics, machine learning, Bayesian statistics, econometrics, numerical integration, spatial statistics, and more (http://cran.r-project.org/web/packages/)
One major advantage over Python is that it's vectorised, so you can say things like A + 1 when A is a vector (or matrix). A bit like Matlab in that regard, only that it doesn't suck as much as the Matlab ``language''.
"One major advantage over Python is that it's vectorised, so you can say things like A + 1 when A is a vector (or matrix). A bit like Matlab in that regard, only that it doesn't suck as much as the Matlab ``language''."
This is not very accurate. NumPy / SciPy provide vectorized matrix libraries, significantly faster than both R and Matlab for matrix operations. No argument though that Matlab as a language truly sucks =)
Well, I don't how NumPy can be significantly faster than R, as R basically passes most linear algebra down to BLAS and LAPACK. NumPy does the same, no?
it strikes me that it would be pretty trivial to implement such syntax in smalltalk seeing as whitespace is not an issue and you can implement operators how you please per class. Or even better yet an "array based language" like Nial would be a good fit as well (think apl/j but with out the "thats just noise" readability issue)
The full course title was - deep breath - The Philosophical Foundations of Statistical Modeling and Causal Inference. It was a economics professor and a visiting philosophy professor teaching their research.
- All statistical models have assumptions. Even if a model looks like it fits the data, make sure the data doesn't violate those assumptions. If it does, the model doesn't fit.
- Causation can be inferred, with confidence, just by analyzing data.
Honestly, the econometrics stuff was presented poorly. What looked like pages from a book were put up on the projector (and in some cases, I think they were book pages), and the professor would just talk through the page. Picking up anything worthwhile from his lectures was hard - he knew the class had a varied background (some CS, some philosophy, some economics, even one person from marketing), but he still went faster than my prob/stat background could keep up.
The causal inference stuff was presented better, but I think the subject matter is more intuitive in general. His (the philosophy professor) math was graph theory, which I have a firmer grasp of.
having programmed R on the job for some heavy statistics, I will say this: good for quick analyses but burdened by legacy functionality from the S+ days. I switched to Python/NumPy and rewrote all the R code I had, could not be happier with the results. Of course, you have to create your own data structures if you want something like R's data frame, but at least you have a rich language to do that with.
however, if you need to do anything systematic, do NOT use R, bugs are elusive and extremely tedious to debug
I have used R for a number of minor projects. I really like its functional programming design, but its syntax can be obtuse. It's preferable to MatLab, IMHO. What do you consider to be the legacy aspects that weigh it down? What do you like better about NumPy? While I'm asking questions, have you tried Sage? If so, what do you think of it as a meta package of mathematical software?
Also, here are interesting links I've found comparing R and similar statistical programs. Sorry, don't know much about Numpy's capabilities.
One of R's big benefits is the huge amount of statistical functionality, for example numerous different quantilization algorithms.
If you are working with a lot of heterogeneous data in R it becomes a real headache. Merging data frames seems like it should work like you think it should but if one of your sets of keys (strings) are 'factors' (what I am calling 'legacy S+ functionality', I'm sure they're useful for many algorithms), you'll end up with garbage. There's a hack you can put in your code ('options(stringsAsFactors=FALSE)') which alleviates some of this but in general aligning data I found to be a huge pain. If you're running regressions this is pretty important
haven't tried sage but have heard good things. NumPy is a good alternative because it's extremely well implemented and has consistent behavior across the board. Extensibility (with Fortran, Cython/Pyrex, C/C++) is clean and easy. Never thought I'd write Fortran 77 code being born quite a few years after '77 but it's an easy way to speed up simple procedural algorithms 50x or more.
factors are nothing but enums and are used to shrink data and speed processing. Plus that matches what you typically want to happen in regressions: strings turn into (n-1) indicator variables. Otherwise, what is the meaning of using a string as an explanatory variable in a regression?
If you want to merge data frames that were created w/ different factors, perhaps the easiest thing to do is turn your factors into strings?
"We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet." - Ugh, SAS isn't about science, it's about administration.
I had a job interview where the guy told me my first task would be to code a neural network. I said that honestly he would be better off using one of the hundreds of open source versions already out there. He said "We don't use freeware here". That's when I knew the job wasn't for me..
Our largest fund and most profitable fund (around $US 25B), which includes some our of brightest people, which uses the Linux server infrastructure I design, also uses R and Python.
Trades are made based on the models created in these languages by our research teams. Our traders simply execute what our researchers models tell them to, when the model tells them to.
So both R and Python are fundamental parts of our business without which our best products could not function.
We don't use freeware either, as neither R, Python, or Linux are freeware. They are licensed software, with OSD compatible licenses.
It's a fair question. Everyone is aware the traders, which are normally a big deal, are slaves to the research gents and their algorithms.
My guess is largely, they can be. The actual trades could be automated (this would become increasingly necessary with rapid-fire [millisecond] trading, which we don't do now but could in future). The meatware oversight could be consolidated to a smaller group of individuals.
Well, for starters, the video screens on the seatbacks of 777s (and other aircraft) are running some flavor of Linux. But that's not a counterexample to her argument, per se, since the video screens aren't critical to the plane's operation.
The counterargument is that, if I am to fly, I would like the ability to check for myself that the physics equations used to design the plane are correct.
I meant SAP, but that or SAS is about 'business intelligence', which means running an organisation - staffing, payroll, etc, but certainly not jet engines. My Aunt programs in SAP for a University. So SAP determines how businesses are run, and are paid for the privelege.
SAP and SAS are completely different. SAS is for science (well, for statistical programming in general); SAS has nothing to do with SAP, which is for back-office stuff. The article and the quote are about statistics packages like SAS.
“I think it addresses a niche market for high-end data analysts that want free, readily available code," said Anne H. Milley, director of technology product marketing at SAS. She adds, “We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet.”
Wow, is that a FUD statement if I ever heard one. Pretty cyncial stuff from Anne H. Milley.
I don't get your comment. You mean mission critical systems should not use open source software?. Google runs on free software, what's wrong with using free software?
But the person he's quoting, like you, confuses Open Source software / Free Software with freeware, which is generally considered to be unlicensed or public domain software.
At least in the physics world people just post everything to arXiv and their personal websites. This way you only need to pay attention to a couple of places to keep up to date.
As I said in another thread, most computer science papers are on authors' webpages. Then this one comes along - but it looks like the authors are stat people. I don't know what their culture is - and what kind of copyright agreements they sign.
I can access the paper because my school subscribes. If anyone wants to read it, figure out how to email me and I'll send a copy.
Even in certain areas of computer science (specifically, machine learning) nearly all recent articles are open. I find in most cases that both statisticians and computer scientists tend to post links to pdfs or ps on their websites as well.
Statistics and Lisp have a long, long history together. Before R took off, many academic statisticians were using Lisp-Stat, Luke Tierney's tool from UMN. Luke has now been spending time helping R, and it shows in how it is starting to adopt more of a Lisp mentality. The referenced paper also shows the Lisp background from R (and S) history. R syntax can be confusing enough; Lisp really has a learning effort such that most analysts just don't have the time to invest in it unless they were trained in it early on.
In the life science space, R dominates research informatics. A large chunk of molecular profiling methods and techniques use R, or quite often the Bioconductor package, http://www.bioconductor.org/ (from Gentleman's group). Most commercial bioinformatics apps also implement a number of methods using R and provide ability to implement R-based classifiers, etc.
In the clinical space, it's all SAS. Pretty much the de facto standard.
Is it just me, or was this article not very well written? I felt it was all over the place. It brings up S, then mentions that S isn't open source. It mentions open source and brings up things like apache, the web, and Microsoft, I don't see how it relates much to R. Though, I'm probably spoiled due to the usual quality of news I get here.
Interesting. My girlfriend is a statistician for the WHO and they definitely still use SAS, at least in her area (calculating global burden of disease). I'm going to ask her if anyone there uses R.
My 2 cents: Numpy+Scipy+Matplotlib and other packages, which u can download together in a convenient package at Enthought. That enthought distribution also comes with Ipython, which, is, REALLY nice. I checked out R, Sage and am still sometimes forced into Matlab, but u just can't beat a programming language (Python) which can be used OUTSIDE of whtever problem space you happen to be working on.
My 2 cents: Numpy+Scipy+Matplotlib and other packages, which u can download together in a convenient package at Enthought. That enthought distribution also comes with Ipython, which, is, REALLY nice. I checked out R, Sage and am still sometimes forced into Matlab, but u just can't beat a programming language (Python) which can be used OUTSIDE of whtever problem space you happen to be working on.
As earl said, it's free! But that's not the most important aspect (although nice for a student).
It's an actual programming language. Like Python, you can use both non-OOP and OOP styles. You can define your own packages and namespaces (unlike MATLAB where there's just one big namespace). There are hundreds of contributed packages, you don't need to buy separate toolboxes. Not to forget, it's GPL, you can look at the source and learn a few things from people who know what they're doing (to name a few: Brian Ripley, Terry Therneau, Douglas Bates, Bill Venables).
On the other hand, it can be a bit of a steep learning curve at the beginning, but I feel it's definitely worth it. It's not perfect, I stub my toe on obscure language features from time to time, but to paraphrase Winston Churchill, R is the worst form of statistical languages except all those other forms that have been tried from time to time...
The power of R (speaking as a very heavy user who has deployed it in multiple production environments and been using it for 5 years) is that it makes it very fast, easy, and natural to do statistics. It also has the nicest data structure I've ever seen for manipulating table data, called a data frame -- I'll elaborate, if anybody cares. In addition, it encourages people to create packages to extend the functionality. There are extant packages to do almost every analysis you can think of -- time series, kmeans, other clustering techniques, cox-box style analyses, regular maximum likelihood style GLM, hierarchical regression, HB, etc. Further, the amount of knowledge and the open source nature of the language, base, and packages encourage additional development and widespread adoption.
See:
http://cran.r-project.org/
and
http://cran.r-project.org/web/views/
^ is task views. Explore it -- it's well worth your time.
The downsides are, well, it's slow for large data sets and debugging can be difficult. But as a desktop / rapid development platform for statistics it is without peer, IMO.
ps -- unlike Matlab, which often costs thousands of dollars, and the Statistics Toolbox, more thousands, R is free. This is pretty important on its own -- instead of $5k per server and workstation and home pc, install it on any linux, Mac, or windows box you have and get to work for $0.00.
As an aside, I used R briefly for a econometrics/philosophy course I took. I recognized immediately it was a powerful, functional language. What I wonder, though, is if the scientific programming libraries in Python might eventually be a better environment. Surely there must be some R users here who can comment.