Partial Results of the 2011 Survey
Copyright © 2011 Rexer Analytics
The following is a excerpt from 5th Annual Survey (2011) conducted by Rexer Analytics. This is a partial list of insights from R users. The full survey results can be found by following the survey link above. The following is entirely the work of Rexer Analytics. The surveys they conduct contain more categories, such as Overcoming Data Mining Challenges, Analytic Success Measurement and The Positive Impact of Data Mining.
Over the years, respondents to the 2007-2011 Data Miner Surveys have shown increasing use of R. In the 5th Annual Survey (2011) we asked R users to tell us more about their use of R. The question asked, “If you use R, please tell us more about your use of R. For example, tell us why you have chosen to use R, why you use the R interface you identified in the previous question, the pros and cons of R, or tell us how you use R in conjunction with other tools.” 225 R users shared information about their use of R. They provided an enormous wealth of useful and detailed information. Below are the verbatim comments they shared.
- Best variety of algorithms available, biggest mindshare in online data mining community, free/open source. Previous con of 32-bit in-memory data limitation removed with release of 64-bit. Still suffers some compared to other solutions in handling big data. Stat-ET plugin gives best IDE for R, using widely used Eclipse framework, and is available for free. RapidMiner add-in that supports R is extremely useful, and increasingly used as well.
- The main reason for selecting R, back in the late 1990’s, was that it had the algorithms I needed readily available, and it was free of charge. After getting familiar with R, I have never seen a reason to exchange it for anything else. R is open-source, and runs on many different system architectures, so distributing code and results is easy. Compared to some latest commercial software I’ve evaluated, R is sluggish for certain tasks, and can’t handle very large datasets (mainly because I do not have a 64-bit machine to work with). On top of that, to be really productive with R, one needs to learn other languages, e.g., SQL, but that’s just how things are. Besides, knowledge of those other languages is needed anyway.
- I’ve migrated to R as my primary platform. The fact that it’s free, robust, comprehensive, extensible, and open. More than offsets it’s shortcomings – and those are disappearing over time anyway. I’m using Revolution’s IDE because it improves my productivity. R has a significant learning curve, but once mastered, it is very elegant. R’s shortcomings are it’s in-memory architecture and immature data manipulation operations (i.e. lack of a built-in SQL engine and inability to handle very large datasets). Nevertheless, I use R for all analytic and modeling tasks once I get my data properly prepared. I usually import flat files in CSV format, but I am now using Revolution’s XDF format more frequently.
- Why I use R: Initial adoption was due to the availability of the randomForest package, which was the best option for random forest research, as rated on speed, modification possibility, model information and tweaking. Since adopting R, I have further derived satisfaction from the availability of open source R packages for the newest developments in statistical learning (e.g. conditional inference forests, boosting algorithms, etc.). I use the R command line interface, as well as scripting and function creation, as this gives me maximum modification ability. I generally stay away from GUIs, as I find them generally restrictive to model development, application and analysis. I use R in conjunction with Matlab, through the statconnDCOM connection software, primarily due to the very slow performance of the TreeBagger algorithm in Matlab. Pros of R: Availability of state-of-the-art algorithms, extensive control of model development, access to model information, relatively easy access from Matlab. Cons of R: Lack of friendly editor (such as Matlab’s Mlint; although I am considering TinnR, and have tried Revolution R); less detailed support than Matlab.
- I employ R-Extension in the Rapid Miner interface 2. I use R for its graphing and data munging capability 3. I use R for it’s ability to scrape data from different sources (FTP, ODBC) and implement frequent and automated tasks Pros: -Highly customizable -Great potential for growth and improved efficiencies -Fantastic selection of packages – versatility -Growing number of video tutorials and blogs where users are happy to share their code CONS: -x64 not fully integrated yet -Steep learning curve -Limited number of coding and output examples in package vignettes -Difficult to setup properly (Rprofile.site examples are scarce online) -Memory constraints with i386 32-bit -Output – reporting design limitations suitable for the business environment.
- I have about 12 years experience with R. It’s free, the available libraries are robust and efficient. I mostly work with the command line, but I am moving towards RStudio because it’s available both as a desktop application and a browser-based client-server tool set. I occasionally use Rcmdr. The only other tool I use as heavily as R is Perl – again, lots of experience with it, it’s free, and there are thousands of available library packages.
- The analytics department at my company is very new. Hence we haven’t yet decided which analytics-tool we’ll be using. R is a great interim solution as it is free and compatible with a reasonable amount of commercial analytics tools. The reason I use the standard R GUI and script editor is simply because I haven’t invested the time in trying out different GUI’s, and I haven’t really had any reason to. The advantage of using R (at least compared to SAS, which was the analytics-tool at my old job) is mainly that you have much more control over your data and your algorithms. The main problem with R is that it can’t really handle the data sets we need to analyze. Furthermore, your scripts can tend to be a bit messy if you are not sure what kind of analysis or models you are going to use.
- I use R for the diversity of its algorithms, packages. I use Emacs for other tasks and it’s a natural to use it to run R, and Splus for that matter. I usually do data preparation in Splus, if the technique I want to use is available in Splus I will do all the analysis in Splus. Otherwise I’ll export the data to R, do the analysis in R, export results to Splus where I’ll prepare tables and graphs for presentations of the model(s). The main drawback to R, in my opinion, is that R loads in live memory all the work space it is linked to which is a big waste of time and memory and makes it difficult to use R in a multi-users environment where typical projects consist of several very large data sets.
- We continue to evaluate R. As yet it doesn’t offer the ease of use and ability to deploy models that are required for use by our internationally distributed modeling team. “System” maintenance of R is too high a requirement at the moment and the enormous flexibility and range of tools it offers is offset by data handling limitations (on 32 bit systems) and difficulty of standardizing quick deployment solutions into our environment. But we expect to continue evaluation and training on R and other open source tools. We do, for instance, make extensive use of open source ETL tools.
- I use R extensively for a variety of tasks and I find the R GUI the most flexible way to use it. On occasion I’ve used JGR and Deducer, but I’ve generally found it more convenient to use the GUI. R’s strengths are its support network and the range of packages available for it and its weaknesses are its ability to handle very large datasets and, on occasion, its speed. More recently, with large or broad datasets I’ve been using tools such as Tiberius or Eureqa to identify important variables and then building models in based on the identified variables.
- I’m using R in order to know why R is “buzzing” in analytical areas and to discover some new algorithms. R has many problems with big data, and I don’t really believe that Revolution can effectively support that. R language is not mature for production, but really efficient for research: for my personal researches, I also use SAS/IML programming (which is for me the real equivalent for R, not SAS/STAT). I’m not against R, it’s a perfect tool to learn statistics, but I’m not really for data mining : don’t forget that many techniques used in data mining comes from Operational Research, in convergence with statistics. Good language, but not really conceived for professional efficiency.
- R is used for the whole data loading process (importing, cleaning, profiling, data preparation), the model building as well as for creating graphical results through other technologies like Python. We use it also using the PL/R procedural language to do in-database analytics & plotting.
- Utilize R heavily for survey research sampling and analysis and political data mining. The R TextMate bundle is fantastic although RStudio is quickly becoming a favorite as well. Use heavily in conjunction with MySQL databases.
- I use R in conjunction with Matlab mostly, programming my personalized algorithms in Matlab and using R for running statistical test, ROC curves, and other simple statistical models. I do this since I feel more comfortable with this setting. R is a very good tool for statistical analysis basically because there are many packages covering most of statistical activities, but still I find Matlab more easy to code in.
- I greatly prefer to use R and do use it when working on more “research” type projects versus models that will be reviewed and put into production. Because of the type of work that I do, SAS is the main software that everyone is familiar with and the most popular one that has a strong license. Our organization needs to be able to share across departments and with vendors and governmental divisions. We can’t use R as much as I would like because of how open the software is – good for sharing code, bad for ensuring regulators that data is safe.
- Main reasons for use: 1) Strong and flexible programming language, making it very flexible. 2) No cost, allowing me to also having it on my personal computer so that I can test things at home that I later use at work. I use RODBC to get the data from server and let the server do some of the data manipulation , but control it from within R. Have also started to use RExcel with the goal as using that as a method to deploy models to analysts more familiar with Excel than R.
- Personally I find R easier to use than SAS, mostly because I am not constrained in getting where I want to go. SAS has a canned approach. I see using GUI’s as a “sign of weakness” and as preventing understanding the language at its core. I have not found Rattle to be particularly helpful. I have also tried JGR and Sciviews and found I could not surmount the installation learning curve. Their documentation did not produce a working environment for me.
Jeffrey Strickland, Ph.D.
Jeffrey Strickland, Ph.D., is the Author of Predictive Analytics Using R and a Senior Analytics Scientist with Clarity Solution Group. He has performed predictive modeling, simulation and analysis for the Department of Defense, NASA, the Missile Defense Agency, and the Financial and Insurance Industries for over 20 years. Jeff is a Certified Modeling and Simulation professional (CMSP) and an Associate Systems Engineering Professional (ASEP). He has published nearly 200 blogs on LinkedIn, is also a frequently invited guest speaker and the author of 20 books including:
- Operations Research using Open-Source Tools
- Discrete Event simulation using ExtendSim
- Crime Analysis and Mapping
- Missile Flight Simulation
- Mathematical Modeling of Warfare and Combat Phenomenon
- Predictive Modeling and Analytics
- Using Math to Defeat the Enemy
- Verification and Validation for Modeling and Simulation
- Simulation Conceptual Modeling
- System Engineering Process and Practices