Several months ago I posted an article called, What is Predictive Analytics, describing what it is. In the present article, I want to talk about the skills and tools that one should have to perform predictive analytics.
I am always at a loss in describing the skills of analytics, for there are many. I am working on another book about analytics that has a different approach than Predictive Analytics using R, though I am using material from three chapters. The new book is an operations research approach to analytics, covering a different set of methods, skill and tools. Combined, the two books are over 1000 pages, so perhaps you can see my dilemma. Hence, this article is going to touch the very basics.
What is Predictive Analytics?
In case you missed my previous article, this is a high level description. Predictive analytics—sometimes used synonymously with predictive modeling—is not synonymous with statistics, often requiring modification of functional forms and use of ad hoc procedures, making it a part of data science to some degree. It does however, encompasses a variety of statistical techniques for modeling, incorporates machine learning, and utilizes data mining to analyze current and historical facts, making predictions about future. Beyond the statistical aspect lies a mathematical modeling and programming dimension, which includes linear optimization and simulation, for example. Yet analytics goes even farther by defining the business case and requirements, which are not covered here. I discussed those in How to Build a Model.
Statistical Modeling & Tools
This assumes that you already know the basics of parametric and a little bit of nonparametric statistics. If you are not familiar with these terms, then you are missing a prerequisite. However, this is gap you can fill with online courses from Coursera. Though I have never taken one, I have many colleagues who swear by them.
By statistical modeling I am referring to subject matter that would be covered beyond material in a statistics for engineering or business course(s). Here we are concerned with linear regression, logistic regression, analysis of variance (ANOVA), multivariate regression and clustering analysis, as well as goodness of fit testing, hypotheses testing, experimental design and my friends Kolmogorov and Smirnoff. Mathematical Statistics could be a plus, as it will take you into the underlying theory.
The tools one would/could use are a myriad and are often the tools our company or customer has already deployed. SAS modeling products are well-established tools of the trade. These include SAS Statistics, SAS Enterprise Guide, SAS Enterprise Miner, and others. IBM made its mark on the market with the purchase of Clementine and its repackaging as IBM SPSS Modeler. There are other commercial products like Tableau. I have to mention Excel here, for it is all many will have to work with. But you have to go beyond the basics and into its data tools, statistical analysis tools and perhaps its linear programming Solver, plus be able to construct pivot tables, and so on.
Today, there a multitude of open source domain tools that have become popular, including R and its GUI, R-Studio; the S programming package; and the Python programming language (the most used language in 2014). R, for example, is every bit as good as its nemesis SAS, but I have yet to get it to leverage the enormous amount of data that I have with SAS. Part of this is due to server capacity and allocation, so I really don’t know how much data R can handle.
A recent tool I discovered is called KNIME. The Konstanz Information Miner, is an open source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining concept. A graphical user interface allows assembly of nodes for data preprocessing (ETL: Extraction, Transformation, Loading), for modeling and data analysis and visualization. It reminds me a little of SAS Enterprise Miner. Since 2006, KNIME has been used in pharmaceutical research, but is also used in other areas like CRM customer data analysis, business intelligence and financial data analysis.
For the forgoing methods, data is necessary and it will probably not be handed to you on a silver platter ready for consumption. It may be “dirty”, in the wrong format, incomplete, or just not right. Since this is where you may spend an abundant amount of time, you need the skill at tools to process data. Even if this is a secondary task–it has not been for me–you will probably need to know Structured Query Language (SQL) and something bout the structure of databases.
If you do not have clean, complete, and reliable data to model with, you are doomed. You may have to remove inconsistencies, impute missing values, and so on. Then you have to analyze the data, perform data reduction, and integrate the data so that it is ready for use. Modeling with “bad” data results in a “bad” model!
Databases are plentiful and come in the form of Oracle Exadata, Teradata, Microsoft SQL Server Parallel Data Warehouse, IBM Netezza, and Vertica. The Greenplum Database builds on the foundations of open source database PostgreSQL. Or you may need to use a data platform like Hadoop. Also, Excel has the capacity to store “small amounts” of data across multiple worksheets and built in data processing tools.
Again, there are prerequisites like differential and integral calculus and linear algebra. Multivariate calculus is a plus, particularly if you’ll be doing models involving differential equations and nonlinear optimization. The skills you need to acquire beyond the basics include mathematical programming–linear, integer, mixed, and nonlinear. Goal programming, game theory, Markov chains, and queuing theory, to name a few, may be required. Mathematical studies in real and complex analysis, and linear vector spaces, as well as abstract algebraic concepts like group, fields and rings, can reveal the foundational theory.
Simulation modeling, including Monte Carlo, discrete and continuous time, plus discrete event simulation can be applied in analytics–I have not seen this as common practice in business analytics, but it certainly has its place. These models may rely heavily upon queuing theory, Markov chains, inventory theory and network theory.
The corporate mainstay is the powerhouse combination of MATLAB and Simulink. MATrix LABoratory or MATLAB (that is why it is spelled with all caps!). Other noteworthy commercial products include Mathematica and Analytica. Otave is an open-source mathematical modeling tool that reads MATLAB code and there are add-on GUI environments (like R-studio for R) floating around in hyperspace. I recently discovered the power of Scilab and the world of modules (packages) that are available for this open-source gem.
For simulation, Simulink works “on top of” MATLAB functions/code for a variety of simulation models. I wrote the book “Missile Flight Simulation“, using MATLAB and Simulink. ExtendSim is an excellent tool for discrete event simulation and the subject of my book “Discrete Event Simulation using ExtendSim“. In Scilab, I have used Xcos for discrete event simulation and Quapro for linear programming. Both are featured in my next book.
There is a general analytics tool that I do not know much about yet. BOARD, in its newest release, boasts a predictive analytics capability. I will be speaking on predictive analytics at the BOARD User Conference during April 13th-14th in San Diego. Again, I would be remiss not to mention Excel, and particularly the Solver add-in for mathematical programming. Another 3rd-party add-in to consider to @Risk.
If you aspire to become an analytics consultant or scientist, you have a lot of open-source tools, free training and online tutorials at your fingertips. If you are already working in analytics, you can easily specialize in predictive analytics. If you are already working in predictive analytics, you have what you need to become an expert. All of the tools will either work with your PC’s native processing power or through a virtual machine, for example, when using Hadoop, or remote server.
Jeffrey Strickland, Ph.D.
Jeffrey Strickland, Ph.D., is the Author of Predictive Analytics Using R and a Senior Analytics Scientist with Clarity Solution Group. He has performed predictive modeling, simulation and analysis for the Department of Defense, NASA, the Missile Defense Agency, and the Financial and Insurance Industries for over 20 years. Jeff is a Certified Modeling and Simulation professional (CMSP) and an Associate Systems Engineering Professional (ASEP). He has published nearly 200 blogs on LinkedIn, is also a frequently invited guest speaker and the author of 20 books including:
- Operations Research using Open-Source Tools
- Discrete Event simulation using ExtendSim
- Crime Analysis and Mapping
- Missile Flight Simulation
- Mathematical Modeling of Warfare and Combat Phenomenon
- Predictive Modeling and Analytics
- Using Math to Defeat the Enemy
- Verification and Validation for Modeling and Simulation
- Simulation Conceptual Modeling
- System Engineering Process and Practices