There is absolutely nothing wrong with my grammar, at least in the title. If you do not know what R is, you should probably visit the boutique section on Amazon instead of reading any further. If you do not know what KNIME is, well, I am going to tell you. I did, however, introduce KNIME in the LinkedIn article “What is KNIME?”
KNIME, pronounced “naim”, is a modular data exploration platform that allows the user to visually create data flows (referred to here as workflows). One key behind the success of KNIME is its inherent modular workflow approach, which documents and stores the analysis process in the order it was conceived and implemented, while ensuring that intermediate results are always available.
Core KNIME features include:
- Scalability through sophisticated data handling (intelligent automatic caching of data in the background while maximizing throughput performance)
- High, simple extensibility via a well-defined API for plugin extensions
- Intuitive user interface
- Import/export of workflows (for exchanging with other KNIME users)
- Parallel execution on multi-core systems
- Command line version for “headless” batch executions
Available KNIME modules cover a vast range of functionality, such as:
- I/O: retrieves data from files or data bases
- Data Manipulation: pre-processes your input data with filtering, group-by, pivoting, binning, normalization, aggregation, joining, sampling, partitioning, etc.
- Views: visualize data and results through several interactive views, allowing for interactive data exploration
- Highlighting: ensures highlighted data points in one view are also immediately highlighted in all other views
- Mining: uses state-of-the-art data mining algorithms like clustering, rule induction, decision tree, association rules, naïve Bayes, neural networks, support vector machines, etc. to better understand your data
You can check out the complete node documentation for a comprehensive list of nodes and detailed descriptions at http://www.knime.org/files/node-documentation/index.html.
Supported Operating Systems
- Windows – 32bit (regularly tested on XP and Vista)
- Windows – 64bit (regularly tested on Vista and verified to work under Windows 7)
- Linux – 32bit (regularly tested on RHEL4/5, OpenSUSE 10.2/10.3/11.0, amongst others)
- Linux – 64bit (regularly tested on RHEL4/5, OpenSUSE 10.2/10.3/11.0, amongst others)
- Mac OSX – 64bit Intel-based architecture with Java 1.6
KNIME now has an R user interface, or external library. We are going to construct a workflow in KNIME using elements from the R external library.
Building a Simple Workflow
A workflow is like a Diagram in SAS Enterprise Miner or a Stream in SPSS Modeler. Figure 1 shows the KNIME workbench windows, including the workflow.
Figure 1. The KNIME workbench (GUI)
I am going to walk through the process of building a small, simple workflow using R data for Stage C Prostate Cancer (stagec), so that you can get an idea of the environment using the R interface. We will read in data from a comma delimited (CSV) file using the File Reader and explore the data set. We will use other R nodes such as R Snippet and R View nodes to run arbitrary R script on the input of these nodes. The result of the R Snippet nodes are then returned at the node’s output. We will use the R View node to execute a view command to visualize the generated view content in the node’s specific view. We will also display the data with a Scatterplot Matrix, a Color Scatterplot and Cumulative Statistics table.
Next, we will build a predictive model using the R Learner node, which builds an r-part (decision tree) model and returns a special out-port at this node, and the R Predictor to predict outcomes using the R Learner node model. We will use R Snippet nodes to partition the model data. Finally, we will use an R View node to observe the perditions.
KNIME has a Node Repository containing all the functions used in creating a workflow (see Figure 2). We are using the R modules. To import the data, we expand the IO and the contained Read category as depicted below (right picture in Figure 2) and drag & drop the File Reader icon to the Workflow Editor window. The next node for now will be the Column Rename node from the Data Manipulation library (though this will be used for a future model). Since I do not always remember were various nodes are found, I use the search box of the Node Repository and I enter “Column” and press Enter. This limits the nodes shown to the ones with “column” in their name. We then drag the Column Rename node to the workflow. Next, we will add the nodes from the R library, shown in the left picture of Figure 2.
Figure 2. Where to find workflow nodes in the Node repository
Now, from the R library, we drag the following nodes to the workflow window: R Learner, R Predictor, R Snippet (thee of these), R View (Table) (two of these), and R View (Workspace). We want to arrange these nodes as indicated in Figure 3. We also want to name the nodes, as I have, by editing the text that appears at the bottom of each node.
Figure 3. Final workflow diagram
Our nodes will not show a green status, as long as they are not configured and executed.
Now we need to connect the nodes in order to get the data flowing. We click an output port and drag the connection to an appropriate input port. The complete flow is pictured in Figure 3.
File Reader Node
Fully connected nodes with a red status icon need to be configured. We start with the File Reader by right-clicking it and selecting Configure from the drop-down menu. We navigate to the directory where our data is located. We select the “stagec.csv” file from this location (you can download this set at https://vincentarelbundock.github.io/Rdatasets/datasets.html). The File Reader’s configuration window is shown in Figure 4. Note that the CSV file contains column headers of the variable name sand row IDs. Hence, the read column headers and the read row IDs boxes are checked under Basic Settings. Of course, the file is comma delimited.
Figure 4. Configuring the File Reader node
You can view the data to ensure it imported correctly by clicking on Apply and scrolling down as shown in Figure 5. When the data is imported into KNIME, it is renamed with a KNIME naming convention. The default name in knime.in. Thus, our dataset is now knime.in, instead of stagec.
Figure 5. Iris data preview in the File Reader node configuration window
Column Rename Node
The Column Filter node is found in the Data Manipulation library, Data Manipulation | Convert & Replace | Column Rename. It can be used to rename column names or change their types. We will not use it for now, but will need it for other models. Notice in Figure 6 that we can change the variable type for pgstat (potentiostat/galvanostat) to other types as needed. We could also change the variable names, which may be required for other model types. Feel free to explore the configuration dialog on your own for now.
Figure 6. Configuration for the Column Rename node
R View (Table) Matrix Plot
Now we configure the node for making a matrix plot of the stagec data. To configure the node, refer to Figure 7.
Figure 7. R View (Table) Matrix Plot configuration window
Notice that I have entered R code in the R Script window. As it may be hard to read, I repeat it here:
R <- knime.in
pairs(R[1:2], labels = c(knime.in$"age", knime.in$"g2"))
Since the node is coming from the File Reader node and passing through the Column Rename node with no changes, we just write normal R code here, with the following exception: we now use the dataset knime.in. The pairs function produces a matrix of scatterplots. You can refer to help on the graphics package for details.
Once the code is entered you can click on Eval Script to run it and check for errors. If no errors occur, click on Show Plot to see a preview of the plot as shown in Figure 8.
Figure 8. Output window of the Matrix Plot of the Iris dataset
R View (Table) Colored Scatterplot
Now we configure the node for making a colored scatterplot of the stagec data. To configure the node, refer to Figure 9. We start by selecting the Template tab. Here you should find one predefined template called “Colored Scatter Plot”.
Figure 9. R View (Table) Colored Scatterplot configuration window with the Template tab view
Select the template, click on Apply and then select the R Script tab (if it does not appear automatically). The details for the R script are shown in Figure 10. Here, we merely add the source of the data, the variables for x and y coordinates the class variable and the plot title as shown below (lines 7 through 18). Everything else is scripted from the template.
# Note: variable names are used as plot labels ("x" will be the x axis label)
x = knime.in$"age" #<Choose column from Column List>
y = knime.in$"g2" #<Choose column from Column List>
#Column to color by:
class = knime.in$"ploidy" #<Choose column from Column List>
# Use a flow variable for a title
title = "Stage C Prostate Cancer" #<Choose variable from Flow Variable List or set manually>
Figure 10. R View (Table) Colored Scatterplot configuration window with the R Script tab view
Again, we can check for script errors by clicking Eval Script and then preview the plot, by clicking Show Plot. The preview is shown in Figure 11.
Figure 11. Ouput window for the Colored Scatterplot of the Iris dataset
R Snippet Cummulative Stats
Next we configure the node for calulating the cummulative statistics for the stagec dataset. This is shown in Figure 12. Again, we use a predefined template from the Template tab. We select Cumulative Statistics and click on Apply.
Figure 12. Template tab of the configuration window for the Cumulative Statistics R Snippet
The resulting R Script is shown in Figure 13. We merely need to replace “<Column Name>” with “age” and evaluate the script.
# Reference a column in your table here.
column = knime.in$"<Column Name>"
Figure 13. R Script tab of the configuration window for the Cumulative Statistics R Snippet
In order to see the output we have to close the configuration window, select the node, right-click and select Data Output. However, we must first run the node. You can do that by selecting the node, right clicking and selecting Execute. The output is shown in Figure 14.
Figure 14. Output window for the Cumulative Statistics R snippet node
Summary of stagec Dataset
All we have done so far is explored the stagec dataset. We have not built any models, yet. If you are unfamiliar with this dataset, you may want to examine the output we produced in more detail, before proceeding. The next section starts the model construction.
To build our predictive model, we need two different sets of data, one for training and one for prediction. This is usually accomplished with data that has been collected from two or more studies. However, it can also be accomplished by partitioning the data from one dataset. The Stage C Prostate Cancer data has 146 cases (rows), which we can partitioned by taking half of the cases for training and the other half for prediction. To do this, we will use two R Snippet nodes. The R scripts for these are shown in Figures 15 and 16. Since the code may be difficult to read, I repeat it here:
knime.out <- knime.in[1:73,] #Training in Figure 15
knime.out <- knime.in[74:146,] #Prediction in Figure 16
Figure 15. R Snippet node of partitioning the training dataset
Figure 16. R Snippet node of partitioning the prediction dataset
Now we configure the R Learner node, which allows for the execution of an R modeling script within KNIME. The details are shown in Figure 17.
Figure 17. R Script tab of the configuration window for the R Learner node
Now that we are in the modeling nodes, the stagec dataset has been automatically imported in the proper modeling format and renamed knime.in. The code I entered in the R Script window is shown below. The model is called a Recursive Partitioning and Regression Tree model or rpart from the rpart package.
knime.model <- rpart(knime.in$"ploidy" ~ ., method="class", data=knime.in)
Once we enter the code, we can evaluate it using Eval Script. This node does not produce output per se. Rather it produces a model that we will use next for prediction. The default model name is knime.model. For details on rpart, see the R help for rpart package: https://cran.r-project.org/web/packages/rpart/rpart.pdf.
Now we configure the R predictor node. The details are shown in Figure 18.
Figure 18. R Script tab of the configuration window for the R Predictor node
The code entered in the R Script window is:
Notice that we used the KNIME model name “knime.model” and the KNIME data set “knime.in”. The output is shown in Figure 19.
Figure 19. Output window of the R Predictor model node
R View (Workspace) Tree Plot
The last node shows the tree plot for the rpart model. As expected, this will produce a classification tree. The details for configuring this node are shown in Figure 20.
Figure 20. Configuration window of the R View (Workspace) Tree Plot node
The R Script is:
Plot(knime.model, uniform=TRUE, main="Classification Tree for Stage C Prostate Cancer")
text(knime.model, use.n=TRUE, all=TRUE, cex=.8)
Clicking on Show Plot produces the two branch tree shown in Figure 21.
Figure 21. Image Output window of the R View (Workspace) Tree Plot node
When we execute the workflow for the entire model, we actually execute this node. To view the node, right-click on the Tree Plot node and select Image Output.
KNIME supports other programs including MATLAB and Weka for machine learning.
Though I have not worked with KNIME as much as I would like, my first impression is a favorable one, and the integration of R programming is a plus in my book. To download KNIME to you MAC or PC, use this link http://knime.org/download.
Jeffrey Strickland, Ph.D.
Jeffrey Strickland, Ph.D., is the Author of Predictive Analytics Using R and a Senior Analytics Scientist with Clarity Solution Group. He has performed predictive modeling, simulation and analysis for the Department of Defense, NASA, the Missile Defense Agency, and the Financial and Insurance Industries for over 20 years. Jeff is a Certified Modeling and Simulation professional (CMSP) and an Associate Systems Engineering Professional (ASEP). He has published nearly 200 blogs on LinkedIn, is also a frequently invited guest speaker and the author of 20 books including:
- Operations Research using Open-Source Tools
- Discrete Event simulation using ExtendSim
- Crime Analysis and Mapping
- Missile Flight Simulation
- Verification and Validation for Modeling and Simulation
- Simulation Conceptual Modeling
- System Engineering Process and Practices