Data-Science Series (Part 4)-Visual Programming with Orange Tool
This blog is about splitting the dataset into training and testing datasets and more features for getting the accuracy for different models and comparing them using the Orange Tool. We will also explore the cross-validation method using the Orange tool.
In this blog, we will see Compare different machine learning models for a particular dataset on basis of their scores like accuracy, F1, Precision, Recall using Cross-validation as well as splitting datasets into train and test datasets.
Now let’s start creating the WorkFlow.
First, we use the File widget in the canvas and load the inbuilt titanic dataset in the workflow.
File Widget
Now we have to send this data as an input to the Data Sampler. Data Sampler selects a subset of data instances from an input data set.
It outputs a sampled and a complementary data set. The output is processed after the input data set is provided and Sample Data is pressed.
Data Sampler
Here I sampled the data 80% output sampled data and 20% will be complementary data set.
Now, we need to send this sample data from Data Sampler to Test and Score. The widget tests learning algorithms. Different sampling schemes are available, including using separate test data.
Test and Score First shows a table with different classifier performance measures, such as classification accuracy and area under the curve. and Second, it gives outputs evaluation results, which can be used by other widgets for analyzing the performance of classifiers, such as ROC Analysis or Confusion Matrix.
Workflow
Here we send the sample data from Test and Score to three different learning algorithms namely Neural Network, Naive Bayes, and Logistic Regression.
Cross-Validation
Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.
Cross-validation splits the data into a given number of folds (usually 5 or 10). Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data.
Test and Score
Split data in training data and testing data in Orange
To split the data into train and test datasets, we will send 80% of the sampled data to Data Sampler as the train data and the remaining 20% data as the test data by clicking on the link between Data Sampler and Test and Score. In there set the link from Data Sample box to Databox and Remaining Databox to Test Data as shown in the below figure.
Edit Links
Now, there will be two flows set up from Data Sampler to Test and Score widget: one flow which sends the 80% Data Sample i.e Train data to Test and Score and second flow which sends the 20% Remaining Data i.e Test Data to Test and Score widget.
Workflow
Now get the comparison scores of the three different algorithms by testing on the train data. To do so double click on the Test and Score widget and choose the option of Test on train data there and get the scores for all three algorithms.
Test On Train Data
To test the learning algorithms on the basis of the test data choose the option of Test on test data in the Test and Score widget.
Test on test Data
So till now, we have seen how to how we can sample our data and compare different learning algorithms to find out which is the best algorithm for our data set using the Orange tool. we will explore this tool in detail in the next part of the Data Science series. You can explore more about the Orange tool here.