In my last post, i covered the steps i have taken to setup the environment required for ML.NET to work. Recall that we need to load the data, prepare the data, train the model and lastly using the model. We are focusing on the training and getting the model today. There are 3 ways for us to get the model of ML.NET.
- Coding it ourselves
- Using Model Builder Tool
- Using AutoML via CLI to perform model training and picking the best algorithm.
In this post, I will focus on using the CLI – command line interface to test and get us the best algorithm. The sample data i am using is from https://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip , more specifically – i am using the Yelp file. if you are not able to get it anymore, you may download from here https://www.limguohong.com/wp-content/uploads/2019/06/yelp_labelled.txt
Data
First, we need to understand this data that we downloaded and what it meant. If you were to open up in excel or a text editor, you will notice that everyline is a text followed by a digit at the back. The digit is binary based – 1 or 0. You will further notice that those whose line are positive and labelled with a 1 as the digit and those whose line are negative are labelled with a 0 and there are 1000 lines of text(reviews).
Problem we are solving via machine learning
In this very specific tutorial post i am making here, we are attempting to train a model to understand if a review is positive OR negative and return the result accordingly. We are using yelp review to train up the model via AutoML CLI.
As you have probably noticed, we are attempting to predict if a new review is likely POSITIVE OR NEGATIVE and this is a binary way of classification and this sheds some light on which tasks should we use.
Which tasks should we use?
We understand that there are 7 tasks in ML.NET right now. Based on the problem we are solving, we will then need to choose which tasks (or sometimes i even call it, classification of problem) will it fall within.
Tasks include
- Binary Classification
- Multiclass Classification
- Regression
- Clustering
- Anomaly Detection
- Ranking
- Recommendation
For explanation on what tasks do what – please check the following link – https://docs.microsoft.com/en-us/dotnet/machine-learning/resources/tasks
Now that we understand what tasks are available, we will leave it to AutoML CLI to tell us which Trainer should we use. The concept of Trainer is
Trainer = Algorithm + Task.
In this very specific tutorial we are making here, as the problem is a binary based problem, the best classification to use would be Binary Classification.
Tutorial on Binary Classification – AutoML CLI ML.NET
- Create a new folder – I created “AutoML CLI Binary”
- Place the txt file into the folder.
- We need to modify the data abit as it is missing the header to inform the system on which is the LABEL – do note that it must be of boolean type (1 or 0, true or false). As you can see on https://docs.microsoft.com/en-us/dotnet/machine-learning/resources/tasks under Binary Classification, they require the label column data field to be of Boolean. Open up the txt file in excel and add the header “sentiment_label” above the result.
- Go to the folder and open up command prompt and run this command.
mlnet auto-train --task binary-classification --dataset "yelp_labelled.txt" --label-column-name sentiment_label --max-exploration-time 20
*For explanation, please refer to the end of this post.
- After running, you will notice that it mentioned how many iteration it runs and inform which trainer has the best accuracy.
*You may also want to attempt to run the exploration time to be longer and see if they suggest other better algorithm. In my screenshot below, i used 20 seconds and 60 seconds and the result was different.
- You will notice a new folder has been generated.
In my next post, i will share how to make sense of the generated file.
ML.NET Command breakdown
mlnet auto-train –task binary-classification –dataset “yelp_labelled.txt” –label-column-name sentiment_label –max-exploration-time 20
Notes about the command.
- We are informing mlnet to run the command auto-train which will “Create a new .NET project using ML.NET to train and run a model”
- We then inform mlnet the task that we want it to perform by calling –task, at time of writing, ML.NET AutoML CLI has the following supported.
- regression
- binary-classification
- multiclass-classification
- We then inform it via –dataset or -d on which file is mlnet supposed to read. Since we are running the command on project root, we just input the file name “yelp-labelled.txt”
- We can use either –label-column-name or –label-column-index on column to predict. In this case, we used –label-column-name and inform them to read the column name “label”
- We end off the command by informing mlnet in seconds the max-exploration time. In this case we set it to be 20 seconds.
Posts-
ML.NET Introduction – Introduction
ML.NET Part 2 – Machine Learning – Environment setup
ML.NET Part 3 – Machine Learning – Generating Model via ML.NET CLI – Binary Classification