{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# Install the necessary dependencies\n", "\n", "import os\n", "import sys\n", "!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "---\n", "license:\n", " code: MIT\n", " content: CC-BY-4.0\n", "github: https://github.com/ocademy-ai/machine-learning\n", "venue: By Ocademy\n", "open_access: true\n", "bibliography:\n", " - https://raw.githubusercontent.com/ocademy-ai/machine-learning/main/open-machine-learning-jupyter-book/references.bib\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# More classifiers\n", "\n", "In this section, you will use the dataset you saved from the last section full of balanced, clean data all about cuisines.\n", "\n", "You will use this dataset with a variety of classifiers to _predict a given national cuisine based on a group of ingredients_. While doing so, you'll learn more about some of the ways that algorithms can be leveraged for classification tasks.\n", "\n", "## Exercise - predict a national cuisine\n", "\n", "1\\. Working in this section's [build-classification-models](https://static-1300131294.cos.ap-shanghai.myqcloud.com/assignments/ml-fundamentals/build-classification-models.ipynb) file, import that file along with the Pandas library:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [ "output-scoll" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0cuisinealmondangelicaaniseanise_seedappleapple_brandyapricotarmagnac...whiskeywhite_breadwhite_winewhole_grain_wheat_flourwinewoodyamyeastyogurtzucchini
00indian00000000...0000000000
11indian10000000...0000000000
22indian00000000...0000000000
33indian00000000...0000000000
44indian00000000...0000000010
\n", "

5 rows × 382 columns

\n", "
" ], "text/plain": [ " Unnamed: 0 cuisine almond angelica anise anise_seed apple \\\n", "0 0 indian 0 0 0 0 0 \n", "1 1 indian 1 0 0 0 0 \n", "2 2 indian 0 0 0 0 0 \n", "3 3 indian 0 0 0 0 0 \n", "4 4 indian 0 0 0 0 0 \n", "\n", " apple_brandy apricot armagnac ... whiskey white_bread white_wine \\\n", "0 0 0 0 ... 0 0 0 \n", "1 0 0 0 ... 0 0 0 \n", "2 0 0 0 ... 0 0 0 \n", "3 0 0 0 ... 0 0 0 \n", "4 0 0 0 ... 0 0 0 \n", "\n", " whole_grain_wheat_flour wine wood yam yeast yogurt zucchini \n", "0 0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 0 \n", "4 0 0 0 0 0 1 0 \n", "\n", "[5 rows x 382 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "import pandas as pd\n", "cuisines_df = pd.read_csv(\"https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/classification/cleaned_cuisines.csv\")\n", "cuisines_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2\\. Now, import several more libraries:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\n", "from sklearn.model_selection import train_test_split, cross_val_score\n", "from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve\n", "from sklearn.svm import SVC\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3\\. Divide the x and y coordinates into two dataframes for training. `cuisine` can be the labels dataframe:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 indian\n", "1 indian\n", "2 indian\n", "3 indian\n", "4 indian\n", "Name: cuisine, dtype: object" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cuisines_label_df = cuisines_df['cuisine']\n", "cuisines_label_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4\\. Drop that `Unnamed: 0` column and the `cuisine` column, calling `drop()`. Save the rest of the data as trainable features:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [ "output-scoll" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
almondangelicaaniseanise_seedappleapple_brandyapricotarmagnacartemisiaartichoke...whiskeywhite_breadwhite_winewhole_grain_wheat_flourwinewoodyamyeastyogurtzucchini
00000000000...0000000000
11000000000...0000000000
20000000000...0000000000
30000000000...0000000000
40000000000...0000000010
\n", "

5 rows × 380 columns

\n", "
" ], "text/plain": [ " almond angelica anise anise_seed apple apple_brandy apricot \\\n", "0 0 0 0 0 0 0 0 \n", "1 1 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 0 \n", "\n", " armagnac artemisia artichoke ... whiskey white_bread white_wine \\\n", "0 0 0 0 ... 0 0 0 \n", "1 0 0 0 ... 0 0 0 \n", "2 0 0 0 ... 0 0 0 \n", "3 0 0 0 ... 0 0 0 \n", "4 0 0 0 ... 0 0 0 \n", "\n", " whole_grain_wheat_flour wine wood yam yeast yogurt zucchini \n", "0 0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 0 \n", "4 0 0 0 0 0 1 0 \n", "\n", "[5 rows x 380 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cuisines_feature_df = cuisines_df.drop(['Unnamed: 0', 'cuisine'], axis=1)\n", "cuisines_feature_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you are ready to train your model!\n", "\n", "## Choosing your classifier\n", "\n", "Now that your data is clean and ready for training, you have to decide which algorithm to use for the job. \n", "\n", "Scikit-learn groups classification under Supervised Learning, and in that category you will find many ways to classify. [The variety](https://scikit-learn.org/stable/supervised_learning.html) is quite bewildering at first sight. The following methods all include classification techniques:\n", "\n", "- Linear Models\n", "- Support Vector Machines\n", "- Stochastic Gradient Descent\n", "- Nearest Neighbors\n", "- Gaussian Processes\n", "- Decision Trees\n", "- Ensemble methods (voting Classifier)\n", "- Multiclass and multioutput algorithms (multiclass and multilabel classification, multiclass-multioutput classification)\n", "\n", ":::{seealso}\n", "You can also use [neural networks to classify data](https://scikit-learn.org/stable/modules/neural_networks_supervised.html#classification), but that is outside the scope of this section.\n", ":::\n", "\n", "### What classifier to go with?\n", "\n", "So, which classifier should you choose? Often, running through several and looking for a good result is a way to test. Scikit-learn offers a [side-by-side comparison](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) on a created dataset, comparing KNeighbors, SVC two ways, GaussianProcessClassifier, DecisionTreeClassifier, RandomForestClassifier, MLPClassifier, AdaBoostClassifier, GaussianNB and QuadraticDiscrinationAnalysis, showing the results visualized:\n", "\n", ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/ml-fundamentals/ml-classification/comparison.png\n", "---\n", "name: 'comparison of classifiers'\n", "width: 90%\n", "---\n", "Comparison of classifiers [🔗source](https://github.com/microsoft/ML-For-Beginners/blob/main/4-Classification/2-Classifiers-1/images/comparison.png)\n", ":::\n", "\n", ":::{seealso}\n", "Plots generated on Scikit-learn's documentation.\n", "\n", "AutoML solves this problem neatly by running these comparisons in the cloud, allowing you to choose the best algorithm for your data. Try it [here](https://docs.microsoft.com/learn/modules/automate-model-selection-with-azure-automl/?WT.mc_id=academic-77952-leestott).\n", ":::\n", "\n", "### A better approach\n", "\n", "A better way than wildly guessing, however, is to follow the ideas on this downloadable [ML Cheat sheet](https://docs.microsoft.com/azure/machine-learning/algorithm-cheat-sheet?WT.mc_id=academic-77952-leestott). Here, we discover that, for our multiclass problem, we have some choices:\n", "\n", ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/ml-fundamentals/ml-classification/cheatsheet.png\n", "---\n", "name: 'cheatsheet for multiclass problems'\n", "width: 90%\n", "---\n", "Cheatsheet for multiclass problems [🔗source](https://github.com/microsoft/ML-For-Beginners/blob/main/4-Classification/2-Classifiers-1/images/cheatsheet.png)\n", ":::\n", "\n", ":::{note}\n", "A section of Microsoft's Algorithm Cheat Sheet, detailing multiclass classification options.\n", ":::\n", "\n", ":::{seealso}\n", "Download this cheat sheet, print it out, and hang it on your wall!\n", ":::\n", "\n", "### Reasoning\n", "\n", "Let's see if we can reason our way through different approaches given the constraints we have:\n", "\n", "- **Neural networks are too heavy**. Given our clean, but minimal dataset, and the fact that we are running training locally via notebooks, neural networks are too heavyweight for this task.\n", "- **No two-class classifier**. We do not use a two-class classifier, so that rules out one-vs-all. \n", "- **Decision tree or logistic regression could work**. A decision tree might work, or logistic regression for multiclass data. \n", "- **Multiclass Boosted Decision Trees solve a different problem**. The multiclass boosted decision tree is most suitable for nonparametric tasks, e.g. tasks designed to build rankings, so it is not useful for us.\n", "\n", "### Using Scikit-learn \n", "\n", "We will be using Scikit-learn to analyze our data. However, there are many ways to use logistic regression in Scikit-learn. Take a look at the [parameters to pass](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regressio#sklearn.linear_model.LogisticRegression). \n", "\n", "Essentially there are two important parameters - `multi_class` and `solver` - that we need to specify, when we ask Scikit-learn to perform a logistic regression. The `multi_class` value applies a certain behavior. The value of the solver is what algorithm to use. Not all solvers can be paired with all `multi_class` values.\n", "\n", "According to the docs, in the multiclass case, the training algorithm:\n", "\n", "- **Uses the one-vs-rest (OvR) scheme**, if the `multi_class` option is set to `ovr`.\n", "- **Uses the cross-entropy loss**, if the `multi_class` option is set to `multinomial`. (Currently the `multinomial` option is supported only by the ‘lbfgs’, ‘sag’, ‘saga’ and ‘newton-cg’ solvers.)\n", "\n", ":::{seealso}\n", "The 'scheme' here can either be 'ovr' (one-vs-rest) or 'multinomial'. Since logistic regression is really designed to support binary classification, these schemes allow it to better handle multiclass classification tasks. [🔗source](https://machinelearningmastery.com/one-vs-rest-and-one-vs-one-for-multi-class-classification/)\n", "\n", "The 'solver' is defined as \"the algorithm to use in the optimization problem\". [source](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regressio#sklearn.linear_model.LogisticRegression).\n", ":::\n", "\n", "Scikit-learn offers this table to explain how solvers handle different challenges presented by different kinds of data structures:\n", "\n", ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/ml-fundamentals/ml-classification/solvers.png\n", "---\n", "name: 'solvers'\n", "width: 90%\n", "---\n", "Solvers [🔗source](https://github.com/microsoft/ML-For-Beginners/blob/main/4-Classification/2-Classifiers-1/images/solvers.png)\n", ":::\n", "\n", "## Exercise - split the data\n", "\n", "We can focus on logistic regression for our first training trial since you recently learned about the latter in a previous section.\n", "Split your data into training and testing groups by calling `train_test_split()`:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise - apply logistic regression\n", "\n", "Since you are using the multiclass case, you need to choose what _scheme_ to use and what _solver_ to set. Use LogisticRegression with a multiclass setting and the **liblinear** solver to train.\n", "\n", "1\\. Create a logistic regression with multi_class set to `ovr` and the solver set to `liblinear`:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy is 0.8256880733944955\n" ] } ], "source": [ "lr = LogisticRegression(multi_class='ovr',solver='liblinear')\n", "model = lr.fit(X_train, np.ravel(y_train))\n", "\n", "accuracy = model.score(X_test, y_test)\n", "print (\"Accuracy is {}\".format(accuracy))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::{seealso}\n", "Try a different solver like `lbfgs`, which is often set as default.\n", ":::\n", "\n", ":::{note}\n", "Use Pandas [`ravel`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.ravel.html) function to flatten your data when needed.\n", ":::\n", "\n", "The accuracy is good at over **80%**!\n", "\n", "2\\. You can see this model in action by testing one row of data (#50):" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ingredients: Index(['cinnamon', 'cream', 'egg', 'milk', 'milk_fat'], dtype='object')\n", "cuisine: indian\n" ] } ], "source": [ "print(f'ingredients: {X_test.iloc[50][X_test.iloc[50]!=0].keys()}')\n", "print(f'cuisine: {y_test.iloc[50]}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::{seealso}\n", "Try a different row number and check the results.\n", ":::\n", "\n", "3\\. Digging deeper, you can check for the accuracy of this prediction:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
indian0.583259
japanese0.177337
chinese0.130770
korean0.090274
thai0.018360
\n", "
" ], "text/plain": [ " 0\n", "indian 0.583259\n", "japanese 0.177337\n", "chinese 0.130770\n", "korean 0.090274\n", "thai 0.018360" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test= X_test.iloc[50].values.reshape(-1, 1).T\n", "proba = model.predict_proba(test)\n", "classes = model.classes_\n", "resultdf = pd.DataFrame(data=proba, columns=classes)\n", "\n", "topPrediction = resultdf.T.sort_values(by=[0], ascending = [False])\n", "topPrediction.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::{seealso}\n", "Can you explain why the model is pretty sure this is an Indian cuisine?\n", ":::\n", "\n", "4\\. Get more detail by printing a classification report, as you did in the regression sections:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " chinese 0.74 0.76 0.75 243\n", " indian 0.92 0.92 0.92 213\n", " japanese 0.81 0.77 0.79 251\n", " korean 0.84 0.82 0.83 253\n", " thai 0.83 0.87 0.85 239\n", "\n", " accuracy 0.83 1199\n", " macro avg 0.83 0.83 0.83 1199\n", "weighted avg 0.83 0.83 0.83 1199\n", "\n" ] } ], "source": [ "y_pred = model.predict(X_test)\n", "print(classification_report(y_test,y_pred))" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Your turn! 🚀\n", "\n", "In this section, you used your cleaned data to build a machine learning model that can predict a national cuisine based on a series of ingredients. Take some time to read through the many options Scikit-learn provides to classify data. Dig deeper into the concept of 'solver' to understand what goes on behind the scenes.\n", "\n", "Assignment - [Study the solvers](../../assignments/ml-fundamentals/study-the-solvers.md).\n", "\n", "## Acknowledgments\n", "\n", "Thanks to Microsoft for creating the open-source course [ML-For-Beginners](https://github.com/microsoft/ML-For-Beginners). It inspires the majority of the content in this chapter." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 4 }