{ "cells": [ { "cell_type": "markdown", "id": "84b2ed6d", "metadata": {}, "source": [ "# Exercise Sheet No. 2\n", "\n", "---\n", "\n", "> Machine Learning for Natural Sciences, Summer 2024, Jun.-Prof. Pascal Friederich, pascal.friederich@kit.edu\n", "\n", "> Instructor: Marlen Neubert (marlen.neubert@kit.edu)\n", "\n", "---\n", "**Deadline**: Monday, April 29th 8am \n", "\n", "**Topic**: This exercise deals with decision trees and random forests. We examine the parameters and properties of these two algorithms on a binary classification example using [`sklearn`](https://scikit-learn.org/stable/) methods." ] }, { "cell_type": "markdown", "id": "3aac18d1", "metadata": {}, "source": [ "### Please put your name and your group members here: \n", "You are encouraged to work in groups of a maximum of 3 people, however **each of you** has to submit a solution.\n", "\n", "Nils Lennart Bruns, usxfs\n" ] }, { "cell_type": "markdown", "id": "eee3c2a3", "metadata": {}, "source": [ "## Preliminaries\n", "If you are not familiar with Python, you may want to learn more about Python\n", "and its basic syntax. Since there are a lof of free and well written tutorials\n", " online, we refer you to one of the following online tutorials:\n", "\n", "* http://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook\n", "* https://www.learnpython.org/\n", "* https://automatetheboringstuff.com/" ] }, { "cell_type": "markdown", "id": "20c7a0a0", "metadata": {}, "source": [ "## 1.1 Data Preprocessing and Exploration\n", "\n", "The data we will be working with is the breast cancer dataset from the [University of Wisconsin](http://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28diagnostic%29) - a binary classification dataset for diagnosing breast cancer. \\\n", "It contains 30 features which are derived from digitized images and describe characteristics of the cell nuclei. Corresponding labels describe the stage of cancer as either \\\n", "`B`: benign, the tumor doesn’t contain cancerous cells or \\\n", "`M`: malignant, the tumor contains cancerous cells. \n", "\n", "### Problem Description\n", "We want to predict whether a breast cancer tumor is benign or malignant. This is a binary classification problem since we have two output classes.\\\n", "Before we can start training our algorithms we have to get familiar with the data and prepare it:" ] }, { "cell_type": "code", "execution_count": 2, "id": "84713313", "metadata": {}, "outputs": [], "source": [ "import os\n", "import pandas as pd\n", "import numpy as np\n", "from sklearn.preprocessing import LabelEncoder\n", "import requests\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay\n", "from sklearn.ensemble import RandomForestClassifier" ] }, { "cell_type": "code", "execution_count": 3, "id": "e982076d", "metadata": {}, "outputs": [], "source": [ "data_url = \"https://bwsyncandshare.kit.edu/s/dCsEn6eK5S453Lq/download\"\n", "data_file = \"breast_cancer_data.csv\"\n", "if not os.path.exists(data_file):\n", " print(\"Downloading dataset ...\")\n", " with open(data_file, \"wb\") as f:\n", " f.write(requests.get(data_url).content)\n", " print(\"Downloading dataset done.\")" ] }, { "cell_type": "markdown", "id": "b38f90bf", "metadata": {}, "source": [ "We load the dataset via the data library ``pandas``, which will return a ``DataFrame`` object. We can print the head of the table with ``.head()``:" ] }, { "cell_type": "code", "execution_count": 4, "id": "2c83c30e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddiagnosisradius_meantexture_meanperimeter_meanarea_meansmoothness_meancompactness_meanconcavity_meanconcave points_mean...texture_worstperimeter_worstarea_worstsmoothness_worstcompactness_worstconcavity_worstconcave points_worstsymmetry_worstfractal_dimension_worstUnnamed: 32
0842302M17.9910.38122.801001.00.118400.277600.30010.14710...17.33184.602019.00.16220.66560.71190.26540.46010.11890NaN
1842517M20.5717.77132.901326.00.084740.078640.08690.07017...23.41158.801956.00.12380.18660.24160.18600.27500.08902NaN
284300903M19.6921.25130.001203.00.109600.159900.19740.12790...25.53152.501709.00.14440.42450.45040.24300.36130.08758NaN
384348301M11.4220.3877.58386.10.142500.283900.24140.10520...26.5098.87567.70.20980.86630.68690.25750.66380.17300NaN
484358402M20.2914.34135.101297.00.100300.132800.19800.10430...16.67152.201575.00.13740.20500.40000.16250.23640.07678NaN
\n", "

5 rows × 33 columns

\n", "
" ], "text/plain": [ " id diagnosis radius_mean texture_mean perimeter_mean area_mean \\\n", "0 842302 M 17.99 10.38 122.80 1001.0 \n", "1 842517 M 20.57 17.77 132.90 1326.0 \n", "2 84300903 M 19.69 21.25 130.00 1203.0 \n", "3 84348301 M 11.42 20.38 77.58 386.1 \n", "4 84358402 M 20.29 14.34 135.10 1297.0 \n", "\n", " smoothness_mean compactness_mean concavity_mean concave points_mean \\\n", "0 0.11840 0.27760 0.3001 0.14710 \n", "1 0.08474 0.07864 0.0869 0.07017 \n", "2 0.10960 0.15990 0.1974 0.12790 \n", "3 0.14250 0.28390 0.2414 0.10520 \n", "4 0.10030 0.13280 0.1980 0.10430 \n", "\n", " ... texture_worst perimeter_worst area_worst smoothness_worst \\\n", "0 ... 17.33 184.60 2019.0 0.1622 \n", "1 ... 23.41 158.80 1956.0 0.1238 \n", "2 ... 25.53 152.50 1709.0 0.1444 \n", "3 ... 26.50 98.87 567.7 0.2098 \n", "4 ... 16.67 152.20 1575.0 0.1374 \n", "\n", " compactness_worst concavity_worst concave points_worst symmetry_worst \\\n", "0 0.6656 0.7119 0.2654 0.4601 \n", "1 0.1866 0.2416 0.1860 0.2750 \n", "2 0.4245 0.4504 0.2430 0.3613 \n", "3 0.8663 0.6869 0.2575 0.6638 \n", "4 0.2050 0.4000 0.1625 0.2364 \n", "\n", " fractal_dimension_worst Unnamed: 32 \n", "0 0.11890 NaN \n", "1 0.08902 NaN \n", "2 0.08758 NaN \n", "3 0.17300 NaN \n", "4 0.07678 NaN \n", "\n", "[5 rows x 33 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.read_csv(data_file)\n", "data.head()" ] }, { "cell_type": "markdown", "id": "14164182", "metadata": {}, "source": [ "We see that the data consists of 33 columns and 569 rows - corresponding to 569 samples.\\\n", "The first column is called `id`, followed by `diagnosis` which contains the labels.\\\n", "First, we want to check the distribution of classes. Use a pandas method to count the number of benign and malignant data samples. The values of your answer should be integers assigned to the variables `B` and `M`:" ] }, { "cell_type": "code", "execution_count": 6, "id": "ab7e9856", "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "0b966c5a0ea0e2782eca6e0fc8667893", "grade": false, "grade_id": "cell-22ce838295d464ea", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "569 569\n" ] } ], "source": [ "# look at distribution of classes\n", "B = None \n", "M = None\n", "\n", "B = (data[\"diagnosis\"] == \"B\").count()\n", "M = (data[\"diagnosis\"] == \"M\").count()\n", "\n", "print(B, M)" ] }, { "cell_type": "code", "execution_count": 7, "id": "441ed263", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "2526799b1762effc7e038418ceb7a8f2", "grade": true, "grade_id": "class_distribution", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "##### DO NOT CHANGE #####\n", "\n", "# check results - 1 point\n", "\n", "assert B != None and M != None, \"Please assign values to B and M!\"\n", "\n", "\n", "##### DO NOT CHANGE #####" ] }, { "cell_type": "markdown", "id": "641daac8", "metadata": {}, "source": [ "We also see that there is a column `Unnamed: 32` which doesn't contain any information.\\\n", "In the next step we therefore want to clean the data by removing unnecesary columns." ] }, { "cell_type": "code", "execution_count": 11, "id": "1c2e6b0d", "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "822f8cdb840f4d7e1676f59806a37941", "grade": false, "grade_id": "cell-5d0c5d760be5c1b0", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
diagnosisradius_meantexture_meanperimeter_meanarea_meansmoothness_meancompactness_meanconcavity_meanconcave points_meansymmetry_mean...radius_worsttexture_worstperimeter_worstarea_worstsmoothness_worstcompactness_worstconcavity_worstconcave points_worstsymmetry_worstfractal_dimension_worst
0M17.9910.38122.801001.00.118400.277600.30010.147100.2419...25.3817.33184.602019.00.16220.66560.71190.26540.46010.11890
1M20.5717.77132.901326.00.084740.078640.08690.070170.1812...24.9923.41158.801956.00.12380.18660.24160.18600.27500.08902
2M19.6921.25130.001203.00.109600.159900.19740.127900.2069...23.5725.53152.501709.00.14440.42450.45040.24300.36130.08758
3M11.4220.3877.58386.10.142500.283900.24140.105200.2597...14.9126.5098.87567.70.20980.86630.68690.25750.66380.17300
4M20.2914.34135.101297.00.100300.132800.19800.104300.1809...22.5416.67152.201575.00.13740.20500.40000.16250.23640.07678
\n", "

5 rows × 31 columns

\n", "
" ], "text/plain": [ " diagnosis radius_mean texture_mean perimeter_mean area_mean \\\n", "0 M 17.99 10.38 122.80 1001.0 \n", "1 M 20.57 17.77 132.90 1326.0 \n", "2 M 19.69 21.25 130.00 1203.0 \n", "3 M 11.42 20.38 77.58 386.1 \n", "4 M 20.29 14.34 135.10 1297.0 \n", "\n", " smoothness_mean compactness_mean concavity_mean concave points_mean \\\n", "0 0.11840 0.27760 0.3001 0.14710 \n", "1 0.08474 0.07864 0.0869 0.07017 \n", "2 0.10960 0.15990 0.1974 0.12790 \n", "3 0.14250 0.28390 0.2414 0.10520 \n", "4 0.10030 0.13280 0.1980 0.10430 \n", "\n", " symmetry_mean ... radius_worst texture_worst perimeter_worst \\\n", "0 0.2419 ... 25.38 17.33 184.60 \n", "1 0.1812 ... 24.99 23.41 158.80 \n", "2 0.2069 ... 23.57 25.53 152.50 \n", "3 0.2597 ... 14.91 26.50 98.87 \n", "4 0.1809 ... 22.54 16.67 152.20 \n", "\n", " area_worst smoothness_worst compactness_worst concavity_worst \\\n", "0 2019.0 0.1622 0.6656 0.7119 \n", "1 1956.0 0.1238 0.1866 0.2416 \n", "2 1709.0 0.1444 0.4245 0.4504 \n", "3 567.7 0.2098 0.8663 0.6869 \n", "4 1575.0 0.1374 0.2050 0.4000 \n", "\n", " concave points_worst symmetry_worst fractal_dimension_worst \n", "0 0.2654 0.4601 0.11890 \n", "1 0.1860 0.2750 0.08902 \n", "2 0.2430 0.3613 0.08758 \n", "3 0.2575 0.6638 0.17300 \n", "4 0.1625 0.2364 0.07678 \n", "\n", "[5 rows x 31 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# clean the data by removing columns 'Unnamed: 32' and 'id'\n", "\n", "data = data.drop(['Unnamed: 32', 'id'], axis=1)\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 14, "id": "aea25336", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "0c63eefea726df52338e686a1cb57161", "grade": true, "grade_id": "clean_data", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "##### DO NOT CHANGE #####\n", "\n", "# 1 point\n", "\n", "assert data.shape == (\n", " 569,\n", " 31,\n", "), \"Your data shape after removing the columns does not match!\"\n", "\n", "\n", "##### DO NOT CHANGE #####" ] }, { "cell_type": "markdown", "id": "b89ca58a", "metadata": {}, "source": [ "The first column of the cleaned dataset should now correspond to the labels, the rest of the columns correspond to the features which we will assign to `X`:" ] }, { "cell_type": "code", "execution_count": 15, "id": "ff1d12fd", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "379c26187eb932e5005313ac272192b4", "grade": false, "grade_id": "cell-60ccd97d76da9140", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "##### DO NOT CHANGE #####\n", "\n", "# Features\n", "X = data.drop(\"diagnosis\", axis=1)\n", "\n", "##### DO NOT CHANGE #####" ] }, { "cell_type": "markdown", "id": "758a6841", "metadata": {}, "source": [ "Next, we need to convert the categorical labels `B` and `M` into integers `0` and `1` as our model can only handle numeric data. \\\n", "We can do this easily by using the [`LabelEncoder()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) from `sklearn`." ] }, { "cell_type": "code", "execution_count": 19, "id": "216fe5e0", "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "7c36d8e96a8df0d2841a53b3515bf1bb", "grade": false, "grade_id": "cell-4c39ba7e96a96a4b", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n", " 0 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 1\n", " 0 1 0 1 1 0 0 0 1 1 0 1 1 1 0 0 0 1 0 0 1 1 0 0 0 1 1 0 0 0 0 1 0 0 1 0 0\n", " 0 0 0 0 0 0 1 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 1 0 0 1 0 0 0 0 1 0\n", " 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 1 1 0 0 1 1 0 0 0 0 1 0 0 1 1 1 0 1\n", " 0 1 0 0 0 1 0 0 1 1 0 1 1 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 0 0 1 1 0 0\n", " 0 1 0 0 0 0 0 1 1 0 0 1 0 0 1 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 1 1 1 1\n", " 1 1 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0\n", " 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0\n", " 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1\n", " 1 0 1 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0\n", " 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 1 0 0\n", " 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0\n", " 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0 1 0 0 1 0 1 0 1 1\n", " 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", " 0 0 0 0 0 0 0 1 1 1 1 1 1 0]\n" ] } ], "source": [ "# 2 points\n", "\n", "# categorical y values\n", "y_categorical = data[\"diagnosis\"].values\n", "\n", "# Assign a LabelEncoder object to labelencoder_y and obtain the encoded labels as y.\n", "labelencoder_y = None\n", "y = None\n", "\n", "labelencoder_y = LabelEncoder()\n", "labelencoder_y.fit(y_categorical)\n", "y = labelencoder_y.transform(y_categorical)\n", "print(y)" ] }, { "cell_type": "code", "execution_count": 20, "id": "beab50b4", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "d95bd19da8f9fa60c11dd9317dff27ce", "grade": true, "grade_id": "label_encoder", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "##### DO NOT CHANGE #####\n", "\n", "assert isinstance(\n", " labelencoder_y, LabelEncoder\n", "), \"The labelencoder should be an instance of the sklearn LabelEncoder\"\n", "\n", "# hidden test label encoder - 1 point\n", "\n", "\n", "##### DO NOT CHANGE #####" ] }, { "cell_type": "markdown", "id": "0121eed7", "metadata": {}, "source": [ "In our last preprocessing step we need to divide the data into a training and test set. We use the training set for training and keep the test set for evaluating a trained classifier which gives us the generalization error.\n", "\n", "We use [`train_test_split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highlight=train_test_split#sklearn.model_selection.train_test_split) to split 80% of `X` and `y` as training set and use the rest as test set:" ] }, { "cell_type": "code", "execution_count": 21, "id": "424576f4", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "ba414b9656d7bf8e57eebd77cd445f88", "grade": false, "grade_id": "cell-4158f8d16c082fa1", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "##### DO NOT CHANGE #####\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.2, random_state=42\n", ")\n", "\n", "##### DO NOT CHANGE #####" ] }, { "cell_type": "markdown", "id": "a342c847", "metadata": {}, "source": [ "## 1.2 Decision Tree Classifier\n", "We are now ready to train a decision tree classifier. \\\n", "We will use the [`DecisionTreeClassifier()`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) available in sklearn.\n", "\n", "### Entropy and Gini Index\n", "One parameter we have to choose is the function to measure the quality of a split i.e. the `criterion` which measures the impurity of a split.\\\n", "Possible criteria are `entropy` and `gini` which you also have seen in the lecture. Both quantify the uncertainty or disorder in a dataset's distribution of classes. A higher value implies greater disorder.\\\n", "In decision tree algorithms, the goal is therefore to reduce entropy (or the gini index) by making splits that result in more homogeneous subsets of data.\n", "\n", "Consider a dataset with 100 samples belonging to two classes (class A and class B). Assume that each class has an equal probability of occurrence. Calculate both the entropy and the gini index of the dataset using the formulas given in the lecture.\\\n", "Assign your answers as floats to the variables below." ] }, { "cell_type": "code", "execution_count": 27, "id": "5fcdc4e3", "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "ef7e4111192e4fbcc6c40fb8a7eb8ff1", "grade": false, "grade_id": "cell-16abc80a7ce2e979", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "50.0 25.0\n" ] } ], "source": [ "# assign values as floats - 2 points\n", "entropy = None\n", "gini_index = None\n", "\n", "entropy = -0.5*np.log2(0.5)*100\n", "gini_index = 100*.25\n", "\n", "print(entropy, gini_index)" ] }, { "cell_type": "code", "execution_count": 28, "id": "4fc00fcb", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "f71adb6b074026032ce149e523dafa3a", "grade": true, "grade_id": "entropy_gini_index", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "##### DO NOT CHANGE #####\n", "\n", "assert (\n", " entropy != None and gini_index != None\n", "), \"Please assign values to entropy and gini index!\"\n", "\n", "\n", "##### DO NOT CHANGE #####" ] }, { "cell_type": "markdown", "id": "965039b7", "metadata": {}, "source": [ "We can now initialize the decision tree classifier using the gini index as splitting criterion and a fixed maximum depth of the tree. We use a specific random state to make the results reproducable :" ] }, { "cell_type": "code", "execution_count": 29, "id": "9ad7b8bf", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "e86e3efb9f75c3334390df99ab7c1387", "grade": false, "grade_id": "cell-4fa71eabfa742600", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "##### DO NOT CHANGE #####\n", "\n", "# Initialize DecisionTreeClassifier\n", "tree_classifier = DecisionTreeClassifier(criterion=\"gini\", max_depth=5, random_state=42)\n", "\n", "##### DO NOT CHANGE #####" ] }, { "cell_type": "markdown", "id": "2cef27dc", "metadata": {}, "source": [ "Now, we can train the DecisionTreeClassifier on the training data using the fit() method:" ] }, { "cell_type": "code", "execution_count": 30, "id": "6a74feb6", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "17a5a0f734417d7878ce1b95fa3ddb03", "grade": false, "grade_id": "cell-aaf56bb6680fbe00", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [ { "data": { "text/html": [ "
DecisionTreeClassifier(max_depth=5, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "DecisionTreeClassifier(max_depth=5, random_state=42)" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "##### DO NOT CHANGE #####\n", "\n", "tree_classifier.fit(X_train, y_train)\n", "\n", "##### DO NOT CHANGE #####" ] }, { "cell_type": "markdown", "id": "e0e4182c", "metadata": {}, "source": [ "Use the predict() method to predict the labels of the test set to check how well the model generalizes to unseen data:" ] }, { "cell_type": "code", "execution_count": 32, "id": "af2b2cca", "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "3f0434dcbb78de22f2063c8b41371c0f", "grade": false, "grade_id": "cell-93189e66e709ae9d", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "y_pred = None\n", "\n", "y_pred = tree_classifier.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 33, "id": "a4324177", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "fa9aa02db633f2043715e7eba0889683", "grade": true, "grade_id": "decision_tree_predict", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "##### DO NOT CHANGE #####\n", "\n", "# 1 point\n", "\n", "\n", "##### DO NOT CHANGE #####" ] }, { "cell_type": "markdown", "id": "280aad8e", "metadata": {}, "source": [ "### Accuracy\n", "\n", "Since we now have the predicted labels of the test set we can use them to evaluate the accuracy of the model by comparing them to the 'true' labels.\\\n", "The accuracy is defined as:\n", "\\begin{align}\n", "Accuracy &= \\frac{Number\\,of\\,correct\\,predictions}{Total\\,number\\,of\\,predictions}\n", "\\end{align}\n", "\n", "Use [`accuracy_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) to get the accuracy of the trained decision tree:" ] }, { "cell_type": "code", "execution_count": 39, "id": "1cde2417", "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "a5156c5788d87e0561436147ab227301", "grade": false, "grade_id": "cell-80843b60a7bf0f8b", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.9473684210526315\n" ] } ], "source": [ "# Evaluate the accuracy of the model - 1 point\n", "\n", "accuracy = None\n", "\n", "accuracy = (y_pred == y_test).sum() / len(y_pred)\n", "print(accuracy)" ] }, { "cell_type": "code", "execution_count": 40, "id": "70bbcc22", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "42441dbf38b69d391f0263b24d74a3f9", "grade": true, "grade_id": "decision_tree_accuracy", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "##### DO NOT CHANGE #####\n", "\n", "assert accuracy != None\n", "\n", "\n", "##### DO NOT CHANGE #####" ] }, { "cell_type": "markdown", "id": "9725ffc5", "metadata": {}, "source": [ "## 1.3 Random Forest Classifier\n", "\n", "A random forest classifier is an ensemble method consisting of multiple decision trees. \n", "By combining bagging and a random split selection, random forests generally have many advantages over single decision trees like improved generalization, robustness to noise and higher accuracy.\n", "By averaging predictions from multiple trees, overfitting can also be reduced.\n", "Since features are intrinsically evaluated the interpretability of single decision trees is kept.\\\n", "In this part of the exercise we fit a random forest classifier to our data and compare its perfomance to the single tree classifier." ] }, { "cell_type": "code", "execution_count": 41, "id": "b6c5bc90", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "ce74131157e9146ca14b14d6528e283f", "grade": false, "grade_id": "cell-6d77200d90a73fdd", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "##### DO NOT CHANGE #####\n", "\n", "rf_classifier = RandomForestClassifier(n_estimators=50, max_depth=2, random_state=42)\n", "\n", "##### DO NOT CHANGE #####" ] }, { "cell_type": "markdown", "id": "d6b8326a", "metadata": {}, "source": [ "Again, fit the training data to the random forest classifier, predict the labels of the test set and evaluate the accuracy of the trained model:" ] }, { "cell_type": "code", "execution_count": 44, "id": "0706e73b", "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "ea1ee1eb4d520d971b77e259ea350a04", "grade": false, "grade_id": "cell-e5d3fc5b559d208d", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.956140350877193\n" ] } ], "source": [ "y_pred_rf = None\n", "accuracy_rf = None\n", "\n", "rf_classifier.fit(X_train, y_train)\n", "y_pred_rf = rf_classifier.predict(X_test)\n", "accuracy_rf = (y_pred_rf == y_test).sum() / len(y_pred)\n", "print(accuracy_rf)" ] }, { "cell_type": "code", "execution_count": 45, "id": "9c8baec7", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "d70a1f33852ada1160360589115c9f06", "grade": true, "grade_id": "rf_implementation", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "##### DO NOT CHANGE #####\n", "\n", "# check results - 2 points\n", "\n", "\n", "##### DO NOT CHANGE #####" ] }, { "cell_type": "markdown", "id": "ffd3ddd1", "metadata": {}, "source": [ "### Confusion Matrix \n", "There are many other metrics besides accuracy to evaluate the performance of a classification model.\n", "Confusion matrices are a visual representation of how many samples were correctly and incorrectly classified.\n", "True labels are assigned to the y-axis, predicted labels to the x-axis and each cell of the matrix contains the number of cases the specific combinatination occured.\\\n", "For example, cell (0,0) contains the number of times the model predicted the label 0 (B, benign) correctly.\n", "\n", "[`confusion_matrix`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)" ] }, { "cell_type": "code", "execution_count": 46, "id": "02dd0735", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "7c0bc0c714bad914567dfc890e104feb", "grade": false, "grade_id": "cell-0417ea30bba567bc", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "##### DO NOT CHANGE #####\n", "\n", "cm = confusion_matrix(y_test, y_pred_rf)\n", "\n", "ConfusionMatrixDisplay(confusion_matrix=cm).plot()\n", "\n", "##### DO NOT CHANGE #####" ] }, { "cell_type": "markdown", "id": "138b94b3", "metadata": {}, "source": [ "From your confusion matrix, how many wrong predictions were made by the model? Assign it to `answer` as an integer:" ] }, { "cell_type": "code", "execution_count": 51, "id": "6ac58752", "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "d50f8ddebddbeb56bcda4a420fcd7b48", "grade": false, "grade_id": "cell-c8ce2ca4a871f946", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5\n" ] } ], "source": [ "answer_cm_1 = None\n", "\n", "answer_cm_1 = cm[(0,1)]+cm[(1,0)]\n", "print(answer_cm_1)" ] }, { "cell_type": "code", "execution_count": 52, "id": "de36e00b", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "fdf9444646863dad24c34627304b1716", "grade": true, "grade_id": "rf_confusion_matrix_1", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "##### DO NOT CHANGE #####\n", "\n", "# check results - 1 points\n", "\n", "\n", "##### DO NOT CHANGE #####" ] }, { "cell_type": "markdown", "id": "5ffbc8d0", "metadata": {}, "source": [ "From your confusion matrix, how many samples were classified as benign by the model but are actually malignant? Assign it to `answer` as an integer:" ] }, { "cell_type": "code", "execution_count": 55, "id": "501491a6", "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "4defefea4a3ecefab79c7852a24df6ec", "grade": false, "grade_id": "cell-3baa82187d2089f3", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4\n" ] } ], "source": [ "answer_cm_2 = None\n", "\n", "answer_cm_2= cm[(1,0)]\n", "print(answer_cm_2)" ] }, { "cell_type": "code", "execution_count": 56, "id": "6cfba7bf", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "6133d49de231bb75139468daff12af5f", "grade": true, "grade_id": "rf_confusion_matrix_2", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "##### DO NOT CHANGE #####\n", "\n", "# check results - 1 points\n", "\n", "\n", "##### DO NOT CHANGE #####" ] }, { "cell_type": "markdown", "id": "f0e5ded2", "metadata": {}, "source": [ "### Hyperparameters\n", "\n", "One essential part of any machine learning application is hyperparameter optimization. Hyperparameters refer to the parameters of the algorithm itself and by tuning these parameters we can maximize the performance of the model.\n", "In the case of a random forest classifier these include for example:\\\n", "`n_estimators` numbers of trees in the forest\\\n", "`criterion` impurity measure \\\n", "`max_depth` maximum depth of a tree\n", "\n", "Can you improve the accuracy `accuracy_rf` of the random forest classifier by finding more suitable hyperparameters?" ] }, { "cell_type": "code", "execution_count": 73, "id": "7677e39b", "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "3941eecfab009b40a27077bcf5c3f604", "grade": false, "grade_id": "cell-290803e722f06c0c", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.956140350877193 0.9649122807017544\n" ] } ], "source": [ "accuracy_rf_tuned = None\n", "\n", "rf_classifier = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42)\n", "rf_classifier.fit(X_train, y_train)\n", "y_pred_rf = rf_classifier.predict(X_test)\n", "accuracy_rf_tuned = (y_pred_rf == y_test).sum() / len(y_pred)\n", "print(accuracy_rf, accuracy_rf_tuned)" ] }, { "cell_type": "code", "execution_count": 74, "id": "71371943", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "5a337de897523b4a50a627fab26d67db", "grade": true, "grade_id": "rf_hyperopt", "locked": true, "points": 3, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "##### DO NOT CHANGE #####\n", "\n", "# check results - 3 points\n", "\n", "\n", "##### DO NOT CHANGE #####" ] }, { "cell_type": "markdown", "id": "a56944cd", "metadata": {}, "source": [ "# Submitting your solution\n", "\n", "As a last step, the notebook should be uploaded to Ilias such that we can auto-grade it." ] } ], "metadata": { "jupytext": { "formats": "ipynb,py:percent" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.2" } }, "nbformat": 4, "nbformat_minor": 5 }