{ "cells": [ { "cell_type": "markdown", "id": "84b2ed6d", "metadata": {}, "source": [ "# Exercise Sheet No. 2\n", "\n", "---\n", "\n", "> Machine Learning for Natural Sciences, Summer 2024, Jun.-Prof. Pascal Friederich, pascal.friederich@kit.edu\n", "\n", "> Instructor: Marlen Neubert (marlen.neubert@kit.edu)\n", "\n", "---\n", "**Deadline**: Monday, April 29th 8am \n", "\n", "**Topic**: This exercise deals with decision trees and random forests. We examine the parameters and properties of these two algorithms on a binary classification example using [`sklearn`](https://scikit-learn.org/stable/) methods." ] }, { "cell_type": "markdown", "id": "3aac18d1", "metadata": {}, "source": [ "### Please put your name and your group members here: \n", "You are encouraged to work in groups of a maximum of 3 people, however **each of you** has to submit a solution.\n", "\n", "Nils Lennart Bruns, usxfs\n" ] }, { "cell_type": "markdown", "id": "eee3c2a3", "metadata": {}, "source": [ "## Preliminaries\n", "If you are not familiar with Python, you may want to learn more about Python\n", "and its basic syntax. Since there are a lof of free and well written tutorials\n", " online, we refer you to one of the following online tutorials:\n", "\n", "* http://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook\n", "* https://www.learnpython.org/\n", "* https://automatetheboringstuff.com/" ] }, { "cell_type": "markdown", "id": "20c7a0a0", "metadata": {}, "source": [ "## 1.1 Data Preprocessing and Exploration\n", "\n", "The data we will be working with is the breast cancer dataset from the [University of Wisconsin](http://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28diagnostic%29) - a binary classification dataset for diagnosing breast cancer. \\\n", "It contains 30 features which are derived from digitized images and describe characteristics of the cell nuclei. Corresponding labels describe the stage of cancer as either \\\n", "`B`: benign, the tumor doesn’t contain cancerous cells or \\\n", "`M`: malignant, the tumor contains cancerous cells. \n", "\n", "### Problem Description\n", "We want to predict whether a breast cancer tumor is benign or malignant. This is a binary classification problem since we have two output classes.\\\n", "Before we can start training our algorithms we have to get familiar with the data and prepare it:" ] }, { "cell_type": "code", "execution_count": 2, "id": "84713313", "metadata": {}, "outputs": [], "source": [ "import os\n", "import pandas as pd\n", "import numpy as np\n", "from sklearn.preprocessing import LabelEncoder\n", "import requests\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay\n", "from sklearn.ensemble import RandomForestClassifier" ] }, { "cell_type": "code", "execution_count": 3, "id": "e982076d", "metadata": {}, "outputs": [], "source": [ "data_url = \"https://bwsyncandshare.kit.edu/s/dCsEn6eK5S453Lq/download\"\n", "data_file = \"breast_cancer_data.csv\"\n", "if not os.path.exists(data_file):\n", " print(\"Downloading dataset ...\")\n", " with open(data_file, \"wb\") as f:\n", " f.write(requests.get(data_url).content)\n", " print(\"Downloading dataset done.\")" ] }, { "cell_type": "markdown", "id": "b38f90bf", "metadata": {}, "source": [ "We load the dataset via the data library ``pandas``, which will return a ``DataFrame`` object. We can print the head of the table with ``.head()``:" ] }, { "cell_type": "code", "execution_count": 4, "id": "2c83c30e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | id | \n", "diagnosis | \n", "radius_mean | \n", "texture_mean | \n", "perimeter_mean | \n", "area_mean | \n", "smoothness_mean | \n", "compactness_mean | \n", "concavity_mean | \n", "concave points_mean | \n", "... | \n", "texture_worst | \n", "perimeter_worst | \n", "area_worst | \n", "smoothness_worst | \n", "compactness_worst | \n", "concavity_worst | \n", "concave points_worst | \n", "symmetry_worst | \n", "fractal_dimension_worst | \n", "Unnamed: 32 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "842302 | \n", "M | \n", "17.99 | \n", "10.38 | \n", "122.80 | \n", "1001.0 | \n", "0.11840 | \n", "0.27760 | \n", "0.3001 | \n", "0.14710 | \n", "... | \n", "17.33 | \n", "184.60 | \n", "2019.0 | \n", "0.1622 | \n", "0.6656 | \n", "0.7119 | \n", "0.2654 | \n", "0.4601 | \n", "0.11890 | \n", "NaN | \n", "
1 | \n", "842517 | \n", "M | \n", "20.57 | \n", "17.77 | \n", "132.90 | \n", "1326.0 | \n", "0.08474 | \n", "0.07864 | \n", "0.0869 | \n", "0.07017 | \n", "... | \n", "23.41 | \n", "158.80 | \n", "1956.0 | \n", "0.1238 | \n", "0.1866 | \n", "0.2416 | \n", "0.1860 | \n", "0.2750 | \n", "0.08902 | \n", "NaN | \n", "
2 | \n", "84300903 | \n", "M | \n", "19.69 | \n", "21.25 | \n", "130.00 | \n", "1203.0 | \n", "0.10960 | \n", "0.15990 | \n", "0.1974 | \n", "0.12790 | \n", "... | \n", "25.53 | \n", "152.50 | \n", "1709.0 | \n", "0.1444 | \n", "0.4245 | \n", "0.4504 | \n", "0.2430 | \n", "0.3613 | \n", "0.08758 | \n", "NaN | \n", "
3 | \n", "84348301 | \n", "M | \n", "11.42 | \n", "20.38 | \n", "77.58 | \n", "386.1 | \n", "0.14250 | \n", "0.28390 | \n", "0.2414 | \n", "0.10520 | \n", "... | \n", "26.50 | \n", "98.87 | \n", "567.7 | \n", "0.2098 | \n", "0.8663 | \n", "0.6869 | \n", "0.2575 | \n", "0.6638 | \n", "0.17300 | \n", "NaN | \n", "
4 | \n", "84358402 | \n", "M | \n", "20.29 | \n", "14.34 | \n", "135.10 | \n", "1297.0 | \n", "0.10030 | \n", "0.13280 | \n", "0.1980 | \n", "0.10430 | \n", "... | \n", "16.67 | \n", "152.20 | \n", "1575.0 | \n", "0.1374 | \n", "0.2050 | \n", "0.4000 | \n", "0.1625 | \n", "0.2364 | \n", "0.07678 | \n", "NaN | \n", "
5 rows × 33 columns
\n", "\n", " | diagnosis | \n", "radius_mean | \n", "texture_mean | \n", "perimeter_mean | \n", "area_mean | \n", "smoothness_mean | \n", "compactness_mean | \n", "concavity_mean | \n", "concave points_mean | \n", "symmetry_mean | \n", "... | \n", "radius_worst | \n", "texture_worst | \n", "perimeter_worst | \n", "area_worst | \n", "smoothness_worst | \n", "compactness_worst | \n", "concavity_worst | \n", "concave points_worst | \n", "symmetry_worst | \n", "fractal_dimension_worst | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "M | \n", "17.99 | \n", "10.38 | \n", "122.80 | \n", "1001.0 | \n", "0.11840 | \n", "0.27760 | \n", "0.3001 | \n", "0.14710 | \n", "0.2419 | \n", "... | \n", "25.38 | \n", "17.33 | \n", "184.60 | \n", "2019.0 | \n", "0.1622 | \n", "0.6656 | \n", "0.7119 | \n", "0.2654 | \n", "0.4601 | \n", "0.11890 | \n", "
1 | \n", "M | \n", "20.57 | \n", "17.77 | \n", "132.90 | \n", "1326.0 | \n", "0.08474 | \n", "0.07864 | \n", "0.0869 | \n", "0.07017 | \n", "0.1812 | \n", "... | \n", "24.99 | \n", "23.41 | \n", "158.80 | \n", "1956.0 | \n", "0.1238 | \n", "0.1866 | \n", "0.2416 | \n", "0.1860 | \n", "0.2750 | \n", "0.08902 | \n", "
2 | \n", "M | \n", "19.69 | \n", "21.25 | \n", "130.00 | \n", "1203.0 | \n", "0.10960 | \n", "0.15990 | \n", "0.1974 | \n", "0.12790 | \n", "0.2069 | \n", "... | \n", "23.57 | \n", "25.53 | \n", "152.50 | \n", "1709.0 | \n", "0.1444 | \n", "0.4245 | \n", "0.4504 | \n", "0.2430 | \n", "0.3613 | \n", "0.08758 | \n", "
3 | \n", "M | \n", "11.42 | \n", "20.38 | \n", "77.58 | \n", "386.1 | \n", "0.14250 | \n", "0.28390 | \n", "0.2414 | \n", "0.10520 | \n", "0.2597 | \n", "... | \n", "14.91 | \n", "26.50 | \n", "98.87 | \n", "567.7 | \n", "0.2098 | \n", "0.8663 | \n", "0.6869 | \n", "0.2575 | \n", "0.6638 | \n", "0.17300 | \n", "
4 | \n", "M | \n", "20.29 | \n", "14.34 | \n", "135.10 | \n", "1297.0 | \n", "0.10030 | \n", "0.13280 | \n", "0.1980 | \n", "0.10430 | \n", "0.1809 | \n", "... | \n", "22.54 | \n", "16.67 | \n", "152.20 | \n", "1575.0 | \n", "0.1374 | \n", "0.2050 | \n", "0.4000 | \n", "0.1625 | \n", "0.2364 | \n", "0.07678 | \n", "
5 rows × 31 columns
\n", "DecisionTreeClassifier(max_depth=5, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(max_depth=5, random_state=42)