{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Moderne Methoden der Datenanalyse SS2023\n",
    "# Practical Exercise 6"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 6.1: Hypothesis Testing\n",
    "\n",
    "\"Is this a new discovery or just a statistical fluctuation?\" Statistics offers some methods to give a quantitative answer. But these methods should not be used blindly. In particular, one should know exactly what the obtained numbers mean and what they don't mean."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 6.1.1 (obligatory to solve either 6.1.1 or 6.1.2)\n",
    "\n",
    "The following table shows the number of winners in a horse race for different track numbers:\n",
    "\n",
    "| track      ||| 1    || 2    || 3    || 4    || 5    || 6    || 7    || 8    |\n",
    "|------------|||------||------||------||------||------||------||------||------|\n",
    "| # winners  ||| 29   || 19   || 18   || 25   || 17   || 10   || 15   || 11   |\n",
    "\n",
    "Use a $\\chi ^2$ test to check the hypothesis that the track number has *no* influence on the chance to win. Define a significance level, e.g., $\\alpha = 5 \\, \\%$ or $\\alpha = 1 \\, \\%$, *before*  you do the test."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# pick your poison\n",
    "from ROOT import TMath\n",
    "from scipy.stats import chi2\n",
    "\n",
    "def exercise6_1_1(confidenceLevel):\n",
    "  \n",
    "    # numbers given in the exercise\n",
    "    nTracks = 8\n",
    "    nWin = [29, 19, 18, 25, 17, 10, 15, 11]\n",
    "    \n",
    "    return"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 6.1.2 (obligatory to solve either 6.1.1 or 6.1.2)\n",
    "\n",
    "In a counting experiment, 5 events are observed while $\\mu _\\mathrm{B} = 1.8$ background events are expected. Is this a significant ($= 3 \\sigma$) excess? Calculate the probability of observing 5 or more events when the expectation value is 1.8 using Poisson statistics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "# pick your poison\n",
    "from ROOT import gRandom, TMath, Math\n",
    "from scipy.stats import poisson, norm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def exercise6_1_2():\n",
    "  \n",
    "    # numbers given in the exercise\n",
    "    nBackground = 1.8\n",
    "    nObserved = 5\n",
    "\n",
    "    # calculate the probability to observe 5 or more events\n",
    "  \n",
    "    # optional: make toy experiments\n",
    "    \n",
    "    return"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 6.2: Parameter Estimation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 6.2.1 (voluntary)\n",
    "\n",
    "Consider the following set of values approximately following a Gaussian distribution (see also excercise_6_2_1.csv):\n",
    "\n",
    "<style>\n",
    "    alternative\n",
    "| $x_i$ || $y_i$ || $\\sigma_i$  ||||| $x_i$ || $y_i$ || $\\sigma_i$  ||||| $x_i$ || $y_i$ || $\\sigma_i$  ||||| $x_i$ || $y_i$ || $\\sigma_i$  |\n",
    "|-------||-------||-------------|||||-------||-------||-------------|||||-------||-------||-------------|||||-------||-------||-------------|\n",
    "| 0.46  || 0.19  || 0.05        ||||| 0.69  || 0.27  || 0.06        ||||| 0.71  || 0.28  || 0.05        ||||| 1.04  || 0.62  || 0.01        |\n",
    "| 1.11  || 0.68  || 0.05        ||||| 1.14  || 0.70  || 0.07        ||||| 1.17  || 0.74  || 0.08        ||||| 1.20  || 0.81  || 0.09        |\n",
    "| 1.31  || 0.93  || 0.10        ||||| 2.03  || 2.49  || 0.03        ||||| 2.14  || 2.73  || 0.04        ||||| 2.52  || 3.57  || 0.01        |\n",
    "| 3.24  || 3.90  || 0.07        ||||| 3.46  || 3.55  || 0.03        ||||| 3.81  || 2.87  || 0.03        ||||| 4.06  || 2.24  || 0.01        |\n",
    "| 4.93  || 0.65  || 0.10        ||||| 5.11  || 0.39  || 0.07        ||||| 5.26  || 0.33  || 0.05        ||||| 5.38  || 0.26  || 0.08        |\n",
    "</style>\n",
    "\n",
    "<table id=\"sometable\">\n",
    "  <tr style=\"border-bottom: 1px solid darkgray\">\n",
    "    <th> x<sub>i</sub> </th><th> y<sub>i</sub> </th><th style=\"padding-right: 40px;\"> &sigma;<sub>i</sub></th>\n",
    "    <th> x<sub>i</sub> </th><th> y<sub>i</sub> </th><th style=\"padding-right: 40px;\"> &sigma;<sub>i</sub></th>\n",
    "    <th> x<sub>i</sub> </th><th> y<sub>i</sub> </th><th style=\"padding-right: 40px;\"> &sigma;<sub>i</sub></th>\n",
    "    <th> x<sub>i</sub> </th><th> y<sub>i</sub> </th><th style=\"padding-right: 40px;\"> &sigma;<sub>i</sub></th>\n",
    "  </tr> <tr>\n",
    "    <td>0.46</td> <td>0.19</td> <td style=\"padding-right: 40px;\">0.05</td>   <td>0.69</td><td>0.27</td><td style=\"padding-right: 40px;\">0.06</td>\n",
    "    <td>0.71</td> <td>0.28</td> <td style=\"padding-right: 40px;\">0.05</td>   <td>1.04</td><td>0.62</td><td>0.01</td>\n",
    "  </tr> <tr>\n",
    "    <td>1.11</td> <td>0.68</td> <td style=\"padding-right: 40px;\">0.05</td>   <td>1.14</td><td>0.70</td><td style=\"padding-right: 40px;\">0.07</td>\n",
    "    <td>1.17</td> <td>0.74</td> <td style=\"padding-right: 40px;\">0.08</td>   <td>1.20</td><td>0.81</td><td>0.09</td>\n",
    "  </tr> <tr>\n",
    "    <td>1.31</td> <td>0.93</td> <td style=\"padding-right: 40px;\">0.10</td>   <td>2.03</td><td>2.49</td><td style=\"padding-right: 40px;\">0.03</td>\n",
    "    <td>2.14</td> <td>2.73</td> <td style=\"padding-right: 40px;\">0.04</td>   <td>2.52</td><td>3.57</td><td>0.01</td>\n",
    "  </tr> <tr>\n",
    "    <td>3.24</td> <td>3.90</td> <td style=\"padding-right: 40px;\">0.07</td>   <td>3.46</td><td>3.55</td><td style=\"padding-right: 40px;\">0.03</td>\n",
    "    <td>3.81</td> <td>2.87</td> <td style=\"padding-right: 40px;\">0.03</td>   <td>4.06</td><td>2.24</td><td>0.01</td>\n",
    "  </tr> <tr>\n",
    "    <td>4.93</td> <td>0.65</td> <td style=\"padding-right: 40px;\">0.10</td>   <td>5.11</td><td>0.39</td><td style=\"padding-right: 40px;\">0.07</td>\n",
    "    <td>5.26</td> <td>0.33</td> <td style=\"padding-right: 40px;\">0.05</td>   <td>5.38</td><td>0.26</td><td>0.08</td>\n",
    "  </tr>    \n",
    "</table>  \n",
    "\n",
    "\n",
    "$$ y(x) = a _\\mathrm{1} \\cdot \\exp \\{-\\frac{1}{2} (\\frac{ x-a _\\mathrm{2} }{ a _\\mathrm{3} } )^2 \\} $$\n",
    "\n",
    "\n",
    "with $\\sigma _i$ being the uncertainty on $y _i$.\n",
    "\n",
    " - Determine the values of the three parameters $a _\\mathrm{1}$, $a _\\mathrm{2}$, and $a _\\mathrm{3}$ as well as their uncertainties.\n",
    " \n",
    " - Afterwards, use the transformation $z = \\ln (y)$ to obtain the linear function $z(x) = b _\\mathrm{1} + b _\\mathrm{2} x + b _\\mathrm{3} x^2$. Determine the new parameters $b _\\mathrm{1}$, $b _\\mathrm{2}$, and $b _\\mathrm{3}$ and uncertainties in two ways and compare the results: first, by fitting this new function; second, by a calculation using the results for $a _\\mathrm{j}$ and the transformation $z = \\ln (y)$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### ROOT approach"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from ROOT import TF1, TGraphErrors, TCanvas\n",
    "import numpy as np\n",
    "\n",
    "def exercise6_2_1(verbose = 0):\n",
    "    # read exercise_6_2_1.csv or type are values explicitly\n",
    "    \n",
    "    # fit a Gaussian function\n",
    "    \n",
    "    #print(\" --------  Fit Parameters ---------- \")\n",
    "    #print(\" a1 =     {0:.3f} +/- {1:.3f} \".format(a1, err_a1))\n",
    "    #print(\" a2 =     {0:.3f} +/- {1:.3f} \".format(a2, err_a2))\n",
    "    #print(\" a3 =     {0:.3f} +/- {1:.3f} \".format(a3, err_a3))\n",
    "    #print(\" chi2 =   {0:.3f} \".format(chi))\n",
    " \n",
    "\n",
    "    # now with log\n",
    "    \n",
    "    # calculation\n",
    "    \n",
    "    \n",
    "    #print(\" --------  Fit Parameters ---------- \")\n",
    "    #print(\" b1 =     {0:.3f} +/- {1:.3f} \".format(b1, err_b1))\n",
    "    #print(\" b2 =     {0:.3f} +/- {1:.3f} \".format(b2, err_b2))\n",
    "    #print(\" b3 =     {0:.3f} +/- {1:.3f} \".format(b3, err_b3))\n",
    "    #print(\" chi2 =   {0:.3f} \".format(chi2))\n",
    "  \n",
    "    #print(\"\\n --------  Estimated Parameters ---------- \")\n",
    "    #print(\" b1 =     {0:.3f} +/- {1:.3f} \".format(b1est, err_b1est))\n",
    "    #print(\" b2 =     {0:.3f} +/- {1:.3f} \".format(b2est, err_b2est))\n",
    "    #print(\" b3 =     {0:.3f} +/- {1:.3f} \".format(b3est, err_b3est))\n",
    "\n",
    "  \n",
    "    # draw graphs\n",
    "\n",
    "    return"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Python approach"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import scipy.optimize"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def exercise_6_2_1():\n",
    "    # read exercise_6_2_1.csv or type are values explicitly\n",
    "\n",
    "    # fit a Gaussian function\n",
    "\n",
    "    #print(\"Fit Result:\")\n",
    "    #print(\"a1 = {0:.3f} +- {3:.3f}\\na2 = {1:.3f} +- {4:.3f}\\na3 = {2:.3f} +- {5:.3f}\".\n",
    "    #      format(a1, a2, a3, a1err, a2err, a3err))\n",
    "    \n",
    "    \n",
    "    # now with log\n",
    "        \n",
    "    # calculation\n",
    "    \n",
    "\n",
    "    #print(\"Fit Result:\")\n",
    "    #print(\"b1 = {0:.3f} +- {3:.3f}\\nb2 = {1:.3f} +- {4:.3f}\\nb3 = {2:.3f} +- {5:.3f}\".\n",
    "    #      format(*bopt, np.sqrt(bcov[0, 0]), np.sqrt(bcov[1, 1]), np.sqrt(bcov[2, 2])))\n",
    "    #print(\"Transformed Result:\")\n",
    "    #print(\"b1 = {0:.3f} +- {3:.3f}\\nb2 = {1:.3f} +- {4:.3f}\\nb3 = {2:.3f} +- {5:.3f}\".\n",
    "    #      format(b1, b2, b3, b1err, b2err, b3err))\n",
    "    \n",
    "    # plot\n",
    "    \n",
    "    return"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 6.2.2 (obligatory)\n",
    "\n",
    "This exercise aims at constructing the error band around a function, $f(x)$, fitted to data points $(x,y)$ - i.e., the errors on the fitted parameters are transformed into errors on the value of the function at each value of $x$. (If you do not want to use `ROOT` to optimize the curve fit in this exercise, you can instead use the fitting algorithm provided by scipy, e.g., scipy.optimize.curve_fit - see Python approach)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us attack the problem in steps:\n",
    "\n",
    " - First, define a function for our problem: a straight line as $f(x) = a + bx$ with parameters a and b.\n",
    " \n",
    " - Next, consider `npoints=11` data points in the interval `[xof, xof+npoints-1]` with `xof=10` and y(x) = x. To simulate measurement errors, shift the data points in y-direction by a random shift, drawn from a Gaussian distribution with $\\sigma = 0.5$ and $\\mu = 0.0$.\n",
    "\n",
    " - Now fit the straight line defined above to your data points.\n",
    " \n",
    "   Draw the data points and the fit result. Print the correlation coefficient of the errors on the parameters $a$ and $b$.\n",
    "   \n",
    "   Try to give an intuitive argument why the two parameters $a$ (axis intercept) and $b$ (slope) are strongly correlated.\n",
    "   \n",
    " - Next, construct the error band around the fit function: write a function `make_band(f, cov, withCorr=0)`, which takes the function $f$ and the covariance matrix, as input arguments and calculates for each value of x the error on $y$, $\\Delta _y (x)$. As a first approach, use the simple formula for error propagation, which, in this case, results in $\\Delta _y (x) ^2 = \\Delta _a ^2 + (x \\cdot \\Delta _b )^2$ if correlations are neglected.\n",
    " \n",
    "   Draw the curves $y(x) \\pm \\Delta_y (x)$, i.e., the \"error band\" on top of your data and fit result. Does this look correct?\n",
    "   \n",
    " - Derive the appropriate formula for error propagation taking into account the correlation of the errors. Re-calculate $\\Delta _y (x)$ and plot the corresponding error band. Compare with the above result obtained without correlations.\n",
    "  \n",
    "To get a better idea of what happens here and what the effect of the correlations is, you might want to repeat the whole exercise setting `xof=-5.`, meaning now that the mean of the $x$-values of the data points is approximately at $0$. Alternatively, you may choose the parametrisation of the function as `y(x) = a' + b(x-(xof+(npoints-1)/2))`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### ROOT Apprach"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from ROOT import TF1, gRandom, TGraphErrors, TAxis, TVirtualFitter, TMatrixD, TH1, TCanvas, TFile, TLegend\n",
    "from array import array\n",
    "from math import fabs, sqrt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Evaluates error band with/without taking into account the correlation\n",
    "def make_band_r (f, cov, withCorr=0):\n",
    "    # As first approach, use the simple formula for error propagation\n",
    "    \n",
    "    # Derive the appropriate formula taking into account the correlation of the errors\n",
    "    \n",
    "    return"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "def drawUncertaintyBand(npoints=11, xof=10, sigma=0.5):\n",
    "    # First: Define a function (e.g. a straight line)\n",
    "  \n",
    "\n",
    "    # Next: `npoints` data points in the interval [xof, xof+npoints-1] with y(x) = x.\n",
    "    # Shift the data points in y-direction\n",
    "\n",
    "    \n",
    "    # Store the values in a TGraphErrors\n",
    "  \n",
    "\n",
    "    # Now fit the straight line defined above to your data points\n",
    "\n",
    "    \n",
    "    # Access the correlation matrix\n",
    "    # Hint: you can use the code in the appendix (see bottom of this notebook!)\n",
    "    \n",
    "    \n",
    "    # Define above the function `make_band` and draw the \"error band\" on top of your data and fit result\n",
    "    # w/o correlations  \n",
    "\n",
    "    # w/ correlations\n",
    "    \n",
    "    \n",
    "    return"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Python approach"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import scipy.optimize"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# First: Define a function (e.g. a straight line)\n",
    "def linear(x, *p):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def make_band_p(function, popt, pcov, x_range, use_correlation=False):\n",
    "    # As first approach, use the simple formula for error propagation\n",
    "    \n",
    "    # Derive the appropriate formula taking into account the correlation of the errors\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def exercise_6_2_2(xof=10):\n",
    "    # Next: `npoints` data points in the interval [xof, xof+npoints-1] with y(x) = x.\n",
    "    # Shift the data points in y-direction\n",
    "\n",
    "    # Now fit the straight line defined above to your data points\n",
    "\n",
    "    # Define above the function `make_band` and draw the \"error band\" on top of your data and fit result\n",
    "    # w/o correlation\n",
    "\n",
    "    # w/ correlation\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Appendix: Accessing the covariance matrix\n",
    "\n",
    "#### Example code for pyROOT / JupyROOT:"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "fitrp = TVirtualFitter.GetFitter()\n",
    "nPar = fitrp.GetNumberTotalParameters()\n",
    "\n",
    "covmat = TMatrixD(nPar, nPar, fitrp.GetCovarianceMatrix())\n",
    "\n",
    "covmat.Print()\n",
    "\n",
    "cormat = TMatrixD(covmat)\n",
    "for i in range(nPar):\n",
    "    for j in range(nPar):\n",
    "        cormat[i][j] = cormat[i][j] / (sqrt(covmat[i][i]) * sqrt(covmat[j][j]))\n",
    "\n",
    "cormat.Print()"
   ]
  }
 ],
 "metadata": {
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}