{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Moderne Methoden der Datenanalyse SS2021\n", "# Practical Exercise 6" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 6.1: Hypothesis Testing\n", "\n", "\"Is this a new discovery or just a statistical fluctuation?\" Statistics offers some methods to give a quantitative answer. But these methods should not be used blindly. In particular, one should know exactly what the obtained numbers mean and what they don't mean." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 6.1.1 (obligatory to solve either 6.1.1 or 6.1.2)\n", "\n", "The following table shows the number of winners in a horse race for different track numbers:\n", "\n", "| track ||| 1 || 2 || 3 || 4 || 5 || 6 || 7 || 8 |\n", "|------------|||------||------||------||------||------||------||------||------|\n", "| # winners ||| 29 || 19 || 18 || 25 || 17 || 10 || 15 || 11 |\n", "\n", "Use a $\\chi ^2$ test to check the hypothesis that the track number has *no* influence on the chance to win. Define a significance level, e.g., $\\alpha = 5 \\, \\%$ or $\\alpha = 1 \\, \\%$, *before* you do the test." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# pick your poison\n", "from ROOT import TMath\n", "from scipy.stats import chi2\n", "\n", "def exercise6_1_1(confidenceLevel):\n", " \n", " # numbers given in the exercise\n", " nTracks = 8\n", " nWin = [29, 19, 18, 25, 17, 10, 15, 11]\n", " \n", " return" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 6.1.2 (obligatory to solve either 6.1.1 or 6.1.2)\n", "\n", "In a counting experiment, 5 events are observed while $\\mu _\\mathrm{B} = 1.8$ background events are expected. Is this a significant ($= 3 \\sigma$) excess? Calculate the probability of observing 5 or more events when the expectation value is 1.8 using Poisson statistics." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "# pick your poison\n", "from ROOT import gRandom, TMath, Math\n", "from scipy.stats import poisson, norm" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def exercise6_1_2():\n", " \n", " # numbers given in the exercise\n", " nBackground = 1.8\n", " nObserved = 5\n", "\n", " # calculate the probability to observe 5 or more events\n", " \n", " # optional: make toy experiments\n", " \n", " return" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 6.2: Parameter Estimation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 6.2.1 (voluntary)\n", "\n", "Consider the following set of values approximately following a Gaussian distribution (see also excercise_6_2_1.csv):\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xi yi σi xi yi σi xi yi σi xi yi σi
0.46 0.19 0.05 0.690.270.060.71 0.28 0.05 1.040.620.01
1.11 0.68 0.05 1.140.700.071.17 0.74 0.08 1.200.810.09
1.31 0.93 0.10 2.032.490.032.14 2.73 0.04 2.523.570.01
3.24 3.90 0.07 3.463.550.033.81 2.87 0.03 4.062.240.01
4.93 0.65 0.10 5.110.390.075.26 0.33 0.05 5.380.260.08
\n", "\n", "\n", "$$ y(x) = a _\\mathrm{1} \\cdot \\exp \\{-\\frac{1}{2} (\\frac{ x-a _\\mathrm{2} }{ a _\\mathrm{3} } )^2 \\} $$\n", "\n", "\n", "with $\\sigma _i$ being the uncertainty on $y _i$.\n", "\n", " - Determine the values of the three parameters $a _\\mathrm{1}$, $a _\\mathrm{2}$, and $a _\\mathrm{3}$ as well as their uncertainties.\n", " \n", " - Afterwards, use the transformation $z = \\ln (y)$ to obtain the linear function $z(x) = b _\\mathrm{1} + b _\\mathrm{2} x + b _\\mathrm{3} x^2$. Determine the new parameters $b _\\mathrm{1}$, $b _\\mathrm{2}$, and $b _\\mathrm{3}$ and uncertainties in two ways and compare the results: first, by fitting this new function; second, by a calculation using the results for $a _\\mathrm{j}$ and the transformation $z = \\ln (y)$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### ROOT approach" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from ROOT import TF1, TGraphErrors, TCanvas\n", "import numpy as np\n", "\n", "def exercise6_2_1(verbose = 0):\n", " # read exercise_6_2_1.csv or type are values explicitly\n", " \n", " # fit a Gaussian function\n", " \n", " #print(\" -------- Fit Parameters ---------- \")\n", " #print(\" a1 = {0:.3f} +/- {1:.3f} \".format(a1, err_a1))\n", " #print(\" a2 = {0:.3f} +/- {1:.3f} \".format(a2, err_a2))\n", " #print(\" a3 = {0:.3f} +/- {1:.3f} \".format(a3, err_a3))\n", " #print(\" chi2 = {0:.3f} \".format(chi))\n", " \n", "\n", " # now with log\n", " \n", " # calculation\n", " \n", " \n", " #print(\" -------- Fit Parameters ---------- \")\n", " #print(\" b1 = {0:.3f} +/- {1:.3f} \".format(b1, err_b1))\n", " #print(\" b2 = {0:.3f} +/- {1:.3f} \".format(b2, err_b2))\n", " #print(\" b3 = {0:.3f} +/- {1:.3f} \".format(b3, err_b3))\n", " #print(\" chi2 = {0:.3f} \".format(chi2))\n", " \n", " #print(\"\\n -------- Estimated Parameters ---------- \")\n", " #print(\" b1 = {0:.3f} +/- {1:.3f} \".format(b1est, err_b1est))\n", " #print(\" b2 = {0:.3f} +/- {1:.3f} \".format(b2est, err_b2est))\n", " #print(\" b3 = {0:.3f} +/- {1:.3f} \".format(b3est, err_b3est))\n", "\n", " \n", " # draw graphs\n", "\n", " return" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Python approach" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import scipy.optimize" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def exercise_6_2_1():\n", " # read exercise_6_2_1.csv or type are values explicitly\n", "\n", " # fit a Gaussian function\n", "\n", " #print(\"Fit Result:\")\n", " #print(\"a1 = {0:.3f} +- {3:.3f}\\na2 = {1:.3f} +- {4:.3f}\\na3 = {2:.3f} +- {5:.3f}\".\n", " # format(a1, a2, a3, a1err, a2err, a3err))\n", " \n", " \n", " # now with log\n", " \n", " # calculation\n", " \n", "\n", " #print(\"Fit Result:\")\n", " #print(\"b1 = {0:.3f} +- {3:.3f}\\nb2 = {1:.3f} +- {4:.3f}\\nb3 = {2:.3f} +- {5:.3f}\".\n", " # format(*bopt, np.sqrt(bcov[0, 0]), np.sqrt(bcov[1, 1]), np.sqrt(bcov[2, 2])))\n", " #print(\"Transformed Result:\")\n", " #print(\"b1 = {0:.3f} +- {3:.3f}\\nb2 = {1:.3f} +- {4:.3f}\\nb3 = {2:.3f} +- {5:.3f}\".\n", " # format(b1, b2, b3, b1err, b2err, b3err))\n", " \n", " # plot\n", " \n", " return" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 6.2.2 (obligatory)\n", "\n", "This exercise aims at constructing the error band around a function, $f(x)$, fitted to data points $(x,y)$ - i.e., the errors on the fitted parameters are transformed into errors on the value of the function at each value of $x$. (If you do not want to use `ROOT` to optimize the curve fit in this exercise, you can instead use the fitting algorithm provided by scipy, e.g., scipy.optimize.curve_fit - see Python approach)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us attack the problem in steps:\n", "\n", " - First, define a function for our problem: a straight line as $f(x) = a + bx$ with parameters a and b.\n", " \n", " - Next, consider `npoints=11` data points in the interval `[xof, xof+npoints-1]` with `xof=10` and y(x) = x. To simulate measurement errors, shift the data points in y-direction by a random shift, drawn from a Gaussian distribution with $\\sigma = 0.5$ and $\\mu = 0.0$.\n", "\n", " - Now fit the straight line defined above to your data points.\n", " \n", " Draw the data points and the fit result. Print the correlation coefficient of the rrors on the parameters $a$ and $b$.\n", " \n", " Try to give an intuitive argument why the two parameters $a$ (axis intercept) and $b$ (slope) are strongly correlated.\n", " \n", " - Next, construct the error band around the fit function: write a function `make_band(f, cov, withCorr=0)`, which takes the function $f$ and the covariance matrix, as input arguments and calculates for each value of x the error on $y$, $\\Delta _y (x)$. As a first approach, use the simple formula for error propagation, which, in this case, results in $\\Delta _y (x) ^2 = \\Delta _a ^2 + (x \\cdot \\Delta _b )^2$ if correlations are neglected.\n", " \n", " Draw the curves $y(x) \\pm \\Delta_y (x)$, i.e., the \"error band\" on top of your data and fit result. Does this look correct?\n", " \n", " - Derive the appropriate formula for error propagation taking into account the correlation of the errors. Re-calculate $\\Delta _y (x)$ and plot the corresponding error band. Compare with the above result obtained without correlations.\n", " \n", "To get a better idea of what happens here and what the effect of the correlations is, you might want to repeat the whole exercise setting `xof=-5.`, meaning now that the mean of the $x$-values of the data points is approximately at $0$. Alternatively, you may choose the parametrisation of the function as `y(x) = a' + b(x-(xof+(npoints-1)/2))`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### ROOT Apprach" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from ROOT import TF1, gRandom, TGraphErrors, TAxis, TVirtualFitter, TMatrixD, TH1, TCanvas, TFile, TLegend\n", "from array import array\n", "from math import fabs, sqrt" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Evaluates error band with/without taking into account the correlation\n", "def make_band_r (f, cov, withCorr=0):\n", " # As first approach, use the simple formula for error propagation\n", " \n", " # Derive the appropriate formula taking into account the correlation of the errors\n", " \n", " return" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "def drawUncertaintyBand(npoints=11, xof=10, sigma=0.5):\n", " # First: Define a function (e.g. a straight line)\n", " \n", "\n", " # Next: `npoints` data points in the interval [xof, xof+npoints-1] with y(x) = x.\n", " # Shift the data points in y-direction\n", "\n", " \n", " # Store the values in a TGraphErrors\n", " \n", "\n", " # Now fit the straight line defined above to your data points\n", "\n", " \n", " # Access the correlation matrix\n", " # Hint: you can use the code in the appendix (see bottom of this notebook!)\n", " \n", " \n", " # Define above the function `make_band` and draw the \"error band\" on top of your data and fit result\n", " # w/o correlations \n", "\n", " # w/ correlations\n", " \n", " \n", " return" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Python approach" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import scipy.optimize" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# First: Define a function (e.g. a straight line)\n", "def linear(x, *p):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def make_band_p(function, popt, pcov, range, use_correlation=False):\n", " # As first approach, use the simple formula for error propagation\n", " \n", " # Derive the appropriate formula taking into account the correlation of the errors\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def exercise_6_2_2(xof=10):\n", " # Next: `npoints` data points in the interval [xof, xof+npoints-1] with y(x) = x.\n", " # Shift the data points in y-direction\n", "\n", " # Now fit the straight line defined above to your data points\n", "\n", " # Define above the function `make_band` and draw the \"error band\" on top of your data and fit result\n", " # w/o correlation\n", "\n", " # w/ correlation\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Appendix: Accessing the covariance matrix\n", "\n", "#### Example code for pyROOT / JupyROOT:" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "fitrp = TVirtualFitter.GetFitter()\n", "nPar = fitrp.GetNumberTotalParameters()\n", "\n", "covmat = TMatrixD(nPar, nPar, fitrp.GetCovarianceMatrix())\n", "\n", "covmat.Print()\n", "\n", "cormat = TMatrixD(covmat)\n", "for i in range(nPar):\n", " for j in range(nPar):\n", " cormat[i][j] = cormat[i][j] / (sqrt(covmat[i][i]) * sqrt(covmat[j][j]))\n", "\n", "cormat.Print()" ] } ], "metadata": { "hide_input": false, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6" } }, "nbformat": 4, "nbformat_minor": 4 }