{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **Search for New Physics at CMS**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Introduction\n",
    "\n",
    "Despite its success, the Standard Model (SM) of particle physics is still a very incomplete description of nature. Gravity for example has so far not been included consistently into the SM framework. Furthermore, Dark Matter and Dark Energy, which together constitute approximately 95% of the total energy in the universe, lack candidates in the SM. Numerous extensions to the SM have been therefore proposed that address the mentioned shortcomings and often predict the existence of new particles. It is one of the primary missions of the physics programme at the Large Hadron Collider (LHC) to search for signals of any new physics processes [1].\n",
    "\n",
    "A prominent example is Supersymmetry (SUSY) [2, 3], which postulates the invariance of the Langrangian under so-called *supersymmetric transformations*, which relate bosonic and fermionic degrees of freedom. In consequence, new particles need to be introduced, that could serve as an extension to the particle spectrum of the  Standard Model (SM). SUSY particles could provide excellent candidates for Dark Matter. Moreover, they could help to cancel the otherwise large corrections to the Higgs boson mass through their presence in virtual loops. This presence can also lead to common gauge coupling strengths when extrapolated to higher scales, that hints at some underlying unified theory. Another interesting consequence of SUSY models is the inclusion of a mechanism triggering electroweak symmetry breaking. All of these promising consequences of SUSY models have made the search for such signals one of the primary missions at the Large Hadron Collider. However, so far, no SUSY particles have been observed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Search for Supersymmetry at CMS"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![alt text](Figure1.png \"Title\")\n",
    "<center> Figure 1: Illustration of the signal and background  processes as well as the sensitive variables. The acronym LSP stands for lightest supersymmetric particles<center> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In general, there are usually SM processes with the same signature (SM background) as the sought-for new physics processes. Such a signature is illustrated in Figure 1 above. Therefore, one can only claim to observe a new process when the observed number of candidate events is significantly larger than the expected number of SM background events. Hence, precise knowledge of the expected SM background is crucial when searching for new-physics processes.\n",
    "\n",
    "We will now investigate the SM background processes using simulated (Monte Carlo, MC) events and compare them to the expectations for new-physics signal processes. They are available in the CSV data format for several different SM, as well as potential SUSY-signal processes. They are listed in Table 1 together with their cross sections and with the total number of simulated events, which one can one can access with the functions `Sample.CrossSection(ID)` and `Sample.TotalEvents(ID)` below. Beware that certain kinematic requirements have been applied at the generation of the MC samples, as stated in the comment field of Table 1. This is considered within the quoted cross sections, which are therefore termed fiducial cross sections. The recorded data is also available with the ID `0`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "<br>\n",
    "Table 1: Simulated background and SUSY signal (LM6 and LM9) processes together with their fiducial cross sections $\\sigma$ at $\\sqrt s = 8 \\,\\mathrm{TeV}$ centre-of-mass energy and generated total number of events. The ‘comment’ column lists requirements applied at sample generation. The ID refers to an identifier used across this exercise.\n",
    "\n",
    "|ID|Proccess| $\\sigma$(pb) |MC Events|comment|\n",
    "|---|---|---|----|------|\n",
    "|0|data||\n",
    "|1|W($l \\nu$) + jets|30.08|6619654|for $H_{\\mathrm{T}} > 400 \\,\\mathrm{GeV}$ |\n",
    "|2|$t\\bar{t}$ + jets|127.06|1925324|only semi- and dileptonic decays|\n",
    "|3|Z($\\nu\\nu$) + jets|6.26|51637|for $H_{\\mathrm{T}} > 400 \\,\\mathrm{GeV}$|\n",
    "|4|QCD|8630.0|44443155|for $H_{\\mathrm{T}} > 500 \\,\\mathrm{GeV}$|\n",
    "|5|LM6|0.502|51282|$m_{0} = 85 \\,\\mathrm{GeV}$, $ m_{1/2} = 400 \\,\\mathrm{GeV}$: $m_{\\tilde{g}} > m_{\\tilde{q}}$ |\n",
    "|6|LM9|9.287|51282|$m_{0} = 1450 \\,\\mathrm{GeV}$, $ m_{1/2} = 175 \\,\\mathrm{GeV}$: $m_{\\tilde{q}} > m_{\\tilde{g}}$ |\n",
    "\n",
    "<br>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "\n",
    "The considered SUSY signal processes LM6 and LM9 are expected in a particular type of the MSSM where the SUSY breaking is mediated by gravity (termed: mSUGRA or cMSSM). The parameters $m_{0}$ and $m_{1/2}$ define the common mass of the scalars (sleptons, squarks, and Higgs bosons) and the gauginos and higgsinos at the high scale, respecitvely. The chosen parameter points LM6 and LM9 refer to ‘low-mass’ benchmark scenarios, which represent phase-space regions favoured by the pre-LHC data [4]. In case of the LM9, the squark masses at the electroweak scale are larger than the gluino masses. Therefore, signal events are dominated by gluino pair-production. Since gluinos decay via intermediate squarks into quarks, one expects a particularly large number of jets and large $H_{\\mathrm{T}}$ in this case. In case of LM6 scenario, the squark pair-production dominates, and since squarks can decay directly to quarks, there are typically fewer jets but large $H\\kern-0.575em\\raise-0.375ex{\\large/}_{\\mathrm{T}}$. \n",
    "Accordingly, the sensitive variables of the analysis are: \n",
    "* $N_{\\mathrm{jets}}$, the number of jets;\n",
    "* $H_{\\mathrm{T}}$, the sum of the scalar $p_{\\mathrm{T}}$ of all jets, a measure of the visible energy; and\n",
    "* $H\\kern-0.575em\\raise-0.375ex{\\large/}_{\\mathrm{T}}$, the magnitude of the negative vectorial sum of the jet $\\vec{p_{\\mathrm{T}}}$, a measure of\n",
    "the missing transverse momentum, i.e. the energy of the undetected particles (in the code sometimes referred to as `MHt`).\n",
    "\n",
    "In addition to the requirements for MC generation, the following preselection has been applied when creating the CSV-files (Note that the number of events quoted in Table 1 refers to the total number of generated events *before* preseletion!):\n",
    "* $H_{\\mathrm{T}} > 300 \\,\\mathrm{GeV}$ and $N_{\\mathrm{jets}} \\geq 2$; and in addition\n",
    "* $H\\kern-0.575em\\raise-0.375ex{\\large/}_{\\mathrm{T}} > 100 \\,\\mathrm{GeV}$ in case of QCD.\n",
    "* No well-reconstructed and isolated lepton (electron or muon) with $p_{\\mathrm{T}} > 10 \\,\\mathrm{GeV}$ and  $|\\eta| < 2.4$ are present (lepton veto)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\">\n",
    "\n",
    "**Question 1**\n",
    "    \n",
    "Can you imagine why the generator-level and preselection requirements have been applied?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Getting started\n",
    "\n",
    "Before starting with the exercise some useful libraries need to be imported. The needed data for signal and background distributions as well as distributions from real pp-collisions are provided in the CSV format, which can be easily accessed using predefined functions described in Table 2. The `pandas` library saves the data in so called dataframes with columns representing different variables and rows representing the events. For filtering the data and computing the variables we use the Scikit-HEP Vector library, which was introduced in the second exercise. Finally, the `matplotlib` library is used to plot the final signal and background distributions using pre-defined functions in Sample.py and Utils.py files  . "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np # manipulations and operations on arrays\n",
    "import pandas as pd # easy access to data structures\n",
    "import matplotlib.pyplot as plt # plotting\n",
    "from matplotlib.lines import Line2D\n",
    "from scipy.stats import poisson \n",
    "from scipy.stats import norm\n",
    "import ast # useful for converting strings to lists \n",
    "from Sample import * # provides easy access to data \n",
    "import utils\n",
    "import vector as vec"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    " Table 2: Overview of functions in provided by `Sample` class \n",
    "\n",
    "|Function|Input (type)|Output (type)|Explanation|\n",
    "|---|---|---|--|\n",
    "|toString(x)| ID(int) | Full File Name(str) |Compact sample name without spaces, i.e. suitable for file names etc.|\n",
    "|fileNameFullSample(x)| ID(int)|Sample Name without Spaces(str) |Ntuple file name|\n",
    "|label(x)|ID(int)| Name of Data(str) |Full-blown sample name including TLatex commands for plot legends etc.|\n",
    "|color(x)|ID(int)|Color(str)|A color associated with each sample, e.g. for plotting|\n",
    "|CrossSection(x)|ID(int)|Cross section(float)|returns cross section of the sample as in Table 1|\n",
    "|TotalEvents(x)|ID(int)|Events(int)|returns total number of events in sample as in Table 1|\n",
    "|MaxNJets(x)|df or List(df)|Maximal number of jets(int)|returns the maximal number of jets for a dataframe or a list of dataframes|\n",
    "|JetNames(x)|df or List(df)|Jet names(list)|returns the names of jets as a list of strings for a dataframe or a list of dataframes|\n",
    "|Get_Jet(x,y)|df,Jet_number(int)|jet(df)|returns a specific jet from a specific sample|\n",
    "|nansubtract(x,y)|np.arrays|np.array|subtracts 2 np.arrays while ignoring NaN values|\n",
    "|Compare(Data,Backgorund,Signal,Bins)|(df, list of df, list of df, list of bins)|plot|Plots a comparison between signal and background distributions and measured data |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Reading data and setting the dataframes\n",
    "To read the data you will need to set the path to the location of the data simply by defining the variable `Sample.path`. After trying out the different functions provided in `Sample.py` you can read all the samples and save them as `pandas` dataframes. Samples are saved in CSV-format with columns corresponding to different variables and rows corresponding to events. Variables $p_{T}$, $\\eta$ and $\\phi$ are also saved as columns for each seperate jet."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reading Data and creating Data Frame\n",
    "\n",
    "# Set path and ID \n",
    "Sample.path = \"/data_share/tp2_bsm/Exercise03/\"  \n",
    "\n",
    "Files = [] # List containing the Dataframes \n",
    "\n",
    "for ID in [0,1,2,3,4,5,6]:\n",
    "    \n",
    "    print('Reading',Sample.toString(ID))\n",
    "    print(\"Path:\",Sample.fileNameFullSample(ID))\n",
    "    df = pd.read_csv(Sample.fileNameFullSample(ID), header = [0,1]) # read the sample and save it as a dataframe \n",
    "    Files.append(df) # add dataframe to the list \n",
    "    print('Number of Events:', len(df))\n",
    "    print('-'*20)\n",
    "    \n",
    "print('Done')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Testing the accessor\n",
    "\n",
    "In order to understand how quantities are accessed, it is good to test the four-vector accessor and other useful functions. The four-vector accessor is used similarly to the previous exercises. The only difference is that one should define the Sample ID (Dataframe ID) in addition to the Jet name (column name), when accessing multiple samples. The four-vector quantities can be accessed by adding (`.pt`, `.eta`, `.phi`, `.E`) to the four-vector object accordingly. As output, an array of the desired quantity is returned for all events.   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "    \n",
    "**Optional:**\n",
    "    \n",
    "Try to access different quantities from different jets and samples before you continue to the other tasks  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# test the functions \n",
    "print('Maximum number of jets in any event for each sample :' ,Sample.MaxNJets(Files)) # Maximal Number of Jets in each Sample \n",
    "print('\\n')\n",
    "print('List of jet column headers',Sample.JetNames(Files[0])) # List of of Jets as strings in the Data (Sample(0))\n",
    "print('\\n')\n",
    "# accessing the quantities in the 4 vectors based on sample-ID and Jet name  \n",
    "print('Transverse momentum of the hardest jet in every event in data')\n",
    "print('\\n')\n",
    "print(Files[0]['Jet 0'].v4.pt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sample definition and composition\n",
    "\n",
    "In this section, we will learn how to apply the event selection criteria and calculate the sensitive variables mentioned above. We will investigate the data and compare it to the expected properties of the SM backgrounds. We will also get a feeling of how hypothetical SUSY signals look like.\n",
    "\n",
    "### Event selection\n",
    "Candidate events are required to pass a *baseline selection* defined as follows:\n",
    "\n",
    "1. $N_{\\mathrm{jets}} \\geq 3$, where only jets with $p_{\\mathrm{T}} > 50 \\,\\mathrm{GeV}$ and $|\\eta| < 2.5$ are considered;\n",
    "\n",
    "2. $H_{\\mathrm{T}} > 500 \\,\\mathrm{GeV}$, where again only jets with $p_{\\mathrm{T}} > 50 \\,\\mathrm{GeV}$ and $|\\eta| < 2.5$ are considered;\n",
    "\n",
    "3. $H\\kern-0.575em\\raise-0.375ex{\\large/}_{\\mathrm{T}} > 200 \\,\\mathrm{GeV}$, where jets with $p_{\\mathrm{T}} > 30 \\,\\mathrm{GeV}$ and $|\\eta| < 5.0$ are considered - please note the difference in jet selection; and\n",
    "\n",
    "4. $\\Delta \\phi(\\mathrm{jet}_{1}, \\vec{H\\kern-0.575em\\raise-0.375ex{\\large/}_{\\mathrm{T}}} ) > 0.5$,\n",
    "\n",
    "5. $\\Delta \\phi(\\mathrm{jet}_{2}, \\vec{H\\kern-0.575em\\raise-0.375ex{\\large/}_{\\mathrm{T}}} ) > 0.5$ and\n",
    "\n",
    "6. $\\Delta \\phi(\\mathrm{jet}_{3}, \\vec{H\\kern-0.575em\\raise-0.375ex{\\large/}_{\\mathrm{T}}}) > 0.3$, where $\\mathrm{jet}_{n}$ refers to the $n$-th jet ordered in $p_{\\mathrm{T}}$ ."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\">\n",
    "\n",
    "**Question 2**\n",
    "    \n",
    "   Why does one expect the jets in signal events to have relatively large $p_{\\mathrm{T}}$? Likewise, why are we looking for events with relatively large $H\\kern-0.575em\\raise-0.375ex{\\large/}_{\\mathrm{T}}$?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\">\n",
    "\n",
    "**Question 3** \n",
    "    \n",
    "Which SM processes result in the same final state as our signal process? Given all these background processes, can you motivate the baseline selection? Which of the background processes do you expect to dominate at large $N_{\\mathrm{jets}}$, at large $H_{\\mathrm{T}}$, and which at large $H\\kern-0.575em\\raise-0.375ex{\\large/}_{\\mathrm{T}}$? Don’t do any calculation, just use your intuition."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We still want to better understand the motivation for the baseline selection cuts.\n",
    "Therefore, we will compare the relevant kinematic distributions for the different\n",
    "processes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Initialise dataframes for sensitive variables\n",
    "\n",
    "\n",
    "The sensitive variables ($N_{\\mathrm{jets}}$, $H_{\\mathrm{T}}$, and $H\\kern-0.575em\\raise-0.375ex{\\large/}_{\\mathrm{T}}$) for each sample are saved in a separate in a separate dataframe. All dataframes are saved in the list called Sensitive_Variables.   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# List with an empty dataframe for each sample \n",
    "Sensitive_Variables = [pd.DataFrame(columns=['NJets', 'Ht','MHt','DeltaPhi1','DeltaPhi2','DeltaPhi3','Weights']) \n",
    "                       for i in range(len(Files))] \n",
    "print(Sensitive_Variables[3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Event selection, part I\n",
    "\n",
    "In order to compute the number of jets ($N_{\\mathrm{jets}}$) and the traverse energy ($H_{\\mathrm{T}}$), one needs first to select the jets with $p_{\\mathrm{T}} > 50 \\,\\mathrm{GeV}$ and $|\\eta| < 2.5$. This is achieved using the particle-based filter introduced in exercises of the previous Particle Physics course (TP1). First, a mask representing the selection criteria is generated for each sample. This mask is then applied to the dataframes, when computing the sensitive variables.     "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "    \n",
    "**Task 1:**\n",
    "    \n",
    "Please implement a mask for the conditions  $p_{\\mathrm{T}} > 50 \\,\\mathrm{GeV}$ and $|\\eta| < 2.5$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate and save masks for each sample \n",
    "\n",
    "def make_particle_filter_mask(df, mask):\n",
    "    return (mask).reindex(df.columns, level=0, axis=1).fillna(True)\n",
    "\n",
    "Selection1 = [] # List of Masks(Dataframes) for the Samples\n",
    "\n",
    "\n",
    "for ID in [0,1,2,3,4,5,6]: \n",
    "    \n",
    "    # Get the sample jets\n",
    "    sample = Files[ID]\n",
    "    sample_Jets = Sample.JetNames(sample)    \n",
    "    \n",
    "    # Initialise the mask \n",
    "    mask = pd.DataFrame(np.ones((len(sample),len(sample.columns.levels[0]))), dtype = bool, columns = sample.columns.levels[0])\n",
    "\n",
    "    # pt condition \n",
    "    mask.loc[:,sample_Jets] &= (sample[sample_Jets].v4.pt > 50 ).droplevel(1, axis = 1)\n",
    "    \n",
    "    # eta condition\n",
    "    ### YOUR CODE HERE\n",
    "    \n",
    "    # reindex\n",
    "    mask = make_particle_filter_mask(sample, mask)\n",
    "    \n",
    "    # add mask to list \n",
    "    Selection1.append(mask)\n",
    "    print(f'Mask {Sample.toString(ID)} Generated')\n",
    "        \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Calculate $N_{\\mathrm{jets}}$ and $H_{\\mathrm{T}}$\n",
    "\n",
    "After applying the masks, one can compute the number of jets and the traverse energy. The final distributions are visualized using the pre-defined function `plot_quantities`, which was introduced in exercises from the previous course. For those who are interested in how this function works, feel free to look at the code in `utils.py`. The basic idea is a simple variable arrangement of histogram plots of physics object quantities by utilizing combinations of zip, itertools.product functions and other minor features.    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "    \n",
    "**Task 2:**\n",
    "    \n",
    "Compute the transverse energy $H_{\\mathrm{T}}$. \n",
    "    \n",
    "Hint: First use the accessor to access the transverse momentum of the four-vector object and use the useful [sum function](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html) of Pandas to sum columns row-wise. At the end, you can transfer the resulting pandas series to a NumPy array simply by adding `.values` to the series.    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# Calculate NJets and Ht\n",
    "for ID in [0,1,2,3,4,5,6]:\n",
    "    \n",
    "    #1- Calculate the number of selected jets NJet\n",
    "    \n",
    "    # converts true/false values to 1/0, which can be summed \n",
    "    mask = Selection1[ID]*1 \n",
    "    \n",
    "    # subtract 2 to remove non-jet columns (weight and Jetsnum ) and divide by 4 since each jet has 4 quantities (columns)  \n",
    "    Sel_Njets = (mask.sum(axis=1).values-2)/4 \n",
    "    \n",
    "    # save the values in the dataframe\n",
    "    Sensitive_Variables[ID]['NJets'] = Sel_Njets \n",
    "    \n",
    "    #2- Calculate the transversal Energy Ht\n",
    "    \n",
    "    # apply mask \n",
    "    masked_df = Files[ID].mask(~Selection1[ID]) # mask() deletes Value if condition is True, therefore mask was inversed \n",
    "    \n",
    "    # compute Ht\n",
    "    ### YOUR CODE HERE\n",
    "    # Ht = masked_df[Sample.JetNames(Files[ID])] ...\n",
    "    \n",
    "    Sensitive_Variables[ID]['Ht'] = Ht\n",
    "    \n",
    "    # plot the distributions using the pre-defined function from utils.py  \n",
    "    \n",
    "    # FOR QCD need weight to properly treat events from different HT bins at generation\n",
    "    if ID == 4: \n",
    "        Weights = Files[ID]['Weight'].values\n",
    "        Weights = np.squeeze(Weights) # change the shape from (N,1) to (N,), does not effect data only changes shape\n",
    "    else:\n",
    "        Weights = None\n",
    "        \n",
    "    utils.plot_quantities(df=Sensitive_Variables[ID], column=['NJets','Ht'], quantity=None,\n",
    "                    bins=[10,30], hist_range = [(-0.5,9.5),(0.1, 3000)], density=False, weights=Weights,\n",
    "                    label=Sample.toString(ID), unit=['', 'GeV'], yscale = ['linear','log'], suptitle=Sample.toString(ID), color=Sample.color(ID))\n",
    "    print('\\n'*3)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Event selection, part II\n",
    "\n",
    "In order to compute the missing transverse energy $H\\kern-0.575em\\raise-0.375ex{\\large/}_{\\mathrm{T}}$ and $\\Delta \\phi$ defined earlier, one needs first to select the jets with $p_{\\mathrm{T}} > 30 \\,\\mathrm{GeV}$ and $|\\eta| < 5$. First, a mask representing the selection criteria is generated for each sample. This mask is then applied to the dataframes, when computing the sensitive variables. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "    \n",
    "**Task 3:**\n",
    "    \n",
    "Please implement the conditions $|\\eta| < 5$ and $p_{\\mathrm{T}} > 30 \\,\\mathrm{GeV}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "Selection2 = [] # List of Masks(Dataframes) for the Samples \n",
    "\n",
    "for ID in [0,1,2,3,4,5,6]: \n",
    "    \n",
    "    # get sample jets\n",
    "    sample = Files[ID]\n",
    "    sample_Jets = Sample.JetNames(sample)    \n",
    "    \n",
    "\n",
    "    mask = pd.DataFrame(np.ones((len(sample),len(sample.columns.levels[0]))), dtype = bool, columns = sample.columns.levels[0])\n",
    "    \n",
    "    # condition on Pt\n",
    "    ### YOUR CODE HERE\n",
    "    \n",
    "    # condition on Eta\n",
    "    ### YOUR CODE HERE\n",
    "    \n",
    "    mask = make_particle_filter_mask(sample, mask)\n",
    "    \n",
    "    # save the selection \n",
    "    Selection2.append(mask)\n",
    "    print(f'Mask {Sample.toString(ID)} Generated')\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Calculate $H\\kern-0.575em\\raise-0.375ex{\\large/}_{\\mathrm{T}}$ and $ \\Delta \\phi (\\vec{H\\kern-0.575em\\raise-0.375ex{\\large/}_{\\mathrm{T}}}, \\mathrm{jet}_{i}) $\n",
    "\n",
    "After applying the masks, one can compute the missing transverse energy  $H\\kern-0.575em\\raise-0.375ex{\\large/}_{\\mathrm{T}}$ and $ \\Delta \\phi (\\vec{H\\kern-0.575em\\raise-0.375ex{\\large/}_{\\mathrm{T}}}, \\mathrm{jet}_{i}) $. The final distributions are visualized using the pre-defined function `plot_quantities`. For those who are interested in how this function works, feel free to look at the code in `utils.py`. The basic idea is a simple variable arrangement of histogram plots of physics object quantities by utilizing combinations of zip, itertools.product functions and other minor features.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "    \n",
    "**Task 4**\n",
    "    \n",
    "Compute the missing transverse energy and the angle difference between the missing transverse energy and each of the three hardest jets  $ | \\Delta \\phi (\\vec{H\\kern-0.575em\\raise-0.375ex{\\large/}_{\\mathrm{T}}}, \\mathrm{jet}_{i})| $: \n",
    "\n",
    "1. Find the $x$- and $y$-components of the missing transverse energy using $\\phi$ angle. \n",
    "2. Initialize a vector object for $H\\kern-0.575em\\raise-0.375ex{\\large/}_{\\mathrm{T}}$ using the $x$- and $y$-components.\n",
    "3. Compute the missing transverse energy from the $x$- and $y$-components.\n",
    "4. Find the angle difference $| \\Delta \\phi (\\vec{H\\kern-0.575em\\raise-0.375ex{\\large/}_{\\mathrm{T}}}, \\mathrm{jet}_{i})| $ using well known function from the scikit-HEP vector package. \n",
    "    \n",
    "For reference, please refer to the documentation of the [scikit-HEP vector package](https://github.com/scikit-hep/vector). You might find the function `vector.obj.deltaphi(vector.obj)` especially useful. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for ID in [0,1,2,3,4,5,6]: \n",
    "    \n",
    "    #apply mask to the sample\n",
    "    masked_df = Files[ID].mask(~Selection2[ID])\n",
    "    #get a list of jets as strings \n",
    "    jets = Sample.JetNames(Files[ID])\n",
    "    \n",
    "    \n",
    "    #Calculate MHt components with selection \n",
    "    \n",
    "    ### YOUR CODE HERE\n",
    "        \n",
    "    # save MHT\n",
    "    Sensitive_Variables[ID]['MHt'] = MHt\n",
    "    \n",
    "    # Save MHt as a vector object using the x- and y-components \n",
    "    MHt_vec = vec.arr(dict(x=MHtx, y=MHty))  \n",
    "    \n",
    "    #Calculate Delta Phi \n",
    "    \n",
    "    # Calculate the angle of missing transverse MHt (MHtPhi) using the vector library \n",
    "    MHtPhi = MHt_vec.phi\n",
    "    \n",
    "    ### YOUR CODE HERE\n",
    "    \n",
    "    # save the results \n",
    "    Sensitive_Variables[ID]['DeltaPhi1'] = DeltaPhi1\n",
    "    Sensitive_Variables[ID]['DeltaPhi2'] = DeltaPhi2 \n",
    "    Sensitive_Variables[ID]['DeltaPhi3'] = DeltaPhi3  \n",
    "\n",
    "    # FOR QCD need weight to properly treat events from different HT bins at generation\n",
    "    if ID == 4:\n",
    "        Weights = Files[ID]['Weight'].values\n",
    "        Weights = np.squeeze(Weights) # change the shape from (N,1) to (N,), does not effect data only changes shape\n",
    "    else:\n",
    "        Weights = None\n",
    "    \n",
    "    utils.plot_quantities(df=Sensitive_Variables[ID]\n",
    "                          ,column=['NJets','Ht','MHt','DeltaPhi1','DeltaPhi2','DeltaPhi3']\n",
    "                          ,quantity=None\n",
    "                          ,bins=[10,30,24,24,24,24]\n",
    "                          ,hist_range = [(-0.5,9.5),(0.1, 3000),(0.1, 1000),(0, 3.2),(0, 3.2),(0, 3.2)]\n",
    "                          ,density=False\n",
    "                          ,weights=Weights\n",
    "                          ,label=Sample.toString(ID)\n",
    "                          ,unit=['', 'GeV', 'GeV','Radian', 'Radian', 'Radian']\n",
    "                          ,yscale = ['linear','log','log','linear','linear','linear']\n",
    "                          ,suptitle=Sample.toString(ID)\n",
    "                          ,color=Sample.color(ID))\n",
    "    print('\\n'*3)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Comparison of distributions \n",
    "After computing the sensitive variables for **all samples**, we will now compare the shapes of the different kinematic distributions of the different processes. In the following cell the $N_{\\mathrm{jets}}$ and $H_{\\mathrm{T}}$ distributions are plotted. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# plot sample comparision \n",
    "Labels = [Sample.toString(ID) for ID in range(7)]\n",
    "Colors = [Sample.color(ID) for ID in range(7)]\n",
    "\n",
    "utils.plot_quantities(df=Sensitive_Variables, column=['NJets','Ht','MHt'], quantity=None,\n",
    "                    bins=[10,30,24], hist_range = [(-0.5,9.5),(0.1, 3000),(0.1, 1000)], density=True, weights=Weights,\n",
    "                    label=Labels, unit=['','GeV','GeV'], yscale = ['linear','log','log'], suptitle=Sample.toString(ID), color=Colors)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\">\n",
    "\n",
    "**Question 4**\n",
    "    \n",
    "How do the distributions for the different processes differ? In particular:\n",
    "* Explain the differences of the $N_{\\mathrm{jets}}$ and $H_{\\mathrm{T}}$ distributions of QCD and Z($\\nu \\nu $) + jets events.\n",
    "* Explain the different $H\\kern-0.575em\\raise-0.375ex{\\large/}_{\\mathrm{T}}$ distributions of the QCD and the other processes. Hint: what is the primary source of $H\\kern-0.575em\\raise-0.375ex{\\large/}_{\\mathrm{T}}$ in the Z($\\nu \\nu $) + jets, W($ l \\nu$) + jets and $t\\bar{t}$ + jets events? Where does $H\\kern-0.575em\\raise-0.375ex{\\large/}_{\\mathrm{T}}$ stem from in QCD events?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Expected event yields\n",
    "We will now use the simulated events to estimate the number of SM-background events after the full baseline event selection. \n",
    "The yields dataframe initialized below stores the number of simulated events passing the selection. The first column contains the number\n",
    "of events after the baseline selection. The further columns store the number of events after tighter requirements (on top of the baseline selection) on $H_{\\mathrm{T}}$, $H\\kern-0.575em\\raise-0.375ex{\\large/}_{\\mathrm{T}}$, and $N_{\\mathrm{jets}} $. Have a look at the code to identify the additional requirements.\n",
    "<br>\n",
    "\n",
    "By comparing the number of events after the baseline selection with the total number of events given in Table 1, we can compute the total selection efficiency:\n",
    "\n",
    "<center> $\\epsilon = \\dfrac{\\text{number of MC events after selection requirements}}{\\text{total number of MC events}} $<center> <br>\n",
    "    \n",
    "Together with the cross section and the integrated luminosity, we can then compute the expected number of events in a given data sample for a process $i$:  \n",
    "<center>  $N_{\\mathrm{exp},i} = \\epsilon_i \\cdot \\sigma_i \\cdot L $     <center>\n",
    "    \n",
    "We do not have to compute the cross section normalisation ourselves every time. Instead, we can use the weights already stored in the dataframes, which include the cross section normalisation. In addition, the weights contain a correction to properly describe the impact of additional proton-proton collisions in the same event (pile-up)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# initlize Yields dataframe \n",
    "\n",
    "Yields = pd.DataFrame(columns=['baseline', 'Njets<7','Njets>=7','HT<1700','HT>1700','MHT<600','MHT>600'],\n",
    "                      index=['Data', 'WJets', 'TTJets','ZJets','QCD','LM6','LM9']) \n",
    "print(Yields) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**After** computing the $N_{\\mathrm{jets}}$, $H_{\\mathrm{T}}$ and $H\\kern-0.575em\\raise-0.375ex{\\large/}_{\\mathrm{T}}$ variables in the previous cells, we can apply the full baseline selection and plot the final distributions. We will also determine the expected event yields for the background and signal processes. \n",
    "We do not have to compute the cross section normalisation ourselves every time. Instead, we can use the weight already stored in CSV files. It is saved in\n",
    "the column 'Weight' and includes the cross section normalisation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "\n",
    "**Task 5:**\n",
    "    \n",
    "Complete the implementation of the baseline selection: $N_{\\mathrm{jets}} \\geq 3$, $H\\kern-0.575em\\raise-0.375ex{\\large/}_{\\mathrm{T}} > 200 \\,\\mathrm{GeV}$ and $H_{\\mathrm{T}} > 500 \\,\\mathrm{GeV}$ in the cell below. \n",
    "\n",
    "**Task 6:**\n",
    "    \n",
    "Determine the yields after applying the $H\\kern-0.575em\\raise-0.375ex{\\large/}_{\\mathrm{T}} > 600 \\,\\mathrm{GeV}$ and $H_{\\mathrm{T}} > 1700 \\,\\mathrm{GeV}$ requirements. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "Final_Variables = [] # List of Final_Variables as Dataframes after the baseline selection\n",
    "\n",
    "# loop over the samples \n",
    "for ID in [0,1,2,3,4,5,6]: \n",
    "    \n",
    "    # add the weights to the sensitive variables dataframe \n",
    "    Weights = Files[ID]['Weight'].values\n",
    "    Weights  = np.squeeze(Weights) # change the shape from (N,1) to (N,), does not effect data only changes shape\n",
    "    Sensitive_Variables[ID]['Weights'] = Weights \n",
    "    \n",
    "    # complete the Baseline Selection\n",
    "    Baseline_Selection = ((Sensitive_Variables[ID]['NJets']>=3)\n",
    "    ### YOUR CODE HERE\n",
    "    )\n",
    "    \n",
    "    # applying Baseline_Selection\n",
    "    Variables_after_Cut = Sensitive_Variables[ID].mask(~ Baseline_Selection) \n",
    "    Yields.iloc[ID,0] = np.nansum(Variables_after_Cut['Weights'].values) # np.nansum: sums entries, while ignoring nan values \n",
    "    Final_Variables.append(Variables_after_Cut)\n",
    "    \n",
    "    # applying selection criteria 1 \n",
    "    Variables_after_Cut = Sensitive_Variables[ID].mask(~(Baseline_Selection & (Sensitive_Variables[ID]['NJets']< 7)))\n",
    "    Yields.iloc[ID,1] = np.nansum(Variables_after_Cut['Weights'].values)\n",
    "    \n",
    "    # applying selection criteria 2 \n",
    "    Variables_after_Cut = Sensitive_Variables[ID].mask(~(Baseline_Selection & (Sensitive_Variables[ID]['NJets']>= 7)))\n",
    "    Yields.iloc[ID,2] = np.nansum(Variables_after_Cut['Weights'].values)\n",
    "    \n",
    "    # applying selection criteria 3 \n",
    "    Variables_after_Cut = Sensitive_Variables[ID].mask(~(Baseline_Selection & (Sensitive_Variables[ID]['Ht']< 1700)))\n",
    "    Yields.iloc[ID,3] = np.nansum(Variables_after_Cut['Weights'].values)\n",
    "    \n",
    "    # applying selection criteria 4 \n",
    "    ### YOUR CODE HERE\n",
    "    \n",
    "    # applying selection criteria 5 \n",
    "    Variables_after_Cut = Sensitive_Variables[ID].mask(~(Baseline_Selection & (Sensitive_Variables[ID]['MHt']< 600)))\n",
    "    Yields.iloc[ID,5] = np.nansum(Variables_after_Cut['Weights'].values)\n",
    "    \n",
    "    # applying selection criteria 6\n",
    "    ### YOUR CODE HERE\n",
    "        \n",
    "    # Plotting the results \n",
    "    utils.plot_quantities(df=Final_Variables[ID]\n",
    "                          ,column=['NJets','Ht','MHt','DeltaPhi1','DeltaPhi2','DeltaPhi3']\n",
    "                          ,quantity=None\n",
    "                          ,bins=[9,25,16,24,24,24]\n",
    "                          ,hist_range = [(2.5,11.5),(500, 3000),(200, 1000),(0,3.2),(0,3.2),(0,3.2)]\n",
    "                          ,density=False\n",
    "                          ,weights=Final_Variables[ID]['Weights']\n",
    "                          ,label=Sample.toString(ID)\n",
    "                          ,unit=['', 'GeV', 'GeV','Radian', 'Radian', 'Radian']\n",
    "                          ,yscale = ['linear','log','log','linear','linear','linear']\n",
    "                          ,suptitle=Sample.toString(ID)\n",
    "                          ,color=Sample.color(ID))\n",
    "    \n",
    "    print('\\n'*3)\n",
    "    \n",
    "print(Yields)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\">\n",
    "\n",
    "**Question 5**\n",
    "    \n",
    "Discuss the result. Which background dominates in which phase-space region? Does this match your initial expectation (Question 3)?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data\n",
    "\n",
    "We will now compare the $N_{\\mathrm{jets}}$, $H_{\\mathrm{T}}$, and $H\\kern-0.575em\\raise-0.375ex{\\large/}_{\\mathrm{T}}$ distributions in real proton-proton collision data with the sum of the background distributions obtained from simulation. Both potential signals are also shown, but not added to the stack of background processes. This is all done using the pre-defined compare plotting function. Please refer to `Sample.py` if you are interested in the exact implantation of the plotting function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Plot_Bins = [np.arange(2.5,11.5,1),range(500,3100,100),range(200,1000,50)]\n",
    "Data = Final_Variables[0]\n",
    "Bkg = [Final_Variables[i] for i in [1,2,3,4]]\n",
    "Sgl = [Final_Variables[i] for i in [5,6]]\n",
    "Labels = [Sample.toString(i) for i in range(len(Final_Variables))] # note that Labels = [Data_label, Background_labels, Signal_labels] \n",
    "Colors = [Sample.color(i) for i in range(len(Final_Variables))] # # note that colors = [Data_label, Background_labels, Signal_labels] \n",
    "\n",
    "\n",
    "Sample.Compare(Bkg,Sgl,Data,Bins = Plot_Bins,Labels = Labels, Colors = Colors )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\">\n",
    "\n",
    "**Question 6**\n",
    "    \n",
    "Do you observe any deviations of the data from the SM background expectation? What can you say about the existence of any new physics processes?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\">\n",
    "\n",
    "**Question 7**\n",
    "    \n",
    "Which uncertainty is represented by the uncertainty bars? Are there any further uncertainties to be considered?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References\n",
    "\n",
    "[1] The CMS Collaboration, “Search for new physics in the multijet and missing transverse momentum final state in proton-proton collisions at $\\sqrt{s}= 8$ TeV”, JHEP 06 (2014) 055, arXiv:1402.4770. doi:10.1007/JHEP06(2014)055.\n",
    "\n",
    "[2] H. Baer and X. Tata, “Weak Scale Supersymmetry: From Superfields to Scattering Events”. Cambridge University Press, 2006. ISBN 0-521-85786-4.\n",
    "\n",
    "[3] S. P. Martin, “A Supersymmetry primer”, arXiv:hep-ph/9709356.\n",
    "\n",
    "[4] The CMS Collaboration, “CMS technical design report, volume II: Physics performance”, J. Phys. G34 (2007) 995. doi:10.1088/0954-3899/34/6/S01.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}