diff --git a/docs/analysis/backgrounds.md b/docs/analysis/backgrounds.md new file mode 100644 index 00000000..69a581bc --- /dev/null +++ b/docs/analysis/backgrounds.md @@ -0,0 +1,142 @@ +# Background Modelling Techniques + +!!! Warning + This page is under construction + +Accurate modeling of SM background processes is essential to most searches and measurements in high energy physics. +The dominant background processes depend strongly on the selected objects, particularly leptons, missing +transverse momentum, and b quark jets or jets from boosted particles. Background estimation strategies +are always tailored to an individual analysis, typically as a variation of one or more of the following +common methods. + +## Simulation + +Many SM processes are simulated at NLO, which provides a strong basis for background +estimation. For processes such as W/Z+jets production that are often simulated at LO, *k*-factor +calculations allows one to weight simulated events to reproduce predictions of distributions at NLO. +In final states with charged leptons, for which QCD multijet production is unlikely to be a significant background, simulation is a common choice. +Additionally, the majority of searches utilize simulation to model the signal process under consideration. + +Simulated events are weighted so that the efficiencies of certain selections +in simulation match those observed in data. These corrections are referred to as ``scale factors''. +Common scale factors in searches at the CMS experiment correct +for differences in : + +- the number of pileup interactions +- the efficiencies of trigger selections +- the efficiencies of charged lepton identification and isolation selection criteria +- the efficiencies of various jet identification selection criteria, such as heavy-flavor tagging. + +A detailed set of [corrections for the jet energy scale and resolution](systematics/objectsuncertain/jetmetuncertain.md) are computed for simulated events so that the +response of the jet reconstruction algorithms is consistent between observed data and simulation. Searches may also develop +correction formulas to correct for observed mismodeling of data by simulation in certain distributions of interest. +A common correction of this type is to reweight the reconstructed top quark \pt spectrum, since the NLO top quark pair simulations +tend to overpredict the rate of high-\pt top quark pairs. Each correction applied to simulation carries an uncertainty +that should be taken into account in the statistical methods of signal extraction. + +## Tight / loose or "Matrix" methods + +Searches that select multiple charged leptons often have considerable background +from events in which *nonprompt* leptons are selected. Nonprompt leptons are usually charged leptons that arise from +sources other than the hard scatter or decays of massive particles produced in the hard scatter. + +One method to estimate contributions from these events is to measure how often known prompt leptons, typically from the +decay of Z bosons, and known nonprompt leptons, typically from a sample of QCD multijet events, pass a certain set of +lepton selection criteria. A Z boson sample is created in data by selecting events with two same-flavor opposite-sign +leptons whose mass lies very close to the Z boson mass. One lepton, known as the *tag*, is selected using very high-purity +selection criteria, giving confidence that the other *probe* lepton is indeed a prompt lepton. The efficiency for the +probe lepton to pass any criteria of interest can then be measured in this sample (learn more about this calculation +on the [tag and probe page](selection/idefficiencystudy/tagandprobe.md)). In the context of this background +estimation method, the efficiency of the analysis selection criteria is referred to as the *prompt rate*, $p$. + +A QCD multijet sample can be created by selecting events that pass a low-momentum, low-purity, single-lepton trigger, but otherwise +exhibit no strong signs of the lepton arising from a SM boson decay. The rate at which these leptons pass the analysis selection +criteria can be measured, and is referred to as the *nonprompt* rate (or colloquially, *fake* rate), $f$. Both of these rates +describe how often either prompt or nonprompt leptons that pass some baseline *loose* selection also pass the *tight* +selection criteria used in the analysis. + +For searches that probe final states with two charged leptons, the probabilities for any prompt or nonprompt lepton to enter the sample must be considered +together to develop a background distribution. +The number of events with leptons passing the tight and/or loose criteria may be observed, in particular the number of events with two tight leptons, \(N_{tt}\); one tight and one loose lepton, \(N_{tl}\); and two loose leptons, \(N_{ll}\). +The prompt and nonprompt rates may then be used to convert those observations into numbers of events with two prompt leptons, \(N_{pp}\); one prompt and one nonprompt lepton, \(N_{pf}\); and two nonprompt leptons, \(N_{ff}\). + +\[ +\begin{pmatrix} + N_{tt} \\ + N_{tl} \\ + N_{ll} +\end{pmatrix} = \left( \begin{array}{ccc} + p^2 & pf & f^2 \\ + 2p(1-p) & f(1-p) + p(1-f) & 2f(1-f) \\ + (1-p)^2 & (1-p)(1-f) & (1-f)^2 +\end{array} \right) +\begin{pmatrix} + N_{pp}\\ + N_{pf}\\ + N_{ff} +\end{pmatrix} +\] + +A matrix inversion provides formulas to calculate \(N_{pf}\) and \(N_{ff}\) from the observed number of events with leptons of +varying quality. For a search selecting two tight leptons, the background from events with nonprompt leptons will be given +by \(N_{\mathrm{bkg}} = pfN_{pf} + f^2N_{ff}\). This method can be extended to searches targeting final states more than two charged leptons by expanding the probability matrix. + +A good reference for this method, built on earlier uses within CMS, is the [2022 doctoral thesis of Wing Yan Wong](http://cds.cern.ch/record/2808538). + +## Transfer factors + +In many searches, one important selection criterion is the primary dividing line between +a background-dominated control region (CR) and a region with good signal sensitivity, called the signal region (SR). +A *transfer factor* or *transfer function* +that describes the efficiency of this principle selection criteria can be derived and applied to the observed data in the +CR in order to estimate the background present in the SR. + +### Alpha-ratio method + +The transfer function can be computed in multiple ways. Some searches use simulation for this purpose, in which +case the method is often called the *alpha-ratio method*. The number of background events in the SR, \(N_{\mathrm{SR}}^{bkg}\), is calculated +as: + +\[ +N_{\mathrm{SR}}^{bkg} = N_\mathrm{CR}^{data} \times \frac{N_{\mathrm{SR}}^{sim}}{N_{\mathrm{CR}}^{sim}}, +\] + +where \(N_\mathrm{CR}^{data}\) is the number of observed collision events in the CR, \(N_{\mathrm{SR}}^{sim}\) is the number of simulated events in the SR, +and \(N_{\mathrm{CR}}^{sim}\) is the number of simulated events in the CR. +The transfer factor from simulation can be computed in any bin of an observable, so the shape as well as the rate of +background in the SR may be obtained. + +### ABCD method + +Other searches measure transfer factors using observed data in selection regions that are distinct from the primary SR and CR, +in which case the method might be referred to as the **ABCD method**. This method is particularly popular for dealing with multijet +backgrounds that are not typically modelled well by simulation. + +Four selection regions in the observed data are involved, +formed by events either passing or failing either of two selection criteria, as shown in the graphic below. The +number of background events in the SR (region C), \(N_\mathrm{C}\), is calculated from observations in regions A, B, and D as +\(N_\mathrm{D} \times (N_\mathrm{B} / N_\mathrm{A})\). This method may also be used in any bin of an observable to obtain a shape-based prediction for the background. +In general, the ABCD method requires that the selection criteria are statistically independent in order to produce unbiased predictions. + +![Demonstration of the ABCD method](../images/ABCD.png) + +If some background sources are well-modelled by +simulation, these contributions may be subtracted from the observed data in each region before computing or applying the transfer function. +More than four regions can be used to incorporate a method for validation into the procedure, as shown in the second graphic. +The number of background events in the validation region X is estimated from the observations in regions A, D, and Y as \(N_\mathrm{D} \times (N_\mathrm{Y} / N_\mathrm{A})\), and if region X has a suitably low rate of +expected signal events the observed data in this region could be compared to the background prediction, to test the validity +of the prediction method. + +![Demonstration of ABCD method with a validation region](../images/ABCDext.png) + +## Sideband fits + +In many searches, the observable most sensitive to the signal is a reconstructed mass +or jet mass distribution, in which the signal is expected to be resonant while the dominant background +processes are non-resonant. The shape of the background distribution may then be predicted by fitting a smooth +functional form to the observed data on either side of the region in which the signal distribution is expected to peak. This method +may be used in multiple dimensions for signals that feature more than one resonance. + +When multiple functional forms offer adequate fits to the observed data, an F-statistic may be used to compare the residual sums of +squares for two formulas and determine whether a formula with more parameters provides a significantly better +fit than an alternate formula with fewer parameters (known as the Fischer \(\mathcal{F}\)-test). diff --git a/docs/analysis/backgrounds/qcdestimation.md b/docs/analysis/backgrounds/qcdestimation.md deleted file mode 100644 index 07abd139..00000000 --- a/docs/analysis/backgrounds/qcdestimation.md +++ /dev/null @@ -1,4 +0,0 @@ -# QCD Estimation - -!!! Warning - This page is under construction diff --git a/docs/analysis/backgrounds/techniques.md b/docs/analysis/backgrounds/techniques.md deleted file mode 100644 index 0d69a731..00000000 --- a/docs/analysis/backgrounds/techniques.md +++ /dev/null @@ -1,4 +0,0 @@ -# Techniques - -!!! Warning - This page is under construction diff --git a/docs/analysis/interpretation/stats.md b/docs/analysis/interpretation/stats.md deleted file mode 100644 index 5ba770ed..00000000 --- a/docs/analysis/interpretation/stats.md +++ /dev/null @@ -1,4 +0,0 @@ -# Statistics - -!!! Warning - This page is under construction diff --git a/docs/analysis/luminosity/lumi.md b/docs/analysis/lumi.md similarity index 100% rename from docs/analysis/luminosity/lumi.md rename to docs/analysis/lumi.md diff --git a/docs/analysis/selection/validatedRuns.md b/docs/analysis/selection/validatedRuns.md index ca7722a7..0276ba88 100644 --- a/docs/analysis/selection/validatedRuns.md +++ b/docs/analysis/selection/validatedRuns.md @@ -41,4 +41,4 @@ and by adding these two lines after the `process.source` input file definition: process.source.lumisToProcess.extend(myLumis) ``` -This list should also be used as an input to the [luminosity calculation](../luminosity/lumi.md). +This list should also be used as an input to the [luminosity calculation](../lumi.md). diff --git a/docs/analysis/stats.md b/docs/analysis/stats.md new file mode 100644 index 00000000..8bf16248 --- /dev/null +++ b/docs/analysis/stats.md @@ -0,0 +1,56 @@ +# Statistics + +!!! Warning + This page is under construction + +## Overview of CMS techniques + +CMS searches typically determine an observable or set of observables that is used to measure the potential presence of +signal events. This can be any observable, preferably highlighting unique features of the signal process. +Signal extraction is based on maximum likelihood fits that compare ``data'' (either collision data or pseudodata +sampled from a test distribution) to the signal (\(s\)) and background (\(b\)) predictions, with signal scaled by some +unknown ratio \(\mu\). The likelihood is assumed to follow a Poisson distribution, and all predictions are subject to various +nuisance parameters, \(\theta\), that are given default values \(\tilde{\theta}\) and assigned probability density functions (\(p\)). +The likelihood function can be written as: + +\[ +\mathcal{L}(\mathrm{data}\vert \mu,\theta) = \mathrm{Poisson}(\mathrm{data}\vert \mu\cdot s(\theta) + b(\theta))\cdot p(\tilde{\theta}\vert\theta). +\] + +Systematic uncertainties are incorporated into the fit as nuisance parameters. Lognormal probability distributions are assigned +to uncertainties that affect only the normalization of a histogram or rate of a predicted event yield, and Gaussian probability +distributions are typically assigned to uncertainties provided as histograms that affect the shape of a distribution. +You can learn about several typical sources of uncertainty in CMS analyses in the [Systematics section](systematics/lumiuncertain.md) +of the Guide. + +Observed and expected limits on the signal ratio \(\mu\) are extracted by comparing the compatibility +of the observed data with a background-only (\(\mu = 0\)) hypothesis as well as with a signal+background hypothesis. +The most common statistical method within CMS is the **CLs** method ([Read, 2002](http://dx.doi.org/10.1088/0954-3899/28/10/313) and [Junk, 1999](https://www.arxiv.org/abs/hep-ex/9902006)), +which can be used to obtain a limit at the 95% confidence level using the profile likelihood test statistic +([Cowan, 2010](https://arxiv.org/abs/1007.1727)) with the asymptotic limit approximation. + +The ["Higgs Combine"](https://cms-analysis.github.io/HiggsAnalysis-CombinedLimit) software framework used by +the CMS experiment to compute limits is built on the [RooFit](https://root.cern/manual/roofit/) and +[RooStats](https://root.cern/doc/master/group__Roostats.html) packages and implements statistical procedures developed +for combining ATLAS and CMS Higgs boson measurements. + +## Tutorials + +Many tutorials and lectures on statistical interpretation of LHC data are available online. Some selected highlights are listed here. + +- *Practical Statistics for LHC Physicists*, a set of three lectures by Prof. Harrison Prosper, 2015. Slides and videos are available for each lecture: + + - [Descriptive Statistics, Probability and Likelihood](https://indico.cern.ch/event/358542/) + - [Frequentist Inference](https://indico.cern.ch/event/358543/) + - [Bayesian Inference](https://indico.cern.ch/event/358544/) + +- Higgs Combine [tutorial on the main features of Combine](https://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part5/longexercise/). + + - [Solutions](https://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part5/longexerciseanswers/) are available + - Note: some links within this tutorial point to CMS internal resources. + +- Open Data Workshop [*Simplified Run 2 Analysis* lesson](https://cms-opendata-workshop.github.io/workshopwhepp-lesson-ttbarljetsanalysis/) + + - Lessons from the Open Data Workshop series use the docker container environment recommended for processing Open Data. + - The overall lesson offers tools for analysis of files in the NanoAOD or [PhysObjectExtractorTool](https://github.com/cms-opendata-analyses/PhysObjectExtractorTool) format. + - Specifically, the final page of the lesson (*5: Systematics and Statistics*) introduces the python-based tool [pyhf](https://pyhf.readthedocs.io/en/v0.7.6/) for performing statistical inference without any ROOT software. diff --git a/docs/analysis/systematics/objectsuncertain.md b/docs/analysis/systematics/objectsuncertain.md deleted file mode 100644 index 86a2caae..00000000 --- a/docs/analysis/systematics/objectsuncertain.md +++ /dev/null @@ -1,4 +0,0 @@ -# Object Uncertainty - -!!! Warning - This page is under construction diff --git a/docs/images/ABCD.png b/docs/images/ABCD.png new file mode 100644 index 00000000..62a9b6b8 Binary files /dev/null and b/docs/images/ABCD.png differ diff --git a/docs/images/ABCDext.png b/docs/images/ABCDext.png new file mode 100644 index 00000000..d3244643 Binary files /dev/null and b/docs/images/ABCDext.png differ diff --git a/docs/javascripts/mathjax.js b/docs/javascripts/mathjax.js new file mode 100644 index 00000000..080801ef --- /dev/null +++ b/docs/javascripts/mathjax.js @@ -0,0 +1,16 @@ +window.MathJax = { + tex: { + inlineMath: [["\\(", "\\)"]], + displayMath: [["\\[", "\\]"]], + processEscapes: true, + processEnvironments: true + }, + options: { + ignoreHtmlClass: ".*|", + processHtmlClass: "arithmatex" + } +}; + +document$.subscribe(() => { + MathJax.typesetPromise() +}) diff --git a/mkdocs.yml b/mkdocs.yml index b3ade756..e7d02a04 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -65,12 +65,8 @@ nav: - 'Reference Guide (Fit)': - 'Overview': analysis/selection/idefficiencystudy/fittingreferenceguide/overview.md - 'Src files': analysis/selection/idefficiencystudy/fittingreferenceguide/src_files.md - - 'Luminosity': - - 'Luminosity': analysis/luminosity/lumi.md - - 'Backgrounds': - - 'Techniques': analysis/backgrounds/techniques.md - - 'QCD Estimation': analysis/backgrounds/qcdestimation.md - + - 'Luminosity': analysis/lumi.md + - 'Background Modelling': analysis/backgrounds.md - 'Systematics': - 'Luminosity Uncertainties': analysis/systematics/lumiuncertain.md - 'MC Uncertainty': analysis/systematics/mcuncertain.md @@ -79,9 +75,7 @@ nav: - 'Jet/MET uncertainties': analysis/systematics/objectsuncertain/jetmetuncertain.md - 'Tagging uncertainties': analysis/systematics/objectsuncertain/btaguncertain.md - 'Pileup Uncertainty': analysis/systematics/pileupuncertain.md - - 'Interpretation': - - 'Statistics': analysis/interpretation/stats.md - - 'Upper-limit Calculations': analysis/interpretation/limits.md + - 'Statistical Interpretation': analysis/stats.md - FAQ: faq.md - About: about.md @@ -103,6 +97,11 @@ theme: extra_css: - stylesheets/extra.css +extra_javascript: + - javascripts/mathjax.js + - https://polyfill.io/v3/polyfill.min.js?features=es6 + - https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js + # Extensions markdown_extensions: - admonition @@ -115,7 +114,9 @@ markdown_extensions: - pymdownx.superfences - pymdownx.tabbed: alternate_style: true - + - pymdownx.arithmatex: + generic: true + # Plugins plugins: - search