Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First 2024 updates: new page, mathjax #114

Merged
merged 9 commits into from
Jan 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
142 changes: 142 additions & 0 deletions docs/analysis/backgrounds.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# Background Modelling Techniques

!!! Warning
This page is under construction

Accurate modeling of SM background processes is essential to most searches and measurements in high energy physics.
The dominant background processes depend strongly on the selected objects, particularly leptons, missing
transverse momentum, and b quark jets or jets from boosted particles. Background estimation strategies
are always tailored to an individual analysis, typically as a variation of one or more of the following
common methods.

## Simulation

Many SM processes are simulated at NLO, which provides a strong basis for background
estimation. For processes such as W/Z+jets production that are often simulated at LO, *k*-factor
calculations allows one to weight simulated events to reproduce predictions of distributions at NLO.
In final states with charged leptons, for which QCD multijet production is unlikely to be a significant background, simulation is a common choice.
Additionally, the majority of searches utilize simulation to model the signal process under consideration.

Simulated events are weighted so that the efficiencies of certain selections
in simulation match those observed in data. These corrections are referred to as ``scale factors''.
Common scale factors in searches at the CMS experiment correct
for differences in :

- the number of pileup interactions
- the efficiencies of trigger selections
- the efficiencies of charged lepton identification and isolation selection criteria
- the efficiencies of various jet identification selection criteria, such as heavy-flavor tagging.

A detailed set of [corrections for the jet energy scale and resolution](systematics/objectsuncertain/jetmetuncertain.md) are computed for simulated events so that the
response of the jet reconstruction algorithms is consistent between observed data and simulation. Searches may also develop
correction formulas to correct for observed mismodeling of data by simulation in certain distributions of interest.
A common correction of this type is to reweight the reconstructed top quark \pt spectrum, since the NLO top quark pair simulations
tend to overpredict the rate of high-\pt top quark pairs. Each correction applied to simulation carries an uncertainty
that should be taken into account in the statistical methods of signal extraction.

## Tight / loose or "Matrix" methods

Searches that select multiple charged leptons often have considerable background
from events in which *nonprompt* leptons are selected. Nonprompt leptons are usually charged leptons that arise from
sources other than the hard scatter or decays of massive particles produced in the hard scatter.

One method to estimate contributions from these events is to measure how often known prompt leptons, typically from the
decay of Z bosons, and known nonprompt leptons, typically from a sample of QCD multijet events, pass a certain set of
lepton selection criteria. A Z boson sample is created in data by selecting events with two same-flavor opposite-sign
leptons whose mass lies very close to the Z boson mass. One lepton, known as the *tag*, is selected using very high-purity
selection criteria, giving confidence that the other *probe* lepton is indeed a prompt lepton. The efficiency for the
probe lepton to pass any criteria of interest can then be measured in this sample (learn more about this calculation
on the [tag and probe page](selection/idefficiencystudy/tagandprobe.md)). In the context of this background
estimation method, the efficiency of the analysis selection criteria is referred to as the *prompt rate*, $p$.

A QCD multijet sample can be created by selecting events that pass a low-momentum, low-purity, single-lepton trigger, but otherwise
exhibit no strong signs of the lepton arising from a SM boson decay. The rate at which these leptons pass the analysis selection
criteria can be measured, and is referred to as the *nonprompt* rate (or colloquially, *fake* rate), $f$. Both of these rates
describe how often either prompt or nonprompt leptons that pass some baseline *loose* selection also pass the *tight*
selection criteria used in the analysis.

For searches that probe final states with two charged leptons, the probabilities for any prompt or nonprompt lepton to enter the sample must be considered
together to develop a background distribution.
The number of events with leptons passing the tight and/or loose criteria may be observed, in particular the number of events with two tight leptons, \(N_{tt}\); one tight and one loose lepton, \(N_{tl}\); and two loose leptons, \(N_{ll}\).
The prompt and nonprompt rates may then be used to convert those observations into numbers of events with two prompt leptons, \(N_{pp}\); one prompt and one nonprompt lepton, \(N_{pf}\); and two nonprompt leptons, \(N_{ff}\).

\[
\begin{pmatrix}
N_{tt} \\
N_{tl} \\
N_{ll}
\end{pmatrix} = \left( \begin{array}{ccc}
p^2 & pf & f^2 \\
2p(1-p) & f(1-p) + p(1-f) & 2f(1-f) \\
(1-p)^2 & (1-p)(1-f) & (1-f)^2
\end{array} \right)
\begin{pmatrix}
N_{pp}\\
N_{pf}\\
N_{ff}
\end{pmatrix}
\]

A matrix inversion provides formulas to calculate \(N_{pf}\) and \(N_{ff}\) from the observed number of events with leptons of
varying quality. For a search selecting two tight leptons, the background from events with nonprompt leptons will be given
by \(N_{\mathrm{bkg}} = pfN_{pf} + f^2N_{ff}\). This method can be extended to searches targeting final states more than two charged leptons by expanding the probability matrix.

A good reference for this method, built on earlier uses within CMS, is the [2022 doctoral thesis of Wing Yan Wong](http://cds.cern.ch/record/2808538).

## Transfer factors

In many searches, one important selection criterion is the primary dividing line between
a background-dominated control region (CR) and a region with good signal sensitivity, called the signal region (SR).
A *transfer factor* or *transfer function*
that describes the efficiency of this principle selection criteria can be derived and applied to the observed data in the
CR in order to estimate the background present in the SR.

### Alpha-ratio method

The transfer function can be computed in multiple ways. Some searches use simulation for this purpose, in which
case the method is often called the *alpha-ratio method*. The number of background events in the SR, \(N_{\mathrm{SR}}^{bkg}\), is calculated
as:

\[
N_{\mathrm{SR}}^{bkg} = N_\mathrm{CR}^{data} \times \frac{N_{\mathrm{SR}}^{sim}}{N_{\mathrm{CR}}^{sim}},
\]

where \(N_\mathrm{CR}^{data}\) is the number of observed collision events in the CR, \(N_{\mathrm{SR}}^{sim}\) is the number of simulated events in the SR,
and \(N_{\mathrm{CR}}^{sim}\) is the number of simulated events in the CR.
The transfer factor from simulation can be computed in any bin of an observable, so the shape as well as the rate of
background in the SR may be obtained.

### ABCD method

Other searches measure transfer factors using observed data in selection regions that are distinct from the primary SR and CR,
in which case the method might be referred to as the **ABCD method**. This method is particularly popular for dealing with multijet
backgrounds that are not typically modelled well by simulation.

Four selection regions in the observed data are involved,
formed by events either passing or failing either of two selection criteria, as shown in the graphic below. The
number of background events in the SR (region C), \(N_\mathrm{C}\), is calculated from observations in regions A, B, and D as
\(N_\mathrm{D} \times (N_\mathrm{B} / N_\mathrm{A})\). This method may also be used in any bin of an observable to obtain a shape-based prediction for the background.
In general, the ABCD method requires that the selection criteria are statistically independent in order to produce unbiased predictions.

![Demonstration of the ABCD method](../images/ABCD.png)

If some background sources are well-modelled by
simulation, these contributions may be subtracted from the observed data in each region before computing or applying the transfer function.
More than four regions can be used to incorporate a method for validation into the procedure, as shown in the second graphic.
The number of background events in the validation region X is estimated from the observations in regions A, D, and Y as \(N_\mathrm{D} \times (N_\mathrm{Y} / N_\mathrm{A})\), and if region X has a suitably low rate of
expected signal events the observed data in this region could be compared to the background prediction, to test the validity
of the prediction method.

![Demonstration of ABCD method with a validation region](../images/ABCDext.png)

## Sideband fits

In many searches, the observable most sensitive to the signal is a reconstructed mass
or jet mass distribution, in which the signal is expected to be resonant while the dominant background
processes are non-resonant. The shape of the background distribution may then be predicted by fitting a smooth
functional form to the observed data on either side of the region in which the signal distribution is expected to peak. This method
may be used in multiple dimensions for signals that feature more than one resonance.

When multiple functional forms offer adequate fits to the observed data, an F-statistic may be used to compare the residual sums of
squares for two formulas and determine whether a formula with more parameters provides a significantly better
fit than an alternate formula with fewer parameters (known as the Fischer \(\mathcal{F}\)-test).
4 changes: 0 additions & 4 deletions docs/analysis/backgrounds/qcdestimation.md

This file was deleted.

4 changes: 0 additions & 4 deletions docs/analysis/backgrounds/techniques.md

This file was deleted.

4 changes: 0 additions & 4 deletions docs/analysis/interpretation/stats.md

This file was deleted.

File renamed without changes.
2 changes: 1 addition & 1 deletion docs/analysis/selection/validatedRuns.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,4 +41,4 @@ and by adding these two lines after the `process.source` input file definition:
process.source.lumisToProcess.extend(myLumis)
```
This list should also be used as an input to the [luminosity calculation](../luminosity/lumi.md).
This list should also be used as an input to the [luminosity calculation](../lumi.md).
56 changes: 56 additions & 0 deletions docs/analysis/stats.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Statistics

!!! Warning
This page is under construction

## Overview of CMS techniques

CMS searches typically determine an observable or set of observables that is used to measure the potential presence of
signal events. This can be any observable, preferably highlighting unique features of the signal process.
Signal extraction is based on maximum likelihood fits that compare ``data'' (either collision data or pseudodata
sampled from a test distribution) to the signal (\(s\)) and background (\(b\)) predictions, with signal scaled by some
unknown ratio \(\mu\). The likelihood is assumed to follow a Poisson distribution, and all predictions are subject to various
nuisance parameters, \(\theta\), that are given default values \(\tilde{\theta}\) and assigned probability density functions (\(p\)).
The likelihood function can be written as:

\[
\mathcal{L}(\mathrm{data}\vert \mu,\theta) = \mathrm{Poisson}(\mathrm{data}\vert \mu\cdot s(\theta) + b(\theta))\cdot p(\tilde{\theta}\vert\theta).
\]

Systematic uncertainties are incorporated into the fit as nuisance parameters. Lognormal probability distributions are assigned
to uncertainties that affect only the normalization of a histogram or rate of a predicted event yield, and Gaussian probability
distributions are typically assigned to uncertainties provided as histograms that affect the shape of a distribution.
You can learn about several typical sources of uncertainty in CMS analyses in the [Systematics section](systematics/lumiuncertain.md)
of the Guide.

Observed and expected limits on the signal ratio \(\mu\) are extracted by comparing the compatibility
of the observed data with a background-only (\(\mu = 0\)) hypothesis as well as with a signal+background hypothesis.
The most common statistical method within CMS is the **CLs** method ([Read, 2002](http://dx.doi.org/10.1088/0954-3899/28/10/313) and [Junk, 1999](https://www.arxiv.org/abs/hep-ex/9902006)),
which can be used to obtain a limit at the 95% confidence level using the profile likelihood test statistic
([Cowan, 2010](https://arxiv.org/abs/1007.1727)) with the asymptotic limit approximation.

The ["Higgs Combine"](https://cms-analysis.github.io/HiggsAnalysis-CombinedLimit) software framework used by
the CMS experiment to compute limits is built on the [RooFit](https://root.cern/manual/roofit/) and
[RooStats](https://root.cern/doc/master/group__Roostats.html) packages and implements statistical procedures developed
for combining ATLAS and CMS Higgs boson measurements.

## Tutorials

Many tutorials and lectures on statistical interpretation of LHC data are available online. Some selected highlights are listed here.

- *Practical Statistics for LHC Physicists*, a set of three lectures by Prof. Harrison Prosper, 2015. Slides and videos are available for each lecture:

- [Descriptive Statistics, Probability and Likelihood](https://indico.cern.ch/event/358542/)
- [Frequentist Inference](https://indico.cern.ch/event/358543/)
- [Bayesian Inference](https://indico.cern.ch/event/358544/)

- Higgs Combine [tutorial on the main features of Combine](https://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part5/longexercise/).

- [Solutions](https://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part5/longexerciseanswers/) are available
- Note: some links within this tutorial point to CMS internal resources.

- Open Data Workshop [*Simplified Run 2 Analysis* lesson](https://cms-opendata-workshop.github.io/workshopwhepp-lesson-ttbarljetsanalysis/)

- Lessons from the Open Data Workshop series use the docker container environment recommended for processing Open Data.
- The overall lesson offers tools for analysis of files in the NanoAOD or [PhysObjectExtractorTool](https://github.com/cms-opendata-analyses/PhysObjectExtractorTool) format.
- Specifically, the final page of the lesson (*5: Systematics and Statistics*) introduces the python-based tool [pyhf](https://pyhf.readthedocs.io/en/v0.7.6/) for performing statistical inference without any ROOT software.
4 changes: 0 additions & 4 deletions docs/analysis/systematics/objectsuncertain.md

This file was deleted.

Binary file added docs/images/ABCD.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/ABCDext.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
16 changes: 16 additions & 0 deletions docs/javascripts/mathjax.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
window.MathJax = {
tex: {
inlineMath: [["\\(", "\\)"]],
displayMath: [["\\[", "\\]"]],
processEscapes: true,
processEnvironments: true
},
options: {
ignoreHtmlClass: ".*|",
processHtmlClass: "arithmatex"
}
};

document$.subscribe(() => {
MathJax.typesetPromise()
})
21 changes: 11 additions & 10 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -65,12 +65,8 @@ nav:
- 'Reference Guide (Fit)':
- 'Overview': analysis/selection/idefficiencystudy/fittingreferenceguide/overview.md
- 'Src files': analysis/selection/idefficiencystudy/fittingreferenceguide/src_files.md
- 'Luminosity':
- 'Luminosity': analysis/luminosity/lumi.md
- 'Backgrounds':
- 'Techniques': analysis/backgrounds/techniques.md
- 'QCD Estimation': analysis/backgrounds/qcdestimation.md

- 'Luminosity': analysis/lumi.md
- 'Background Modelling': analysis/backgrounds.md
- 'Systematics':
- 'Luminosity Uncertainties': analysis/systematics/lumiuncertain.md
- 'MC Uncertainty': analysis/systematics/mcuncertain.md
Expand All @@ -79,9 +75,7 @@ nav:
- 'Jet/MET uncertainties': analysis/systematics/objectsuncertain/jetmetuncertain.md
- 'Tagging uncertainties': analysis/systematics/objectsuncertain/btaguncertain.md
- 'Pileup Uncertainty': analysis/systematics/pileupuncertain.md
- 'Interpretation':
- 'Statistics': analysis/interpretation/stats.md
- 'Upper-limit Calculations': analysis/interpretation/limits.md
- 'Statistical Interpretation': analysis/stats.md
- FAQ: faq.md
- About: about.md

Expand All @@ -103,6 +97,11 @@ theme:
extra_css:
- stylesheets/extra.css

extra_javascript:
- javascripts/mathjax.js
- https://polyfill.io/v3/polyfill.min.js?features=es6
- https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js

# Extensions
markdown_extensions:
- admonition
Expand All @@ -115,7 +114,9 @@ markdown_extensions:
- pymdownx.superfences
- pymdownx.tabbed:
alternate_style: true

- pymdownx.arithmatex:
generic: true

# Plugins
plugins:
- search
Expand Down
Loading