cernopendata · jmhogan · Jan 12, 2024 · Jun 28, 2021 · Jan 9, 2024 · Jan 11, 2024
diff --git a/docs/analysis/backgrounds.md b/docs/analysis/backgrounds.md
@@ -0,0 +1,142 @@
+# Background Modelling Techniques
+
+!!! Warning
+    This page is under construction
+
+Accurate modeling of SM background processes is essential to most searches and measurements in high energy physics.
+The dominant background processes depend strongly on the selected objects, particularly leptons, missing
+transverse momentum, and b quark jets or jets from boosted particles. Background estimation strategies
+are always tailored to an individual analysis, typically as a variation of one or more of the following
+common methods.
+
+## Simulation
+
+Many SM processes are simulated at NLO, which provides a strong basis for background
+estimation. For processes such as W/Z+jets production that are often simulated at LO, *k*-factor
+calculations allows one to weight simulated events to reproduce predictions of distributions at NLO.
+In final states with charged leptons, for which QCD multijet production is unlikely to be a significant background, simulation is a common choice.
+Additionally, the majority of searches utilize simulation to model the signal process under consideration.
+
+Simulated events are weighted so that the efficiencies of certain selections
+in simulation match those observed in data. These corrections are referred to as ``scale factors''.
+Common scale factors in searches at the CMS experiment correct
+for differences in :
+
+- the number of pileup interactions
+- the efficiencies of trigger selections
+- the efficiencies of charged lepton identification and isolation selection criteria
+- the efficiencies of various jet identification selection criteria, such as heavy-flavor tagging.
+
+A detailed set of [corrections for the jet energy scale and resolution](systematics/objectsuncertain/jetmetuncertain.md) are computed for simulated events so that the
+response of the jet reconstruction algorithms is consistent between observed data and simulation. Searches may also develop
+correction formulas to correct for observed mismodeling of data by simulation in certain distributions of interest.
+A common correction of this type is to reweight the reconstructed top quark \pt spectrum, since the NLO top quark pair simulations
+tend to overpredict the rate of high-\pt top quark pairs. Each correction applied to simulation carries an uncertainty
+that should be taken into account in the statistical methods of signal extraction.
+
+## Tight / loose or "Matrix" methods
+
+Searches that select multiple charged leptons often have considerable background
+from events in which *nonprompt* leptons are selected. Nonprompt leptons are usually charged leptons that arise from
+sources other than the hard scatter or decays of massive particles produced in the hard scatter.
+
+One method to estimate contributions from these events is to measure how often known prompt leptons, typically from the
+decay of Z bosons, and known nonprompt leptons, typically from a sample of QCD multijet events, pass a certain set of
+lepton selection criteria. A Z boson sample is created in data by selecting events with two same-flavor opposite-sign
+leptons whose mass lies very close to the Z boson mass. One lepton, known as the *tag*, is selected using very high-purity
+selection criteria, giving confidence that the other *probe* lepton is indeed a prompt lepton. The efficiency for the
+probe lepton to pass any criteria of interest can then be measured in this sample (learn more about this calculation
+on the [tag and probe page](selection/idefficiencystudy/tagandprobe.md)). In the context of this background
+estimation method, the efficiency of the analysis selection criteria is referred to as the *prompt rate*, $p$.
+
+A QCD multijet sample can be created by selecting events that pass a low-momentum, low-purity, single-lepton trigger, but otherwise
+exhibit no strong signs of the lepton arising from a SM boson decay. The rate at which these leptons pass the analysis selection
+criteria can be measured, and is referred to as the *nonprompt* rate (or colloquially, *fake* rate), $f$. Both of these rates
+describe how often either prompt or nonprompt leptons that pass some baseline *loose* selection also pass the *tight*
+selection criteria used in the analysis.
+
+For searches that probe final states with two charged leptons, the probabilities for any prompt or nonprompt lepton to enter the sample must be considered
+together to develop a background distribution.
+The number of events with leptons passing the tight and/or loose criteria may be observed, in particular the number of events with two tight leptons, \(N_{tt}\); one tight and one loose lepton, \(N_{tl}\); and two loose leptons, \(N_{ll}\).
+The prompt and nonprompt rates may then be used to convert those observations into numbers of events with two prompt leptons, \(N_{pp}\); one prompt and one nonprompt lepton, \(N_{pf}\); and two nonprompt leptons, \(N_{ff}\).
+
+\[
+\begin{pmatrix}
+  N_{tt} \\
+  N_{tl} \\
+  N_{ll}
+\end{pmatrix} = \left( \begin{array}{ccc}
+  p^2 & pf & f^2 \\
+  2p(1-p) & f(1-p) + p(1-f) & 2f(1-f) \\
+  (1-p)^2 & (1-p)(1-f) & (1-f)^2
+\end{array} \right)
+\begin{pmatrix}
+  N_{pp}\\
+  N_{pf}\\
+  N_{ff}
+\end{pmatrix}
+\]
+
+A matrix inversion provides formulas to calculate \(N_{pf}\) and \(N_{ff}\) from the observed number of events with leptons of
+varying quality. For a search selecting two tight leptons, the background from events with nonprompt leptons will be given
+by \(N_{\mathrm{bkg}} = pfN_{pf} + f^2N_{ff}\). This method can be extended to searches targeting final states more than two charged leptons by expanding the probability matrix.
+
+A good reference for this method, built on earlier uses within CMS, is the [2022 doctoral thesis of Wing Yan Wong](http://cds.cern.ch/record/2808538).
+
+## Transfer factors
+
+In many searches, one important selection criterion is the primary dividing line between
+a background-dominated control region (CR) and a region with good signal sensitivity, called the signal region (SR).
+A *transfer factor* or *transfer function*
+that describes the efficiency of this principle selection criteria can be derived and applied to the observed data in the
+CR in order to estimate the background present in the SR.
+
+### Alpha-ratio method
+
+The transfer function can be computed in multiple ways. Some searches use simulation for this purpose, in which
+case the method is often called the *alpha-ratio method*. The number of background events in the SR, \(N_{\mathrm{SR}}^{bkg}\), is calculated
+as:
+
+\[
+N_{\mathrm{SR}}^{bkg} = N_\mathrm{CR}^{data} \times \frac{N_{\mathrm{SR}}^{sim}}{N_{\mathrm{CR}}^{sim}},
+\]
+
+where \(N_\mathrm{CR}^{data}\) is the number of observed collision events in the CR, \(N_{\mathrm{SR}}^{sim}\) is the number of simulated events in the SR,
+and \(N_{\mathrm{CR}}^{sim}\) is the number of simulated events in the CR.
+The transfer factor from simulation can be computed in any bin of an observable, so the shape as well as the rate of
+background in the SR may be obtained.
+
+### ABCD method
+
+Other searches measure transfer factors using observed data in selection regions that are distinct from the primary SR and CR,
+in which case the method might be referred to as the **ABCD method**. This method is particularly popular for dealing with multijet
+backgrounds that are not typically modelled well by simulation.
+
+Four selection regions in the observed data are involved,
+formed by events either passing or failing either of two selection criteria, as shown in the graphic below. The
+number of background events in the SR (region C), \(N_\mathrm{C}\), is calculated from observations in regions A, B, and D as
+\(N_\mathrm{D} \times (N_\mathrm{B} / N_\mathrm{A})\). This method may also be used in any bin of an observable to obtain a shape-based prediction for the background.
+In general, the ABCD method requires that the selection criteria are statistically independent in order to produce unbiased predictions.
+
+![Demonstration of the ABCD method](../images/ABCD.png)
+
+If some background sources are well-modelled by
+simulation, these contributions may be subtracted from the observed data in each region before computing or applying the transfer function.
+More than four regions can be used to incorporate a method for validation into the procedure, as shown in the second graphic.
+The number of background events in the validation region X is estimated from the observations in regions A, D, and Y as \(N_\mathrm{D} \times (N_\mathrm{Y} / N_\mathrm{A})\), and if region X has a suitably low rate of
+expected signal events the observed data in this region could be compared to the background prediction, to test the validity
+of the prediction method.
+
+![Demonstration of ABCD method with a validation region](../images/ABCDext.png)
+
+## Sideband fits
+
+In many searches, the observable most sensitive to the signal is a reconstructed mass
+or jet mass distribution, in which the signal is expected to be resonant while the dominant background
+processes are non-resonant. The shape of the background distribution may then be predicted by fitting a smooth
+functional form to the observed data on either side of the region in which the signal distribution is expected to peak. This method
+may be used in multiple dimensions for signals that feature more than one resonance.
+
+When multiple functional forms offer adequate fits to the observed data, an F-statistic may be used to compare the residual sums of
+squares for two formulas and determine whether a formula with more parameters provides a significantly better
+fit than an alternate formula with fewer parameters (known as the Fischer \(\mathcal{F}\)-test).
diff --git a/docs/analysis/backgrounds/qcdestimation.md b/docs/analysis/backgrounds/qcdestimation.md
diff --git a/docs/analysis/backgrounds/techniques.md b/docs/analysis/backgrounds/techniques.md
diff --git a/docs/analysis/interpretation/stats.md b/docs/analysis/interpretation/stats.md
diff --git a/docs/analysis/luminosity/lumi.md → docs/analysis/lumi.md b/docs/analysis/luminosity/lumi.md → docs/analysis/lumi.md
diff --git a/docs/analysis/selection/validatedRuns.md b/docs/analysis/selection/validatedRuns.md
@@ -41,4 +41,4 @@ and by adding these two lines after the `process.source` input file definition:
    process.source.lumisToProcess.extend(myLumis)
 ```
 
-This list should also be used as an input to the [luminosity calculation](../luminosity/lumi.md).
+This list should also be used as an input to the [luminosity calculation](../lumi.md).
diff --git a/docs/analysis/stats.md b/docs/analysis/stats.md
@@ -0,0 +1,56 @@
+# Statistics
+
+!!! Warning
+    This page is under construction
+
+## Overview of CMS techniques
+
+CMS searches typically determine an observable or set of observables that is used to measure the potential presence of
+signal events. This can be any observable, preferably highlighting unique features of the signal process.
+Signal extraction is based on maximum likelihood fits that compare ``data'' (either collision data or pseudodata
+sampled from a test distribution) to the signal (\(s\)) and background (\(b\)) predictions, with signal scaled by some
+unknown ratio \(\mu\). The likelihood is assumed to follow a Poisson distribution, and all predictions are subject to various
+nuisance parameters, \(\theta\), that are given default values \(\tilde{\theta}\) and assigned probability density functions (\(p\)).
+The likelihood function can be written as:
+
+\[
+\mathcal{L}(\mathrm{data}\vert \mu,\theta) = \mathrm{Poisson}(\mathrm{data}\vert \mu\cdot s(\theta) + b(\theta))\cdot p(\tilde{\theta}\vert\theta).
+\]
+
+Systematic uncertainties are incorporated into the fit as nuisance parameters. Lognormal probability distributions are assigned
+to uncertainties that affect only the normalization of a histogram or rate of a predicted event yield, and Gaussian probability
+distributions are typically assigned to uncertainties provided as histograms that affect the shape of a distribution.
+You can learn about several typical sources of uncertainty in CMS analyses in the [Systematics section](systematics/lumiuncertain.md)
+of the Guide.
+
+Observed and expected limits on the signal ratio \(\mu\) are extracted by comparing the compatibility
+of the observed data with a background-only (\(\mu = 0\)) hypothesis as well as with a signal+background hypothesis.
+The most common statistical method within CMS is the **CLs** method ([Read, 2002](http://dx.doi.org/10.1088/0954-3899/28/10/313) and [Junk, 1999](https://www.arxiv.org/abs/hep-ex/9902006)),
+which can be used to obtain a limit at the 95% confidence level using the profile likelihood test statistic
+([Cowan, 2010](https://arxiv.org/abs/1007.1727)) with the asymptotic limit approximation.
+
+The ["Higgs Combine"](https://cms-analysis.github.io/HiggsAnalysis-CombinedLimit) software framework used by
+the CMS experiment to compute limits is built on the [RooFit](https://root.cern/manual/roofit/) and
+[RooStats](https://root.cern/doc/master/group__Roostats.html) packages and implements statistical procedures developed
+for combining ATLAS and CMS Higgs boson measurements.
+
+## Tutorials
+
+Many tutorials and lectures on statistical interpretation of LHC data are available online. Some selected highlights are listed here.
+
+- *Practical Statistics for LHC Physicists*, a set of three lectures by Prof. Harrison Prosper, 2015. Slides and videos are available for each lecture:
+
+    - [Descriptive Statistics, Probability and Likelihood](https://indico.cern.ch/event/358542/)
+    - [Frequentist Inference](https://indico.cern.ch/event/358543/)
+    - [Bayesian Inference](https://indico.cern.ch/event/358544/)
+
+- Higgs Combine [tutorial on the main features of Combine](https://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part5/longexercise/).
+
+    - [Solutions](https://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part5/longexerciseanswers/) are available
+    - Note: some links within this tutorial point to CMS internal resources.
+
+- Open Data Workshop [*Simplified Run 2 Analysis* lesson](https://cms-opendata-workshop.github.io/workshopwhepp-lesson-ttbarljetsanalysis/)
+
+    - Lessons from the Open Data Workshop series use the docker container environment recommended for processing Open Data.
+    - The overall lesson offers tools for analysis of files in the NanoAOD or [PhysObjectExtractorTool](https://github.com/cms-opendata-analyses/PhysObjectExtractorTool) format.
+    - Specifically, the final page of the lesson (*5: Systematics and Statistics*) introduces the python-based tool [pyhf](https://pyhf.readthedocs.io/en/v0.7.6/) for performing statistical inference without any ROOT software.
diff --git a/docs/analysis/systematics/objectsuncertain.md b/docs/analysis/systematics/objectsuncertain.md
diff --git a/docs/images/ABCD.png b/docs/images/ABCD.png
diff --git a/docs/images/ABCDext.png b/docs/images/ABCDext.png
diff --git a/docs/javascripts/mathjax.js b/docs/javascripts/mathjax.js
@@ -0,0 +1,16 @@
+window.MathJax = {
+  tex: {
+    inlineMath: [["\\(", "\\)"]],
+    displayMath: [["\\[", "\\]"]],
+    processEscapes: true,
+    processEnvironments: true
+  },
+  options: {
+    ignoreHtmlClass: ".*|",
+    processHtmlClass: "arithmatex"
+  }
+};
+
+document$.subscribe(() => { 
+  MathJax.typesetPromise()
+})
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -65,12 +65,8 @@ nav:
                 - 'Reference Guide (Fit)':
                     - 'Overview':                  analysis/selection/idefficiencystudy/fittingreferenceguide/overview.md
                     - 'Src files':                 analysis/selection/idefficiencystudy/fittingreferenceguide/src_files.md
-        - 'Luminosity':
-            - 'Luminosity': analysis/luminosity/lumi.md
-        - 'Backgrounds':
-            - 'Techniques': analysis/backgrounds/techniques.md
-            - 'QCD Estimation': analysis/backgrounds/qcdestimation.md
-
+        - 'Luminosity': analysis/lumi.md
+        - 'Background Modelling': analysis/backgrounds.md
         - 'Systematics':
             - 'Luminosity Uncertainties': analysis/systematics/lumiuncertain.md
             - 'MC Uncertainty': analysis/systematics/mcuncertain.md
@@ -79,9 +75,7 @@ nav:
                  - 'Jet/MET uncertainties': analysis/systematics/objectsuncertain/jetmetuncertain.md
                  - 'Tagging uncertainties': analysis/systematics/objectsuncertain/btaguncertain.md
             - 'Pileup Uncertainty': analysis/systematics/pileupuncertain.md
-        - 'Interpretation':
-            - 'Statistics': analysis/interpretation/stats.md  
-            - 'Upper-limit Calculations': analysis/interpretation/limits.md
+        - 'Statistical Interpretation': analysis/stats.md  
     - FAQ: faq.md
     - About: about.md
 
@@ -103,6 +97,11 @@ theme:
 extra_css:
   - stylesheets/extra.css
 
+extra_javascript:
+  - javascripts/mathjax.js
+  - https://polyfill.io/v3/polyfill.min.js?features=es6
+  - https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js
+
 # Extensions
 markdown_extensions:
     - admonition
@@ -115,7 +114,9 @@ markdown_extensions:
     - pymdownx.superfences
     - pymdownx.tabbed:
         alternate_style: true
-
+    - pymdownx.arithmatex:
+        generic: true
+
 # Plugins
 plugins:
   - search