STATISTICAL METHODS and TECHNIQUES for DATA ANALYSIS
(in Particle and Astroparticle Physics)

Professor: Alexis Pompili
(Dipartimento Interuniversitario di Fisica, University of Bari Aldo Moro & I.N.F.N.-Bari)

Course for Ph.D. students / July 2025
Dottorato Nazionale in Tecnologie per la Ricerca Fondamentale in Fisica ed Astrofisica / Bari Ph.D. in Physics XL-Cycle

This course is a self-standing course in statistical data analysis
with examples of applications borrowed by High Energy Physics.

It cannot be, of course, an all-encompassing course, but concepts are rigorously introduced
and used coherently and some selected topics are treated in a detailed way within an hands-on approach.

Ph.D students will be brought, starting from the simpler concepts of probability and statistics, to enough advanced methods and techniques.
In this way the course aims to cover students with different/non-homogeneous entry knowledge level.

Hands-on exercises are carried out within a Jupyter Notebook executed in the Google Colab framework. Students need to have a (free) google account.

The course is performed in 20+4 hours
[the last 4 hours are devoted to introduce a Machine Learning-based signal selection;
the use of XGBoost modern tool is demonstrated together with its hyper-parameters optimization carried out by the Optuna tool;
these last 4 hours are co-organized with Dr. Umit Sozbilir, post-doc at University of Bari]

A final exam (homework type) will be organized as learning assessment.

The course will be provided on the Zoom platform because of the Ph.D. students of the Dottorato Nazionale are geographically spread.


Compact description of contents and topics of the course

Basic concept of the theory of Probability. Axiomatic probability and the role of Bayes theorem.
Histograms: sampling and binning. Hystograms' comparison: absolute and relative normalization, stacked plots, data-to-simulation comparison, data-to-data comparison.
Histograms ratio and uncertainties.

Probability density functions and their features. Joint and conditional probabilities.
Dependence and correlation between observables. Covariance matrix. Variance propagation.

Generation of distributions. Binomial distribution and efficiency. Stochastic (Poissonian) processes and applicability of the Poissonian distribution.
Gaussian function and its role in the Central Limit Theorem. Gaussian resolution function.
Other important distributions (Crystal Ball, Breit-Wigner, chi-squared).

Hypothesis testing: test statistics, discrimination of signal against background, ROC curva and choice of a suitable Working Point.

Point estimation theory. Maximum Likelihood fitting, binned and unbinned, extended.
Symmetric and asymmetric uncertainties, Profile Likelihood.
Fitting tasks within a Jupyter notebook. Background modelization with different polynomia; sidebands subtraction method.

Python framework and Jupyter notebook. Uproot and RDataFrame to handle big data.
Extraction of a physical signal from big data with a classical cut-based selection; evaluation of signal significance, signal purity and signal-to-noise ratio.

Extraction of physical signal from big data with a machine-learning approach (XGBoost) optimized by Optuna. Comparison with the cut-based selection. [U. Sozbilir]

Note: all items are covered by hands-on examples/exercises - executed on Google COLAB platform - borrowed by High Energy Physics best practices.


Temptative but realistic Agenda

(material will be added during the execution of the course)

July 7th: 2 hours (15-17)

July 8th: 5 hours (10-13, 15-17)

July 9th: 4 hours (11-13, 15-17)

July 10th: 5 hours (10-13, 15-17)

July 11th: 4 hours (11-13, 15-17)

July 16th: 4 hours (9.30-11.30 / 12.00-14.00) [with U.Sozbilir]

Zoom coordinates will be the same and sent by email the first day of lessons.

Copyright: all the material of this course could be used only under permission of the author (pompili AT ba.infn.it) and with proper acknowledgment.