The program is used to practice linear (predictive) discriminant analysis. It is based on the SAS DISCRIM and STEPDISC procedures. The program produces ranking functions, allows to make a prediction from a data set and a variable selection mechanism has been implemented. Automatic reporting of results is also available, in HTML or PDF format.
Carried out by AbdDia and ASKruchinina and Dataymeric as part of a project in L3 IDS (Bachelor in Computer Science and Statistics for Data Science) at the Université Lumière Lyon 2 (France).
A demo program has been made available (demo_program.py
) with the corresponding data in the data
folder.
After loading your data using the pandas
library and finding the name of the categorical variable, the user will have to create an object from the LinearDiscriminantAnalysis
class of the discriminant_analysis.py
module by specifying the two parameters (dataset and name of the categorical variable).
from discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(dataset, varName)
From there, the user will be able to directly call the different methods to perform the calculations he wants, as well as the different attributes of the created object, if he wants to retrieve specific parameters.
dataset
: the data setclassLabel
: the name of the target variableclassNames
: the names of the values taken for the target variablevarNames
: the names of the explanatory variablen
: the sample sizep
: the number of explanatory variablesK
: the number of classesV
: the total covariance matrixVb
: the biased covariance matrixW
: the intra-class covariance matrixWb
: the biased intra-class covariance matrix
Thus, to display the biased covariance matrix, just do :
lda.Vb
fit()
produces the discriminant functions.
lda.fit()
intercept_
: the intercept computed by the modelcoef_
: the coefficients of the modelinfoFuncClassification
: the values of the classification function
predict()
allows to make a prediction from a set of data and returns the vector of the predicted classes.
y_pred = lda.predict(values_a_predict)
Having predictions is all very well, but how can we know or rather measure whether these predictions are correct or not? The confusion matrix and the accuracy rate help.
The function confusion_matrix()
(confusion matrix) "provides an unbiased assessment of the model's performance in deployment". It takes the vector of true values of the target variable that were not used during training and the vector of predicted target values. Thus, one can estimate whether the function is capable of good prediction.
The function returns the numerical matrix. The function also returns a graph of the confusion matrix.
The function accuracy_score()
calculates the prediction rate. It takes as input the true values of the target variable and the predictions and returns the proportion of correct predictions.
lda.confusion_matrix(y_true, y_pred)
lda.accuracy_score(y_true, y_pred)
confusionMatrix
: the confusion matrixconfusionMatrixGraph
: graph of the confusion matrixaccuracy
: the accuracy's score
The function stepdisc()
allows to realize a stepwise discriminant analysis. It allows the user to take a forward
approach or a backward
approach.
It is necessary to set the stopping threshold (risk).
lda.stepdisc(slentry, method)
lda.stepdisc(0.01, 'forward')
infoStepResults
: displays the different values for the last stepstepdiscSummary
: summary of the variable selection approach
The function wilks_decay()
is used to calculate the value of Wilks' lambda as a function of the number of selected explanatory variables and to display the curve of the Wilks' lambda decay. For example, if there are 40 explanatory variables, the lambda values will be for {1, 2, ..., 40}.
lda.wilks_decay()
lda.figWilksDecay
infoWilksDecay
: list of wilks lambda values according to the number of selected variablesfigWilksDecay
: Wilks lambda decay curve
Provides output similar to SAS PROC DISCRIM, in HTML or PDF output. Requires importing the project reporting
module.
from reporting import HTML, PDF
HTML().discrim_html_output(lda, outputFileName.html)
PDF().discrim_pdf_output(lda, outputFileName.pdf)
Provides output similar to SAS PROC STEPDISC, in HTML output.
Requires importing the reporting
module of the project.
from reporting import HTML
HTML().stepdisc_html_output(lda, outputFileName.html)
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning : with Applications in R (Springer Texts in Statistics) (1st ed. 2013, Corr. 7th printing 2017 éd.). Springer.
- Jia Li. Linear Discriminant Analysis [Diapositives]. Department of Statistics of The Pennsylvania State University.
- Ricco Rakotomalala. Cours Analyse Discriminante.
- Ricco Rakotomalala. (2020). Analyse Discriminante Linéaire sous R.
- Ricco Rakotomalala. (2020). Pratique de l’Analyse Discriminante Linéaire.