Statistiques sous `Python`¶

Nous allons utiliser dans ce TP le module pandas permettant l'analyse de données avec Python. La première instruction est d'installer le module, à faire dans un terminale de commande. Nous devons aussi installer les modules matplotlib et scipy.

{bash}
pip3 install pandas
pip3 install matplotlib
pip3 install scipy
pip3 install numpy

Une fois ces modules installés, nous pouvons lancer un notebook pour commencer notre programme.

Il faut tout d'abord importer ces modules. La dernière ligne permettra de voir le résultat des graphiques dans le document.

import matplotlib.pyplot
import pandas
import scipy.stats
import numpy

%matplotlib inline

Données¶

Nous allons travailler sur les données tips. Vous pouvez trouver des informations (ici). Voici comment lire ces données dans python avec read_csv() de pandas.

# Lecture d'un fichier texte
tips = pandas.read_csv("tips.csv", header = 0, sep = ",")

Sur ces données, il est bien évidemment possible de voir quelques informations classiques.

type(tips)

pandas.core.frame.DataFrame

# informations diverses
tips.shape

(244, 7)

tips.count()

total_bill    244
tip           244
sex           244
smoker        244
day           244
time          244
size          244
dtype: int64

tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null object
smoker        244 non-null object
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: float64(2), int64(1), object(4)
memory usage: 13.4+ KB

list(tips.columns)

['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size']

list(tips)

['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size']

Statistiques descriptives univariés¶

La fonction describe() permet de décrire toutes les variables quantitatives d'un jeu de données directement.

# résumé basique
tips.describe()

tips.describe().round(2)

Quantitative¶

Il est possible de sélectionner les variables soit via les crochets [], soit par un point ..

Les fonctions ci-dessous permettent de décrire une variable quantitative (ici "total_bill").

tips.total_bill.describe()

count    244.000000
mean      19.785943
std        8.902412
min        3.070000
25%       13.347500
50%       17.795000
75%       24.127500
max       50.810000
Name: total_bill, dtype: float64

tips["total_bill"].describe()

count    244.000000
mean      19.785943
std        8.902412
min        3.070000
25%       13.347500
50%       17.795000
75%       24.127500
max       50.810000
Name: total_bill, dtype: float64

tips.total_bill.mean()

19.78594262295082

tips.total_bill.std()

8.902411954856856

tips.total_bill.var()

79.25293861397827

tips.total_bill.min()

3.07

tips.total_bill.max()

50.81

tips.total_bill.median()

17.795

tips.total_bill.quantile([.01, .1, .9, .99])

0.01     7.250
0.10    10.340
0.90    32.235
0.99    48.227
Name: total_bill, dtype: float64

scipy.stats.normaltest(tips.total_bill)

NormaltestResult(statistic=45.11781912347332, pvalue=1.5951078766352608e-10)

scipy.stats.shapiro(tips.total_bill)

(0.9197188019752502, 3.3245434183371003e-10)

Histogramme¶

Pour représenter graphiquement cette variable, pandas met à disposition (via le module matplotlib utilisé par pandas) des fonctions graphiques.

Pour réaliser un histogramme, nous utilisons la fonction hist(). Celle-ci peut prendre des options. La fonction plot() avec le paramètre kind avec la valeur "hist" revient au même résultat.

tips.plot.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x7f7dff1ebfd0>

tips.total_bill.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x7f7dfd11c4e0>

tips.total_bill.hist(bins = 20)

<matplotlib.axes._subplots.AxesSubplot at 0x7f7dfd0935c0>

tips.total_bill.plot(kind = "hist")

<matplotlib.axes._subplots.AxesSubplot at 0x7f7dfd04b748>

tips.total_bill.plot(kind = "hist", normed = True)

<matplotlib.axes._subplots.AxesSubplot at 0x7f7dfd0d91d0>

tips.total_bill.plot(kind = "kde")

<matplotlib.axes._subplots.AxesSubplot at 0x7f7dfcfef860>

Pour avoir la densité et l'histogramme sur le même graphique, il est nécessaire de compiler les deux lignes suivantes ensemble.

tips.total_bill.plot(kind = "hist", normed = True, color = "lightgrey")
tips.total_bill.plot(kind = "kde")

<matplotlib.axes._subplots.AxesSubplot at 0x7f7dfcf3d8d0>

Boîtes à moustaches¶

Enfin, pour les boîtes à moustaches, il faut passer par le DataFrame pour l'afficher, et choisir une variable spécifiquement éventuellement.

tips.boxplot()

<matplotlib.axes._subplots.AxesSubplot at 0x7f7dfcf500f0>

tips.boxplot(column = "total_bill")

<matplotlib.axes._subplots.AxesSubplot at 0x7f7dfcf50550>

tips.boxplot(column = "total_bill", grid = False)

<matplotlib.axes._subplots.AxesSubplot at 0x7f7dfceed9e8>

Qualitative¶

Pour les variables qualitatives, il y a plusieurs façons de faire pour obtenir la table d'occurences (ou des effectifs), ainsi que la table des proportions des modalités.

tips.sex.describe()

count      244
unique       2
top       Male
freq       157
Name: sex, dtype: object

tips.sex.unique()

array(['Female', 'Male'], dtype=object)

tips.sex.value_counts()

Male      157
Female     87
Name: sex, dtype: int64

pandas.crosstab(tips.sex, "freq")

pandas.crosstab(tips.sex, "freq", normalize=True)

t = pandas.crosstab(tips.sex, "freq")
scipy.stats.chisquare(t)

Power_divergenceResult(statistic=array([20.08196721]), pvalue=array([7.41929371e-06]))

Diagramme en barres¶

Ensuite, pour réaliser un diagramme en barres, nous utilisons le type "bar" pour plot(). Les calculs de proportions précédents nous permettent d'afficher une représentation des proportions plutôt que des effectifs.

t = pandas.crosstab(tips.sex, "freq")
t.plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x7f7dfce11588>

t.plot(kind = "bar")

<matplotlib.axes._subplots.AxesSubplot at 0x7f7dfce1bbe0>

t = pandas.crosstab(tips.sex, "freq", normalize=True)
t.plot(kind = "bar")

<matplotlib.axes._subplots.AxesSubplot at 0x7f7dfce1b048>

(t * 100).plot(kind = "bar")

<matplotlib.axes._subplots.AxesSubplot at 0x7f7dfcd92160>

Diagramme circulaire¶

Et pour un diagramme circulaire, seul le tableau des effectifs produit par value_counts() nous permet de le réaliser.

t = pandas.crosstab(tips.sex, "freq")
t.plot.pie(subplots=True, figsize = (6, 6))

array([<matplotlib.axes._subplots.AxesSubplot object at 0x7f7dfcd41780>],
      dtype=object)

Statistiques descriptives bivariées¶

Quantitative - quantitative¶

tips.corr()

tips.total_bill.corr(tips.tip)

0.6757341092113647

tips.total_bill.cov(tips.tip)

8.323501629224854

scipy.stats.pearsonr(tips.total_bill, tips.tip)

(0.6757341092113643, 6.692470646864041e-34)

scipy.stats.kendalltau(tips.total_bill, tips.tip)

KendalltauResult(correlation=0.517180972142381, pvalue=2.4455728480214792e-32)

Nuage de points¶

tips.plot.scatter("total_bill", "tip")

<matplotlib.axes._subplots.AxesSubplot at 0x7f7dfcd4b438>

pandas.plotting.scatter_matrix(tips)

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f7dfcd056a0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f7dfccac208>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f7dfcc50208>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f7dfcc62518>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f7dfcc17208>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f7dfcbb2c88>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f7dfcbd5c88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f7dfcb81198>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f7dfcba5198>]],
      dtype=object)

Qualitative - qualitative¶

pandas.crosstab(tips.sex, tips.smoker)

pandas.crosstab(tips.sex, tips.smoker, margins=True)

pandas.crosstab(tips.sex, tips.smoker, normalize = True)

pandas.crosstab(tips.sex, tips.smoker, normalize = "index")

pandas.crosstab(tips.sex, tips.smoker, normalize = "index", margins=True)

pandas.crosstab(tips.sex, tips.smoker, normalize = "columns")

pandas.crosstab(tips.sex, tips.smoker, normalize = "columns", margins=True)

t = pandas.crosstab(tips.sex, tips.smoker)
scipy.stats.chi2_contingency(t)

(0.008763290531773594, 0.925417020494423, 1, array([[53.84016393, 33.15983607],
        [97.15983607, 59.84016393]]))

Diagramme en barres¶

t = pandas.crosstab(tips.sex, tips.smoker)
t.plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x7f7dfcb28c50>

t = pandas.crosstab(tips.sex, tips.smoker, normalize=True)
t.plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x7f7dfcb2f390>

t = pandas.crosstab(tips.sex, tips.smoker, normalize="index")
t.plot.bar(stacked=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7f7dfca5b940>

t = pandas.crosstab(tips.sex, tips.smoker)
t.plot.pie(subplots=True, figsize = (12, 6))

array([<matplotlib.axes._subplots.AxesSubplot object at 0x7f7dfca698d0>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x7f7dfc9c8d30>],
      dtype=object)

Qualitative - quantitative¶

tips.groupby("sex").mean()

tips.groupby("sex")["total_bill"].agg([numpy.mean, numpy.std, numpy.median, numpy.min, numpy.max])

billFemale = tips.total_bill[tips.sex == "Female"]
billMale = tips.total_bill[tips.sex == "Male"]
scipy.stats.ttest_ind(billFemale, billMale)

Ttest_indResult(statistic=-2.2777940289803134, pvalue=0.0236116668468594)

billGrouped = [tips.total_bill[tips.sex == s] for s in list(tips.sex.unique())]
scipy.stats.f_oneway(*billGrouped)

F_onewayResult(statistic=5.188345638458361, pvalue=0.023611666846859697)

tips.hist(column = "total_bill", by = "sex")

array([<matplotlib.axes._subplots.AxesSubplot object at 0x7f7dfca5bf60>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x7f7dfc9aacc0>],
      dtype=object)

tips.boxplot(by = "sex")

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f7dfc996b38>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f7dfc92b8d0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f7dfc8c6b38>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f7dfc8ec898>]],
      dtype=object)

tips.boxplot(column = "total_bill", by = "sex")

<matplotlib.axes._subplots.AxesSubplot at 0x7f7dfc814cf8>

Exercices¶

A partir du fichier diamonds.csv (voir l'aide ici), analyser les données suivant le déroulement classique

Description de chaque variable
Recherche des liens entre le prix (price) et les autres variables

	total_bill	tip	size
count	244.00	244.00	244.00
mean	19.79	3.00	2.57
std	8.90	1.38	0.95
min	3.07	1.00	1.00
25%	13.35	2.00	2.00
50%	17.80	2.90	2.00
75%	24.13	3.56	3.00
max	50.81	10.00	6.00

smoker	No	Yes
sex
Female	0.620690	0.379310
Male	0.617834	0.382166
All	0.618852	0.381148

smoker	No	Yes	All
sex
Female	0.357616	0.354839	0.356557
Male	0.642384	0.645161	0.643443

	total_bill	tip	size
count	244.000000	244.000000	244.000000
mean	19.785943	2.998279	2.569672
std	8.902412	1.383638	0.951100
min	3.070000	1.000000	1.000000
25%	13.347500	2.000000	2.000000
50%	17.795000	2.900000	2.000000
75%	24.127500	3.562500	3.000000
max	50.810000	10.000000	6.000000

	total_bill	tip	size
total_bill	1.000000	0.675734	0.598315
tip	0.675734	1.000000	0.489299
size	0.598315	0.489299	1.000000

	total_bill	tip	size
sex
Female	18.056897	2.833448	2.459770
Male	20.744076	3.089618	2.630573

	mean	std	median	amin	amax
sex
Female	18.056897	8.009209	16.40	3.07	44.30
Male	20.744076	9.246469	18.35	7.25	50.81

Statistiques sous Python¶

Données¶

Statistiques descriptives univariés¶

Quantitative¶

Histogramme¶

Boîtes à moustaches¶

Qualitative¶

Diagramme en barres¶

Diagramme circulaire¶

Statistiques descriptives bivariées¶

Quantitative - quantitative¶

Nuage de points¶

Qualitative - qualitative¶

Diagramme en barres¶

Qualitative - quantitative¶

Exercices¶

Statistiques sous `Python`¶