Toujours dans le module sklearn, et particulièrement le sous-module linear_model
LinearRegression pour faire un modèle linéaire généraliséLogisitcRegression pour faire un modèle de régression logistiqueimport pandas
import numpy
import matplotlib.pyplot as plt
import seaborn
seaborn.set_style("white")
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
url_wine = "https://fxjollois.github.io/cours-2022-2023/m1-dci-ecd/wine.csv"
wine = pandas.read_csv(url_wine)
wine
| class | Alcohol | Malic acid | Ash | Alcalinity of ash | Magnesium | Total phenols | Flavanoids | Nonflavanoid phenols | Proanthocyanins | Color intensity | Hue | OD280/OD315 of diluted wines | Proline | Alcohol_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 14.23 | 1.71 | 2.43 | 15.6 | 127 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065 | > 13 |
| 1 | 1 | 13.20 | 1.78 | 2.14 | 11.2 | 100 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050 | > 13 |
| 2 | 1 | 13.16 | 2.36 | 2.67 | 18.6 | 101 | 2.80 | 3.24 | 0.30 | 2.81 | 5.68 | 1.03 | 3.17 | 1185 | > 13 |
| 3 | 1 | 14.37 | 1.95 | 2.50 | 16.8 | 113 | 3.85 | 3.49 | 0.24 | 2.18 | 7.80 | 0.86 | 3.45 | 1480 | > 13 |
| 4 | 1 | 13.24 | 2.59 | 2.87 | 21.0 | 118 | 2.80 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735 | > 13 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 173 | 3 | 13.71 | 5.65 | 2.45 | 20.5 | 95 | 1.68 | 0.61 | 0.52 | 1.06 | 7.70 | 0.64 | 1.74 | 740 | > 13 |
| 174 | 3 | 13.40 | 3.91 | 2.48 | 23.0 | 102 | 1.80 | 0.75 | 0.43 | 1.41 | 7.30 | 0.70 | 1.56 | 750 | > 13 |
| 175 | 3 | 13.27 | 4.28 | 2.26 | 20.0 | 120 | 1.59 | 0.69 | 0.43 | 1.35 | 10.20 | 0.59 | 1.56 | 835 | > 13 |
| 176 | 3 | 13.17 | 2.59 | 2.37 | 20.0 | 120 | 1.65 | 0.68 | 0.53 | 1.46 | 9.30 | 0.60 | 1.62 | 840 | > 13 |
| 177 | 3 | 14.13 | 4.10 | 2.74 | 24.5 | 96 | 2.05 | 0.76 | 0.56 | 1.35 | 9.20 | 0.61 | 1.60 | 560 | > 13 |
178 rows × 15 columns
adult¶url_adult = "https://fxjollois.github.io/cours-2022-2023/m1-dci-ecd/adult.csv"
adult = pandas.read_csv(url_adult)
adult
| age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
| 1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
| 2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
| 3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
| 4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 32556 | 27 | Private | 257302 | Assoc-acdm | 12 | Married-civ-spouse | Tech-support | Wife | White | Female | 0 | 0 | 38 | United-States | <=50K |
| 32557 | 40 | Private | 154374 | HS-grad | 9 | Married-civ-spouse | Machine-op-inspct | Husband | White | Male | 0 | 0 | 40 | United-States | >50K |
| 32558 | 58 | Private | 151910 | HS-grad | 9 | Widowed | Adm-clerical | Unmarried | White | Female | 0 | 0 | 40 | United-States | <=50K |
| 32559 | 22 | Private | 201490 | HS-grad | 9 | Never-married | Adm-clerical | Own-child | White | Male | 0 | 0 | 20 | United-States | <=50K |
| 32560 | 52 | Self-emp-inc | 287927 | HS-grad | 9 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 15024 | 0 | 40 | United-States | >50K |
32561 rows × 15 columns
Pour cela, nous allons utiliser les données wine.
La fonction LinearRegression() (ainsi que LogisticRegression()) nécessite de séparer la variable à prédire et les variables explicatives. Pour cela, nous créons ici 2 objets :
y : variable à prédireX : variable(s) explicative(s)[[]] est obligatoire pour garder un format array pour XX = wine[["Malic acid"]]
y = wine["Alcohol"]
m1 = LinearRegression().fit(X, y)
m1.coef_
array([0.06859796])
m1.intercept_
12.840349252602367
m1.score(X, y) # r2
0.00891078245324417
X = wine[["Malic acid", "Color intensity", "Magnesium"]]
y = wine["Alcohol"]
m2 = LinearRegression().fit(X, y)
m2.coef_
array([-0.01888221, 0.1820084 , 0.00940464])
m2.intercept_
11.186085337626954
m2.score(X, y) # r2
0.32632564681673604
On peut obtenir les prédictions sur les données d'apprentissage avec la fonction predict(). Bien évidémment, si on veut prédire sur d'autres données, on pourra utiliser la même fonction.
m2.predict(X)
array([13.37471294, 12.89013542, 13.12519929, 13.63165444, 13.03320378,
13.43472861, 13.00916478, 13.20259199, 13.01381191, 13.39634935,
13.17933487, 12.96162211, 13.00967878, 12.99208637, 13.4751115 ,
13.53388912, 13.40683993, 13.43922888, 13.75523642, 13.14673113,
13.36863897, 12.89264363, 12.79246461, 12.76460726, 12.69542319,
12.96514178, 12.90093532, 12.75657693, 12.97554302, 12.91264779,
13.14507816, 13.40749027, 12.82852532, 13.38145291, 12.95104263,
13.020615 , 13.02686714, 12.85011975, 12.75284746, 13.2427816 ,
13.36985069, 12.73899086, 13.08857183, 12.8731683 , 13.07628224,
13.10743985, 12.96941226, 13.21448273, 13.24507289, 13.78899459,
13.32910613, 13.05821254, 13.48011531, 13.37839523, 13.32904827,
13.38190942, 13.42494625, 13.20021068, 13.41244161, 12.35086044,
12.71035064, 13.14741746, 12.73815031, 12.79288918, 12.67371309,
12.92213086, 12.86522045, 12.74935409, 12.77981268, 13.10243953,
12.67948814, 12.58156031, 12.65365564, 13.07152463, 12.69961896,
12.75446842, 12.81512871, 12.66715727, 13.06525105, 12.53064103,
12.43253343, 12.67054002, 12.29872855, 12.7862344 , 12.6084416 ,
12.57550177, 12.44802283, 12.45538188, 12.44680095, 12.13599115,
12.35012659, 12.62160216, 12.46469208, 12.35157473, 12.66169139,
13.1541573 , 12.86129732, 12.48667987, 12.81252716, 12.37245604,
12.65968778, 12.43431175, 12.5711018 , 12.33734396, 12.48791789,
12.47577565, 12.52461858, 12.54365864, 12.51837651, 12.52695243,
12.64958631, 12.33182636, 12.79579047, 12.56030637, 12.47765287,
12.30278325, 12.32204359, 12.546323 , 12.49251881, 12.17249346,
12.63514041, 13.35857892, 12.44028752, 12.35858907, 12.3855067 ,
12.44703703, 12.68310737, 12.55711797, 12.36877314, 12.33048457,
13.05419442, 13.09055506, 13.10156966, 13.02598693, 12.95401126,
13.31593054, 12.63470627, 12.89493143, 12.98335407, 12.97554358,
12.87311005, 12.99400654, 12.83021499, 12.75868189, 13.66043058,
12.80598234, 12.73513105, 13.30020132, 13.52264423, 13.88604757,
13.84921534, 14.15467981, 13.53340658, 13.96015676, 13.51366866,
13.40058419, 13.58519893, 13.40618495, 14.44212677, 14.13016336,
13.33373872, 13.2010348 , 13.13683882, 13.07883751, 13.72402816,
13.13626249, 14.10398549, 13.81746645, 13.70104878, 13.69961782,
13.03276352, 13.75163854, 13.75999436, 13.37430596, 13.40019011,
14.0903115 , 13.95841488, 13.68599063])
On peut ainsi croiser les données observées (en abcisses ici) et les valeurs prédites (en ordonnées donc).
plt.figure(figsize = (16,8))
plt.title("Prédit vs Observé")
g = seaborn.scatterplot(x = y, y = m2.predict(X))
g.set(xlabel = "Valeurs observées", ylabel = "Valeurs prédites")
plt.show()
Ici nous allons utiliser les données adult, dans lesquelles nous allons n'utiliser que certaines variables.
Pour la régression logisitique, nous allons devoir créer un encodage disjonctif complet (aussi appelé one-hot encoding) des variables qualitatives. Par exemple, d'une variable à 2 modalités, nous allons créer 2 variables binaires. Pour cela, nous avons besoin de la fonction OneHotEncoder() du sous-module preprocessing.
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
Voici un petit exemple de comment elle fonctionne
enc = encoder.fit(adult[["sex"]])
enc.categories_
[array(['Female', 'Male'], dtype=object)]
enc.transform(adult[["sex"]]).toarray()
array([[0., 1.],
[0., 1.],
[0., 1.],
...,
[1., 0.],
[0., 1.],
[1., 0.]])
Pour créer reformater notre jeu de données, nous appliquons cela sur l'ensemble des variables qualitatives explicatives. Et à chaque fois, nous concaténons le résultat aux variables quantitatives.
X = adult[['age', 'capital_gain', 'capital_loss', 'hours_per_week']]
var_qual = ['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country']
for v in var_qual:
enc = encoder.fit(adult[[v]])
enc_df = pandas.DataFrame(enc.transform(adult[[v]]).toarray(), columns = [v+":"+c for c in enc.categories_[0]])
X = pandas.concat([X, enc_df], axis = 1)
X
| age | capital_gain | capital_loss | hours_per_week | workclass:? | workclass:Federal-gov | workclass:Local-gov | workclass:Never-worked | workclass:Private | workclass:Self-emp-inc | ... | native_country:Portugal | native_country:Puerto-Rico | native_country:Scotland | native_country:South | native_country:Taiwan | native_country:Thailand | native_country:Trinadad&Tobago | native_country:United-States | native_country:Vietnam | native_country:Yugoslavia | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 39 | 2174 | 0 | 40 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 1 | 50 | 0 | 0 | 13 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 2 | 38 | 0 | 0 | 40 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 3 | 53 | 0 | 0 | 40 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 4 | 28 | 0 | 0 | 40 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 32556 | 27 | 0 | 0 | 38 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 32557 | 40 | 0 | 0 | 40 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 32558 | 58 | 0 | 0 | 40 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 32559 | 22 | 0 | 0 | 20 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 32560 | 52 | 15024 | 0 | 40 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
32561 rows × 106 columns
Nous pouvons maintenant créer notre modèle.
y = adult["class"]
m3 = LogisticRegression(max_iter = 10000).fit(X, y)
m3.score(X, y) # r2
0.8516630324621479
On peut obtenir les prédictions sur les données d'apprentissage avec la fonction predict(). Bien évidémment, si on veut prédire sur d'autres données, on pourra utiliser la même fonction.
m3.predict(X)
array(['<=50K', '<=50K', '<=50K', ..., '<=50K', '<=50K', '>50K'],
dtype=object)
Nous avons aussi accès aux probabilités de prévision pour chaque individu.
m3.predict_proba(X)
array([[0.86743459, 0.13256541],
[0.57431367, 0.42568633],
[0.97675684, 0.02324316],
...,
[0.96464562, 0.03535438],
[0.9961107 , 0.0038893 ],
[0.00499158, 0.99500842]])
On peut ainsi croiser les données observées (en colonnes ici) et les valeurs prédites (en lignes donc).
pandas.crosstab(m3.predict(X), y)
| class | <=50K | >50K |
|---|---|---|
| row_0 | ||
| <=50K | 23002 | 3112 |
| >50K | 1718 | 4729 |
from sklearn.metrics import roc_curve, auc
fpr, tpr, th = roc_curve(numpy.array([(yy == ">50K") * 1 for yy in y]), pandas.DataFrame(m3.predict_proba(X))[1])
roc_auc = auc(fpr, tpr)
plt.figure(figsize = (16,8))
plt.title("Courbe ROC")
g = seaborn.lineplot(x = fpr, y = tpr)
g.set(xlabel = "Taux de faux positifs", ylabel = "Taux de vrais positifs")
plt.text(.8, .2, "AUC = %0.3f" % roc_auc)
plt.show()
from sklearn.metrics import precision_recall_curve
pr, rc, th = precision_recall_curve(numpy.array([(yy == ">50K") * 1 for yy in y]),
pandas.DataFrame(m3.predict_proba(X))[1])
plt.figure(figsize = (16,8))
plt.title("Courbe Precision/Recall")
g = seaborn.lineplot(x = pr, y = rc)
g.set(xlabel = "Recall", ylabel = "Precision")
plt.show()
Nous allons sur des données de détection de spam (assez anciennes pour information), disponible sur cette page.
url_spam = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
spam = pandas.read_csv(url_spam, header = None)
spam
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.00 | 0.64 | 0.64 | 0.0 | 0.32 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.000 | 0.000 | 0.0 | 0.778 | 0.000 | 0.000 | 3.756 | 61 | 278 | 1 |
| 1 | 0.21 | 0.28 | 0.50 | 0.0 | 0.14 | 0.28 | 0.21 | 0.07 | 0.00 | 0.94 | ... | 0.000 | 0.132 | 0.0 | 0.372 | 0.180 | 0.048 | 5.114 | 101 | 1028 | 1 |
| 2 | 0.06 | 0.00 | 0.71 | 0.0 | 1.23 | 0.19 | 0.19 | 0.12 | 0.64 | 0.25 | ... | 0.010 | 0.143 | 0.0 | 0.276 | 0.184 | 0.010 | 9.821 | 485 | 2259 | 1 |
| 3 | 0.00 | 0.00 | 0.00 | 0.0 | 0.63 | 0.00 | 0.31 | 0.63 | 0.31 | 0.63 | ... | 0.000 | 0.137 | 0.0 | 0.137 | 0.000 | 0.000 | 3.537 | 40 | 191 | 1 |
| 4 | 0.00 | 0.00 | 0.00 | 0.0 | 0.63 | 0.00 | 0.31 | 0.63 | 0.31 | 0.63 | ... | 0.000 | 0.135 | 0.0 | 0.135 | 0.000 | 0.000 | 3.537 | 40 | 191 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4596 | 0.31 | 0.00 | 0.62 | 0.0 | 0.00 | 0.31 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.000 | 0.232 | 0.0 | 0.000 | 0.000 | 0.000 | 1.142 | 3 | 88 | 0 |
| 4597 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.000 | 0.000 | 0.0 | 0.353 | 0.000 | 0.000 | 1.555 | 4 | 14 | 0 |
| 4598 | 0.30 | 0.00 | 0.30 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.102 | 0.718 | 0.0 | 0.000 | 0.000 | 0.000 | 1.404 | 6 | 118 | 0 |
| 4599 | 0.96 | 0.00 | 0.00 | 0.0 | 0.32 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.000 | 0.057 | 0.0 | 0.000 | 0.000 | 0.000 | 1.147 | 5 | 78 | 0 |
| 4600 | 0.00 | 0.00 | 0.65 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.000 | 0.000 | 0.0 | 0.125 | 0.000 | 0.000 | 1.250 | 5 | 40 | 0 |
4601 rows × 58 columns
Comme vous pouvez le voir, ces données n'ont pas de noms de variables. Ceux-ci sont disponibles dans un fichier spambase.names (cf page web des données), à partir de la ligne 34. Nous les importons donc, pour les ajouter à notre DataFrame.
url_names = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.names"
spam_names = pandas.read_table(url_names, sep = ":", header = None, skiprows = 33, names = ["var", "type"])
spam_names.head()
| var | type | |
|---|---|---|
| 0 | word_freq_make | continuous. |
| 1 | word_freq_address | continuous. |
| 2 | word_freq_all | continuous. |
| 3 | word_freq_3d | continuous. |
| 4 | word_freq_our | continuous. |
On ajoute la variable spam qui indique si le mail est un spam (1) ou non (0). Et on a au final notre DataFrame correctement configuré.
spam.columns = list(spam_names["var"]) + ["spam"]
spam
| word_freq_make | word_freq_address | word_freq_all | word_freq_3d | word_freq_our | word_freq_over | word_freq_remove | word_freq_internet | word_freq_order | word_freq_mail | ... | char_freq_; | char_freq_( | char_freq_[ | char_freq_! | char_freq_$ | char_freq_# | capital_run_length_average | capital_run_length_longest | capital_run_length_total | spam | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.00 | 0.64 | 0.64 | 0.0 | 0.32 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.000 | 0.000 | 0.0 | 0.778 | 0.000 | 0.000 | 3.756 | 61 | 278 | 1 |
| 1 | 0.21 | 0.28 | 0.50 | 0.0 | 0.14 | 0.28 | 0.21 | 0.07 | 0.00 | 0.94 | ... | 0.000 | 0.132 | 0.0 | 0.372 | 0.180 | 0.048 | 5.114 | 101 | 1028 | 1 |
| 2 | 0.06 | 0.00 | 0.71 | 0.0 | 1.23 | 0.19 | 0.19 | 0.12 | 0.64 | 0.25 | ... | 0.010 | 0.143 | 0.0 | 0.276 | 0.184 | 0.010 | 9.821 | 485 | 2259 | 1 |
| 3 | 0.00 | 0.00 | 0.00 | 0.0 | 0.63 | 0.00 | 0.31 | 0.63 | 0.31 | 0.63 | ... | 0.000 | 0.137 | 0.0 | 0.137 | 0.000 | 0.000 | 3.537 | 40 | 191 | 1 |
| 4 | 0.00 | 0.00 | 0.00 | 0.0 | 0.63 | 0.00 | 0.31 | 0.63 | 0.31 | 0.63 | ... | 0.000 | 0.135 | 0.0 | 0.135 | 0.000 | 0.000 | 3.537 | 40 | 191 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4596 | 0.31 | 0.00 | 0.62 | 0.0 | 0.00 | 0.31 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.000 | 0.232 | 0.0 | 0.000 | 0.000 | 0.000 | 1.142 | 3 | 88 | 0 |
| 4597 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.000 | 0.000 | 0.0 | 0.353 | 0.000 | 0.000 | 1.555 | 4 | 14 | 0 |
| 4598 | 0.30 | 0.00 | 0.30 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.102 | 0.718 | 0.0 | 0.000 | 0.000 | 0.000 | 1.404 | 6 | 118 | 0 |
| 4599 | 0.96 | 0.00 | 0.00 | 0.0 | 0.32 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.000 | 0.057 | 0.0 | 0.000 | 0.000 | 0.000 | 1.147 | 5 | 78 | 0 |
| 4600 | 0.00 | 0.00 | 0.65 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.000 | 0.000 | 0.0 | 0.125 | 0.000 | 0.000 | 1.250 | 5 | 40 | 0 |
4601 rows × 58 columns
Vous devez donc réaliser les étapes suivantes :
spam, avec toutes les autresword ou char, créer une variable binaire (1 ou 0) indiquant s'il y a présence ou non du mot ou du caractère (prendre comme seuil une valeur raisonnable, à tester)capital)On peut se poser la question de la pertinence de prendre autant de variables. Il est souvent intéressant de regarder la performance des modèles à 1 variable explicative. Cela peut amener parfois à des modèles, certes moins performants, mais de peu et surtout moins gourmand en temps de calcul et en contraintes.
DataFrame ($R^2$, proportion de bien prédits et AUC par exemple)