Vous devez rendre votre fichier .ipynb
(obtenable en allant sur Fichier -> Télécharger au format -> Notebook (.ipynb)), et le déposer avec votre nom de famille dans le nom de fichier (et dans le notebook), à cette adresse :
sans objet ici
Nous allons utiliser un jeu de données concernant le niveau d'ozone (et plus précisemment le dépassement d'un seuil) en fonction d'autres informations telles que la température, l'humidité... Ces données sont disponibles sur cette page. En plus de 72 variables explicatives (la date n'est pas à prendre en compte), nous avons la cernière colonne nous indiquant si la journée a montré un taux élevé d'ozone (1) ou non (0). C'est cette variable (dite cible) que nous allons chercher à modéliser et prédire.
Voici comment importer les données (sans le nom des variables) dans python. Notez le paramètre na_values
qui permet de définir que les données manquantes sont indiquées par un "?"
dans le fichier de données.
import pandas
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/ozone/onehr.data"
data = pandas.read_csv(url, header = None, na_values = "?")
data
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1/1/1998 | 0.8 | 1.8 | 2.4 | 2.1 | 2.0 | 2.1 | 1.5 | 1.7 | 1.9 | ... | 0.15 | 10.67 | -1.56 | 5795.0 | -12.10 | 17.90 | 10330.0 | -55.0 | 0.00 | 0.0 |
1 | 1/2/1998 | 2.8 | 3.2 | 3.3 | 2.7 | 3.3 | 3.2 | 2.9 | 2.8 | 3.1 | ... | 0.48 | 8.39 | 3.84 | 5805.0 | 14.05 | 29.00 | 10275.0 | -55.0 | 0.00 | 0.0 |
2 | 1/3/1998 | 2.9 | 2.8 | 2.6 | 2.1 | 2.2 | 2.5 | 2.5 | 2.7 | 2.2 | ... | 0.60 | 6.94 | 9.80 | 5790.0 | 17.90 | 41.30 | 10235.0 | -40.0 | 0.00 | 0.0 |
3 | 1/4/1998 | 4.7 | 3.8 | 3.7 | 3.8 | 2.9 | 3.1 | 2.8 | 2.5 | 2.4 | ... | 0.49 | 8.73 | 10.54 | 5775.0 | 31.15 | 51.70 | 10195.0 | -40.0 | 2.08 | 0.0 |
4 | 1/5/1998 | 2.6 | 2.1 | 1.6 | 1.4 | 0.9 | 1.5 | 1.2 | 1.4 | 1.3 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.58 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2531 | 12/27/2004 | 0.3 | 0.4 | 0.5 | 0.5 | 0.2 | 0.3 | 0.4 | 0.4 | 1.3 | ... | 0.07 | 7.93 | -4.41 | 5800.0 | -25.60 | 21.80 | 10295.0 | 65.0 | 0.00 | 0.0 |
2532 | 12/28/2004 | 1.0 | 1.4 | 1.1 | 1.7 | 1.5 | 1.7 | 1.8 | 1.5 | 2.1 | ... | 0.04 | 5.95 | -1.14 | 5845.0 | -19.40 | 19.10 | 10310.0 | 15.0 | 0.00 | 0.0 |
2533 | 12/29/2004 | 0.8 | 0.8 | 1.2 | 0.9 | 0.4 | 0.6 | 0.8 | 1.1 | 1.5 | ... | 0.06 | 7.80 | -0.64 | 5845.0 | -9.60 | 35.20 | 10275.0 | -35.0 | 0.00 | 0.0 |
2534 | 12/30/2004 | 1.3 | 0.9 | 1.5 | 1.2 | 1.6 | 1.8 | 1.1 | 1.0 | 1.9 | ... | 0.25 | 7.72 | -0.89 | 5845.0 | -19.60 | 34.20 | 10245.0 | -30.0 | 0.05 | 0.0 |
2535 | 12/31/2004 | 1.5 | 1.3 | 1.8 | 1.4 | 1.2 | 1.7 | 1.6 | 1.4 | 1.6 | ... | 0.54 | 13.07 | 9.15 | 5820.0 | 1.95 | 39.35 | 10220.0 | -25.0 | 0.00 | 0.0 |
2536 rows × 74 columns
Les informations sont contenus dans un fichier texte (avec l'extension .names
), dont le contenu est présenté ci-dessous.
import requests
response = requests.get("https://archive.ics.uci.edu/ml/machine-learning-databases/ozone/onehr.names")
print(response.text)
1. Title: Ozone Level Detection 2. Source: Kun Zhang zhang.kun05 '@' gmail.com Department of Computer Science, Xavier University of Lousiana Wei Fan wei.fan '@' gmail.com IBM T.J.Watson Research XiaoJing Yuan xyuan '@' uh.edu Engineering Technology Department, College of Technology, University of Houston 3. Past Usage: Forecasting skewed biased stochastic ozone days: analyses, solutions and beyond, Knowledge and Information Systems, Vol. 14, No. 3, 2008. Discusses details about the dataset, its use as well as various experiments (both cross-validation and streaming) using many state-of-the-art methods. A shorter version of the paper (does not contain some detailed experiments as the journal paper above) is in: Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions. ICDM 2006: 753-764 4. Relevant Information: The following are specifications for several most important attributes that are highly valued by Texas Commission on Environmental Quality (TCEQ). More details can be found in the two relevant papers. -- O 3 - Local ozone peak prediction -- Upwind - Upwind ozone background level -- EmFactor - Precursor emissions related factor -- Tmax - Maximum temperature in degrees F -- Tb - Base temperature where net ozone production begins (50 F) -- SRd - Solar radiation total for the day -- WSa - Wind speed near sunrise (using 09-12 UTC forecast mode) -- WSp - Wind speed mid-day (using 15-21 UTC forecast mode) 5. Number of Instances: 2536 6. Number of Attributes: 73 7. Attribute Information: 1,0 | two classes 1: ozone day, 0: normal day Date: ignore. WSR0: continuous. WSR1: continuous. WSR2: continuous. WSR3: continuous. WSR4: continuous. WSR5: continuous. WSR6: continuous. WSR7: continuous. WSR8: continuous. WSR9: continuous. WSR10: continuous. WSR11: continuous. WSR12: continuous. WSR13: continuous. WSR14: continuous. WSR15: continuous. WSR16: continuous. WSR17: continuous. WSR18: continuous. WSR19: continuous. WSR20: continuous. WSR21: continuous. WSR22: continuous. WSR23: continuous. WSR_PK: continuous. WSR_AV: continuous. T0: continuous. T1: continuous. T2: continuous. T3: continuous. T4: continuous. T5: continuous. T6: continuous. T7: continuous. T8: continuous. T9: continuous. T10: continuous. T11: continuous. T12: continuous. T13: continuous. T14: continuous. T15: continuous. T16: continuous. T17: continuous. T18: continuous. T19: continuous. T20: continuous. T21: continuous. T22: continuous. T23: continuous. T_PK: continuous. T_AV: continuous. T85: continuous. RH85: continuous. U85: continuous. V85: continuous. HT85: continuous. T70: continuous. RH70: continuous. U70: continuous. V70: continuous. HT70: continuous. T50: continuous. RH50: continuous. U50: continuous. V50: continuous. HT50: continuous. KI: continuous. TT: continuous. SLP: continuous. SLP_: continuous. Precp: continuous.
onehr.names
)