This Jupyter Notebook is a practice/learning project that applies Linear Regression and Decision Tree Regression to a clean, small dataset of car specifications. The goal is to predict CO₂ emissions based on engine volume and vehicle weight.
-
Source file:
DATA1.csv -
Size: 36 entries, 5 columns
-
Columns:
Column Type Description Car object Car brand (e.g., Mercedes, Ford, BMW) Model object Car model name Volume int64 Engine displacement (cc) Weight int64 Vehicle weight (kg) CO2 int64 CO₂ emissions (target variable) -
Missing values: None
- Reads the CSV file using
pandas - Displays shape, column types, and missing value counts
- Shows value counts for each column
- Bar charts for car brand and model distributions
- Histogram of CO₂ emissions
- Boxplot for Weight and Volume
- Descriptive statistics table (max, min, quartiles, IQR, mean) for Weight, Volume, and CO₂
- Scatter plots: Volume vs. Weight, Volume vs. Car brand, Weight vs. CO₂
- Correlation matrix (heatmap) for Volume, Weight, and CO₂
- Features:
Volume,Weight - Target:
CO2 - Uses
sklearn.linear_model.LinearRegression - Prints the regression equation coefficients and intercept
- Makes a sample prediction for a car with Volume=1300, Weight=2300
- Features:
Volume,Weight - Target:
CO2 - Uses
sklearn.tree.DecisionTreeRegressorwithmax_leaf_nodes=20 - Splits data into train/validation sets using
train_test_split - Evaluates with Mean Absolute Error (MAE)
- Uses
sklearn.ensemble.RandomForestRegressor - Same train/val split as the decision tree
- Also evaluated with MAE for comparison
-
Correlation:
- Volume ↔ Weight: ~0.75 (strong positive)
- Volume ↔ CO₂: ~0.59 (moderate positive)
- Weight ↔ CO₂: ~0.55 (moderate positive)
-
Linear Regression Equation (approximate):
CO2 ≈ 0.0078 × Volume + 0.0076 × Weight + 79.69 -
MAE (Decision Tree & Random Forest): ~5.38 on validation set
pandas
numpy
matplotlib
seaborn
scikit-learn
- This notebook is for learning/practice purposes on a clean, small dataset.
- The dataset path is hardcoded (
D:/chenghan/...) and will need to be updated to run locally. - No formal train/test split is used for the Linear Regression model (fit on full dataset).