Classification project is made with sklearn library in python.
You can reach jupyter notebook and datas in my Github link:
In the Global AI Hub Machine Learning Bootcamp, participants were asked to create classification or regression projects.
1. Project Topic and Dataset
In this project, classification in machine learning was selected using the Wine Quality dataset, which was recommended by the project mentors.
2. EDA
About the Dataset:
The dataset consists of 1,599 rows and 12 columns.
The columns include: 'fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', and 'quality'.
The 'quality' column is the target (dependent) variable, and its data type is integer. The independent variables are of type float.
Variety in the Columns:
There are no missing values in the dataset. However, there are many outliers.
To examine the relationship between independent variables affecting quality, I first needed to test the distribution of the data to decide on the correlation method. Since the dataset was small, I applied the Shapiro-Wilk test. If the p-value was less than 0.05, I could not reject the normality hypothesis. To further validate, I performed the Jarque-Bera test, which also indicated normal distribution for the independent variables. As a result, I used the 'Spearman' method for correlation analysis. No variables with high correlation (+/- 0.6 or above) were found.
For outlier detection based on multiple independent variables, I applied the Local Outlier Factor (LOF). However, since many of the outliers were associated with high-quality wines, I decided to keep them. Additionally, I observed significant performance drops in my models when removing data, which reinforced my decision to retain the outliers.
Classification Models I Selected:
Logistic Regression (LG)
Ridge Regression (RR)
Decision Tree (DT)
Naive Bayes (NB)
Neural Network (NN)
Accuracy Rates Observed Before Removing Data:
LG: 0.61
RR: 0.58
DT: 0.56
NB: 0.56
NN: 0.57
Accuracy Rates After Removing Duplicate Data:
LG: 0.58
RR: 0.57
DT: 0.50
NB: 0.50
NN: 0.59
Accuracy Rates After Removing Data with Very Low Correlation:
LG: 0.55
RR: 0.55
DT: 0.49
NB: 0.45
NN: 0.56
Re-Evaluating DataSet
I identified an imbalance in the dataset, so I thought it would be more appropriate to train the KNN model using SMOTE from the imbalanced-learn library.
Based on this, I increased the data by mixing synthetic data with real data, using SMOTE for KNN-based data augmentation and ADASYN for increasing minority class data. After performing this process, I encountered data with significantly high correlation levels (±0.60-0.90).
To correct the data structure without negatively impacting the model's performance, I removed the outliers of the least influential value in the dataset.
After applying MinMaxScaler and LabelEncoding, I trained my model and achieved an accuracy of 0.91.
The hyperparameter analysis performed with GridSearchCV showed that the best values were criterion="gini", splitter="best", max_depth=5, min_samples_split=2, and max_leaf_nodes=10.
When I trained my model, the accuracy score was 0.93
Comments