Wine Quality Modeling with K-Nearest Neighbors

ROLE:

Lead developer


TIMELINE:

October- November 2024

TEAM:

3 Developers

SKILLS:

Data Mining

Predictive Analytics

Business Insight


Project Overview

This project is an attempt to model wine preferences from only their physicochemical properties: alcohol, sulphates, sugars, etc. Our testing and training data consists of 6000+ entries, white and red Vinho Verde wine samples (from Portugal) and participants’ scores for perceived wine’s quality. The data used for this project was sourced from UC Irvine’s Machine Learning Repository.

Full Project on Github

Data Overview

Source: Two CSV files, winequality-red.csv and winequality-white.csv, merged into a single dataset.
Total wine samples: 6497

  • Red Wines- 1599

  • White Wines- 4898

Features (11):

  • fixed acidity

  • volatile acidity

  • citric acid

  • residual sugar

  • chlorides

  • free sulfur dioxide

  • total sulfur dioxide

  • density

  • pH

  • sulphates

  • alcohol

Original Target Variable: quality (score between 3 and 9).

Distribution:

  • 3/10: 0.46%

  • 4/10: 3.32%

  • 5/10: 32.91%

  • 6/10: 43.65%

  • 7/10: 16.61%

  • 8/10: 2.97%

  • 9/10: 0.08%

Observation: The dataset is heavily imbalanced, with the vast majority of wines scoring 5 or 6.

Methodology

  1. Discretize data

Perceived quality scores are given out of 10

  • To create labels, we sort values into 2 groups

  • 3-5 -> Poor quality 6-9 Excellent quality

These class boundaries were chosen because:

  • no quality scores lower than 3 or higher than 9

  • they created similar sized groups for poor & good quality

2. Clean-up Data

  1. Label features separated from training data

  2. 42% Normal quality wines dropped to match poor wine count

    1. balanced dataset yields more accurate classification results

  3. All features scaled to [0,1]

    1. to ensure equal contribution of features

3. Optimize K-value

  1. Determine optimal k-value via elbow graph

    1. Error rate vs. k-value

  2. Maximize k-value within range to prevent overfitting

    1. Large datasets suffer from overfitting with low k-values

  3. Minimize error rate

  4. Optimal value found to be k=23

4. Classify Data

  1. Fit training data to KNN model with

    • empirically optimal k=23

    • 80-20 train-test-split

  2. Use fitted model to predict labels for test set

5. Evaluate Model

The model’s classification performance is analyzed via the following:

  1. Accuracy score

  2. Precision score

  3. Recall score

  4. F1-Macro

  5. Confusion matrix

6. Results Validation

  1. Create decision boundary graph to verify optimal fitting

  2. If overfitted, test other local minima in k-value graph

Modeling Results

1.Model Metrics

Accuracy Score: 0.7673 ≈ 76.7%

Precision Score: 0.7672 ≈ 76.7%

Recall Score: 0.7673 ≈ 76.7%

F1-Macro: 0.7673 ≈ 76.7%

2.Confusion Matrix

3.Decision Boundary

Strengths & Limitations

Strengths of our approach:

  1. Binary label bins create even class distribution

  2. Optimal K-value is found empirically & cross-validated

  3. K-NN model generalizes well to unseen data

Limitations of our approach:

  1. Only 2 clusters: ‘poor’ and ‘good’ quality

    1. Model is unable to predict degree of good/poor quality

  2. No separate evaluation of models trained on only red or white wine samples

  3. Loss of info from dimensionality reduction makes decision boundary graph hard to interpret

Reflection

What went well:

  • Project development was rapid and smooth

    • members were attentive during sprint meetings

    • initial experimentation period helped refine our training parameters

  • We expanded simple concept into a robust exploration of wine quality modeling

    • Taught all of us about value of cross-validation

    • How to minimize lurking variables & sampling bias

    • Visualization methods for higher-order data

What could be improved:

No separate insights into red and white wine data

  • datasets were combined to increase size of training data

  • time constraints prevented us from adding this after completing 1st working build

Decision boundary graph is difficult to interpret

  • dimensionally reduced dataset loses detail

  • does not appear to be a good approximation of full dataset

    • but full data cannot be graphed

    • And predicting from projected data has substantially worse performance

Implementation of elbow graph may be flawed

  • Maximizing k may not have been the right choice

  • Hard to check for underfitting without a clear metric

  • Perhaps silhouette score would have been better than graphing the decision boundary