Project Overview

This project is an attempt to model wine preferences from only their physicochemical properties: alcohol, sulphates, sugars, etc. Our testing and training data consists of 6000+ entries, white and red Vinho Verde wine samples (from Portugal) and participants’ scores for perceived wine’s quality. The data used for this project was sourced from UC Irvine’s Machine Learning Repository.

Full Project on Github

Data Overview

Source: Two CSV files, winequality-red.csv and winequality-white.csv, merged into a single dataset.
Total wine samples: 6497

Red Wines- 1599
White Wines- 4898

Features (11):

fixed acidity
volatile acidity
citric acid
residual sugar
chlorides
free sulfur dioxide
total sulfur dioxide
density
pH
sulphates
alcohol

Original Target Variable: quality (score between 3 and 9).

Distribution:

3/10: 0.46%
4/10: 3.32%
5/10: 32.91%
6/10: 43.65%
7/10: 16.61%
8/10: 2.97%
9/10: 0.08%

Observation: The dataset is heavily imbalanced, with the vast majority of wines scoring 5 or 6.

Methodology

Discretize data

Perceived quality scores are given out of 10

To create labels, we sort values into 2 groups
3-5 -> Poor quality 6-9 Excellent quality

These class boundaries were chosen because:

no quality scores lower than 3 or higher than 9
they created similar sized groups for poor & good quality

2. Clean-up Data

Label features separated from training data
42% Normal quality wines dropped to match poor wine count
1. balanced dataset yields more accurate classification results
All features scaled to [0,1]
1. to ensure equal contribution of features

3. Optimize K-value

Determine optimal k-value via elbow graph
1. Error rate vs. k-value
Maximize k-value within range to prevent overfitting
1. Large datasets suffer from overfitting with low k-values
Minimize error rate
Optimal value found to be k=23

4. Classify Data

Fit training data to KNN model with
- empirically optimal k=23
- 80-20 train-test-split
Use fitted model to predict labels for test set

5. Evaluate Model

The model’s classification performance is analyzed via the following:

Accuracy score
Precision score
Recall score
F1-Macro
Confusion matrix

6. Results Validation

Create decision boundary graph to verify optimal fitting
If overfitted, test other local minima in k-value graph

Modeling Results

1.Model Metrics

Accuracy Score: 0.7673 ≈ 76.7%

Precision Score: 0.7672 ≈ 76.7%

Recall Score: 0.7673 ≈ 76.7%

F1-Macro: 0.7673 ≈ 76.7%

2.Confusion Matrix

3.Decision Boundary

Strengths & Limitations

Strengths of our approach:

Binary label bins create even class distribution
Optimal K-value is found empirically & cross-validated
K-NN model generalizes well to unseen data

Limitations of our approach:

Only 2 clusters: ‘poor’ and ‘good’ quality
1. Model is unable to predict degree of good/poor quality
No separate evaluation of models trained on only red or white wine samples
Loss of info from dimensionality reduction makes decision boundary graph hard to interpret

Reflection

What went well:

Project development was rapid and smooth
- members were attentive during sprint meetings
- initial experimentation period helped refine our training parameters
We expanded simple concept into a robust exploration of wine quality modeling
- Taught all of us about value of cross-validation
- How to minimize lurking variables & sampling bias
- Visualization methods for higher-order data

What could be improved:

No separate insights into red and white wine data

datasets were combined to increase size of training data
time constraints prevented us from adding this after completing 1st working build

Decision boundary graph is difficult to interpret

dimensionally reduced dataset loses detail
does not appear to be a good approximation of full dataset
- but full data cannot be graphed
- And predicting from projected data has substantially worse performance

Implementation of elbow graph may be flawed

Maximizing k may not have been the right choice
Hard to check for underfitting without a clear metric
Perhaps silhouette score would have been better than graphing the decision boundary

Wine Quality Modeling with K-Nearest Neighbors

Discretize data

2. Clean-up Data

3. Optimize K-value

4. Classify Data

5. Evaluate Model

6. Results Validation

Contact Me!