Iris Dataset Classification

ROLE:

Sole developer


TIMELINE:

October 2024 (20 days)

(Updated for portfolio showcase)

SKILLS:

Data Mining

Statistical Analysis

  • This project models iris species based on petal length & width as well as sepal (green part underneath bloom) length & width. I classified this data based on 2 classic algorithms: K-NN & Decision Trees and evaluated their performance.

  • The data is sourced from the Fisher Iris dataset, 1936. It is one of the earliest known datasets used for evaluating classification methods.

    Total Iris Samples: 150

    • Setosa Count = 50

    • Versicolor Count = 50

    • Virginica Count: 50

    Input data (features):

    1. Sepal Length (cm)

    2. Sepal Width (cm)

    3. Petal Length (cm)

    4. Petal Width (cm)

  • 1. Classification Results:

    Average Metrics (Stratified 5-Fold Cross-Validation):
    A. 3-NN Average Metrics:

    • Accuracy: 0.9533 ± 0.0340

    • Precision: 0.9560 ± 0.0322

    • Recall: 0.9533 ± 0.0340

    • F1: 0.9532 ± 0.0341

    B. Decision Tree Average Metrics:

    • Accuracy: 0.9333 ± 0.0422

    • Precision: 0.9432 ± 0.0324

    • Recall: 0.9333 ± 0.0422

    • F1: 0.9321 ± 0.0439

    2. Broad data insights:

    • Mean sepal length = 5.843 ± 0.828 cm

    • Mean Sepal Width = 3.054 ± 0.434 cm

    • Mean Petal Length = 3.759 ± 1.76 cm

    • Mean Petal Width = 1.199 ± 0.763 cm

    3. Species-specific Inisghts:

    A. Average Petal Area

    • Setosa 0.357 cm^2

    • Versicolor 5.65 cm^2

    • Virginica 11.2 cm^2

    B. Average Sepal Area:

    • Setosa = 17.1 cm^2

    • Versicolor = 16.4 cm^2

    • Virginica = 19.6 cm^2

    1. Small test set size
      - 70-30 train-test split only allows for 45 total predictions

      -Highly subject to sampling bias
      - I counteracted somewhat using 5-fold cross-validation

    2. Limited Experimentation with model types jk

      - only tests 2 models

      - maybe test Naive Bayes, LogRes, or Linear SVM in the future?

Graphical Analysis

  1. Confusion Matrices

At first, I thought both confusion matrices having the same result was a result of faulty code, but after triple checking my code and re-running the experiment with different random state values, I can confirm these matrices are correct. It is possible for both to have the same result (one versicolor incorrectly identified as Viriginica), especially in a clean and well-separated dataset like this one.

2. Value Distribution of Data Features

Petal length for Iris Setosa closely followed a normal distribution.

Petal width for Iris Virginica was the only feature to approximate a bimodal distribution.

Learn More on GitHub

While petal width was right-skewed!

Others still, like, petal width for Versicolor, were right-skewed.

Learn More on GitHub