Houjun Liu

AIBridge Iris Variance Worksheet

SPOILER ALERT for future labs!! Don’t scroll down!

We are going to create a copy of the iris dataset with a random variance.

import sklearn
from sklearn.datasets import load_iris

Let’s load the iris dataset:

x,y = load_iris(return_X_y=True)

Because we need to generate a lot of random data, let’s import random

import random

Put this in a df

import pandas as pd
df = pd.DataFrame(x)
df
       0    1    2    3
0    5.1  3.5  1.4  0.2
1    4.9  3.0  1.4  0.2
2    4.7  3.2  1.3  0.2
3    4.6  3.1  1.5  0.2
4    5.0  3.6  1.4  0.2
..   ...  ...  ...  ...
145  6.7  3.0  5.2  2.3
146  6.3  2.5  5.0  1.9
147  6.5  3.0  5.2  2.0
148  6.2  3.4  5.4  2.3
149  5.9  3.0  5.1  1.8

[150 rows x 4 columns]

Let’s make 150 random numbers with pretty low variance:

random_ns = [random.uniform(65,65.2) for _ in range(0, 150)]
random_series = pd.Series(random_ns)
random_series
0      65.127515
1      65.034572
2      65.123271
3      65.043985
4      65.145743
         ...
145    65.036410
146    65.157172
147    65.034925
148    65.037373
149    65.042466
Length: 150, dtype: float64

Excellent. Now let’s put the two things together!

df["temp"] =  random_series
df
       0    1    2    3       temp
0    5.1  3.5  1.4  0.2  65.127515
1    4.9  3.0  1.4  0.2  65.034572
2    4.7  3.2  1.3  0.2  65.123271
3    4.6  3.1  1.5  0.2  65.043985
4    5.0  3.6  1.4  0.2  65.145743
..   ...  ...  ...  ...        ...
145  6.7  3.0  5.2  2.3  65.036410
146  6.3  2.5  5.0  1.9  65.157172
147  6.5  3.0  5.2  2.0  65.034925
148  6.2  3.4  5.4  2.3  65.037373
149  5.9  3.0  5.1  1.8  65.042466

[150 rows x 5 columns]

And, while we are at it, let’s make new labels

names = pd.Series(["sepal length", "sepal width", "pedal length", "pedal width", "temp"])
df.columns = names
df
     sepal length  sepal width  pedal length  pedal width       temp
0             5.1          3.5           1.4          0.2  65.127515
1             4.9          3.0           1.4          0.2  65.034572
2             4.7          3.2           1.3          0.2  65.123271
3             4.6          3.1           1.5          0.2  65.043985
4             5.0          3.6           1.4          0.2  65.145743
..            ...          ...           ...          ...        ...
145           6.7          3.0           5.2          2.3  65.036410
146           6.3          2.5           5.0          1.9  65.157172
147           6.5          3.0           5.2          2.0  65.034925
148           6.2          3.4           5.4          2.3  65.037373
149           5.9          3.0           5.1          1.8  65.042466

[150 rows x 5 columns]

Excellent. Let’s finally get the flower results.

df["species"] = y
df
     sepal length  sepal width  pedal length  pedal width       temp  species
0             5.1          3.5           1.4          0.2  65.127515        0
1             4.9          3.0           1.4          0.2  65.034572        0
2             4.7          3.2           1.3          0.2  65.123271        0
3             4.6          3.1           1.5          0.2  65.043985        0
4             5.0          3.6           1.4          0.2  65.145743        0
..            ...          ...           ...          ...        ...      ...
145           6.7          3.0           5.2          2.3  65.036410        2
146           6.3          2.5           5.0          1.9  65.157172        2
147           6.5          3.0           5.2          2.0  65.034925        2
148           6.2          3.4           5.4          2.3  65.037373        2
149           5.9          3.0           5.1          1.8  65.042466        2

[150 rows x 6 columns]

And dump it to a CSV.

df.to_csv("./iris_variance.csv", index=False)

Let’s select for the input data again:

X = df.iloc[:,0:5]
y = df.iloc[:,5]
X
     sepal length  sepal width  pedal length  pedal width       temp
0             5.1          3.5           1.4          0.2  65.127515
1             4.9          3.0           1.4          0.2  65.034572
2             4.7          3.2           1.3          0.2  65.123271
3             4.6          3.1           1.5          0.2  65.043985
4             5.0          3.6           1.4          0.2  65.145743
..            ...          ...           ...          ...        ...
145           6.7          3.0           5.2          2.3  65.036410
146           6.3          2.5           5.0          1.9  65.157172
147           6.5          3.0           5.2          2.0  65.034925
148           6.2          3.4           5.4          2.3  65.037373
149           5.9          3.0           5.1          1.8  65.042466

[150 rows x 5 columns]

And use the variance threshold tool:

from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(0.1)
sel.fit_transform(X)
5.13.51.40.2
4.931.40.2
4.73.21.30.2
4.63.11.50.2
53.61.40.2
5.43.91.70.4
4.63.41.40.3

As we expected.

And let’s use the select k best tool:

from sklearn.feature_selection import SelectKBest, chi2
sel = SelectKBest(chi2, k=4)
res = sel.fit_transform(X, y)
res
5.13.51.40.2
4.931.40.2
4.73.21.30.2
4.63.11.50.2
53.61.40.2
5.43.91.70.4
4.63.41.40.3
53.41.50.2

Also, as we expected. Got rid of temp.