Skip to main content

Classification of Wisconsin breast cancer dataset

Machine Learning and Deep Learning: Classification of Wisconsin breast cancer dataset

Machine Learning and Deep Learning: Classification of Wisconsin breast cancer dataset

Lai Tai-Yu

Loading dataset and useful packages.

In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn import metrics

Loading dataset as DataFrame and Series of the Pandas. The DataFrame is the feature, and the Series is a target value. Using isnull() for check NaN. The default test size is 0.25 (25%). Random seed and Random state are both set 1. train_test_split can help to build training set and test set.

In [3]:
data = load_breast_cancer(as_frame=True)
print(data.data.isnull().sum())
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, stratify=y)
mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
dtype: int64

The 400 estimators got enough accuracy in the pass experiment. If the estimator increase, then accuracy will drop.

In [4]:
clf_e = ExtraTreesClassifier(n_estimators=400, random_state=1)
clf_e.fit(X_train, y_train)
Out[4]:
ExtraTreesClassifier(n_estimators=400, random_state=1)
In [5]:
print("Accuracy on test data: {:.2f}".format(clf_e.score(X_test, y_test)))
scores = cross_val_score(clf_e, X, y)
print("ExtraTrees KFold Accuracy: {:.2f}".format(scores.mean()))
Accuracy on test data: 0.95
ExtraTrees KFold Accuracy: 0.97

The number of estimators in RandomForestClassifier needs more than ExtraTreesClassifier.

In [6]:
clf_r = RandomForestClassifier(n_estimators=600, random_state=1)
clf_r.fit(X_train, y_train)
Out[6]:
RandomForestClassifier(n_estimators=600, random_state=1)
In [7]:
print("Accuracy on test data: {:.2f}".format(clf_r.score(X_test, y_test)))
scores = cross_val_score(clf_r, X, y)
print("Random Forest KFold Accuracy: {:.2f}".format(scores.mean()))
Accuracy on test data: 0.95
Random Forest KFold Accuracy: 0.96

ExtraTreesClassifier has an attribute, it's called feature_importances_. Using descending sort and collecting the score greater and equal to 0.01. Got top K feature of important. At the same time, record whole feature into a list.

In [8]:
indices_all = []
indices_e = []
extra_importance_sorted_idx = np.argsort(clf_e.feature_importances_)[::-1]
for idx in zip(extra_importance_sorted_idx):
    print("%-30s %.8f" % (data.feature_names[idx],\
                          clf_e.feature_importances_[idx]))
    if clf_e.feature_importances_[idx] >= 0.01:
        indices_e.append([data.feature_names[idx],\
                          idx])
    indices_all.append([data.feature_names[idx],\
                       idx])    
worst concave points           0.10236108
mean concave points            0.09020062
worst radius                   0.08011382
worst perimeter                0.06791662
worst area                     0.06649168
mean perimeter                 0.06605122
mean concavity                 0.06333271
mean area                      0.06101473
mean radius                    0.05269444
worst concavity                0.04822469
worst compactness              0.03307735
area error                     0.03087653
worst texture                  0.02742005
mean compactness               0.02410239
mean texture                   0.02111108
perimeter error                0.02069637
radius error                   0.02061688
worst smoothness               0.02043304
worst symmetry                 0.01812615
worst fractal dimension        0.01094089
concavity error                0.01051810
concave points error           0.00945332
mean smoothness                0.00807585
compactness error              0.00798494
mean symmetry                  0.00797939
mean fractal dimension         0.00686189
smoothness error               0.00615091
texture error                  0.00601483
symmetry error                 0.00570167
fractal dimension error        0.00545675

RandomForestClassifier has the attribute, the same name called feature_importances_. The process routine is the same. Using descending sort and collecting greater equal to 0.01 score of the feature.

In [9]:
indices_r = []
forest_importance_sorted_idx = np.argsort(clf_r.feature_importances_)[::-1]
for idx in zip(forest_importance_sorted_idx):
    print("%-30s %.8f" % (data.feature_names[idx],\
                          clf_r.feature_importances_[idx]))
    if clf_r.feature_importances_[idx] >= 0.01:
        indices_r.append([data.feature_names[idx],\
                          idx])
worst concave points           0.13581296
worst perimeter                0.12433113
worst area                     0.11618983
worst radius                   0.11210755
mean concave points            0.09165992
mean concavity                 0.06552638
mean radius                    0.04731921
mean area                      0.04665603
mean perimeter                 0.04090480
worst concavity                0.03938244
area error                     0.02892531
worst compactness              0.01817479
worst texture                  0.01570876
perimeter error                0.01554450
mean texture                   0.01335187
worst smoothness               0.01101596
worst symmetry                 0.01050331
radius error                   0.01004983
mean compactness               0.00946926
worst fractal dimension        0.00613224
concavity error                0.00609868
concave points error           0.00439628
texture error                  0.00427330
mean smoothness                0.00415643
smoothness error               0.00413842
fractal dimension error        0.00391505
compactness error              0.00382008
symmetry error                 0.00370792
mean fractal dimension         0.00365456
mean symmetry                  0.00307320

Dimensionality reduction: There are combining two feature lists of importance. Using the instruction filter to compare the same feature name. The new dataset has 21 features and the original dataset has 30 features.

In [10]:
indices_list = []
match = ""
for idx in range(len(indices_e)):
    indices_list += [indices_e[idx][0]]
for idx in range(len(indices_r)):
    match = list(filter(lambda x: indices_r[idx][0] in x, indices_list))
    if len(match) > 0:
        pass
        #print("Duplicate!")
    else:
        indices_list += [indices_r[idx][0]]
print(f"Feature Name: {indices_list}")
print(f"Feature Total: {len(indices_list)}")
Feature Name: ['worst concave points', 'mean concave points', 'worst radius', 'worst perimeter', 'worst area', 'mean perimeter', 'mean concavity', 'mean area', 'mean radius', 'worst concavity', 'worst compactness', 'area error', 'worst texture', 'mean compactness', 'mean texture', 'perimeter error', 'radius error', 'worst smoothness', 'worst symmetry', 'worst fractal dimension', 'concavity error']
Feature Total: 21

To transform the feature name to feature indices. To create training set and test set.

In [11]:
indices = []
for idx in range(len(indices_list)):
    if indices_list[idx] in indices_all[idx][0]:
        indices += indices_all[idx][1]
    
print(f"Feature indices: {indices}") 
print(f"Feature Total: {len(indices)}")

X_f = pd.DataFrame(X.iloc[:,indices], columns=data.feature_names[indices]).values    
X_train_f, X_test_f, y_train, y_test = train_test_split(X_f, y, random_state=1, stratify=y)
Feature indices: [27, 7, 20, 22, 23, 2, 6, 3, 0, 26, 25, 13, 21, 5, 1, 12, 10, 24, 28, 29, 16]
Feature Total: 21

The XGBClassifier is from XGBoost. It's a famous package in the machine learning field. The eval_metric set error. That means using the binary classification rules. There has 3 FP and 4 FN in the confusion matrix. Only using 21 features.

In [11]:
xgb = XGBClassifier(use_label_encoder=False, eval_metric='error', seed=1)
xgb.fit(X_train_f, y_train)
y_pred = xgb.predict(X_test_f)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy on test data: {:.2f}".format(accuracy))
scores = cross_val_score(xgb, X, y)
print("XGBClassifier KFold Accuracy: {:.2f}".format(scores.mean()))
print("Report:\n", metrics.classification_report(y_test, y_pred))
print("Confusion Matrix:\n", metrics.confusion_matrix(y_test, y_pred))
Accuracy on test data: 0.95
XGBClassifier KFold Accuracy: 0.98
Report:
               precision    recall  f1-score   support

           0       0.94      0.92      0.93        53
           1       0.96      0.97      0.96        90

    accuracy                           0.95       143
   macro avg       0.95      0.95      0.95       143
weighted avg       0.95      0.95      0.95       143

Confusion Matrix:
 [[49  4]
 [ 3 87]]

The GridSearchCV is a brute force way of adjusting the hyperparameters. In several hours, the GridSearchCV gives the hyperparameters of recommendation automatically. Using manual to improve a few values, there has 2 FP and 4 FN in the confusion matrix. Only using 21 features.

In [12]:
param_test = {
    #'max_depth':[i for i in range(5,13,2)],
    #'min_child_weight':[i/10.0 for i in range(5,20,2)],
    #'gamma':[i/10.0 for i in range(-2,7,2)],
    #'subsample':[i/10.0 for i in range(4,11,2)],
    #'colsample_bytree':[i/10.0 for i in range(4,11,2)],
    #'reg_alpha':[0, 1e-2, 0.1, 1],
    #'reg_lambda':[0, 1e-2, 0.1, 1]
    }
from sklearn.model_selection import GridSearchCV
gsearch = GridSearchCV(
    estimator = XGBClassifier(
        base_score=0.5, 
        booster='gbtree', 
        colsample_bylevel=1,
        colsample_bynode=1, 
        colsample_bytree=1, 
        eval_metric='error',
        gamma=0.2, 
        gpu_id=-1, 
        importance_type='gain',
        interaction_constraints='', 
        learning_rate=0.300000012,
        max_delta_step=0, 
        max_depth=5, 
        min_child_weight=1.5, 
        missing=np.nan,
        monotone_constraints='()', 
        n_estimators=100, 
        n_jobs=-1,
        num_parallel_tree=1, 
        random_state=1, 
        reg_alpha=0, 
        reg_lambda=1,
        scale_pos_weight=1, 
        subsample=0.8, 
        tree_method='exact',
        use_label_encoder=False, 
        validate_parameters=1, 
        verbosity=None,
        objective= 'binary:logistic',
        seed=1),
    param_grid = param_test,
    scoring='accuracy',
    n_jobs=-1,
    cv=10)
gsearch.fit(X_train_f,y_train)
print("="*40)
print("gsearch.scorer_\t", gsearch.scorer_)
print("gsearch.best_params_\t", gsearch.best_params_)
print("gsearch.best_score_\t", gsearch.best_score_)
xgb_a = gsearch.best_estimator_
xgb_a.fit(X_train_f,y_train)
y_pred = xgb_a.predict(X_test_f)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy on test data: {:.2f}".format(accuracy))
scores = cross_val_score(xgb_a, X, y)
print("XGBClassifier KFold Accuracy: {:.2f}".format(scores.mean()))
print("Report:\n", metrics.classification_report(y_test, y_pred))
print("Confusion Matrix:\n", metrics.confusion_matrix(y_test, y_pred))
========================================
gsearch.scorer_	 make_scorer(accuracy_score)
gsearch.best_params_	 {}
gsearch.best_score_	 0.9696013289036545
Accuracy on test data: 0.96
XGBClassifier KFold Accuracy: 0.97
Report:
               precision    recall  f1-score   support

           0       0.96      0.92      0.94        53
           1       0.96      0.98      0.97        90

    accuracy                           0.96       143
   macro avg       0.96      0.95      0.95       143
weighted avg       0.96      0.96      0.96       143

Confusion Matrix:
 [[49  4]
 [ 2 88]]

Change an algorithm to a support vector machine. Tuning hyperparameters via grid search.
Here using all features and using 21 features have the same result.

In [32]:
pipe_svc = make_pipeline(StandardScaler(),
                         SVC(random_state=1))

param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]

param_grid = [{'svc__C': param_range, 
               'svc__kernel': ['linear']},
              {'svc__C': param_range, 
               'svc__gamma': param_range, 
               'svc__kernel': ['rbf']}]

gs = GridSearchCV(estimator=pipe_svc, 
                  param_grid=param_grid, 
                  scoring='accuracy', 
                  cv=10,
                  n_jobs=-1)
gs = gs.fit(X_train_f, y_train)
print(gs.best_score_)
print(gs.best_params_)
0.9811738648947952
{'svc__C': 0.1, 'svc__kernel': 'linear'}
In [33]:
clf_s = gs.best_estimator_
clf_s.fit(X_train_f, y_train)
print('Test accuracy: %.3f' % clf_s.score(X_test_f, y_test))
Test accuracy: 0.972

1 FP and 3 FN. KFold accuracy is 98%.

In [36]:
y_pred = clf_s.predict(X_test_f)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy on test data: {:.2f}".format(accuracy))
scores = cross_val_score(pipe_svc, X, y, cv=10)
print("Scores\n", scores)
print("SVMClassifier KFold Accuracy: {:.2f}".format(scores.mean()))
print("Report:\n", metrics.classification_report(y_test, y_pred))
print("Confusion Matrix:\n", metrics.confusion_matrix(y_test, y_pred))
Accuracy on test data: 0.97
Scores
 [0.98245614 0.96491228 0.94736842 0.98245614 1.         1.
 0.92982456 1.         1.         0.94642857]
SVMClassifier KFold Accuracy: 0.98
Report:
               precision    recall  f1-score   support

           0       0.98      0.94      0.96        53
           1       0.97      0.99      0.98        90

    accuracy                           0.97       143
   macro avg       0.97      0.97      0.97       143
weighted avg       0.97      0.97      0.97       143

Confusion Matrix:
 [[50  3]
 [ 1 89]]

Let's see where are misclassified.

In [30]:
plt.scatter(X_test[y_test == 0]["mean radius"], X_test[y_test == 0]["mean texture"],\
           color="red", marker="^", alpha=0.5)
plt.scatter(X_test[y_test == 1]["mean radius"], X_test[y_test == 1]["mean texture"],\
           color="green", marker="s", alpha=0.5)
plt.scatter(X_test[y_test != y_pred]["mean radius"], X_test[y_test != y_pred]["mean texture"],\
           color="black", marker="x", s=1000, alpha=0.5, linewidth=2.0)
plt.show()

Deep Learning: Classification of Wisconsin breast cancer dataset

The code is only used under TensorFlow 1.15.x. Therefore, here is a check Tensorflow and Keras version.

In [22]:
import tensorflow as tf
print(f"Tensorflow: {tf.__version__}")
import keras
print(f"Keras: {keras.__version__}")
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM
np.random.seed(1)
from tensorflow import set_random_seed
set_random_seed(1)
Tensorflow: 1.15.0
Keras: 2.1.6

Only using 21 features. Using StandardScaler for feature standardization, here is obtained training set and a test set of standardization. For the LSTM layer, the dataset has to transform into a 3D array.

In [23]:
sc = StandardScaler()
sc.fit(X_f)
X_train_std = sc.transform(X_train_f)
X_test_std = sc.transform(X_test_f)
X_train_std = np.reshape(X_train_std, (X_train_std.shape[0], X_train_std.shape[1], 1))
X_test_std = np.reshape(X_test_std, (X_test_std.shape[0], X_test_std.shape[1], 1))

There have 4 layers of the LSTM. It's called Stacked LSTM. Some researchers found a more deep network better than a large number of neurons. The advantage of the deep network is performance and accuracy highly. In addition, it is able to reduce the number of neurons.

In [24]:
ann = Sequential()
ann.add(LSTM(units=42, return_sequences=True, input_shape=(X_train_std.shape[1], 1)))
ann.add(Dropout(0.5))
ann.add(LSTM(units=42, return_sequences=True))
ann.add(Dropout(0.5))
ann.add(LSTM(units=42, return_sequences=True))
ann.add(Dropout(0.5))
ann.add(LSTM(units=42))
ann.add(Dropout(0.5))
ann.add(Dense(units=210, 
              kernel_initializer='normal',
              activation='sigmoid'))
ann.add(Dropout(0.5))
ann.add(Dense(units=105, 
              kernel_initializer='normal',
              activation='sigmoid'))
ann.add(Dropout(0.5))
ann.add(Dense(units=63, 
              kernel_initializer='normal',
              activation='sigmoid'))
ann.add(Dropout(0.5))
ann.add(Dense(units=1, 
              kernel_initializer='normal',
              activation='sigmoid')) 
ann.compile(loss='binary_crossentropy',             
            optimizer='adam', 
            metrics=['accuracy'])
In [25]:
history = ann.fit(x = X_train_std,
                  y = y_train,
                  validation_split = 0.2,       
                  epochs = 25,                  
                  batch_size = 42, verbose = 2) 
Train on 340 samples, validate on 86 samples
Epoch 1/25
 - 7s - loss: 0.6963 - acc: 0.5265 - val_loss: 0.6666 - val_acc: 0.6512
Epoch 2/25
 - 1s - loss: 0.6749 - acc: 0.6206 - val_loss: 0.6484 - val_acc: 0.6512
Epoch 3/25
 - 1s - loss: 0.6583 - acc: 0.6206 - val_loss: 0.6451 - val_acc: 0.6512
Epoch 4/25
 - 1s - loss: 0.6696 - acc: 0.6206 - val_loss: 0.6412 - val_acc: 0.6512
Epoch 5/25
 - 1s - loss: 0.6660 - acc: 0.6206 - val_loss: 0.6363 - val_acc: 0.6512
Epoch 6/25
 - 1s - loss: 0.6615 - acc: 0.6176 - val_loss: 0.6288 - val_acc: 0.6512
Epoch 7/25
 - 1s - loss: 0.6512 - acc: 0.6206 - val_loss: 0.6131 - val_acc: 0.6512
Epoch 8/25
 - 1s - loss: 0.6218 - acc: 0.6294 - val_loss: 0.5903 - val_acc: 0.6512
Epoch 9/25
 - 1s - loss: 0.6038 - acc: 0.6500 - val_loss: 0.5533 - val_acc: 0.6512
Epoch 10/25
 - 1s - loss: 0.5512 - acc: 0.7265 - val_loss: 0.5024 - val_acc: 0.9419
Epoch 11/25
 - 1s - loss: 0.4905 - acc: 0.8676 - val_loss: 0.4436 - val_acc: 0.9302
Epoch 12/25
 - 1s - loss: 0.4267 - acc: 0.9353 - val_loss: 0.3844 - val_acc: 0.9302
Epoch 13/25
 - 1s - loss: 0.3615 - acc: 0.9500 - val_loss: 0.3324 - val_acc: 0.9302
Epoch 14/25
 - 1s - loss: 0.3259 - acc: 0.9471 - val_loss: 0.2700 - val_acc: 0.9535
Epoch 15/25
 - 1s - loss: 0.2849 - acc: 0.9441 - val_loss: 0.2425 - val_acc: 0.9535
Epoch 16/25
 - 1s - loss: 0.2544 - acc: 0.9529 - val_loss: 0.2027 - val_acc: 0.9651
Epoch 17/25
 - 1s - loss: 0.2208 - acc: 0.9618 - val_loss: 0.1886 - val_acc: 0.9651
Epoch 18/25
 - 1s - loss: 0.2071 - acc: 0.9588 - val_loss: 0.1781 - val_acc: 0.9651
Epoch 19/25
 - 1s - loss: 0.1714 - acc: 0.9676 - val_loss: 0.1908 - val_acc: 0.9535
Epoch 20/25
 - 1s - loss: 0.1834 - acc: 0.9676 - val_loss: 0.1941 - val_acc: 0.9535
Epoch 21/25
 - 1s - loss: 0.1679 - acc: 0.9647 - val_loss: 0.1623 - val_acc: 0.9651
Epoch 22/25
 - 1s - loss: 0.1573 - acc: 0.9706 - val_loss: 0.1583 - val_acc: 0.9651
Epoch 23/25
 - 1s - loss: 0.1436 - acc: 0.9706 - val_loss: 0.1571 - val_acc: 0.9651
Epoch 24/25
 - 1s - loss: 0.1556 - acc: 0.9676 - val_loss: 0.1547 - val_acc: 0.9651
Epoch 25/25
 - 1s - loss: 0.1423 - acc: 0.9706 - val_loss: 0.1520 - val_acc: 0.9651
In [26]:
def show_history(count, history, train, validation):
    plt.plot(range(1, count), history.history[train])
    plt.plot(range(1, count), history.history[validation])
    plt.title('Train history')
    plt.ylabel(train)
    plt.xlabel('Epoch')
    plt.xticks(range(1, count))
    plt.legend(['train', 'validation'], loc='best')
    plt.show()

Because the train line is higher than the validation line. That seems a little overfitting. Therefore, using early stop. Only 25 epochs.

In [27]:
show_history(26, history, 'acc', 'val_acc')

With the same logic, the train line is lowered than the validation line. That seems overfitting. Therefore, early stop and only 25 epochs.

In [28]:
show_history(26, history, 'loss', 'val_loss')

Here is obtained 95.1% accuracy in the test set.

In [29]:
scores = ann.evaluate(x = X_test_std,
                      y = y_test)
print(f"Loss: {scores[0]}, Accuracy: {scores[1]}")
143/143 [==============================] - 0s 532us/step
Loss: 0.1926896571070998, Accuracy: 0.9510489514657667

There has 4 FP and 3 FN in the test set.

In [30]:
y_pred = ann.predict_classes(X_test_std)
y_pred = y_pred.reshape(y_test.shape)
print("Report:\n", metrics.classification_report(y_test, y_pred))
pd.crosstab(y_test, y_pred)
Report:
               precision    recall  f1-score   support

           0       0.93      0.94      0.93        53
           1       0.97      0.96      0.96        90

    accuracy                           0.95       143
   macro avg       0.95      0.95      0.95       143
weighted avg       0.95      0.95      0.95       143

Out[30]:
col_0 0 1
target
0 50 3
1 4 86
In [ ]:
 

Comments

Popular posts from this blog

Python 日期與時間的處理

Visual Basic 6.0 (VB6) 程式語言案例學習 (10. 條碼列印程式)

寫作:波蘭文學習之旅:1-1. 波蘭文字母與發音(注音版)

Python 日期與時間的處理

Image

Visual Basic 6.0 (VB6) 程式語言案例學習 (10. 條碼列印程式)

Image

寫作:波蘭文學習之旅:1-1. 波蘭文字母與發音(注音版)

Image

數位影像處理:最佳化處理策略之快速消除扭曲演算法

Image

Visual Basic .Net (VB.Net) 程式語言案例學習 (06. 題庫測驗系統)

Image

修復損毀的 SQLite DB 資料庫

Image

用10種程式語言做影像二值化(Image binarization)

Visual Basic 6.0 (VB6) 程式語言案例學習 (04. 人事考勤管理系統)

Image

Visual Basic 6.0 (VB6) 程式語言案例學習 (07. 收據列印程式)

Image

Visual Basic .Net (VB.Net) 程式語言案例學習 (03. 場地預約系統)

Image