%%html
<!-- The customized css for the slides -->
<link rel="stylesheet" type="text/css" href="../styles/python-programming-introduction.css"/>
<link rel="stylesheet" type="text/css" href="../styles/basic.css"/>
<link rel="stylesheet" type="text/css" href="../../assets/styles/basic.css" />
# Install the necessary dependencies

import os
import sys
!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython  

43.15. Neural Network#

43.15.1. Introduction#

  • In fact, Logistic Regression (that have learned in our last session) is the simplest form of Neural Network; Artificial neural networks can be viewed as an extension of Logistic Regression

  • Logistic Regression: results in decision boundaries that are a straight line

  • Neural Networks: can generate more complex decision boundaries

  • (Deep) Neural Networks: a universal approximator!

  • In this session, we will learn to use TensorFlow Keras for digit recgonization

43.15.2. Importing the libraries#

%matplotlib inline
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # plotting library
from keras.models import Sequential
from keras.layers import Dense , Activation, Dropout
from tensorflow.keras.optimizers import Adam, RMSprop
from keras import  backend as K

43.15.3. Importing the dataset#

  • MNIST is a collection of handwritten digits ranging from the number 0 to 9.

  • It has a training set of 60,000 images, and 10,000 test images that are classified into corresponding categories or labels.

# import dataset
from keras.datasets import mnist

# load dataset
(x_train, y_train),(x_test, y_test) = mnist.load_data()
# count the number of unique train labels
unique, counts = np.unique(y_train, return_counts=True)
print("Train labels: ", dict(zip(unique, counts)))

# count the number of unique test labels
unique, counts = np.unique(y_test, return_counts=True)
print("\nTest labels: ", dict(zip(unique, counts)))

43.15.4. Data visualization#

  • Let’s sample the 25 random MNIST digits and visualize them.

# sample 25 mnist digits from train dataset
indexes = np.random.randint(0, x_train.shape[0], size=25)
images = x_train[indexes]
labels = y_train[indexes]

# plot the 25 mnist digits
plt.figure(figsize=(5,5))
for i in range(len(indexes)):
    plt.subplot(5, 5, i + 1)
    image = images[i]
    plt.imshow(image, cmap='gray')
    plt.axis('off')
    
plt.show()

43.15.5. Designing model architecture using Keras#

43.15.6. Import Keras layers#

from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from tensorflow.keras.utils import to_categorical, plot_model

43.15.7. Compute the number of labels#

num_labels = len(np.unique(y_train))

43.15.8. One-Hot Encoding#

  • At this point, the labels are in digits format, 0 to 9.

  • A more suitable format is called a one-hot vector, a 10-dim vector with all elements 0, except for the index of the digit class.

  • For example, if the label is 2, the equivalent one-hot vector is [0,0,1,0,0,0,0,0,0,0]. The first label has index 0.

# convert to one-hot vector
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

43.15.9. Data Preprocessing#

# image dimensions (assumed square)
image_size = x_train.shape[1]
input_size = image_size * image_size
input_size
# resize and normalize
x_train = np.reshape(x_train, [-1, input_size])
x_train = x_train.astype('float32') / 255
x_test = np.reshape(x_test, [-1, input_size])
x_test = x_test.astype('float32') / 255

43.15.10. Setting network parameters#

  • The batch_size argument indicates the number of data that we will use for each update of the model parameters.

  • Hidden_units shows the number of hidden units.

  • Dropout is the dropout rate (related to Overfitting and Regularization).

# network parameters
batch_size = 128
hidden_units = 256
dropout = 0.45

43.15.11. Designing the model architecture#

# model is a 3-layer MLP with ReLU and dropout after each layer
model = Sequential()
model.add(Dense(hidden_units, input_dim=input_size))
model.add(Activation('relu'))
model.add(Dropout(dropout))
model.add(Dense(hidden_units))
model.add(Activation('relu'))
model.add(Dropout(dropout))
model.add(Dense(num_labels))
model.add(Activation('softmax'))

43.15.12. View model summary#

  • Keras library provides us summary() method to check the model description.

model.summary()

43.15.13. How big is our model (number of parameters)?#

  • From input to Dense layer: 784 Ă— 256 + 256 = 200,960.

  • From first Dense to second Dense: 256 Ă— 256 + 256 = 65,792.

  • From second Dense to the output layer: 10 Ă— 256 + 10 = 2,570.

  • The total is 200,690 + 65,972 + 2,570 = 269,322.

43.15.14. Compile the model with compile() method#

model.compile(loss='categorical_crossentropy', 
              optimizer='adam',
              metrics=['accuracy'])

43.15.15. Loss function (categorical_crossentropy)#

  • How far the predicted tensor is from the one-hot ground truth vector is called loss.

  • In this example, we use categorical_crossentropy as the loss function. It is the negative of the sum of the product of the target and the logarithm of the prediction.

  • There are other loss functions in Keras, such as mean_absolute_error and binary_crossentropy. The choice of the loss function is not arbitrary but should be a criterion that the model is learning.

  • For classification by category, categorical_crossentropy or mean_squared_error is a good choice after the softmax activation layer. The binary_crossentropy loss function is normally used after the sigmoid activation layer while mean_squared_error is an option for tanh output.

43.15.16. Optimization (optimizer adam)#

  • With optimization, the objective is to minimize the loss function. The idea is that if the loss is reduced to an acceptable level, the model has indirectly learned the function mapping input to output.

  • In Keras, there are several choices for optimizers. The most commonly used optimizers are; Stochastic Gradient Descent (SGD), Adaptive Moments (Adam) and Root Mean Squared Propagation (RMSprop).

  • Each optimizer features tunable parameters like learning rate, momentum, and decay.

  • Adam and RMSprop are variations of SGD with adaptive learning rates. In the proposed classifier network, Adam is used since it has the highest test accuracy.

43.15.17. Metrics (accuracy)#

  • Performance metrics are used to determine if a model has learned the underlying data distribution. The default metric in Keras is loss.

  • During training, validation, and testing, other metrics such as accuracy can also be included.

  • Accuracy is the percent, or fraction, of correct predictions based on ground truth.

43.15.18. Train the model with fit() method#

model.fit(x_train, y_train, epochs=20, batch_size=batch_size)

43.15.19. Evaluating model performance with evaluate() method#

loss, acc = model.evaluate(x_test, y_test, batch_size=batch_size)
print("\nTest accuracy: %.1f%%" % (100.0 * acc))

43.15.20. Neural Network from scratch#

  • It’s for tomorrow!

43.15.21. Acknowledgments#

Thanks to PRASHANT BANERJEE for creating the open-source Kaggle jupyter notebook, licensed under Apache 2.0. It inspires the majority of the content of this slides.