Hotdog or Not Hotdog: Transfer learning in PyTorch

6 minute read

Transfer learning is a useful approach in deep learning: we take an existing model, with pre-trained weights, and simply repurpose the model for another task. Often, the domains are relatively similar so a lot of the fundamentals which the network may have already learned are applicable. This is most useful when we might not have a lot of training data - the pre-trained model would likely have been trained on a very large amount of data and for a very long time. Thus, it’s feature extraction abilities in the lower layers are well optimized. All we have to do is re-train the last few layers such that, using the features the lower layers are already able to extract, we can classify the images of our choosing.

In this example, we’ll use a pre-trained image classifier, and we’ll re-train the last few linear layers using a smaller dataset. The problem here is simpler (two categories, instead of the usual 1000 for ImageNet). Let’s dive in.

Imports

First we import the necessary packages:

import torch
from torch import nn
from torch import optim
from torch.utils.data import DataLoader
from torch.autograd import Variable

import torchvision

import os
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm_notebook

Loading and editing the VGG19 model

I’ll be using VGG19 as the base for this exercise. I guess you could choose another model, or try a few to see if others work better. The original paper for VGG19 is available here.

To load the model, simply use the torchvision package:

model = torchvision.models.vgg19(pretrained=True)

After loading the model, the first thing is to set the required_grad flag false such that the weights will not be updated. Since we’re doing transfer learning, we don’t need to train the weights in the lower layers. I print the model so that you can see the output and the structure of the entire model.

for param in model.parameters():
    param.requires_grad = False
print(model)

And the output from the print() command:

>>>	VGG (
	  (features): Sequential (
	    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
	    (1): ReLU (inplace)
	    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
	    (3): ReLU (inplace)
	    (4): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1))
	    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
	    (6): ReLU (inplace)
	    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
	    (8): ReLU (inplace)
	    (9): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1))
	    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
	    (11): ReLU (inplace)
	    (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
	    (13): ReLU (inplace)
	    (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
	    (15): ReLU (inplace)
	    (16): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
	    (17): ReLU (inplace)
	    (18): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1))
	    (19): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
	    (20): ReLU (inplace)
	    (21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
	    (22): ReLU (inplace)
	    (23): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
	    (24): ReLU (inplace)
	    (25): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
	    (26): ReLU (inplace)
	    (27): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1))
	    (28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
	    (29): ReLU (inplace)
	    (30): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
	    (31): ReLU (inplace)
	    (32): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
	    (33): ReLU (inplace)
	    (34): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
	    (35): ReLU (inplace)
	    (36): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1))
	  )
	  (classifier): Sequential (
	    (0): Linear (25088 -> 4096)
	    (1): ReLU (inplace)
	    (2): Dropout (p = 0.5)
	    (3): Linear (4096 -> 4096)
	    (4): ReLU (inplace)
	    (5): Dropout (p = 0.5)
	    (6): Linear (4096 -> 1000)
	  )
	)

Inspecting the model details, you can see above that the final linear layer takes an input of size 4096 and outputs a vector of size 1000 - this corresponds to the 1000 categories in ImageNet. For our purposes, we want to replace this with a layer that outputs a vector of size 2: hotdog or not_hotdog. To do this, we simply change the out_features parameter of the layer model.classifier[6].

Next, we have to ensure that this layer will be learned so we set its require_grad flag to True.

Third, we initializae the weights on this new layer using the He Normal method. We make use of the handy model.apply() method, supplying it with a function to initialize the weights in the Linear layers only (since Dropout and Relu layers don’t need weights initialization).

In the last step, we move the entire model to the GPU for training.

# Change the number of output features
model.classifier[6].out_features = 2

# Set requires_grad to True on the linear layer
for param in model.classifier.parameters():
    param.requires_grad = True

# Initialize the weights
def weights_init(m):
    classname = m.__class__.__name__
    if classname.find('Linear') != -1:
        nn.init.kaiming_normal(m.weight.data)
        
model.classifier.apply(weights_init);

# Move the model to the GPU
model = model.cuda()

Loading the data

The data I’m using to train this model is available from here. There are 998 images, 500 in the test set and 498 in the training set. We’ll use PyTorch’s ready made ImageFolder method from the torchvision.datasets package to load these images on the fly.

First, we define the data transforms. These are used by the dataset class to transform images on-the-fly. For training, we simply take random crops of size 224x224 and apply random horizontal flipping. The images are converted to PyTorch Tensors. For the test set, we scale everything to 256 and take a center crop of the same size as the training data.

data_transforms = {'train':
                    torchvision.transforms.Compose([
                    torchvision.transforms.RandomSizedCrop(224),
                    torchvision.transforms.RandomHorizontalFlip(),
                    torchvision.transforms.ToTensor()]),
                   'test':
                    torchvision.transforms.Compose([
                    torchvision.transforms.Scale(256),
                    torchvision.transforms.CenterCrop(224),
                    torchvision.transforms.ToTensor()])
                  }

Now we can create the datasets. It’s worth noting here that there are train and test sets in the original dataset but I haven’t used the test set explicitly. Of course, you should use it to validate that your model is working.

image_dataset = {x: torchvision.datasets.ImageFolder(os.path.join('./data/', x), data_transforms[x]) for x in ['train', 'test']}
data_loader = {x: DataLoader(image_dataset[x], batch_size=4, shuffle=True, num_workers=4) for x in ['train', 'test']}

Just for fun, we can quickly plot a few images and see what we’re dealing with:

def imshow(imgs, title=None):
    """Imshow for Tensor."""
    imgs = imgs.numpy().transpose((1, 2, 0))
    plt.imshow(imgs)
    if title is not None:
        plt.title(title)
    


# Get a batch of training data
inputs, classes = next(iter(data_loader['train']))

# Make a grid from batch
imgs = torchvision.utils.make_grid(inputs)

imshow(imgs, title=[class_names[x] for x in classes])
Matplotlib plot of training images
Matplotlib output.

Training the model

Now the optimizer and the loss functions are defined. Note that we specify the parameters that are being optimized - in this case it’s all the layers withing the classifier.

optimizer = optim.SGD(model.classifier.parameters(),lr=0.001, momentum=0.9, nesterov=True, weight_decay=1e-6)
criterion = nn.CrossEntropyLoss()

Finally, we’re ready to begin training the model. I use the tqdm package to generate handy progress bars. I’ll train for 300 epochs. I’m using an Nvidia GTX1080Ti so this runs at ~6 seconds/epoch, or about 30 minutes. YMMV.

epochs = 300
dataset_sizes = {x: len(image_dataset[x]) for x in ['train', 'test']}

with tqdm_notebook(total=epochs,unit="epoch") as pbar:
    for epoch in range(epochs):
        running_loss = 0
        running_corrects = 0
        for i,data in enumerate(data_loader['train']):
            inputs, labels = data
            inputs = Variable(inputs.cuda())
            labels = Variable(labels.cuda())


            optimizer.zero_grad()
            outputs = model(inputs)
            preds = torch.max(outputs.data, 1)[1]
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.data[0]
            running_corrects += torch.sum(preds == labels.data)
        epoch_loss = running_loss / dataset_sizes['train']
        epoch_acc = running_corrects / dataset_sizes['train']
        pbar.set_postfix(loss=epoch_loss, acc=epoch_acc)
        pbar.update()

I manage to get into the ~0.95 accuracy range within about 150 epochs and loss decreases a bit from there. I’m sure there is a bit more performance to be gained from tuning the hyperparameters. Of course, I’ll need to validate using the held-out test set before I know the true performance, but it’s not like this is going to be pitched to Bream-Hall so I’ll leave that up to Jian-Yang.