How To Classify Images With Smaller Datasets

Brendan Hoss
9 min readFeb 4, 2021

Objectives

  • Collect & Augment Data
  • Create And Tune A Model
  • Visualize Data

The Data

The first step in any data science task is finding and collecting data. Now this may seem easy enough for large scale projects or if you work in a corporate setting with access to large quantity's of data. Outside of that realm finding data isn’t always as easy as it seems. Luckily for us there is innovate and strategic ways to maneuver this issue. My personal favorite is web scrapping. There is a handful of easy to use web scrappers at your disposal. My two favorite are Beautiful Soup and Selenium. Beautiful Soup is more tailored towards retrieving data from html and xml files. If you want to learn more about Beautiful Soup you can click here. My most recent web scrapper used was Selenium. I was able to follow this Github repo to collect large amounts of images to be used within my image classification problem. The process that I went through involved downloading a google chrome extension and pip installing selenium through my local machines command line. After that, all you need to do is open the python script in the folder that was downloaded and go to line 105. From here there is a variable called queries and within that you can change the text string to your liking to yield results. The data collection process is typically one of the more challenging processes within creating a useable model. I was able to use this model to create a directory for my data. Within this directory I used specific queries to navigate the downloaded images into their corresponding folders initially based off criteria such as elementary level artwork to high school level artwork. After running a few different queries I was able to amass roughly 1300 photos. These photos needed to be sorted into more specific directories labeled 1–5 to help with the image classification of the model. With more diversity between the sub-directories of the data it allowed the model to understand the difference between the art work clearly. This is the format that I had used to set up my directories.

Within each folder is the below directory
Within each of the 1–5 folders is images based on skill level
sample images from folder 1
Images from folder 5 to show the contrast from folder 1

Finally, we have the data cleaned in categorical folders and have amassed enough of a dataset to begin working on a model. But there is one last step we need to take before we jump into creating the model and that is using data augmentation to transform our small dataset of 1300 into something more realistic in size for the model to understand. The idea behind data augmentation is that you can take a simple image and transform it into a series of altered images to give the model another perspective of what it is trying to understand. For this project I relied on the Keras and Tensorflow library’s which made creating a series of augmented images very easy to understand and perform.

from keras.preprocessing.image import ImageDataGeneratortrain_datagen = ImageDataGenerator(
rescale=1./255,
width_shift_range = .2,
height_shift_range = .2,
rotation_range=45,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
fill_mode = 'constant',
cval = 125
)
validation_datagen = ImageDataGenerator(rescale=1./255)

Above is the ImageDataGenerator (IDG) I created for my test set of data. Remember we sectioned off 10% of our training data in the ‘Validation’ folder so we had images to validate on. What this IDG does is takes an image and rescale it to an appropriate size to be fed into the model. We can now apply this to our training data. For our validation datagen we only need to rescale it to be the same size throughout the model as the training set. Once we have instantiated our train and validation datagens we can move on to passing them through our directories mentioned above containing the subclasses of images.

Train = r'story-squad-ds-b\Image_scoring\Data\Training'
Vali = r'story-squad-ds-b\Image_scoring\Data\Validation'
train_generator = train_datagen.flow_from_directory(
Train, # this is the input directory
target_size=(224, 224), # all images will be resized
color_mode = 'rgb',
batch_size=32,
class_mode='sparse',
shuffle=True,
seed = 42,
interpolation="bilinear",
follow_links=False,
)
# the validation generator will be similar in style to the train generator
validation_generator = validation_datagen.flow_from_directory(
Vali,
target_size=(224, 224),
color_mode = 'rgb',
batch_size=32,
class_mode='sparse',
shuffle=True,
seed = 42,
interpolation="bilinear",
follow_links=False,
)

Keras has a built in function .flow_from_directory which allows you to pull each induvial image from your corresponding directly and immediately apply data augmentation without having to save the newly created images to your local machine. This is a critical feature as it allows for a smaller dataset to be instantly multiplied in size without having to take the massive amount of local space from your machine. Furthermore, we can call this function on our train and validation data generators. This stacking allows us to pass in our data along with adjusting for target size and batch size. These two critical hyperparameters will help use keep the model in line later on.

The Model

Another issue we face with such small datasets is that there just isn’t enough data to actually make them usable. These massive models are expecting tens of thousands of pieces of data to help them narrow down their predictions. One solution for this problem is using a pretrained model and tuning that to better fit your specific needs. This is something I did while working on the project.

#load the modelbase_model = keras.applications.ResNet50(weights='imagenet', include top=False, input shape=(224, 224, 3))#freeze base model
base_model.trainable = False
#summarize the model
model = tf.keras.Sequential([
base_model,
tf.keras.layers.Flatten(),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.6),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dropout(0.4),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Dense(16, activation='relu'),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Dense(5, activation='softmax')])
#compile your model
model.compile(
optimizer='adam', #can also try to use rmsprop/adam/sgd optimizer
loss = 'sparse_categorical_crossentropy',
metrics=['accuracy'])

The model I created initially builds off Keras already created ResNet50 model. This is a deep learning model that has already been trained on a massive selection of images used for classification problems. After importing that model we build to it adding a few more layers consisting of Flatten, Batch Normalization, Dense, and drop out. The Flatten layer allows the model to remove all the tensors expect for one. We then use Batch Normalization to help with any gradient issues that arise. The dense model is your regular densely connected neural network layer, this is often the most used layer in any model. Lastly we have the dropout layer which helps protect against overfitting as it randomly sets input units to 0 which allows the model to adjust for the changes. The last layer within the model is the dense layer, as we are working with 5 separate classification groups we tell the model to output 5 predictable occurrences. Once your model is completed with the layers of your choosing its time to compile it. During the compiling there is 3 major parameters to account for, they are the optimizer, the loss type, and the metric to use for scoring. For me I found that using Adam as my optimizer worked consistently well along with rmsprop which performed equally as well. I didn't see as great of an accuracy score when using SGD. My loss objective function had to be sparse categorical crossentropy as my data set was structed with integers as the folder names and they were not one hot encoded like categorical crossentropy likes. Finally, I used the accuracy metric to show how accurate our model was performing over time as it continued to learn. The time has come to fit your model and tie together everything we have been working on.

model1 = model.fit(
train_generator,
steps_per_epoch=int(len(train_generator)/32)
epochs=50,
validation_data=validation_generator,
validation_steps=int(len(train_generator)/32)
callbacks=[earlyStopping, mcp_save, reduce_lr_loss])

We are left with the last few steps to complete to get our model up and running. We use Keras built in fit function on our model, this allows us to pass in the above created train_generator (our automatic data augmenter) to our function as the training data while being able to pull from our sorted directories. Next we set our steps per epoch equal to the length of our train_generator divided by the batch size of that train generator. This allows us to ensure we are taking the correct amount of steps within each epoch not to overfit or run out of data. We mimic the idea for the validation data and finally finish it off using three created callbacks. These callbacks to sum up shortly first will stop the model from running if it realizes no significant change within the desired metric for us that was val_accuracy. Next, we create an automatic checkpoint to save our model based on the best epochs found and last we reduce the learning rate if a metric has stopped improving. We can now run this model and get a result from it.

Performance & Visualizations

Although we created a significant model and used data augmentation we ultimately run into the problem of not having enough data to feed to the model. This lead us to having better than guessing accuracy but not something you would be confident to bet on. Ultimately, the model peaked around 40% validation accuracy. It did show a good negative trend in validation loss.

About

I was tasked by the app Story Squad to help them create this project mentioned throughout the tutorial. It was created for the use of clustering children's hand drawn images into ranked classes allowing them to be paired based on skill. During my time with the team the product I have created allows the team to accept user images and rank them into teams based on skill. Going into this project I did a fair share of research into people trying to cluster art images based on rank and it seemed like untouched water. I was concerned if I got stuck I wouldn’t have many resources to use. Thankfully, I was able to piece together small snippets of information scattered around to create a working model. The largest challenge I ran into when creating this model was collecting the data. This became apparent very quickly after hopelessly searching for art data sets that had been ranked. To add to that, these were not going to be incredibly high level pieces of art as the app is focused around 8–12 year old's. Learning to use the Selenium web scrapper gave me an outlet to my problem. I could now create my own dataset and populate it just how I envisioned. Moving forward with this model, I hope to collect a larger supply of images and filter them respectively. Hopefully on a larger scale I can then see where my model is slipping and be able to fine tune it to perfection. During the course of creating the model I ran into a few issues that I discussed with my peers leading me to rethink and change my attitude towards something I was doing incorrectly. This helped further my career goals more than in just education but in communication and listening. Thank you to Story Squad for inviting me and all my peers who have helped guide me along the way.

--

--