Automating Classification Tasks My Way
I’m working on a project that requires training a machine learning model and I have 50,000 image files. I have to categorize and annotate these images and get them ready in two weeks. This would mean sacrificing my sleep and other things I have to do, so I went searching for implementations of machine learning models that has already done this. I found one implementation with the final model in a h5 file, though it doesn’t have all the classes I want, but it will go a long way to cut the manual task in half.
Steps that I took
Step One: Reading through code implementation
I read through the implementation to understand how the model works — everything about the model including its training was in the repository. I checked to see if there was any requirements.txt file, but there was none as it is a relatively simple model.
Step Two: Importing the files
The files were in a folder that was zipped, what I did was to open Google Colab, uploaded the zip file from my computer to the notebook.
Step Three: Unzipping, loading and preprocessing files
The next thing I did was to unzip the file.
import zipfile
import os
import tensorflow as tf
data_zip = "the-zipped-file"
extract_path = "extraction-path"
# Function to extract zip file
def unzip_file(zip_path, extract_path):
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
zip_ref.extractall(extract_path)
# Unzip the file
unzip_file(data_zip, extract_path)
Wrote a script to preprocess the file using the code snippet from the repository itself.
IMG_SIZE = 224
def process_image(image_path,img_size=IMG_SIZE):
try:
"""
Take an image file path and turn image into a Tensor.
"""
image = tf.io.read_file(image_path) # Read image file
image = tf.image.decode_jpeg(image,channels=3) # Turn the image into 3 channels RGB
image = tf.image.convert_image_dtype(image,tf.float32) # Turn the value 0-255 to 0-1
image = tf.image.resize(image,size=[img_size,img_size]) # Resize the image to 224x224
return image # Return the image
except Exception as e:
print(f"Error loading image {image_path}: {e}")
return None
I had to add an exception handling in the event there were any empty images or corrupted images.
Step Four: Loading the pretrained model:
The next thing I did was to load their pretrained model (I had to download the model to my local machine first) using tensorflow.
from tensorflow import keras
import tensorflow_hub as hub
model_path = "model.h5"
my_model = tf.keras.models.load_model(
(model_path),
custom_objects={'KerasLayer':hub.KerasLayer}
)
# Register the custom object
@keras.utils.custom_object_scope
def KerasLayer(x):
return x ** 2
I had to register the model first as it was a custom object.
Step Five: Classification
Here is where I performed my classification tasks.
# List of labels
object_list = ['Label1', 'Label2', 'Label3', 'Label4']
threshold=0.75
# Function to classify images using a pre-trained model
def classify_images(my_model, output_folder, object_list, threshold=0.75):
misc_folder = os.path.join(output_folder, 'misc')
os.makedirs(misc_folder, exist_ok=True)
for label in object_list:
label_folder = os.path.join(output_folder, label)
os.makedirs(label_folder, exist_ok=True)
for filename in os.listdir(data):
if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
img_path = os.path.join(data, filename)
img = process_image(img_path)
if img is None:
print(f"Skipping {img_path} due to loading error.")
continue
img = np.expand_dims(img, axis=0) # Add batch dimension
prediction = my_model.predict(img)
# Check if any of the predicted labels is in the object_list
predicted_label_index = np.argmax(prediction)
predicted_label = object_list[predicted_label_index]
# Check if the prediction exceeds the threshold
if prediction[0][predicted_label_index] >= threshold:
category_folder = os.path.join(output_folder, predicted_label)
else:
category_folder = misc_folder
destination_path = os.path.join(category_folder, filename)
os.rename(img_path, destination_path)
I listed out the labels associated with the model, then put my threshold at 75%. So, the model will predict the input images according to the labels, then those with 75% and above certainty, it sorts to that label. But those with certainty of below 75%, it sorts to a folder called ‘misc’ which it creates if the folder does not exist. It loops through the image file to sort files ending with .jpeg, .jpg, and .png. If it gets to a corrupted image file, it skips it and then continues until it does.
# Classify images and move to appropriate folders
classify_images(my_model, "output_folder", object_list, threshold=threshold)
Then I call the function ‘classify_images’ with the defined arguments.
Final Step: Zipping and downloading
After the ‘classify_images’ was done running, I zip the folder using this script
# prompt: zip output folder
!zip -r output_folder.zip output_folder
After zipping, I clicked on the three dots beside the file on colab and selected the download option.
Caveat: You have to manually go through the categories to make sure that the data is right.
This is less time consuming and faster for me than having to build everything from scratch. It saved me time!