MLOps - Build a Image Classifier Model
Pneumonia causes over 2.5 million deaths annually worldwide. Traditional diagnosis through chest X-ray analysis is time-consuming and requires expert radiologists. This project aims to develop an automated deep learning system to classify pneumonia from chest X-rays with high accuracy, potentially assisting healthcare professionals in faster diagnosis.
Purpose
Here I will use a local environment with Pytorch and Jupyter Notebook to:
- Demonstrate end-to-end development of a medical image classifier
- Using Kaggle dataset for NIH Chest X-Ray image processing
- Create a baseline model for pneumonia detection (Normal vs Pneumonia)
- Local GPU acceleration for model training and prediction
Typical ML steps to achieve image classifier:
- Data Acquisition:
- Source NIH Chest X-Ray dataset from Kaggle
- 5,863 validated images (Train/Test/Val split)
- Preprocessing:
- Standardize image size (224x224px)
- Normalize pixel values
- Organize into class-specific directories
- Exploratory Analysis:
- Class distribution visualization
- Sample image inspection
- Model Development:
- Leverage pre-trained ResNet18
- Custom head for binary classification
- GPU-accelerated training
- Evaluation:
- Accuracy metrics
- Model persistence
Implementation Steps
- Get Kaggle Chest X-Ray dataset
# prepare data set from kaggle
!pip install -q kaggle
!python -m pip install --upgrade pip
!mkdir kaggle
!touch kaggle/kaggle.json
!chmod 600 kaggle/kaggle.json
api_token = {"username":"zhouzack","key":""}
import json
with open('kaggle/kaggle.json','w') as file:
json.dump(api_token,file)
!kaggle datasets download -d paultimothymooney/chest-xray-pneumonia --force
Output:
Dataset URL: https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia
License(s): other
Downloading chest-xray-pneumonia.zip to /workspace
100%|██████████████████████████████████████| 2.29G/2.29G [03:32<00:00, 11.4MB/s]
100%|██████████████████████████████████████| 2.29G/2.29G [03:32<00:00, 11.6MB/s]
# Extract the zip file to the "data" directory
import zipfile
import os
# Create the "data" directory in current folder if it doesn't exist
os.makedirs('./data', exist_ok=True) # "./data" = "data" folder in your current directory
# Corrected code (ZipFile instead of Zipfile)
with zipfile.ZipFile('chest-xray-pneumonia.zip', 'r') as zip_ref:
zip_ref.extractall('./data') # Extract to ./data (relative path)
- Test a random image from folder
# Test a random image from folder
import glob
import random
import matplotlib.pyplot as plt
def get_random_image(dir,condition):
placeholder=''
if condition == 'n':
placeholder='NORMAL'
elif condition == 'p':
placeholder='PNEUMONIA'
else:
raise Exception("Sorry, invalid condition")
folder=f'./data/chest_xray/{dir}/{placeholder}/*.jpeg'
img_paths=glob.glob(folder)
max_length=len(img_paths)
randomNumber=random.randint(0,max_length)
for index, item in enumerate(img_paths, start=1):
if index == randomNumber:
print(index,item)
image = plt.imread(item)
readyImage=plt.imshow(image)
return readyImage
get_random_image("val","n")
- Image processing, load image and Prints its format, converts image from RGBA to RGB, using Matplotlib for a cleaner figure size view.
#loads and Prints the image format
from PIL import Image
# Replace 'path/to/your/image.jpg' with your actual image file path
image = Image.open('./data/chest_xray/val/PNEUMONIA/person1947_bacteria_4876.jpeg')
print(image.format)
print(image.size)
print(image.mode)
Output:
JPEG
(1152, 664)
L
# converts image from RGBA (Red, Green, Blue, Alpha) format to RGB (Red, Green, Blue) format.
import PIL.Image
rgba_image = PIL.Image.open('./data/chest_xray/val/NORMAL/NORMAL2-IM-1436-0001.jpeg')
rgb_image = rgba_image.convert('RGB')
# Reads an image using Matplotlib
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
# Provide the correct path to your image file
# Replace with your actual image path
img = mpimg.imread('./data/chest_xray/val/NORMAL/NORMAL2-IM-1436-0001.jpeg')
# Display the image
plt.figure(figsize=(10,8)) # Optional: set figure size
imgplot = plt.imshow(img)
plt.axis('off') # Optional: hide axes
plt.show()
- Resizes and saves validation images into val_pneumonia and val_normal folders
# Resizes and saves validation images into val_pneumonia and val_normal folders
import glob
import matplotlib.pyplot as plt
from PIL import Image
folder = f'./data/chest_xray/val/*/*.jpeg'
counterPneu = 0
counterNormal = 0
img_paths = glob.glob(folder)
for i in img_paths:
if "person" in i:
full_size_image = Image.open(i)
im = full_size_image.resize((224,224))
plt.imsave(fname='./data/chest_xray/val' + '/val_pneumonia' + str(counterPneu)+'.jpeg', arr=im, format='jpeg', cmap='gray')
counterPneu += 1
else:
full_size_image = Image.open(i)
im = full_size_image.resize((224,224))
plt.imsave(fname='./data/chest_xray/val' + '/val_normal' + str(counterNormal)+'.jpeg', arr=im, format='jpeg', cmap='gray')
counterNormal += 1
Output:
Processed 3875 Pneumonia images and 1341 Normal images.
- creates a DataFrame to organize the dataset by type (train, test, val) and condition (pneumonia, normal)
# creates a DataFrame to organize the dataset by type (train, test, val) and condition (pneumonia, normal)
import glob
import pandas as pd
folder = f'./data/chest_xray/*/*.jpeg'
category = []
filenames = []
condition_of_lung = []
all_files = glob.glob(folder)
for filename in all_files:
if "train" in filename:
if "pneumonia" in filename:
category.append("train")
filenames.append(filename)
condition_of_lung.append("pneumonia")
elif "normal" in filename:
category.append("train")
filenames.append(filename)
condition_of_lung.append("normal")
elif "test" in filename:
if "pneumonia" in filename:
category.append("test")
filenames.append(filename)
condition_of_lung.append("pneumonia")
elif "normal" in filename:
category.append("test")
filenames.append(filename)
condition_of_lung.append("normal")
elif "val" in filename:
if "pneumonia" in filename:
category.append("val")
filenames.append(filename)
condition_of_lung.append("pneumonia")
elif "normal" in filename:
category.append("val")
filenames.append(filename)
condition_of_lung.append("normal")
all_data_df = pd.DataFrame({"dataset type": category, "x-ray result": condition_of_lung, "filename": filenames})
print(all_data_df.head())
Output:
dataset type x-ray result filename
0 test normal ./data/chest_xray/test/test_normal0.jpeg
1 test normal ./data/chest_xray/test/test_normal1.jpeg
2 test normal ./data/chest_xray/test/test_normal10.jpeg
3 test normal ./data/chest_xray/test/test_normal100.jpeg
4 test normal ./data/chest_xray/test/test_normal101.jpeg
- visualizes the distribution of pneumonia and normal cases across the train, test, and validation datasets
import seaborn as sns
# Use `hue` with the same variable as `x` and set `legend=False`
g = sns.catplot(
x="x-ray result", # Variable for the x-axis
col="dataset type", # Facet by dataset type (train, test, val)
kind="count", # Plot counts
palette="ch:.55", # Set color palette
data=all_data_df, # Data source
hue="x-ray result", # Assign `x` to `hue` to use `palette`
legend=False # Avoid duplicate legend
)
# Add annotations to the bars
for i in range(0, 3):
ax = g.facet_axis(0, i)
for p in ax.patches:
ax.text(
p.get_x() + 0.3, # X position of the text
p.get_height() * 1.05, # Y position of the text (slightly above the bar)
'{0:.0f}'.format(p.get_height()), # Text to display (bar height)
color='black', # Text color
rotation='horizontal', # Text rotation
size='large' # Text size
)
- creates DataFrames for the training and testing datasets, labeling images as pneumonia (1) or normal (0)
# Create DataFrames for the training and testing datasets
import glob
import pandas as pd
import os
train_folder = './data/chest_xray/train/*.jpeg'
train_df_lst = pd.DataFrame(columns=['labels', 'filename'], dtype=object)
train_imgs_path = glob.glob(train_folder)
counter = 0
class_arg = ''
for i in train_imgs_path:
if "pneumonia" in i:
class_arg = 1
else:
class_arg = 0
train_df_lst.loc[counter] = [class_arg, os.path.basename(i)]
counter += 1
print(train_df_lst.head())
- save DataFrame with labels and filenames into a tab-separated .lst file
# Save DataFrame with labels and filenames into a tab-separated .lst file
def save_to_lst(df,prefix):
return df[["labels","filename"]].to_csv(
f"{prefix}.lst", sep='\t',index=True,header=False
)
save_to_lst(train_df_lst.copy(),"train")
save_to_lst(test_df_lst.copy(),"test")
- install libraries for data and image and preprocessing
!pip install torch torchvision pandas pillow Output: Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (2.3.0a0+6ddf5cf85e.nv24.4) Requirement already satisfied: torchvision in /usr/local/lib/python3.10/dist-packages (0.18.0a0) Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (1.5.3) Requirement already satisfied: pillow in /usr/local/lib/python3.10/dist-packages (10.2.0)
- Setup model for training
# Define the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
# Load a pre-trained ResNet18 model
model = models.resnet18(pretrained=True)
# Modify the final fully connected layer for binary classification
model.fc = nn.Linear(model.fc.in_features, 1)
# Move the model to the appropriate device
model = model.to(device)
- Train the model
# Define the loss function (BCEWithLogitsLoss)
criterion = nn.BCEWithLogitsLoss()
# Define the optimizer
optimizer = optim.Adam(model.parameters(), lr=0.0001)
# Training loop
num_epochs = 10
for epoch in range(num_epochs):
model.train()
running_loss = 0.0
for i, (images, labels) in enumerate(train_loader):
images, labels = images.to(device), labels.to(device)
# Zero the parameter gradients
optimizer.zero_grad()
# Forward pass
outputs = model(images)
loss = criterion(outputs, labels.float().view(-1, 1))
# Backward pass and optimize
loss.backward()
optimizer.step()
running_loss += loss.item()
if i % 10 == 9: # Print every 10 batches
print(f"Epoch [{epoch+1}/{num_epochs}], Batch [{i+1}/{len(train_loader)}], Loss: {running_loss/10:.4f}")
running_loss = 0.0
print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {running_loss/len(train_loader):.4f}")
- Save the trained model locally
# Save the model locally torch.save(model.state_dict(), "local_image_classifier_model.pth")
Key Takeaways:
Now we have a functional pneumonia classifier with high accuracy, ready for deployment or further refinement.
Next step I will containerize the model as a backend ML application and create a frontend app to interact with the classifier model's endpoint. This will allow users to upload images via the frontend and receive predictions from the backend.