Building a Smart Intruder Detection System with OpenCV and your Phone

Building a Smart Intruder Detection System with OpenCV and your Phone

Watch the Video Tutorial for this post here

Did you know that you can actually stream a Live Video wirelessly from your phone’s camera to OpenCV’s cv2.VideoCapture() function in your PC and do all sorts of image processing on the spot like build an intruder detection system?

Cool huh?

In today’s post not only we will do just that but we will also build a robust Intruder Detection surveillance system on top of that, this will record video samples whenever someone enters your room and will also send you alert messages via Twilio API.

This post will serve as your building blocks for making a smart intruder detection system with computer vision. Although I’m making this tutorial for a home surveillance experiment, you can easily take this setup and swap the mobile camera with multiple IP Cams to create a much larger system.

Today’s tutorial can be split into 4 parts:

  1. Accessing the Live stream from your phone to OpenCV.
  2. Learning how to use the Twilio API to send Alert messages.
  3. Building a Motion Detector with Background Subtraction and Contour detection.
  4. Making the Final Application

You can watch the full application demo here

So most of the people have used the cv2.videocapture() function to read from a webcam or a video recording from a disk but only a few people know how easy it is to stream a video from a URL, in most cases this URL is from an IP camera. 

By the way with cv2.VideoCapture() you can also read a sequence of images, so yeah a GIF can be read by this.

So let me list out all 4 ways to use VideoCapture() class depending upon what you pass inside the function.

1. Using Live camera feed: You pass in an integer number i.e. 0,1,2 etc e.g. cap = cv2.VideoCapture(0), now you will be able to use your webcam live stream. The number depends upon how many USB cams you attach and on which port.

2. Playing a saved Video on Disk: You pass in the path to the video file e.g. cap = cv2.VideoCapture(Path_To_video).

3. Live Streaming from URL using Ip camera or similar: You can stream from a URL e.g. cap = cv2.VideoCapture( protocol://host:port/video) Note: that each video stream or IP camera feed has its own URL scheme.  

4. Read a sequence of Images: You can also read sequences of images, e.g. GIF.

Part 1: Accessing the Live stream from your phone to OpenCV For The Intruder Detection System:

For those of you who have an Android phone can go ahead and install this IP Camera application from playstore. 

For people that want to try a different application or those of you who want to try on their iPhone I would say that although you can follow along with this tutorial by installing a similar IP camera application on your phones but one issue that you could face is that the URL Scheme for each application would be different so you would need to figure that out, some application makes it really simple like the one I’m showing you today. 

You can also use the same code I’m sharing here to work with an actual IP Camera, again the only difference will be the URL scheme, different IP Cameras have different URL schemes. For our IP Camera, the URL Scheme is: protocol://host:port/video

After installing the IP Camera application, open it and scroll all the way down and click start server.

After starting the server the application will start streaming the video to the highlighted URL:

If you paste this URL in the browser of your computer then you would see this:

Note: Your computer and mobile must be connected to the same Network

Click on the Browser or the Flash button and you’ll see a live stream of your video feed:

Below the live feed, you’ll see many options on how to stream your video, you can try changing these options and see effects take place in real-time.

Some important properties to focus on are the video Quality, FPS, and the resolution of the video. All these things determine the latency of the video. You can also change front/back cameras.

Try copying the image Address of the frame:

If you try pasting the address in a new tab then you will only see the video stream. So this is the address that will go inside the VideoCapture function.

Image Address: http://192.168.18.4:8080/video

So the URL scheme in our case is : protocol://host:port/video, where protocol is “http” ,  host is: “192.168.18.4”  and port is: “8080”

All you have to do is paste the above address inside the VideoCapture function and you’re all set.

Download Code

[optin-monster slug=”yi4hfsyqpz8k693x41yc”]

Here’s the Full Code:

# Import the required libraries
import numpy as np
import cv2
import time
import datetime
from collections import deque
# Set Window normal so we can resize it
cv2.namedWindow('frame', cv2.WINDOW_NORMAL)

# Note the starting time
start_time = time.time()

# Initialize these variables for calculating FPS
fps = 0 
frame_counter = 0

# Read the video steram from the camera
cap = cv2.VideoCapture('http://192.168.18.4:8080/video')

while(True):
    
    ret, frame = cap.read()
    if not ret:
        break 
    
    # Calculate the Average FPS
    frame_counter += 1
    fps = (frame_counter / (time.time() - start_time))
    
    # Display the FPS
    cv2.putText(frame, 'FPS: {:.2f}'.format(fps), (20, 20), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 0, 255),1)
    
    # Show the Frame
    cv2.imshow('frame',frame)
    
    # Exit if q is pressed.
    if cv2.waitKey(1) == ord('q'):
        break

# Release Capture and destroy windows
cap.release()
cv2.destroyAllWindows()

As you can see I’m able to stream video from my phone.

Now there are some options you may want to consider, for e.g you may want to change the resolution, in my case I have set the resolution to be `640×480`. Since I’m not using the web interface so I have used the app to set these settings.

There are also other useful settings that you may want to do, like settings up a password and a username so your stream is protected. Setting up a password would, of course, change the URL to something like:

cv2.VideoCapture( protocol://username:password@host:port/video)

I’ve also enabled background mode so even when I’m out of the app or my phone screen is closed the camera is recording secretly, now this is super stealth mode.

Finally here are some other URL Schemes to read this IP Camera stream, with these URLs you can even load audio and images from the stream:

  • http://19412.168.3.:8080/video is the MJPEG URL.
  • http://192.168.43.1:8080/shot.jpg fetches the latest frame.
  • http://192.168.43.1:8080/audio.wav is the audio stream in Wav format.
  • http://192.168.43.1:8080/audio.aac is the audio stream in AAC format (if supported by hardware).

Part 2: Learning how to use the Twilio API to send Alert messages for the Intruder Detection System:

What is Twilio?

Twilio is an online service that allows us to programmatically make and receive phone calls, send and receive SMS, MMS and even Whatsapp messages, using its web  APIs.

Today we’ll just be using it to send an SMS, you won’t need to purchase anything since you get some free credits after you have signed up here.

So go ahead and sign up, after signing up go to the console interface and grab these two keys and your trial Number:

  • ACCOUNT SID
  • AUTH TOKEN

After getting these keys you would need to insert them in the credentials.txt file provided in the source code folder. You can download the folder from above.

Make sure to replace the `INSERT_YOUR_ACCOUNT_SID` with your ACCOUNT SID and also replace `INSERT_YOUR_AUTH_TOKEN` with your `AUTH TOKEN.`

There are also two other things you need to insert in the text file, this is your trail Number given to by the Twilio API and your personal number where you will receive the messages.

So replace `PERSONAL_NUMBER` with your number and `TRIAL_NUMBER` with the Twilio number, make sure to include the country code for your personal number. 

Note: in the trail account the personal number can’t be any random number but its verified number. After you have created the account you can add verified numbers here.

Now you’re ready to use the twilio api, you first have to install the API by doing:

pip install twilio

Now just run this code to send a message:

from twilio.rest import Client

# Read text from the credentials file and store in data variable
with open('credentials.txt', 'r') as myfile:
  data = myfile.read()

# Convert data variable into dictionary
info_dict = eval(data)

# Your Account SID from twilio.com/console
account_sid = info_dict['account_sid']

# Your Auth Token from twilio.com/console
auth_token  = info_dict['auth_token']

# Set client and send the message
client = Client(account_sid, auth_token)
message = client.messages.create( to =info_dict['your_num'], from_ = info_dict['trial_num'], body= "What's Up Man")

Check your phone you would have received a message. Later on we’ll properly fill up the body text.

Part 3: Building a Motion Detector with Background Subtraction and Contour detection:

Now in OpenCV, there are multiple ways to detect and track a moving object, but we’re going to go for a simple background subtraction method. 

What are Background Subtraction methods?

Basically these kinds of methods separate the background from the foreground in a video so for e.g. if a person walks in an empty room then the background subtraction algorithm would know there’s disturbance by subtracting the previously stored image of the room (without the person ) and the current image (with the person). 

So background subtraction can be used as effective motion detectors and even object counters like a people counter, how many people went in or out of a shop.

Now what I’ve described above is a very basic approach to background subtraction, In OpenCV, you would find a number of complex algorithms that use background subtraction to detect motion, In my Computer Vision & Image Processing Course I have talked about background subtraction in detail. I have taught how to construct your own custom background subtraction methods and how to use the built-in OpenCV ones. So make sure to check out the course if you want to study computer vision in depth.

For this tutorial, I will be using a Gaussian Mixture-based Background / Foreground Segmentation Algorithm. It is based on two papers by Z.Zivkovic, “Improved adaptive Gaussian mixture model for background subtraction” in 2004 and “Efficient Adaptive Density Estimation per Image Pixel for the Task of Background Subtraction” in 2006

Here’s the code to apply background subtraction:

# load a video
cap = cv2.VideoCapture('sample_video.mp4')

# Create the background subtractor object
foog = cv2.createBackgroundSubtractorMOG2( detectShadows = True, varThreshold = 50, history = 2800)

while(1):
    
    ret, frame = cap.read() 
    if not ret:
        break
        
    # Apply the background object on each frame
    fgmask = foog.apply(frame)
    
    # Get rid of the shadows
    ret, fgmask = cv2.threshold(fgmask, 250, 255, cv2.THRESH_BINARY)
    
    # Show the background subtraction frame.
    cv2.imshow('All three',fgmask)
    k = cv2.waitKey(10)
    if k == 27: 
        break

cap.release()
cv2.destroyAllWindows()

The `cv2.createBackgroundSubtractorMOG2()` takes in 3 arguments:

detectsSadows: Now this algorithm will also be able to detect shadows, if we pass in `detectShadows=True` argument in the constructor.  The ability to detect and get rid of shadows will give us smooth and robust results. Enabling shadow detection slightly decreases speed.

history: This is the number of frames that is used to create the background model, increase this number if your target object often stops or pauses for a moment.

varThreshold: This threshold will help you filter out noise present in the frame, increase this number if there are lots of white spots in the frame. Although we will also use morphological operations like erosion to get rid of the noise.

Now after we have our background subtraction done then we can further refine the results by getting rid of the noise and enlarging our target object.

We can refine our results by using morphological operations like erosion and dilation. After we have cleaned our image then we can apply contour detection to detect those moving big white blobs  (people) and then draw bounding boxes over those blobs.

If you don’t know about Morphological Operations or Contour Detection then you should go over this Computer Vision Crash course post, I published a few weeks back.

# initlize video capture object
cap = cv2.VideoCapture('sample_video.mp4')

# you can set custom kernel size if you want
kernel= None

# initilize background subtractor object
foog = cv2.createBackgroundSubtractorMOG2( detectShadows = True, varThreshold = 50, history = 2800)

# Noise filter threshold
thresh = 1100

while(1):
    ret, frame = cap.read()
    if not ret:
        break
        
    # Apply background subtraction
    fgmask = foog.apply(frame)
    
    # Get rid of the shadows
    ret, fgmask = cv2.threshold(fgmask, 250, 255, cv2.THRESH_BINARY)
    
    # Apply some morphological operations to make sure you have a good mask
  # fgmask = cv2.erode(fgmask,kernel,iterations = 1)
    fgmask = cv2.dilate(fgmask,kernel,iterations = 4)
    
    # Detect contours in the frame
    contours, hierarchy = cv2.findContours(fgmask,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE)
    
    if contours:
        
        # Get the maximum contour
        cnt = max(contours, key = cv2.contourArea)


        # make sure the contour area is somewhat hihger than some threshold to make sure its a person and not some noise.
        if cv2.contourArea(cnt) > thresh:

            # Draw a bounding box around the person and label it as person detected
            x,y,w,h = cv2.boundingRect(cnt)
            cv2.rectangle(frame,(x ,y),(x+w,y+h),(0,0,255),2)
            cv2.putText(frame,'Person Detected',(x,y-10), cv2.FONT_HERSHEY_SIMPLEX, 0.3, (0,255,0), 1, cv2.LINE_AA)

 
    # Stack both frames and show the image
    fgmask_3 = cv2.cvtColor(fgmask, cv2.COLOR_GRAY2BGR)
    stacked = np.hstack((fgmask_3,frame))
    cv2.imshow('Combined',cv2.resize(stacked,None,fx=0.65,fy=0.65))

    k = cv2.waitKey(40) & 0xff
    if k == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

So in summary 4 major steps are being performed above:

  • Step 1: We’re Extracting moving objects with Background Subtraction and getting rid of the shadows
  • Step 2: Applying morphological operations to improve the background subtraction mask
  • Step 3: Then we’re detecting Contours and making sure you’re not detecting noise by filtering small contours
  • Step 4: Finally we’re computing a bounding box over the max contour, drawing the box, and displaying the image.

Part 4: Creating the Final Intruder Detection System Application:

Finally, we will combine all the things above, we will also use the cv2.VideoWriter() class to save the images as a video in our disk. We will alert the user via Twilio API whenever there is someone in the room.

#time.sleep(15)

# Set Window normal so we can resize it
cv2.namedWindow('frame', cv2.WINDOW_NORMAL)

# This is a test video
cap = cv2.VideoCapture('sample_video.mp4')

# Read the video steram from the camera
#cap = cv2.VideoCapture('http://192.168.18.4:8080/video')

# Get width and height of the frame
width = int(cap.get(3))
height = int(cap.get(4))

# Read and store the credentials information in a dict
with open('credentials.txt', 'r') as myfile:
  data = myfile.read()

info_dict = eval(data)

# Initialize the background Subtractor
foog = cv2.createBackgroundSubtractorMOG2( detectShadows = True, varThreshold = 100, history = 2000)

# Status is True when person is present and False when the person is not present.
status = False

# After the person disapears from view, wait atleast 7 seconds before making the status False
patience = 7

# We don't consider an initial detection unless its detected 15 times, this gets rid of false positives
detection_thresh = 15

# Initial time for calculating if patience time is up
initial_time = None

# We are creating a deque object of length detection_thresh and will store individual detection statuses here
de = deque([False] * detection_thresh, maxlen=detection_thresh)

# Initialize these variables for calculating FPS
fps = 0 
frame_counter = 0
tart_time = time.time()


while(True):
    
    ret, frame = cap.read()
    if not ret:
        break 
            
    # This function will return a boolean variable telling if someone was present or not, it will also draw boxes if it 
    # finds someone
    detected, annotated_image = is_person_present(frame)  
    
    # Register the current detection status on our deque object
    de.appendleft(detected)
     
    # If we have consectutively detected a person 15 times then we are sure that soemone is present    
    # We also make this is the first time that this person has been detected so we only initialize the videowriter once
    if sum(de) == detection_thresh and not status:                       
            status = True
            entry_time = datetime.datetime.now().strftime("%A, %I-%M-%S %p %d %B %Y")
            out = cv2.VideoWriter('outputs/{}.mp4'.format(entry_time), cv2.VideoWriter_fourcc(*'XVID'), 15.0, (width, height))

    # If status is True but the person is not in the current frame
    if status and not detected:
        
        # Restart the patience timer only if the person has not been detected for a few frames so we are sure it was'nt a 
        # False positive
        if sum(de) > (detection_thresh/2): 
            
            if initial_time is None:
                initial_time = time.time()
            
        elif initial_time is not None:        
            
            # If the patience has run out and the person is still not detected then set the status to False
            # Also save the video by releasing the video writer and send a text message.
            if  time.time() - initial_time >= patience:
                status = False
                exit_time = datetime.datetime.now().strftime("%A, %I:%M:%S %p %d %B %Y")
                out.release()
                initial_time = None
            
                body = "Alert: n A Person Entered the Room at {} n Left the room at {}".format(entry_time, exit_time)
                send_message(body, info_dict)
    
    # If significant amount of detections (more than half of detection_thresh) has occured then we reset the Initial Time.
    elif status and sum(de) > (detection_thresh/2):
        initial_time = None
    
    # Get the current time in the required format
    current_time = datetime.datetime.now().strftime("%A, %I:%M:%S %p %d %B %Y")

    # Display the FPS
    cv2.putText(annotated_image, 'FPS: {:.2f}'.format(fps), (510, 450), cv2.FONT_HERSHEY_COMPLEX, 0.6, (255, 40, 155),2)
    
    # Display Time
    cv2.putText(annotated_image, current_time, (310, 20), cv2.FONT_HERSHEY_COMPLEX, 0.5, (0, 0, 255),1)    
    
    # Display the Room Status
    cv2.putText(annotated_image, 'Room Occupied: {}'.format(str(status)), (10, 20), cv2.FONT_HERSHEY_SIMPLEX, 0.6, 
                (200, 10, 150),2)

    # Show the patience Value
    if initial_time is None:
        text = 'Patience: {}'.format(patience)
    else: 
        text = 'Patience: {:.2f}'.format(max(0, patience - (time.time() - initial_time)))
        
    cv2.putText(annotated_image, text, (10, 450), cv2.FONT_HERSHEY_COMPLEX, 0.6, (255, 40, 155) , 2)   

    # If status is true save the frame
    if status:
        out.write(annotated_image)
 
    # Show the Frame
    cv2.imshow('frame',frame)
    
    # Calculate the Average FPS
    frame_counter += 1
    fps = (frame_counter / (time.time() - start_time))
    
    
    # Exit if q is pressed.
    if cv2.waitKey(30) == ord('q'):
        break

# Release Capture and destroy windows
cap.release()
cv2.destroyAllWindows()
out.release()

Here are the final results:

This is the function that detects if someone is present in the frame or not.

def is_person_present(frame, thresh=1100):
    
    global foog
    
    # Apply background subtraction
    fgmask = foog.apply(frame)

    # Get rid of the shadows
    ret, fgmask = cv2.threshold(fgmask, 250, 255, cv2.THRESH_BINARY)

    # Apply some morphological operations to make sure you have a good mask
    fgmask = cv2.dilate(fgmask,kernel,iterations = 4)

    # Detect contours in the frame
    contours, hierarchy = cv2.findContours(fgmask,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE)
     
    # Check if there was a contour and the area is somewhat higher than some threshold so we know its a person and not noise
    if contours and cv2.contourArea(max(contours, key = cv2.contourArea)) > thresh:
            
            # Get the max contour
            cnt = max(contours, key = cv2.contourArea)

            # Draw a bounding box around the person and label it as person detected
            x,y,w,h = cv2.boundingRect(cnt)
            cv2.rectangle(frame,(x ,y),(x+w,y+h),(0,0,255),2)
            cv2.putText(frame,'Person Detected',(x,y-10), cv2.FONT_HERSHEY_SIMPLEX, 0.3, (0,255,0), 1, cv2.LINE_AA)
            
            return True, frame
        
        
    # Otherwise report there was no one present
    else:
        return False, frame

This function uses twilio to send messages.

def send_message(body, info_dict):

    # Your Account SID from twilio.com/console
    account_sid = info_dict['account_sid']

    # Your Auth Token from twilio.com/console
    auth_token  = info_dict['auth_token']


    client = Client(account_sid, auth_token)

    message = client.messages.create( to = info_dict['your_num'], from_ = info_dict['trial_num'], body= body)

Explanation of the Final Application Code:

The function is_person_present()  is called on each frame and it tells us if a person is present in the current frame or not, if it is then we append True to a deque list of length 15, now if the detection has occurred 15 times consecutively we then change the Room occupied status to True. The reason we don’t change the Occupied status to True on the first detection is to avoid our system being triggered by false positives. As soon as the room status is true the VideoWriter is initialized and the video starts recording.

Now when the person is not detected anymore then we wait for `7` seconds before turning the room status to False, this is because the person may disappear from view for a moment and then reappear or we may miss detecting the person for a few seconds. 

Now when the person disappears and the 7-second timer ends then we make the room status to False, we release the VideoWriter in order to save the video and then send an alert message via send_message() function to the user.

Also I have designed the code in a way that our patience timer (7 second timer) is not affected by False positives.

Here’s a high level explanation of the demo:


See how I have placed my mobile, while the screen is closed it’s actually recording and sending live feed to my PC.  No one would suspect that you have the perfect intruder detection system setup in the room.

Improvements:

Right now your IP Camera has a dynamic IP so you may be interested in learning how to make your device have a static IP address so you don’t have to change the address each time you launch your IP Camera.

Another limitation you have right now is that you can only use this setup when your device and your PC are connected to the same network/WIFI so you may want to learn how to get this setup to run globally.

Both of these issues can be solved by some configuration, All the instructions for that are in a manual which you can get by downloading the source code from above for the intruder detection system.

Summary:

In this tutorial you learned how to turn your phone into a smart IP Camera, you learned how to work with URL video feeds in general.

After that we went over how to create a background subtraction based motion detector. 

We also learned how to connect the twilio api to our system to enable alert messages. Right now we are sending alert messages every time there is motion so you may want to change this and make the api send you a single message each day containing a summary of all movements that happened in the room throughout the day.

Finally we created a complete application where we also saved the recording snippets of people moving about in the room.

This post was just a basic template for a surveillance system, you can actually take this and make more enhancements to it, for e.g. for each person coming in the room you can check with facial recognition if it’s actually an intruder or a family member. Similarly there are lots of other things you can do with this.

If you enjoyed this tutorial then I would love to hear your opinion on it, please feel free to comment and ask questions, I’ll gladly answer them.

You can reach out to me personally for a 1 on 1 consultation session in AI/computer vision regarding your project. Our talented team of vision engineers will help you every step of the way. Get on a call with me directly here.

Ready to seriously dive into State of the Art AI & Computer Vision?
Then Sign up for these premium Courses by Bleed AI

You can reach out to me personally for a 1 on 1 consultation session in AI/computer vision regarding your project. Our talented team of vision engineers will help you every step of the way. Get on a call with me directly here.

Contour Detection 101: Contour Analysis (Pt:3)

Contour Detection 101: Contour Analysis (Pt:3)

Watch the Full Video Here:

So far in our Contour Detection 101 series, we have made significant progress unpacking many of the techniques and tools that you will need to build useful vision applications. In part 1, we learned the basics, how to detect and draw the contours, in part 2 we learned to do some contour manipulations.

Now in the third part of this series, we will be learning about analyzing contours. This is really important because by doing contour analysis you can actually recognize the object being detected and differentiate one contour from another. We will also explore how you can identify different properties of contours to retrieve useful information. Once you start analyzing the contours, you can do all sorts of cool things with them. The application below that I made, uses contour analysis to detect the shapes being drawn!

You can build this too! In fact, I have an entire course that will help you master contours for building computer vision applications, where you learn by building all sorts of cool vision applications!

This post will be the third part of the Contour Detection 101 series. All 4 posts in the series are titled as:

  1. Contour Detection 101: The Basics  
  2. Contour Detection 101: Contour Manipulation
  3. Contour Detection 101: Contour Analysis (This Post) 
  4. Vehicle Detection with OpenCV using Contours + Background Subtraction

So if you haven’t seen any of the previous posts make sure you do check them out since in this part we are going to build upon what we have learned before so it will be helpful to have the basics straightened out if you are new to contour detection.

Alright, now we can get started with the Code.

Download Code

[optin-monster slug=”lrrdqnjzfuycvuetarn2″]

Import the Libraries

Let’s start by importing the required libraries.

import cv2
import math
import numpy as np
import pandas as pd
import transformations
import matplotlib.pyplot as plt

Read an Image

Next, let’s read an image containing a bunch of shapes.

# Read the image
image1 = cv2.imread('media/image.png') 

# Display the image
plt.figure(figsize=[10,10])
plt.imshow(image1[:,:,::-1]);plt.title("Original Image");plt.axis("off");

Detect and draw Contours

Next, we will detect and draw external contours on the image using cv2.findContours() and cv2.drawContours() functions that we have also discussed thoroughly in the previous posts.

image1_copy = image1.copy()

# Convert to grayscale
gray_scale = cv2.cvtColor(image1_copy,cv2.COLOR_BGR2GRAY)

# Find all contours in the image
contours, hierarchy = cv2.findContours(gray_scale, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)

# Draw all the contours.
contour_image = cv2.drawContours(image1_copy, contours, -1, (0,255,0), 3);

# Display the results.
plt.figure(figsize=[10,10])
plt.imshow(contour_image[:,:,::-1]);plt.title("Image Contours");plt.axis("off");

The result is a list of detected contours that can now be further analyzed for their properties. These properties are going to prove to be really useful when we build a vision application using contours. We will use them to provide us with valuable information about an object in the image and distinguish it from the other objects.

Below we will look at how you can retrieve some of these properties.

Image Moments

Image moments are like the weighted average of the pixel intensities in the image. They help calculate some features like the center of mass of the object, area of the object, etc. Finding image moments is a simple process in OpenCV which we can get by using the function cv2.moments() that returns a dictionary of various properties to use.

Function Syntax:

retval = cv.moments(array)

Parameters:

  • array – Single-channel, 8-bit or floating-point 2D array

Returns:

  • retval – A python dictionary containing different moments properties
# Select a contour
contour = contours[1]

# get its moments
M = cv2.moments(contour)

# print all the moments
print(M)

{‘m00’: 28977.5, ‘m10’: 4850112.666666666, ‘m01’: 15004570.666666666, ‘m20’: 878549048.4166666, ‘m11’: 2511467783.458333, ‘m02’: 7836261882.75, ‘m30’: 169397190630.30002, ‘m21’: 454938259986.68335, ‘m12’: 1311672140996.85, ‘m03’: 4126888029899.3003, ‘mu20’: 66760837.58548939, ‘mu11’: 75901.88486719131, ‘mu02’: 66884231.43453884, ‘mu30’: 1727390.3746643066, ‘mu21’: -487196.02967071533, ‘mu12’: -1770390.7230567932, ‘mu03’: 495214.8310546875, ‘nu20’: 0.07950600793808808, ‘nu11’: 9.03921532296414e-05, ‘nu02’: 0.07965295864597088, ‘nu30’: 1.2084764986041665e-05, ‘nu21’: -3.408407043976586e-06, ‘nu12’: -1.238559397771768e-05, ‘nu03’: 3.4645063088656135e-06}

The values returned represent different kinds of image movements including raw moments, central moments, scale/rotation invariant moments, and so on.

For more information on image moments and how they are calculated, you can read this Wikipedia article. Below we will discuss how some of the image moments can be used to analyze the contours detected.

Find the center of a contour

Let’s start by finding the centroid of the object in the image using the contour’s image moments. The X and Y coordinates of the Centroid are given by two relations of the central image moments, Cx=M10/M00 and Cy=M01/M00.

# Calculate the X-coordinate of the centroid
cx = int(M['m10'] / M['m00'])

# Calculate the Y-coordinate of the centroid
cy = int(M['m01'] / M['m00'])

# Print the centroid point
print('Centroid: ({},{})'.format(cx,cy))

Centroid: (167,517)

Let’s repeat the process for the rest of the contours detected and draw a circle using cv2.circle() to indicate the centroids on the image.

image1_copy = image1.copy()

# Loop over the contours
for contour in contours:

    # Get the image moments for the contour
    M = cv2.moments(contour)
    
    # Calculate the centroid
    cx = int(M['m10'] / M['m00'])
    cy = int(M['m01'] / M['m00'])

    # Draw a circle to indicate the contour
    cv2.circle(image1_copy,(cx,cy), 10, (0,0,255), -1)

# Display the results
plt.figure(figsize=[10,10])
plt.imshow(image1_copy[:,:,::-1]);plt.axis("off");

Finding Contour Area

We are already familiar with one way of finding the area of contour in the last post, using function cv2.contourArea().

# Select a contour
contour = contours[1]

# Get the area of the selected contour
area_method1 = cv2.contourArea(contour)

print('Area:',area_method1)

Area: 28977.5

Additionally, you can also find the area using the m00 moment of the contour which contains the contour’s area.

# get selected contour moments
M = cv2.moments(contour)

# Get the moment containing the Area
area_method2 = M['m00']

print('Area:',area_method2)

Area: 28977.5

As you can see, both of the methods give the same result.

Contour Properties

When building an application using contours, information about the properties of a contour is vital. These properties are often invariant to one or more transformations such as translation, scaling, and rotation. Below, we will have a look at some of these properties.

Let’s start by detecting the external contours of an image.

# Read the image
image4 = cv2.imread('media/sword.jpg') 

# Create a copy 
image4_copy = image4.copy()

# Convert to gray-scale
imageGray = cv2.cvtColor(image4_copy,cv2.COLOR_BGR2GRAY)

# create a binary thresholded image
_, binary = cv2.threshold(imageGray, 220, 255, cv2.THRESH_BINARY_INV)

# Detect and draw external contour
contours, hierarchy = cv2.findContours(binary, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)

# Select a contour
contour = contours[0]

# Draw the selected contour
cv2.drawContours(image4_copy, contour, -1, (0,255,0), 3)

# Display the result
plt.figure(figsize=[10,10])
plt.imshow(image4_copy[:,:,::-1]);plt.title("Sword Contour");plt.axis("off");

Now using a custom transform() function from the transformation.py module which you will find included with the code for this post, we can conveniently apply and display different transformations to an image.

Function Syntax:

transformations.transform(translate=True, scale=False, rotate=False, path='media/sword.jpg', display=True)

By default, only translation is applied but you may scale and rotate the image as well.

modified_contour = transformations.transform(rotate = True,scale=True)

Applied Translation of x: 44, y: 30
Applied rotation of angle: 80
Image resized to: 95.0

Aspect ratio

Aspect ratio is the ratio of width to height of the bounding rectangle of an object. It can be calculated as AR=width/height. This value is always invariant to translation.

# Get the up-right bounding rectangle for the image
x,y,w,h = cv2.boundingRect(contour)

# calculate the aspect ratio
aspect_ratio = float(w)/h
print("Aspect ratio intitially {}".format(aspect_ratio))

# Apply translation to the image and get its detected contour
modified_contour = transformations.transform(translate=True)

# Get the bounding rectangle for the detected contour
x,y,w,h = cv2.boundingRect(modified_contour)

# Calculate the aspect ratio for the modified contour
aspect_ratio = float(w)/h
print("Aspect ratio After Modification {}".format(aspect_ratio))

Aspect ratio initially 0.9442231075697212
Applied Translation of x: -45 , y: -49
Aspect ratio After Modification 0.9442231075697212

Extent 

Another useful property is the extent of a contour which is the ratio of contour area to its bounding rectangle area. Extent is invariant to Translation & Scaling.

To find the extend we start by calculating the contour area for the selected contour using the function cv2.contourArea(). Next, the bounding rectangle is found using cv2.boundingRect(). The area of the bounding rectangle is calculated using rectarea=width×height. Finally, the extent is then calculated as extent=contourarea/rectarea.

# Calculate the area for the contour
original_area = cv2.contourArea(contour)

# find the bounding rectangle for the contour
x,y,w,h = cv2.boundingRect(contour)

# calculate the area for the bounding rectangle
rect_area = w*h

# calcuate the extent
extent = float(original_area)/rect_area
print("Extent intitially {}".format(extent))

# apply scaling and translation to the image and get the contour
modified_contour = transformations.transform(translate=True,scale = True)

# Get the area of modified contour
modified_area = cv2.contourArea(modified_contour)

# Get the bounding rectangle
x,y,w,h = cv2.boundingRect(modified_contour)

# Calculate the area for the bounding rectangle
modified_rect_area = w*h

# calcuate the extent
extent = float(modified_area)/modified_rect_area

print("Extent After Modification {}".format(extent))

Extent intitially 0.2404054667406324
Applied Translation of x: 38 , y: 44
Image resized to: 117.0%
Extent After Modification 0.24218788234718347

Equivalent Diameter 

Equivalent Diameter is the diameter of the circle whose area is the same as the contour area. It is Invariant to Translation & Rotation. The equivalent diameter can be calculated by first getting the area of contour with cv2.boundingRect(), the area of the circle is given by area=2×π×d2/4 where d is the diameter of the circle.

So to find the diameter we just have to make d the subject in the above equation, giving us: d=  √(4×rectArea/π).

# Calculate the diameter
equi_diameter = np.sqrt(4*original_area/np.pi)
print("Equi diameter intitially {}".format(equi_diameter))

# Apply rotation and transformation
modified_contour = transformations.transform(rotate= True)

# Get the area of modified contour
modified_area = cv2.contourArea(modified_contour)

# Calculate the diameter
equi_diameter = np.sqrt(4*modified_area/np.pi)
print("Equi diameter After Modification {}".format(equi_diameter))

Equi diameter intitially 134.93924087995146
Applied Translation of x: -39 , y: 38
Applied rotation of angle: 38
Equi diameter After Modification 135.06184863765444

Orientation 

Orientation is simply the angle at which an object is rotated.

# Rotate and translate the contour
modified_contour = transformations.transform(translate=True,rotate= True,display = True)

Applied Translation of x: 48 , y: -37
Applied rotation of angle: 176

Now Let’s take a look at an elliptical angle on the sword contour above

# Fit and ellipse onto the contour similarly to minimum area rectangle
(x,y),(MA,ma),angle = cv2.fitEllipse(modified_contour)

# Print the angle of rotation of ellipse
print("Elliptical Angle is {}".format(angle))

Elliptical Angle is 46.882904052734375

Below method also gives the angle of the contour by fitting a rotated box instead of an ellipse

(x,y),(w,mh),angle = cv2.minAreaRect(modified_contour)
print("RotatedRect Angle is {}".format(angle))

RotatedRect Angle is 45.0

Note: Don’t be confused by why all three angles are showing different results, they all calculate angles differently, for e.g ellipse fits an ellipse and then calculates the angle that an ellipse makes, similarly the rotated rectangle calculates the angle the rectangle makes. For triggering decisions based on the calculated angle you would first need to find what angle the respective method is making at the given orientations of the object.

Hu moments

Hu moments are a set of 7 numbers calculated using the central moments. What makes these 7 moments special is the fact that out of these 7 moments, the first 6 of the Hu moments are invariant to translation, scaling, rotation and reflection. The 7th Hu moment is also invariant to these transformations, except that it changes its sign in case of reflection. Below we will calculate the Hu moments for the sword contour, using the moments of the contour.

You can read this paper if you want to know more about hu-moments and how they are calculated.

# Calculate moments
M = cv2.moments(contour)

# Calculate Hu Moments
hu_M = cv2.HuMoments(M)

print(hu_M)

[[5.69251998e-01]
[2.88541572e-01]
[1.37780830e-04]
[1.28680955e-06]
[2.45025329e-12]
[3.54895392e-07]
[1.69581763e-11]]

As you can see the different hu-moments have varying ranges (e.g. compare hu-moment 1 and 7) so to make the Hu-moments more comparable with each other, we will transform them to log-scale and bring them all to the same range.

# Log scale hu moments 
for i in range(0,7):
    hu_M[i] = -1* math.copysign(1.0,  hu_M[i]) * math.log10(abs(hu_M[i]))

df = pd.DataFrame(hu_M,columns=['Hu-moments of original Image']) 
df

Next up let’s apply transformations to the image and find the Hu-moments again.

# Apply translation to the image and get its detected contour
modified_contour = transformations.transform(translate = True, scale = True, rotate = True)

Applied Translation of x: -31 , y: 48
Applied rotation of angle: 122
Image resized to: 87.0%

# Calculate moments
M_modified = cv2.moments(modified_contour)

# Calculate Hu Moments
hu_Modified = cv2.HuMoments(M_modified)

# Log scale hu moments 
for i in range(0,7):
    hu_Modified[i] = -1* math.copysign(1.0, hu_Modified[i]) * math.log10(abs(hu_Modified[i]))

df['Hu-moments of Modified Image'] = hu_Modified
df

The difference is minimal because of the invariance of Hu-moments to the applied transformations.

[optin-monster slug=”d8wgq6fdm5mppdb5fi99″]

Summary

In this post, we saw how useful contour detection can be when you analyze the detected contour for its properties, enabling you to build applications capable of detecting and identifying objects in an image.

We learned how image moments can provide us useful information about a contour such as the center of a contour or the area of contour.

We also learned how to calculate different contour properties invariant to different transformations such as rotation, translation, and scaling. 

Lastly, we also explored seven unique image moments called Hu-moments which are really helpful for object detection using contours since they are invariant to translation, scaling, rotation, and reflection at once.

This concludes the third part of the series. In the next and final part of the series, we will be building a Vehicle Detection Application using many of the techniques we have learned in this series.

You can reach out to me personally for a 1 on 1 consultation session in AI/computer vision regarding your project. Our talented team of vision engineers will help you every step of the way. Get on a call with me directly here.

Ready to seriously dive into State of the Art AI & Computer Vision?
Then Sign up for these premium Courses by Bleed AI

(LearnOpenCV) Playing Rock, Paper, Scissors with AI

(LearnOpenCV) Playing Rock, Paper, Scissors with AI

Let’s play rock, paper scissors.

You think of your move and I’ll make mine below this line in 1…2…and 3.

I choose ROCK.

Well? …who won. It doesn’t matter cause you probably glanced at the word “ROCK” before thinking about a move or maybe you didn’t pay any heed to my feeble attempt at playing rock, paper, scissor with you in a blog post.

So why am I making some miserable attempts trying to play this game in text with you?

Let’s just say, a couple of months down the road in lockdown you just run out of fun ideas. To be honest I desperately need to socialize and do something fun. 

Ideally, I would love to play games with some good friends, …or just friends…or anyone who is willing to play.

Now I’m tired of video games. I want to go for something old fashioned, like something involving other intelligent beings, ideally a human. But because of the lockdown, we’re a bit short on those for close proximity activities. So what’s the next best thing?

AI of course. So yeah why not build an AI that would play with me whenever I want.

Now I don’t want to make a dumb AI bot that predicts randomly between rock, paper, and scissor, but rather I also don’t want to use any keyboard inputs or mouse. Just want to play the old fashioned way.

Real-Time 3D Pose Detection & Pose Classification with Mediapipe and Python

Real-Time 3D Pose Detection & Pose Classification with Mediapipe and Python

In this tutorial, we’ll learn how to do real-time 3D pose detection using the mediapipe library in python. After that, we’ll calculate angles between body joints and combine them with some heuristics to create a pose classification system. 

All of this will work on real-time camera feed using your CPU as well as on images. See results below.

pose detection

The code is really simple, for detailed code explanation do also check out the YouTube tutorial, although this blog post will suffice enough to get the code up and running in no time.

This post can be split into 3 parts:

Part 1 (a): Introduction to Pose Detection

Part 1 (b): Mediapipe’s Pose Detection Implementation

Part 2: Using Pose Detection in images and on videos

Part 3: Pose Classification with Angle Heuristics

Part 1 (a): Introduction to Pose Detection:

Pose Detection or Pose Estimation is a very popular problem in computer vision, in fact, it belongs to a broader class of computer vision domain called key point estimation. Today we’ll learn to do Pose Detection where we’ll try to localize 33 key body landmarks on a person e.g. elbows, knees, ankles, etc. see the image below:

Some interesting applications of pose detection are:

  • Full body Gesture Control to control anything from video games (e.g. kinect) to physical appliances, robots etc. Check this.
  • Full body Sign Language Recognition. Check this.
  • Creating Fitness / exercise / dance monitoring applications. Check this.
  • Creating Augmented reality applications that overlay virtual clothes or other accessories over someone’s body. Check this.

Now, these are just some interesting things you can make using pose detection, as you can see it’s a really interesting problem.

And that’s not it there are other types of key point detection problems too, e.g. facial landmark detection, hand landmark detection, etc.

We will actually learn to do both of the above in the upcoming tutorials.

Key point detection in turn belongs to a major computer vision branch called Image recognition, other broad classes of vision that belong in this branch are Classification, Detection, and Segmentation.

Here’s a very generic definition of each class.

  • In classification we try to classify whole images or videos as belonging to a certain class.
  • In Detection we try to classify and localize objects or classes of interest.
  • In Segmentation, we try to extract/segment or find the exact boundary/outline of our target object/class.
  • In Keypoint Detection, we try to localize predefined points/landmarks.

It should be noted that each of the major categories above has subcategories or different types, a few weeks ago I wrote a post on Selfie segmentation using mediapipe where I talked about various segmentation types. Be sure to read that post.

If you’re new to Computer vision and just exploring the waters, check this page from paperswithcode, it lists a lot of subcategories from the above major categories. Now don’t be confused by the categorization that paperswtihcode has done, personally speaking, I don’t agree with the way they have sorted subcategories with applications and there are some other issues. The takeaway is that there are a lot of variations in computer vision problems, but the 4 categories I’ve listed above are some major ones.

Part 1 (b): Mediapipe’s Pose Detection Implementation:

Here’s a brief introduction to Mediapipe;

 “Mediapipe is a cross-platform/open-source tool that allows you to run a variety of machine learning models in real-time. It’s designed primarily for facilitating the use of ML in streaming media & It was built by Google”

Not only is this tool backed by google but models in Mediapipe are actively used in Google products. So you can expect nothing less than the state of the Art performance from this library.

Now MediaPipe’s Pose detection is a State of the Art solution for high-fidelity (i.e. high quality) and low latency (i.e. Damn fast) for detecting 33 3D landmarks on a person in real-time video feeds on low-end devices i.e. phones, laptops, etc.

Alright, so what makes this pose detection model from Mediapipe so fast?

They are actually using a very successful deep learning recipe that is creating a 2 step detector, where you combine a computationally expensive object detector with a lightweight object tracker.

Here’s how this works:

You run the detector in the first frame of the video to localize the person and provide a bounding box around it, after that the tracker takes over and it predicts the landmark points inside that bounding box ROI, the tracker continues to run on any subsequent frames in the video using the previous frame’s ROI and only calls the detection model again when it fails to track the person with high confidence.

Their model works best if the person is standing 2-4 meters away from the camera and one major limitation of their model is that this approach only works for single-person pose detection, it’s not applicable for multi-person detection.

Mediapipe actually trained 3 models, with different tradeoffs between speed and performance. You’ll be able to use all 3 of them with mediapipe.

MethodLatencyPixel 3 TFLite GPULatencyMacBook Pro (15-inch 2017)
BlazePose.Heavy53 ms38 ms
BlazePose.Full25 ms27 ms
BlazePose.Lite20 ms25 ms

The detector used in pose detection is inspired by Mediapiep’s lightweight BlazeFace model, you can read this paper. For the landmark model used in pose detection, you can read this paper for more details. or read Google’s blog on it.

Here are the 33 landmarks that this model detects:

Alright now that we have covered some basic theory and implementation details, let’s get into the code.

Download Code

[optin-monster slug=”kalfyxphljhqu1zouums”]

Part 2: Using Pose Detection in images and on videos

Import the Libraries

Let’s start by importing the required libraries.

import math
import cv2
import numpy as np
from time import time
import mediapipe as mp
import matplotlib.pyplot as plt

Initialize the Pose Detection Model

The first thing that we need to do is initialize the pose class using the mp.solutions.pose syntax and then we will call the setup function mp.solutions.pose.Pose() with the arguments:

  • static_image_mode – It is a boolean value that is if set to False, the detector is only invoked as needed, that is in the very first frame or when the tracker loses track. If set to True, the person detector is invoked on every input image. So you should probably set this value to True when working with a bunch of unrelated images not videos. Its default value is False.
  • min_detection_confidence – It is the minimum detection confidence with range (0.0 , 1.0) required to consider the person-detection model’s prediction correct. Its default value is 0.5. This means if the detector has a prediction confidence of greater or equal to 50% then it will be considered as a positive detection.
  • min_tracking_confidence – It is the minimum tracking confidence ([0.0, 1.0]) required to consider the landmark-tracking model’s tracked pose landmarks valid. If the confidence is less than the set value then the detector is invoked again in the next frame/image, so increasing its value increases the robustness, but also increases the latency. Its default value is 0.5.
  • model_complexity – It is the complexity of the pose landmark model. As there are three different models to choose from so the possible values are 01, or 2. The higher the value, the more accurate the results are, but at the expense of higher latency. Its default value is 1.
  • smooth_landmarks – It is a boolean value that is if set to True, pose landmarks across different frames are filtered to reduce noise. But only works when static_image_mode is also set to False. Its default value is True.

Then we will also initialize mp.solutions.drawing_utils class that will allow us to visualize the landmarks after detection, instead of using this, you can also use OpenCV to visualize the landmarks.

# Initializing mediapipe pose class.
mp_pose = mp.solutions.pose

# Setting up the Pose function.
pose = mp_pose.Pose(static_image_mode=True, min_detection_confidence=0.3, model_complexity=2)

# Initializing mediapipe drawing class, useful for annotation.
mp_drawing = mp.solutions.drawing_utils

Downloading model to C:\ProgramData\Anaconda3\lib\site-packages\mediapipe/modules/pose_landmark/pose_landmark_heavy.tflite

Read an Image

Now we will read a sample image using the function cv2.imread() and then display that image using the matplotlib library.

# Read an image from the specified path.
sample_img = cv2.imread('media/sample.jpg')

# Specify a size of the figure.
plt.figure(figsize = [10, 10])

# Display the sample image, also convert BGR to RGB for display. 
plt.title("Sample Image");plt.axis('off');plt.imshow(sample_img[:,:,::-1]);plt.show()

Perform Pose Detection

Now we will pass the image to the pose detection machine learning pipeline by using the function mp.solutions.pose.Pose().process(). But the pipeline expects the input images in RGB color format so first we will have to convert the sample image from BGR to RGB format using the function cv2.cvtColor() as OpenCV reads images in BGR format (instead of RGB).

After performing the pose detection, we will get a list of thirty-three landmarks representing the body joint locations of the prominent person in the image. Each landmark has:

  • x – It is the landmark x-coordinate normalized to [0.0, 1.0] by the image width.
  • y: It is the landmark y-coordinate normalized to [0.0, 1.0] by the image height.
  • z: It is the landmark z-coordinate normalized to roughly the same scale as x. It represents the landmark depth with midpoint of hips being the origin, so the smaller the value of z, the closer the landmark is to the camera.
  • visibility: It is a value with range [0.0, 1.0] representing the possibility of the landmark being visible (not occluded) in the image. This is a useful variable when deciding if you want to show a particular joint because it might be occluded or partially visible in the image.

After performing the pose detection on the sample image above, we will display the first two landmarks from the list, so that you get a better idea of the output of the model.

# Perform pose detection after converting the image into RGB format.
results = pose.process(cv2.cvtColor(sample_img, cv2.COLOR_BGR2RGB))

# Check if any landmarks are found.
if results.pose_landmarks:
    
    # Iterate two times as we only want to display first two landmarks.
    for i in range(2):
        
        # Display the found normalized landmarks.
        print(f'{mp_pose.PoseLandmark(i).name}:\n{results.pose_landmarks.landmark[mp_pose.PoseLandmark(i).value]}') 

NOSE:
x: 0.4321258
y: 0.28087094
z: -0.67494285
visibility: 0.99999905

LEFT_EYE_INNER:
x: 0.44070682
y: 0.2621727
z: -0.6380733
visibility: 0.99999845

Now we will convert the two normalized landmarks displayed above into their original scale by using the width and height of the image.

# Retrieve the height and width of the sample image.
image_height, image_width, _ = sample_img.shape

# Check if any landmarks are found.
if results.pose_landmarks:
    
    # Iterate two times as we only want to display first two landmark.
    for i in range(2):
        
        # Display the found landmarks after converting them into their original scale.
        print(f'{mp_pose.PoseLandmark(i).name}:') 
        print(f'x: {results.pose_landmarks.landmark[mp_pose.PoseLandmark(i).value].x * image_width}')
        print(f'y: {results.pose_landmarks.landmark[mp_pose.PoseLandmark(i).value].y * image_height}')
        print(f'z: {results.pose_landmarks.landmark[mp_pose.PoseLandmark(i).value].z * image_width}')
        print(f'visibility: {results.pose_landmarks.landmark[mp_pose.PoseLandmark(i).value].visibility}\n')

NOSE:
x: 310.69845509529114
y: 303.340619802475
z: -485.28390991687775
visibility: 0.9999990463256836

LEFT_EYE_INNER:
x: 316.86820307374
y: 283.1465148925781
z: -458.774720788002
visibility: 0.9999984502792358

Now we will draw the detected landmarks on the sample image using the function mp.solutions.drawing_utils.draw_landmarks() and display the resultant image using the matplotlib library.

# Create a copy of the sample image to draw landmarks on.
img_copy = sample_img.copy()

# Check if any landmarks are found.
if results.pose_landmarks:
    
    # Draw Pose landmarks on the sample image.
    mp_drawing.draw_landmarks(image=img_copy, landmark_list=results.pose_landmarks, connections=mp_pose.POSE_CONNECTIONS)
       
    # Specify a size of the figure.
    fig = plt.figure(figsize = [10, 10])

    # Display the output image with the landmarks drawn, also convert BGR to RGB for display. 
    plt.title("Output");plt.axis('off');plt.imshow(img_copy[:,:,::-1]);plt.show()

Now we will go a step further and visualize the landmarks in three-dimensions (3D) using the function mp.solutions.drawing_utils.plot_landmarks(). We will need the POSE_WORLD_LANDMARKS that is another list of pose landmarks in world coordinates that has the 3D coordinates in meters with the origin at the center between the hips of the person.

Image
# Plot Pose landmarks in 3D.
mp_drawing.plot_landmarks(results.pose_world_landmarks, mp_pose.POSE_CONNECTIONS)

Note: This is actually a neat hack by mediapipe, the coordinates returned are not actually in 3D but by setting hip landmark as the origin allows us to measure the relative distance of the other points from the hip, and since this distance increases or decreases depending upon if you’re close or further from the camera it gives us a sense of the depth of each landmark point.

Create a Pose Detection Function

Now we will put all this together to create a function that will perform pose detection on an image and visualize the results or return the results depending upon the passed arguments.

def detectPose(image, pose, display=True):
    '''
    This function performs pose detection on an image.
    Args:
        image: The input image with a prominent person whose pose landmarks needs to be detected.
        pose: The pose setup function required to perform the pose detection.
        display: A boolean value that is if set to true the function displays the original input image, the resultant image, 
                 and the pose landmarks in 3D plot and returns nothing.
    Returns:
        output_image: The input image with the detected pose landmarks drawn.
        landmarks: A list of detected landmarks converted into their original scale.
    '''
    
    # Create a copy of the input image.
    output_image = image.copy()
    
    # Convert the image from BGR into RGB format.
    imageRGB = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    
    # Perform the Pose Detection.
    results = pose.process(imageRGB)
    
    # Retrieve the height and width of the input image.
    height, width, _ = image.shape
    
    # Initialize a list to store the detected landmarks.
    landmarks = []
    
    # Check if any landmarks are detected.
    if results.pose_landmarks:
    
        # Draw Pose landmarks on the output image.
        mp_drawing.draw_landmarks(image=output_image, landmark_list=results.pose_landmarks,
                                  connections=mp_pose.POSE_CONNECTIONS)
        
        # Iterate over the detected landmarks.
        for landmark in results.pose_landmarks.landmark:
            
            # Append the landmark into the list.
            landmarks.append((int(landmark.x * width), int(landmark.y * height),
                                  (landmark.z * width)))
    
    # Check if the original input image and the resultant image are specified to be displayed.
    if display:
    
        # Display the original input image and the resultant image.
        plt.figure(figsize=[22,22])
        plt.subplot(121);plt.imshow(image[:,:,::-1]);plt.title("Original Image");plt.axis('off');
        plt.subplot(122);plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');
        
        # Also Plot the Pose landmarks in 3D.
        mp_drawing.plot_landmarks(results.pose_world_landmarks, mp_pose.POSE_CONNECTIONS)
        
    # Otherwise
    else:
        
        # Return the output image and the found landmarks.
        return output_image, landmarks

Now we will utilize the function created above to perform pose detection on a few sample images and display the results.

# Read another sample image and perform pose detection on it.
image = cv2.imread('media/sample1.jpg')
detectPose(image, pose, display=True)
# Read another sample image and perform pose detection on it.
image = cv2.imread('media/sample2.jpg')
detectPose(image, pose, display=True)
# Read another sample image and perform pose detection on it.
image = cv2.imread('media/sample3.jpg')
detectPose(image, pose, display=True)

Pose Detection On Real-Time Webcam Feed/Video

The results on the images were pretty good, now we will try the function on a real-time webcam feed and a video. Depending upon whether you want to run pose detection on a video stored in the disk or on the webcam feed, you can comment and uncomment the initialization code of the VideoCapture object accordingly.

# Setup Pose function for video.
pose_video = mp_pose.Pose(static_image_mode=False, min_detection_confidence=0.5, model_complexity=1)

# Initialize the VideoCapture object to read from the webcam.
#video = cv2.VideoCapture(0)

# Initialize the VideoCapture object to read from a video stored in the disk.
video = cv2.VideoCapture('media/running.mp4')


# Initialize a variable to store the time of the previous frame.
time1 = 0

# Iterate until the video is accessed successfully.
while video.isOpened():
    
    # Read a frame.
    ok, frame = video.read()
    
    # Check if frame is not read properly.
    if not ok:
        
        # Break the loop.
        break
    
    # Flip the frame horizontally for natural (selfie-view) visualization.
    frame = cv2.flip(frame, 1)
    
    # Get the width and height of the frame
    frame_height, frame_width, _ =  frame.shape
    
    # Resize the frame while keeping the aspect ratio.
    frame = cv2.resize(frame, (int(frame_width * (640 / frame_height)), 640))
    
    # Perform Pose landmark detection.
    frame, _ = detectPose(frame, pose_video, display=False)
    
    # Set the time for this frame to the current time.
    time2 = time()
    
    # Check if the difference between the previous and this frame time > 0 to avoid division by zero.
    if (time2 - time1) > 0:
    
        # Calculate the number of frames per second.
        frames_per_second = 1.0 / (time2 - time1)
        
        # Write the calculated number of frames per second on the frame. 
        cv2.putText(frame, 'FPS: {}'.format(int(frames_per_second)), (10, 30),cv2.FONT_HERSHEY_PLAIN, 2, (0, 255, 0), 3)
    
    # Update the previous frame time to this frame time.
    # As this frame will become previous frame in next iteration.
    time1 = time2
    
    # Display the frame.
    cv2.imshow('Pose Detection', frame)
    
    # Wait until a key is pressed.
    # Retreive the ASCII code of the key pressed
    k = cv2.waitKey(1) & 0xFF
    
    # Check if 'ESC' is pressed.
    if(k == 27):
        
        # Break the loop.
        break

# Release the VideoCapture object.
video.release()

# Close the windows.
cv2.destroyAllWindows()

Output:


Cool! so it works great on the videos too. The model is pretty fast and accurate.

Part 3: Pose Classification with Angle Heuristics

We have learned to perform pose detection, now we will level up our game by also classifying different yoga poses using the calculated angles of various joints. We will first detect the pose landmarks and then use them to compute angles between joints and depending upon those angles we will recognize the yoga pose of the prominent person in an image.

Image

But this approach does have a drawback that limits its use to a controlled environment, the calculated angles vary with the angle between the person and the camera. So the person needs to be facing the camera straight to get the best results.

Create a Function to Calculate Angle between Landmarks

Now we will create a function that will be capable of calculating angles between three landmarks. The angle between landmarks? Do not get confused, as this is the same as calculating the angle between two lines.

The first point (landmark) is considered as the starting point of the first line, the second point (landmark) is considered as the ending point of the first line and the starting point of the second line as well, and the third point (landmark) is considered as the ending point of the second line.

Image

def calculateAngle(landmark1, landmark2, landmark3):
    '''
    This function calculates angle between three different landmarks.
    Args:
        landmark1: The first landmark containing the x,y and z coordinates.
        landmark2: The second landmark containing the x,y and z coordinates.
        landmark3: The third landmark containing the x,y and z coordinates.
    Returns:
        angle: The calculated angle between the three landmarks.

    '''

    # Get the required landmarks coordinates.
    x1, y1, _ = landmark1
    x2, y2, _ = landmark2
    x3, y3, _ = landmark3

    # Calculate the angle between the three points
    angle = math.degrees(math.atan2(y3 - y2, x3 - x2) - math.atan2(y1 - y2, x1 - x2))
    
    # Check if the angle is less than zero.
    if angle < 0:

        # Add 360 to the found angle.
        angle += 360
    
    # Return the calculated angle.
    return angle

Now we will test the function created above to calculate angle three landmarks with dummy values.

# Calculate the angle between the three landmarks.
angle = calculateAngle((558, 326, 0), (642, 333, 0), (718, 321, 0))

# Display the calculated angle.
print(f'The calculated angle is {angle}')

The calculated angle is 166.26373169437744

Create a Function to Perform Pose Classification

Now we will create a function that will be capable of classifying different yoga poses using the calculated angles of various joints. The function will be capable of identifying the following yoga poses:

  • Warrior II Pose
  • T Pose
  • Tree Pose
def classifyPose(landmarks, output_image, display=False):
    '''
    This function classifies yoga poses depending upon the angles of various body joints.
    Args:
        landmarks: A list of detected landmarks of the person whose pose needs to be classified.
        output_image: A image of the person with the detected pose landmarks drawn.
        display: A boolean value that is if set to true the function displays the resultant image with the pose label 
        written on it and returns nothing.
    Returns:
        output_image: The image with the detected pose landmarks drawn and pose label written.
        label: The classified pose label of the person in the output_image.

    '''
    
    # Initialize the label of the pose. It is not known at this stage.
    label = 'Unknown Pose'

    # Specify the color (Red) with which the label will be written on the image.
    color = (0, 0, 255)
    
    # Calculate the required angles.
    #----------------------------------------------------------------------------------------------------------------
    
    # Get the angle between the left shoulder, elbow and wrist points. 
    left_elbow_angle = calculateAngle(landmarks[mp_pose.PoseLandmark.LEFT_SHOULDER.value],
                                      landmarks[mp_pose.PoseLandmark.LEFT_ELBOW.value],
                                      landmarks[mp_pose.PoseLandmark.LEFT_WRIST.value])
    
    # Get the angle between the right shoulder, elbow and wrist points. 
    right_elbow_angle = calculateAngle(landmarks[mp_pose.PoseLandmark.RIGHT_SHOULDER.value],
                                       landmarks[mp_pose.PoseLandmark.RIGHT_ELBOW.value],
                                       landmarks[mp_pose.PoseLandmark.RIGHT_WRIST.value])   
    
    # Get the angle between the left elbow, shoulder and hip points. 
    left_shoulder_angle = calculateAngle(landmarks[mp_pose.PoseLandmark.LEFT_ELBOW.value],
                                         landmarks[mp_pose.PoseLandmark.LEFT_SHOULDER.value],
                                         landmarks[mp_pose.PoseLandmark.LEFT_HIP.value])

    # Get the angle between the right hip, shoulder and elbow points. 
    right_shoulder_angle = calculateAngle(landmarks[mp_pose.PoseLandmark.RIGHT_HIP.value],
                                          landmarks[mp_pose.PoseLandmark.RIGHT_SHOULDER.value],
                                          landmarks[mp_pose.PoseLandmark.RIGHT_ELBOW.value])

    # Get the angle between the left hip, knee and ankle points. 
    left_knee_angle = calculateAngle(landmarks[mp_pose.PoseLandmark.LEFT_HIP.value],
                                     landmarks[mp_pose.PoseLandmark.LEFT_KNEE.value],
                                     landmarks[mp_pose.PoseLandmark.LEFT_ANKLE.value])

    # Get the angle between the right hip, knee and ankle points 
    right_knee_angle = calculateAngle(landmarks[mp_pose.PoseLandmark.RIGHT_HIP.value],
                                      landmarks[mp_pose.PoseLandmark.RIGHT_KNEE.value],
                                      landmarks[mp_pose.PoseLandmark.RIGHT_ANKLE.value])
    
    #----------------------------------------------------------------------------------------------------------------
    
    # Check if it is the warrior II pose or the T pose.
    # As for both of them, both arms should be straight and shoulders should be at the specific angle.
    #----------------------------------------------------------------------------------------------------------------
    
    # Check if the both arms are straight.
    if left_elbow_angle > 165 and left_elbow_angle < 195 and right_elbow_angle > 165 and right_elbow_angle < 195:

        # Check if shoulders are at the required angle.
        if left_shoulder_angle > 80 and left_shoulder_angle < 110 and right_shoulder_angle > 80 and right_shoulder_angle < 110:

    # Check if it is the warrior II pose.
    #----------------------------------------------------------------------------------------------------------------

            # Check if one leg is straight.
            if left_knee_angle > 165 and left_knee_angle < 195 or right_knee_angle > 165 and right_knee_angle < 195:

                # Check if the other leg is bended at the required angle.
                if left_knee_angle > 90 and left_knee_angle < 120 or right_knee_angle > 90 and right_knee_angle < 120:

                    # Specify the label of the pose that is Warrior II pose.
                    label = 'Warrior II Pose' 
                        
    #----------------------------------------------------------------------------------------------------------------
    
    # Check if it is the T pose.
    #----------------------------------------------------------------------------------------------------------------
    
            # Check if both legs are straight
            if left_knee_angle > 160 and left_knee_angle < 195 and right_knee_angle > 160 and right_knee_angle < 195:

                # Specify the label of the pose that is tree pose.
                label = 'T Pose'

    #----------------------------------------------------------------------------------------------------------------
    
    # Check if it is the tree pose.
    #----------------------------------------------------------------------------------------------------------------
    
    # Check if one leg is straight
    if left_knee_angle > 165 and left_knee_angle < 195 or right_knee_angle > 165 and right_knee_angle < 195:

        # Check if the other leg is bended at the required angle.
        if left_knee_angle > 315 and left_knee_angle < 335 or right_knee_angle > 25 and right_knee_angle < 45:

            # Specify the label of the pose that is tree pose.
            label = 'Tree Pose'
                
    #----------------------------------------------------------------------------------------------------------------
    
    # Check if the pose is classified successfully
    if label != 'Unknown Pose':
        
        # Update the color (to green) with which the label will be written on the image.
        color = (0, 255, 0)  
    
    # Write the label on the output image. 
    cv2.putText(output_image, label, (10, 30),cv2.FONT_HERSHEY_PLAIN, 2, color, 2)
    
    # Check if the resultant image is specified to be displayed.
    if display:
    
        # Display the resultant image.
        plt.figure(figsize=[10,10])
        plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');
        
    else:
        
        # Return the output image and the classified label.
        return output_image, label

Now we will utilize the function created above to perform pose classification on a few images of people and display the results.

Warrior II Pose

The Warrior II Pose (also known as Virabhadrasana II) is the same pose that the person is making in the image above. It can be classified using the following combination of body part angles:

  • Around 180° at both elbows
  • Around 90° angle at both shoulders
  • Around 180° angle at one knee
  • Around 90° angle at the other knee
# Read a sample image and perform pose classification on it.
image = cv2.imread('media/warriorIIpose.jpg')
output_image, landmarks = detectPose(image, pose, display=False)
if landmarks:
    classifyPose(landmarks, output_image, display=True)
# Read another sample image and perform pose classification on it.
image = cv2.imread('media/warriorIIpose1.jpg')
output_image, landmarks = detectPose(image, pose, display=False)
if landmarks:
    classifyPose(landmarks, output_image, display=True)

Tree Pose

Tree Pose (also known as Vrikshasana) is another yoga pose for which the person has to keep one leg straight and bend the other leg at a required angle. The pose can be classified easily using the following combination of body part angles:

  • Around 180° angle at one knee
  • Around 35° (if right knee) or 335° (if left knee) angle at the other knee

Now to understand it better, you should go back to the pose classification function above to overview the classification code of this yoga pose.

We will perform pose classification on a few images of people in the tree yoga pose and display the results using the same function we had created above.

# Read a sample image and perform pose classification on it.
image = cv2.imread('media/treepose.jpg')
output_image, landmarks = detectPose(image, mp_pose.Pose(static_image_mode=True,
                                         min_detection_confidence=0.5, model_complexity=0), display=False)
if landmarks:
    classifyPose(landmarks, output_image, display=True)
# Read another sample image and perform pose classification on it.
image = cv2.imread('media/treepose1.jpg')
output_image, landmarks = detectPose(image, mp_pose.Pose(static_image_mode=True,
                                         min_detection_confidence=0.5, model_complexity=0), display=False)
if landmarks:
    classifyPose(landmarks, output_image, display=True)
# Read another sample image and perform pose classification on it.
image = cv2.imread('media/treepose2.jpg')
output_image, landmarks = detectPose(image, pose, display=False)
if landmarks:
    classifyPose(landmarks, output_image, display=True)

T Pose

T Pose (also known as a bind pose or reference pose) is the last pose we are dealing with in this lesson. To make this pose, one has to stand up like a tree with both hands wide open as branches. The following body part angles are required to make this one:

  • Around 180° at both elbows
  • Around 90° angle at both shoulders
  • Around 180° angle at both knees

You can now go back to go through the classification code of this T pose in the pose classification function created above.

Now, let’s test the pose classification function on a few images of the T pose.

# Read another sample image and perform pose classification on it.
image = cv2.imread('media/Tpose.jpg')
output_image, landmarks = detectPose(image, pose, display=False)
if landmarks:
    classifyPose(landmarks, output_image, display=True)
# Read another sample image and perform pose classification on it.
image = cv2.imread('media/Tpose1.jpg')
output_image, landmarks = detectPose(image, pose, display=False)
if landmarks:
    classifyPose(landmarks, output_image, display=True)

So the function is working pretty well on all the known poses on images lets try it on an unknown pose called cobra pose (also known as Bhujangasana).

# Read another sample image and perform pose classification on it.
image = cv2.imread('media/cobrapose1.jpg')
output_image, landmarks = detectPose(image, pose, display=False)
if landmarks:
    classifyPose(landmarks, output_image, display=True)

Now if you want you can extend the pose classification function to make it capable of identifying more yoga poses like the one in the image above. The following combination of body part angles can help classify this one:

  • Around 180° angle at both knees
  • Around 105° (if the person is facing right side) or 240° (if the person is facing left side) angle at both hips

Pose Classification On Real-Time Webcam Feed

Now we will test the function created above to perform the pose classification on a real-time webcam feed.

# Initialize the VideoCapture object to read from the webcam.
camera_video = cv2.VideoCapture('sample.mp4')

# Initialize a resizable window.
cv2.namedWindow('Pose Classification', cv2.WINDOW_NORMAL)

# Iterate until the webcam is accessed successfully.
while camera_video.isOpened():
    
    # Read a frame.
    ok, frame = camera_video.read()
    
    # Check if frame is not read properly.
    if not ok:
        
        # Continue to the next iteration to read the next frame and ignore the empty camera frame.
        continue
    
    # Flip the frame horizontally for natural (selfie-view) visualization.
    frame = cv2.flip(frame, 1)
    
    # Get the width and height of the frame
    frame_height, frame_width, _ =  frame.shape
    
    # Resize the frame while keeping the aspect ratio.
    frame = cv2.resize(frame, (int(frame_width * (640 / frame_height)), 640))
    
    # Perform Pose landmark detection.
    frame, landmarks = detectPose(frame, pose_video, display=False)
    
    # Check if the landmarks are detected.
    if landmarks:
        
        # Perform the Pose Classification.
        frame, _ = classifyPose(landmarks, frame, display=False)
    
    # Display the frame.
    cv2.imshow('Pose Classification', frame)
    
    # Wait until a key is pressed.
    # Retreive the ASCII code of the key pressed
    k = cv2.waitKey(1) & 0xFF
    
    # Check if 'ESC' is pressed.
    if(k == 27):
        
        # Break the loop.
        break

# Release the VideoCapture object and close the windows.
camera_video.release()
cv2.destroyAllWindows()

Output:


Summary:

Today, we learned about a very popular vision problem called pose detection. We briefly discussed popular computer vision problems then we saw how mediapipe has implemented its pose detection solution and how it used a 2 step detection + tracking pipeline to speed up the process.

After that, we saw step by step how to do real-time 3d pose detection with mediapipe on images and on webcam.

Then we learned to calculate angles between different landmarks and then used some heuristics to build a classification system that could determine 3 poses, T-Pose, Tree Pose, and a Warrior II Pose.

Alright here are some limitations to our pose classification system, it has too many conditions and checks, now for our case it’s not that complicated, but if you throw in a few more poses this system can easily get too confusing and complicated, a much better method is to train an MLP ( a simple multi-layer perceptron) using Keras on landmark points from a few target pose pictures and then classify them. I’m not sure but I might create a separate tutorial for that in the future.

Another issue that I briefly went over was that the pose detection model in mediapipe is only able to detect a single person at a time, now this is fine for most pose-based applications but can prove to be an issue where you’re required to detect more than one person. If you do want to detect more people then you could try other popular models like PoseNet or OpenPose.

You can reach out to me personally for a 1 on 1 consultation session in AI/computer vision regarding your project. Our talented team of vision engineers will help you every step of the way. Get on a call with me directly here.

Ready to seriously dive into State of the Art AI & Computer Vision?
Then Sign up for these premium Courses by Bleed AI

Training a Object Detector with Tensorflow Object Detection API

Training a Object Detector with Tensorflow Object Detection API

Tensorflow Object Detection API

This is a really descriptive and interesting tutorial, let me highlight what you will learn in this tutorial about Tensorflow Object Detection API.

  1. A Crystal Clear step by step tutorial on training a custom object detector.
  2. A method to download videos and create a custom dataset out of that.
  3. How to use the custom trained network inside the OpenCV DNN module so you can get rid of the TensorFlow framework.

Plus there are two things you will receive from the provided source code:

  1. A Jupyter Notebook that automatically downloads and installs all the required things for you so you don’t have to step outside of that notebook.
  2. A Colab version of the notebook that runs out of the box, just run the cells and train your own network.

I will stress again that all of the steps are explained in a neat and digestible way. I’ve you ever planned to do Object Detection then this is one tutorial you don’t want to miss.

As mentioned, by downloading the Source Code you will get 2 versions of the notebook: a local version and a colab version.

So first we’re going to see a complete end-to-end pipeline for training a custom object detector on our data and then we will use it in the OpenCV DNN module so we can get rid of the heavy Tensorflow framework for deployment. We have already discussed the advantages of using the final trained model in OpenCV instead of Tensorflow in my previous post.

Today’s post is the 3rd tutorial in our 3 part Deep Learning with OpenCV series. All three posts are titled as:

  1. Deep Learning with OpenCV DNN Module, A Comprehensive Guide
  2. Training a Custom Image Classifier with OpenCV, Converting to ONNX, and using it in OpenCV DNN module.
  3. Training a Custom Object Detector with Tensorflow and using it with OpenCV DNN (This Post)

Now to follow along and to learn the full pipeline of training a custom object detector with TensorFlow you don’t need to read the previous two tutorials but when we move to the last part of this tutorial and use the model in OpenCV DNN then those tutorials would help.

What is Tensorflow Object Detection API (TFOD) :

To train our custom Object Detector we will be using TensorFlow API (TFOD API). The Tensorflow Object Detection API is a framework built on top of TensorFlow that makes it easy for you to train your own custom models.

The workflow generally goes like this :

You take a pre-trained model from this model zoo and then fine-tune the model for your own task.
Fine-tuning is a transfer learning method that allows you to utilize features of the model which it learned from a different task to your own task. Because of this, you won’t require thousands of images to train the network, only a few hundred will suffice.
If you’re someone who prefers PyTorch instead of Tensorflow then you may want to look at Detectron 2

For this Tutorial I will be using TensorFlow Object Detection API version 1, If you want to know why we are using version 1 instead of the recently released version 2, then you can read below optional explanation.

Tensorflow Object Detection API 1

Why we’re using Tensorflow Object Detection API Version 1? (OPTIONAL READ)

IGNORE THIS EXPLANATION IF YOU’RE NOT FAMILIAR WITH TENSORFLOW’S  FROZEN_GRAPHS

Tensorflow Object Detection API v2 comes with a lot of improvements, the new API contains some new State of The ART (SoTA) models, some pretty good changes including New binaries for train/eval/export that are eager mode compatible. You can check out this release blog from the Tensorflow Object Detection API developers.

But the thing is because TF 2 no longer supports sessions so you can’t easily export your model to frozen_inference_graph, furthermore TensorFlow depreciates the use of frozen_graphs and promotes saved_model format for future use cases.

For TensorFlow, this is the right move as the saved_model format is an excellent format.

So what’s the issue?

The problem is that OpenCV only works with frozen_inference_graphs and does not support saved_model format yet, so for this reason, if your end goal is to deploy it in OpenCV then you should use Tensorflow Object Detection API v1. Although you can still generate frozen_graphs, those graphs produce errors with OpenCV most of the time, we’ve tried limited experiments with TF2 so feel free to carry out your experiments but do share if you find something useful.

Now One great thing about this situation is that the Tensorflow team decided to keep the whole pipeline and code of Tensorflow Object Detection API 2 almost identical to Tensorflow Object Detection API 1 so learning how to use Tensorflow Object Detection API v1 will also teach you how to use Tensorflow Object Detection API v2.

Now Let’s start with the code

Code For TF Object Detection Pipeline:

[optin-monster slug=”qgtcyvbjr2wbqczjlxxm”]

Make sure to download the source code, which also contains the support folder with some helper files that you will need.

Here’s the hierarchy of the source code folder:

│   Colab Notebook Link.txt
│   Custom_Object_Detection.ipynb
│
└───support
    │   create_tf_record.py
    │   frozen_inference_graph.pb
    │   graph_ours.pbtxt
    │   tf_text_graph_common.py
    │   tf_text_graph_faster_rcnn.py
    │  
    │
    ├───labels
    │       _000.xml
    │       _001.xml
    │       _002.xml
    │       ...
    
    ├───test_images
    │       test1.jpg
    │       test2.jpg
    │       test3.png
    │       ...
    

Here’s a description of what these folders & files are:

  • Custom_Object_Detection.ipynb: This is the main notebook which contains all the code.
  • Colab Notebook Link: This text file contains the link for the colab version of the notebook.
  • Create_tf_record.py: This file will create tf records from the images and labels.
  • fronzen_graph_inference.pb: This is the model we trained, you can try to run this on test images.
  • graph_ours.pbtxt: This is the graph file we generated for OpenCV, you’ll learn to generate your own.
  • tf_text_graph_faster_rcnn.py: This file creates the above graph.pbtxt file for OpenCV.
  • tf_text_graph_common.py: This is a helper file used by the faster_rcnnn.py file.
  • labels: These are .xml labels for each image
  • test_images: These are some sample test images to do inference on.

Note: There are some other folders and files which you will generate along the way, I will explain their use later.

Now Even though I make it really easy but still if you don’t want to worry about environment setup, installation, then you can use the colab version of the notebook that comes with the source code.

The Colab version doesn’t require any Configuration, It’s all set to go. Just run the cells in order. You should also be able to use the Colab GPU to speed up the training process.

The full code can be broken down into the following parts

  • Part 1: Environment Setup
  • Part 2: Installation & TFOD API Setup
  • Part 3: Data Collection & Annotation
  • Part 4: Downloading Model & Configuring it
  • Part 5: Training and Exporting Inference Graph.
  • Part 6: Generating .pbtxt and using the trained model with just OpenCV.

Part 1: Environment Setup:

First, let’s Make sure you have correctly set up your environment.

Since we are going to install TensorFlow version 1.15.0 so we should use a virtual environment, you can either install virtualenv or anaconda distribution. I’m using Anaconda. I will start by creating a virtual environment.

Open up the command prompt and do conda create --name tfod1 python==3.7

Tensorflow Object Detection API installation

Now you can move into that environment by activating it:

conda activate tfod1

Make sure there is a (tfod1) at the beginning of each line in your cmd. This means you’re using that environment. Now anything you install will be in that environment and won’t affect your base/root environment.

Tensorflow Object Detection API installation

The first thing You want to do is install a jupyter notebook in that environment. Otherwise, your environment will use the jupyter notebook of the base environment, so do:

pip install jupyter notebook

Now you should go into the directory/folder which I provided you and contains this notebook and open up the command prompt.

First, activate the environment tfod1 environment and then launch the jupyter notebook by typing jupyter notebook and hit enter.

This will launch the jupyter notebook in your newly created environment. You can now Open up Custom_Object_Detection Notebook.

Make sure your Notebook is Opened up in the Correct environment

import sys

# Make sure to check you're using your tfod1 environment, you should see that name in the printed output
print(sys.executable)

c:usershp-pcanaconda3envstfod1python.exe

Part 2: Installation & Tensorflow Object Detection API Setup: 

You can install all the required libraries by running this cell

# If you can't use ! on windows 10 then you should do conda install posix
# Alternatively you can also use % instead of ! in Windows.

!pip install tensorflow==1.15.0
!pip install youtube-dl
!pip install git+https://github.com/TahaAnwar/pafy.git#egg=pafy
!pip install scipy
!pip install labelImg
!pip install opencv-contrib-python
!pip install matplotlib

If you want to install Tensorflow-GPU for version 1 then you can take a look at my tutorial for that here

Note: You would need to change the Cuda Toolkit version and CuDNN version in the above tutorial since you’ll be installing for TF version 1 instead of version 2. You can look up the exact version requirements here

Another Library you will need is pycocotools

# RUN THIS TO INSTALL IN WINDOWS
!pip install pycocotools-windows

Alternatively, You can also use this command to install in windows:

pip install git+https://github.com/philferriere/cocoapi.git#egg=pycocotools^&subdirectory=PythonAPI

# RUN THIS TO INSTALL IN LINUX
pip install git+https://github.com/waleedka/cocoapi.git#egg=pycocotools&subdirectory=PythonAPI

Alternatively, you can also use this command to install in Linux and osx:

pip install pycocotools

Note: Make sure you have Cython installed first by doing: pip install Cython

Import Libraries

This will also confirm if your installations were successful or not.

import os
import shutil
import math
import datetime

import glob
import urllib
import tarfile
import urllib.request
from urllib.request import urlopen
from io import BytesIO
from zipfile import ZipFile
import re


import matplotlib.pyplot as plt
%matplotlib inline

# This will let you download any video from youtube
import pafy

import cv2
import numpy as np

import tensorflow as tf

print("This should be Version 1.15.0, DETECTED VERSION: " + tf.__version__)

This should be Version 1.15.0, DETECTED VERSION: 1.15.0

Clone Tensorflow Object Detection API Model Repository

You need to clone the TF Object Detection API repository, you can either download the zip file and extract it or if you have git installed then you can git clone it.

Option 1: Download with git:

You can run git clone if you have git installed, this is going to take a while, it’s 600 MB+, have a coffee or something.

# Clone the Github Repo in the current directory.
!git clone https://github.com/tensorflow/models.git

Option 2: Download the zip and extract all: (Only do this if you don’t have git)

You can download the zip by clicking here, after downloading make sure to extract the contents of this zip inside the directory containing this notebook. I’ve already provided you the code that automatically downloads and unzips the repo in this directory.

URL = 'https://github.com/tensorflow/models/archive/master.zip'

# Download and extract the zip file into a folder named support
with urlopen(URL) as zip_file:
    with ZipFile(BytesIO(zip_file.read())) as zfile:
        zfile.extractall()
        
# Rename `models-master` directory to `models`
os.rename('models-master', 'models')

The models we’ll be using are in the research directory of the above repo. The research directory contains a collection of research model implementations in TensorFlow 1 or 2 by researchers. There are a total of 4 directories in the above repo, you can learn more about them here.

Install Tensorflow Object Detection API & Compile Protos

Download Protobuff Compiler:

TFOD contains some files .proto format, I’ll explain more about this format in a later step, for now, you need to download the protobuf compiler from here, make sure to download the correct one based on your system. For e.g. I downloaded protoc-3.12.4-win64.zip for my 64-bit windows. For Linux and osx there are different files.

After downloading unzip the proto folder, go to its bin directory, and copy the proto.exe file. Now paste this proto.exe inside the models/research directory.

The below script does all of this, but you can choose to do it manually if you want. Make sure to change the URL if you’re using a system other than 64-bit windows.

# Set the URL, you can copy/paste your target system's URL here.
URL = 'http://github.com/protocolbuffers/protobuf/releases/download/v3.12.4/protoc-3.12.4-win64.zip'

name = 'proto_file'

# Download and extract the zip file into a folder named proto_file
with urlopen(URL) as zip_file:
    with ZipFile(BytesIO(zip_file.read())) as zfile:
        zfile.extractall('proto_file')
        
# Copy and paste the protoc.exe to 'models/research' directory.
shutil.copy(name + '/bin/protoc.exe', 'models/research/')   

# Delete the protoc folder
shutil.rmtree(name)

Now you can install the object detection API and compile the protos:
Below two operations must be performed in this directory, otherwise, it won’t work, especially the proto command.

# Move to models/research directory.
os.chdir('models/research/')

# Compiles protobuf files in the object_detction/protos folder, Now for every .proto there will be .py file present there.
!protoc object_detectionprotos*.proto --python_out=.

# Copies the requied setup file
!cp object_detection/packages/tf1/setup.py .

# Installs and setsup TF 1 Object Detection API.
!python -m pip install .

# Move up two directories, this will put you back to your original `TF Object Detection v1` directory.
os.chdir('../..')

Note: Since I already had installed pycocotools so after running this line cp object_detection/packages/tf1/setup.py . I edited the setup.py file to get rid of pycocotools package inside the REQUIRED_PACKAGES list then I saved the setup.py file and ran the python -m pip install . command. I did this because I was facing issues installing pycocotools this way which is why I installed the pycocotools-windows package, you probably won’t need to do this.

If you wanted to install TFOD API version 2 instead of version 1 then you can just replace tf1 with tf2 in the cp object_detection/packages/tf1/setup.py . command.

You can check your installation of TFOD API by running model_builder_tf1_test.py

# Move to models/research directory.
os.chdir('models/research/')

# Test the installation.
!python object_detection/builders/model_builder_tf1_test.py

# back to the main directory
os.chdir('../..')

Part 3: Data Collection & Annotation:

Now for this tutorial, I’m going to train a detector to detect the faces of Tom & Jerry. I didn’t want to use the common animal datasets etc. So I went with this.

While I was writing the above sentence I just realized I’m still using a Cat, mouse dataset albeit an animated one so I guess it’s still a unique dataset.

In this tutorial, I’m not only going to show you how to annotate the data but also show you one approach on how to go about collecting data for a new problem.

So What I’ll be doing is that I’m going to download a video of Tom & Jerry from Youtube and then split the frames of the video to create my dataset and then annotate each of those frames with bounding boxes. Now instead of downloading my Tom & Jerry video you can use any other video and try to detect your own classes.

Alternatively, you can also generate training data from other methods including getting images from Google Images.

To prepare the Data we need to perform these 5 steps:

  • Step 1: Download Youtube Video.
  • Step 2: Split Video Frames and store it.
  • Step 3: Annotate Images with labelImg.
  • Step 4: Create a label Map file.
  • Step 5: Generate TFRecords.

Step 1: Download Youtube Video:

# Define the URL of the video
url = "https://www.youtube.com/watch?v=blWvD93bALE"

# Set the name of the Video 
video_name =  "support/test_images/tomandjerry.mp4"

# Create video object
video = pafy.new(url)

# Get that video in best available resolution
bestResolutionVideo = video.getbest()

# Download the Video
bestResolutionVideo.download(filepath= video_name)

11,311,502.0 Bytes [100.00%] received. Rate: [7788 KB/s]. ETA: [0 secs]

For more options on how you can download the video take a look at the documentation here

Step 2: Split Video Frames and store it:

Now we’re going to split the video frames and store them in a folder. Since most videos have a high FPS (30-60 frames/sec) and we don’t exactly need this many frames for two reasons:

  1. If you take a 30 FPS video then for each second of the video you will get 30 images and most of those images won’t be different from each other, there will be a lot of repetition of information.
  2. We’re already going to use Transfer Learning with TFOD API, the benefit of this is that we won’t be needing a lot of images and this is good since we don’t want to annotate thousands of images.

So we can do two things we can skip frames and save every nth frame or we can save a frame every nth second of the video. I’m going with the latter approach, although both are valid approaches.

# Define an output directory
output_directory = "training/images"

# Define the time interval after which you'll save each frame.
sec = 1.5

# If the output directory does not exists then create it
if not os.path.exists(output_directory):
    os.makedirs(output_directory)

# Initialize the video capture object
cap = cv2.VideoCapture(video_name)

# Get the FPS rate of the video, for this video its 25.0
fps = cap.get(cv2.CAP_PROP_FPS)

# Given the FPS rate of the video calculate the no of frames you will need to skip to determine that `sec` seconds are passed.
no_of_frames_to_skip = round(sec * fps)

frame_count = 0

while True:

    ret, frame = cap.read()
    
    # Break the loop if the video has ended
    if not ret:
        break

    # Get the Current Frame Number
    frame_Id = int(cap.get(1))

    # Only Save the frame when you've skipped the defined the number of frames
    if frame_Id % no_of_frames_to_skip == 0:
        
        # Save the frame in the output directory
        #fname = "n_{}.jpg".format(frame_count)
        fname = '_{str_0:0>{str_1}}.jpg'.format(str_0=frame_count, str_1=3)
        cv2.imwrite(os.path.join(output_directory, fname), frame)

        frame_count += 1
        
print('Done Splitting Video, Total Images saved: {}'.format(frame_count))

# Release the capture
cap.release()

Done Splitting Video, Total Images saved: 165

You can go to the directory where the images are saved and manually go through each image and delete the ones where Tom & Jerry are not visible or hardly visible. Although this is not a strict requirement since you can easily skip these images in the annotation step.

Step 3: Annotate Images with labelImg

You can watch this video below to understand how to use labelImg to annotate images and export annotations. You can also take a look at the GitHub repo here.

For the current Tom & Jerry problem, I am providing you with a labels folder that already contains the .xml annotation file for each image. If you want to try a different dataset then go ahead, make sure to put the labels of that dataset in the labels folder

Note: We are not splitting the images into the train and validation folder right now because we’ll be doing that automatically at tfrecord creation step. Although it would still be a good idea to separate 10% of the data for proper testing/evaluation of the final trained detector, since my purpose is to make this tutorial as simple as possible so I won’t be doing that today, I already have test folder with 4-5 images which I will evaluate on.

Step 4: Create a label Map file

TensorFlow requires a label map file, which maps each of the class labels to integer values. This label map is used in the training and detection process. This file should be saved in training the directory which also contains the labels folder

# You can add more classes by adding another item and giving them an id of 3 and so on.

pbtxt = '''

item {
    id: 1
    name: 'Jerry'
}

item {
    id: 2
    name: 'Tom'
}

'''

with open("training/label_map.pbtxt", "w") as text_file:
    text_file.write(pbtxt)

Step 5: Generate TFrecords

What are TFrecords?

Tfrecords are just protocol buffers, they help make the data reading/processing process computationally efficient. The only downside they have is that they are not human-readable.

What are protocol Buffers?

A protocol buffer is a type of serialized structured data. It is more efficient than JSON, XML, pickle, and text storage formats. Google created this Protobuf (protocol buffer) format in 2008 because of their efficiency, Since then they have been widely used by Google and the community. To read the protobuf files (.proto files) you will first need to compile them by a protobuf compiler. So now you probably understand why we needed to compile those proto files at the beginning.

Here’s a nice tutorial by Naveen that explains how you can create a tfrecord for different data types and Here’s a more detailed explanation of protocol buffers with an example.

The create_tf_record.py script I’ll be using to convert images/labels to tfrecords is taken from the TensorFlow’s pet example but I’ve modified the script so now it accepts the following 5 arguments:

  1. Directory of images
  2. Directory of labels
  3. % of Split of Training data
  4. Path to label_map.pbtxt file
  5. Path to output tfrecord files

And it returns a train.record and val.record. So it splits the training data into training/validation sets. For this data, I’m using a training set of 70% and validation is 30%.

# Create tfrecords directory if it does not exits. This is where tfrecords will be stored.
tf_reocords = "training/tfrecords"
if not os.path.exists(tf_reocords):
    os.mkdir(tf_reocords)

# We are saving the record files in the folder named tfrecords.
# Change the slashes (i.e. ) according to your OS system.
# I'm using my own labels you can replace them with your labels.
!python supportcreate_tf_record.py --image_dir trainingimages --split 0.7 --labels supportlabels   
--output_path trainingtfrecords --label_map  traininglabel_map.pbtxt

Done Writing, Saved: trainingtfrecordstrain.record Done Writing, Saved: trainingtfrecordsval.record

You can ignore these warnings, we already know that we’re using an older 1.15 version of TFOD API which contains some depreciated functions.

Most of the tfrecord scripts available online will first tell you to convert your xml files to csv and then you will use another script to split the data into a training and validation folder and then another script to convert to tfrecords. The script above is doing all of this.

Part 4: Downloading Model & Configuring it:

You can now go to the Model Zoo, select a model, and download its zip. Now unzip the contents of that folder and put them inside a directory named pretrained_model. The below script does this automatically for a Faster-RCNN-Inception model which is already trained on the COCO dataset. You can change the model name to download a different model.

# Specify pre-trained model name you want to download
MODEL = 'faster_rcnn_inception_v2_coco_2018_01_28'

# Add the zip extension to the model
MODEL_FILE = MODEL + '.tar.gz'

# Define the base URL
DOWNLOAD_BASE = 'http://download.tensorflow.org/models/object_detection/'

# If model directory is present then remove if for the new model
model_directory = "pretrained_model"
if os.path.exists(model_directory):
    shutil.rmtree(model_directory )

# Download the pretrained Model
opener = urllib.request.URLopener()
opener.retrieve(DOWNLOAD_BASE + MODEL_FILE, MODEL_FILE)

# Here we are extracting the downloaded file
tar = tarfile.open(MODEL_FILE)
tar.extractall()
tar.close()

# Removing the downloaded zip file.
os.remove(MODEL_FILE)

# Rename model directory to pretrained_model
os.rename(MODEL, model_directory)

# Remove the checkpoint file so the model can be trained
os.remove(model_directory + '/checkpoint')

print('Model Downloaded')

Model Downloaded

Modify pipline.config file:

After downloading you will have a number of files present in the pretrained_model folder, I will explain about them later but for now, let’s take a look at the pipeline.config file.

Pipeline.config defines how the whole training process will take place, what optimizers, loss, learning_rate, batch_size will be used. Most of these params are already set by default, it’s up to you if you want to change them or not but there are some paths in the pipeline.config file that you will need to change so that this model can be trained on our data.

So open up pipeline.config with a text editor like Notepad ++ and change these 4 paths:

  • Change: PATH_TO_BE_CONFIGURED/model.ckpt  to  pretrained_model/model.ckpt
  • Change: PATH_TO_BE_CONFIGURED/mscoco_train.record  to  training/tfrecords/train.record
  • Change: PATH_TO_BE_CONFIGURED/mscoco_val.record   to  training/tfrecords/val.record
  • Change: PATH_TO_BE_CONFIGURED/mscoco_label_map.pbtxt  to  training/label_map.pbtxt
  • Change: num_classes: 90  to  num_classes: 2

If you’re lazy like me then no prob, below script does all this

# Path of pipeline configuration file
filename = model_directory + '/pipeline.config'

# Open the configutation file and read the whole file
with open(filename) as f:
    s = f.read()
    
# Now find and subsitute the source paths with the destinations paths.    
with open(filename, 'w') as f:
    s = re.sub('PATH_TO_BE_CONFIGURED/model.ckpt', model_directory + '/model.ckpt', s)
    s = re.sub('PATH_TO_BE_CONFIGURED/mscoco_train.record', 'training/tfrecords/train.record', s)
    s = re.sub('PATH_TO_BE_CONFIGURED/mscoco_val.record', 'training/tfrecords/val.record', s)
    s = re.sub('PATH_TO_BE_CONFIGURED/mscoco_label_map.pbtxt', 'training/label_map.pbtxt', s)
    
    # Since we have 2 classes (Tom, Jerry) so we set this value to 2.
    s = re.sub('num_classes: 90', 'num_classes: 2', s)

    
    # Doing a little correction to avoid an error in training.
    s = re.sub('step: 0', 'step: 1', s)
            
    # I'm also changing the default batch_size of 1 to be 10 for this example
    s = re.sub('batch_size: 1', 'batch_size: 10', s)
    f.write(s)

Notice the correction I did by replacing step: 0 with step: 1, unfortunately for different models sometimes there are some corrections required but you can easily understand what exactly needs to be changed by pasting the error generated during training on google. Click on GitHub issues for that error and you’ll find a solution for that.

Note: These issues seem to be mostly present in TFOD API Version 1

Changing Important Params in Pipeline.config File:

Additionally, I’ve also changed the batch size of the model, just like batch_size, there are lots of important parameters that you would want to tune. I would strongly recommend that you try to change the values according to your problem. Almost always the default values are not optimal for your custom use case. I should tell you that to tune most of these values you need some prior knowledge, make sure to at least change the batch_size according to your system’s memory and learning_rate of the model.

Part 5 Training and Exporting Inference Graph: 

You can start training the model by calling the model_main.py script from the Object_detection folder, we are giving it the following arguments.

  • num_train_steps: These are the number of times your model weights will be updated using a batch of data.
  • pipeline_config_path: This is the path to your pipeline.config file.
  • model_dir: Path to the output directory where the final checkpoint files will be saved.

Now you can run the below cell to start training but I would recommend that you run this cell in the command line, you can just paste this line:

# Start Training
!python models/research/object_detection/model_main.py  --pipeline_config_path="pretrained_modelpipeline.config" 
--model_dir="pretrained_model"  --num_train_steps=20000 

Note: When you start training you will see a lot of warnings, just ignore them as TFOD 1 contains a lot of deprecated functions.

Once you start training, the network will take some time to initialize and then the training will start, after every few minutes, you will see a report of loss values and a global loss. The Network is learning if the loss is going down. If you’re not familiar with the Object detection Jargon Like IOU etc, then just make note of the final global loss after each report.

You ideally want to set the num_train_steps to tens of thousands of steps, you can always end training by pressing CTRL + C on the command prompt if the loss has decreased sufficiently. If training is taking place in jupyter notebook then you can end it by pressing the Stop button on top.

After training has ended or you’ve stopped it, there would be some new files in the pre_trained folder. Among all these files we will only need the checkpoint (ckpt) files.

If you’re training for 1000s of steps (which is most likely the case) then I would strongly recommend that you don’t use your CPU but utilize a GPU. If you don’t have one then it’s best to use Google Colab’s GPU. I’m already providing you a ready-to-run colab Notebook.

Note: There’s another script for training called train.py, this is an older script where you can see the loss value for each step, if you want to use that script then you can find it at models / research / object_detection / legacy / train.py

You can run this script by doing:

python models/research/object_detection/legacy/train.py --pipeline_config_path="pretrained_modelpipeline.config" 
--train_dir="pretrained_model"

The best way to monitor training is to use Tensorboard, I will discuss this another time

Export Frozen Inference Graph:

Now we will use the export_inference_graph.py script to create a frozen_inference_graph from the checkpoint files.

Why are we doing this?

After training our model it is stored in checkpoint format and a saved_model format but in OpenCV, we need the model to be in a frozen_inference_graph format. So we need to generate the frozen_inference_graph using the checkpoint files.

What are these checkpoint files?

After Every few minutes of training, TensorFlow outputs some checkpoint (ckpt) files. The number on those files represents how many train steps they have gone through. So during the frozen_inference_graph creation, we only take the latest checkpoint file (i.e. the file with the highest number) because this is the one that has gone through the most training steps.

Now every time a checkpoint file is saved, it’s split into 3 parts.

For the initial step these files are:

  • model.ckpt-000.data: This file contains the value of each single variable, its pretty large.
  • model.ckpt-000.info: This file contains metadata for each tensor. e.g. checksum, auxiliary data etc.
  • model.ckpt-000.meta: This file stores the graph structure of the model
# Get all the files present in the pretrained_model directory
lst = os.listdir(model_directory)

# Get the most recent checkpoint file number.
lf = filter(lambda k: 'model.ckpt-' in k, lst)
check = sorted([int(x.split('ckpt-')[1].split('.')[0]) for x in sorted(lf)[:]])[-1]

# Attach that number to model.ckpt- and pass it to the export script.
checkpoint = 'model.ckpt-' + str(check)

# Run the export script, it takes the input_type, pipeline.config path,latest checpoint file name and  output path  
!python modelsresearchobject_detectionexport_inference_graph.py  
--input_type image_tensor 
--pipeline_config_path pretrained_model/pipeline.config  
--trained_checkpoint_prefix pretrained_model/$checkpoint 
--output_directory fine_tuned_model

If you take a look at the fine_tuned_model folder which will be created after running the above command then you’ll find that it contains the same files you got when you downloaded the pre_trained model. This is the final folder.

Now Your trained model is in 3 different formats, the saved_model format, the frozen_inference_graph format, and the checkpoint file format. For OpenCV, we only need the frozen inference graph format.

The checkpoint format is ideal for retraining purposes and getting to know other sorts of information about the model, for production and serving the model you will need to use is either the frozen_inference_graph or saved_model format. It’s worth mentioning that both these files contain the extension .pb

In TF 2, frozen_inference_graph is depreciated and TF 2 encourages to use the saved_model format, as said previously unfortunately we can’t use the saved_model format with OpenCV yet.

Run Inference on Trained Model (Bonus Step):

You can optionally choose to run inference using TensorFlow sessions, I’m not going to explain much here as Tf sessions are depreciated and our final goal is to actually use this model in OpenCV DNN.

frozen_graph_path = 'fine_tuned_model/frozen_inference_graph.pb'

#0.49179258942604065

# Read the graph.
with tf.gfile.FastGFile(frozen_graph_path, 'rb') as f:
    graph_def = tf.GraphDef()
    graph_def.ParseFromString(f.read())

with tf.Session() as sess:
    
    # Set the defualt session
    sess.graph.as_default()
    tf.import_graph_def(graph_def, name='')

    # Read the Image
    img = cv2.imread('support/test_images/test7.jpg')
    
    # Get the rows and cols of the image.
    rows = img.shape[0]
    cols = img.shape[1]
    
    # Resize the image to 300x300, this is the size the model was trained on
 
    inp = cv2.resize(img, (300, 300))
    
    # Convert OpenCV's BGR image to RGB
    inp = inp[:, :, [2, 1, 0]]  # BGR2RGB

    # Run the model
    out = sess.run([sess.graph.get_tensor_by_name('num_detections:0'),
                    sess.graph.get_tensor_by_name('detection_scores:0'),
                    sess.graph.get_tensor_by_name('detection_boxes:0'),
                    sess.graph.get_tensor_by_name('detection_classes:0')],
                   feed_dict={'image_tensor:0': inp.reshape(1, inp.shape[0], inp.shape[1], 3)})

    # These are the classes which we want to detect
    classes = {1: "Jerry", 2: 'Tom'}
    
    # Get the total number of Detections
    num_detections = int(out[0][0])
    
    # Loop for each detection
    for i in range(num_detections):
        
        # Get the probability of that class
        score = float(out[1][0][i])
        
        # Check if the score of the detection is big enough
        if score > 0.400:
                                
            # Get their Class ID
            classId = int(out[3][0][i])

            # Get the bounding box coordinates of that class
            bbox = [float(v) for v in out[2][0][i]]
            
            # Get the class name
            class_name = classes[classId]
            
            # Get the actual bounding box coordinates
            x = int(bbox[1] * cols)
            y = int(bbox[0] * rows)
            right = int(bbox[3] * cols)
            bottom = int(bbox[2] * rows)
            
            # Show the class name and the confidence
            cv2.putText(img, "{} {.:2f}%".format(class_name, score*100), x, bottom+30, cv2.FONT_HERSHEY_SIMPLEX, 1, (255,0,255), 4)
            
            # Draw the bounding box
            cv2.rectangle(img, x, y, right, bottom, (125, 255, 51), thickness = 2)

# Show the image with matplotlib           
plt.figure(figsize=(10,10))
plt.imshow(img[:,:,::-1]);
Tom and Jerry, Tensorflow Object Detection API

Part 6: Generating .pbtxt and using the trained model with just OpenCV 

6 a) Export Graph.pbxt with frozen inference graph:

We can use the above generated frozen graph inside the OpenCV DNN module to do detection but most of the time we need another file called a graph.pbtxt file. This file contains a description of the network architecture, it is required by OpenCV to rewire some network layers for Optimization purposes.

This graph.pbtxt can be generated by using one of the 4 scripts provided by OpenCV. These scripts are:

  • tf_text_graph_ssd.py
  • tf_text_graph_faster_rcnn.py
  • tf_text_graph_mask_rcnn.py
  • tf_text_graph_efficientdet.py

They can be downloaded here, you will also find more information regarding them on that page.

Now since the Detection architecture we’re using is Faster-RCNN (you can tell by looking at the name of the downloaded model) so we will use tf_text_graph_faster_rcnn.py to generate the pbtxt file. For .pbtxt generation you will need the frozen_inference_graph.pb file and the pipeline.config file.

Note: When you’re done with training then you will also see a graph.pbtxt file inside the pretrained folder, this graph.pbtxt is different from the one generated by OpenCV’s .pbtxt generator scripts. One major difference is that the OpenCV’s graph.pbtxt do not contain the model weights but only contain the graph description, so they will be much smaller in size.

!python support/tf_text_graph_faster_rcnn.py --input "fine_tuned_model/frozen_inference_graph.pb" 
--config "pretrained_model/pipeline.config" --output "support/graph.pbtxt"

Number of classes: 2
Scales: [0.25, 0.5, 1.0, 2.0] Aspect ratios: [0.5, 1.0, 2.0]
Width stride: 16.000000
Height stride: 16.000000
Features stride: 16.000000

For model architectures that are not one of the above 4, then for those, you will need to convert TensorFlow’s .pbtxt file to OpenCV’s version. You can find more on how to do that here. But we warned this conversion is not a smooth process and there are a lot of low-level issues that come up.

6 b) Using the Frozen inference graph along with Pbtxt file in OpenCV:

Now that we have generated the graph.pbtxt file with OpenCV’s tf_text_graph function we can pass this file to cv2.dnn.readNetFromTensorflow() to initialize the network. All of our work is done now Make sure you’re familiar with OpenCV’s DNN module, if not you can read my previous post on it.

Now we will create the following two functions:

Initialization Function: This function will initialize the network using the .pb and .pbtxt files, it will also set the class labels.

Main Function: This function will contain all the rest of the code from preprocessing to postprocessing, it will also have the option to either return the image or display it with matplotlib

# We're passing in the paths of pbtxt file (graph description of model) and our actual trained model

def initialize(pbtxt = 'support/graph.pbtxt', model = "fine_tuned_model/frozen_inference_graph.pb" ):
    
    # Define global variables
    global net, classes

    # ReadNet function takes both files and intitialize the network
    net = cv2.dnn.readNetFromTensorflow(model, pbtxt);

    # Define Class Labels
    classes = {0: "Jerry", 1: 'Tom'}

This is our main function, the comments will explain what’s going on

def detect_object(img, returndata=False, conf = 0.9):
    
    # Get the rows, cols of Image 
    rows, cols, channels = img.shape

    # This is where we pre-process the image, Resize the image and Swap Image Channels
    # We're converting BGR channels to RGB since OpenCV reads in BGR and our model was trained on RGB images
    blob = cv2.dnn.blobFromImage(img, size=(300, 300), swapRB=True)
    
    # Set the blob as input to the network
    net.setInput(blob)

    # Runs a forward pass, this is where the model predicts on the image.
    networkOutputs = net.forward()

    # Loop over the output results
    for detection in networkOutputs[0,0]:
        
        # Get the score for each detection
        score = float(detection[2])
        
        # IF the class score is bigger than our threshold
        if score > conf:
            
            # Get the index of the class i.e. 1 or 2
            class_index = int(detection[1])
            
            # Use the Class index to get the class name i.e. Jerry or tom
            class_name = classes[class_index]
            
            # Get the bounding box coordinates.
            # Note: the returned coordinates are relative e.g. they are in 0-1 range.
            # Se we multiply them by rows and cols to get the real coordinates.
            x1 = int(detection[3] * cols)
            y1 = int(detection[4] * rows)
            x2 = int(detection[5] * cols)
            y2 = int(detection[6] * rows)
            
            # Show the class name and the confidence
            text = "{},  {:.2f}% ".format(class_name, score*100)
            cv2.putText(img, text, (x1, y2+ 30), cv2.FONT_HERSHEY_SIMPLEX, 0.6, 
                        (255,0,255), 2)
            
            # Draw the bounding box
            cv2.rectangle(img, (x1, y1), (x2, y2), (125, 255, 51), thickness = 2)

    # Return the annotated image if returndata is True
    if  returndata:
        return img
    
    # Otherwise show the full image.
    else:
        plt.figure(figsize=(10,10))
        plt.imshow(img[:,:,::-1]);plt.axis("off");                      

Note: When you do net.forward() you get an output of shape (1,1,100,7). Since we’re predicting on a single image instead of a batch of images so you will get (1,1) at the start now the remaining (100,7) means that there are 100 detections for that image and each image contains 7 properties/variables.

There will be 100 detections for each image, this was set in the pipeline.config, you can choose to change that.

So here are what these 7 properties correspond to:

  1. This is the index of image for a single image its 0
  2. This is the index of the target CLASS
  3. This is the score/confidence of that CLASS

Remaining 4 values are x1,y1,x2,y2. These are used to draw the bounding box of that CLASS object

  1. x1
  2. y1
  3. x2
  4. y2

Initialize the network

You will just need to call this once to initialize the network

# You can initialize the model using our provided trained model
initialize()

# Or use your own trained model
initialize(pbtxt = 'support/graph.pbtxt', model = 'fine_tuned_model/frozen_inference_graph.pb' )

Predict On Images

Now you can use the main function to perform prediction on different images, The images we will predict are placed inside a folder named test_images. These images were not in the training dataset.

img = cv2.imread('support/test_images/test1.jpg')
detect_object(img)
jerry img, Tensorflow Object Detection API
img = cv2.imread('support/test_images/test2.jpg')
detect_object(img)
tom img, Tensorflow Object Detection API
img = cv2.imread('support/test_images/test6.jpg')
detect_object(img)
Jerry 2 img, Tensorflow Object Detection API
img = cv2.imread('support/test_images/test3.png')
detect_object(img)
Tom 2 img, Tensorflow Object Detection API
img = cv2.imread('support/test_images/test7.jpg')
detect_object(img)
Tom and Jerry 2, Tensorflow Object Detection API

Summary

Limitations: Our Final detector has a decent accuracy but it’s not that robust because of 4 reasons:

  1. Transfer Learning works best when the dataset you’re training on shares some features with the original dataset it was trained on, most of the models are trained on ImageNet, COCO, PASCAL VOC datasets. Which is filled with animals and other real-world images. Now our dataset is a dataset of Cartoon images, which is drastically different from real-world images. We can solve this problem by including more images and training more layers of the model.

  2. Animations of cartoon characters are not consistent, they change a lot in different movies. So if you train the model on these pictures and then try to detect random google images of tom and jerry then you won’t get good accuracy. We can solve this problem by including images of these characters from different movies so the model learns the features that are the same throughout the movies.

  3. The images generated from the sample video created an imbalanced dataset, There are more Jerry Images than Tom images, there are ways to handle this scenario but try to get a decent balance of images for both classes to get the best results.

  4. The annotation is poor, Yeah so the annotation I did was just for the sake of making this tutorial, in reality, you want to set a clear outline and standard about how you’ll be annotating, are you going to annotate the whole head, are ears included, is the neck part of it.. so you need answer all these questions ahead of time.

I will stress again that if you’re not planning to use OpenCV for the final deployment then use TFOD API version 2, it’s a lot more cleaner. However, if the final objective is to use OpenCV at the end then you could get away with TF 2 but it’s a lot of trouble.

Even with TFOD API v1, you can’t be sure that your custom trained model will always be loaded in OpenCV correctly, there are times when you would need to manually edit the graph.pbtxt file so that you can use the model in OpenCV. If this happens and you’re sure you have done everything correctly then your best bet is to raise an issue here.

Hopefully, OpenCV will catch up and start supporting TF 2 saved_model format but it’s gonna take time. If you enjoyed this tutorial then please feel free to comment and I’ll gladly answer you.

You can reach out to me personally for a 1 on 1 consultation session in AI/computer vision regarding your project. Our talented team of vision engineers will help you every step of the way. Get on a call with me directly here.

Ready to seriously dive into State of the Art AI & Computer Vision?
Then Sign up for these premium Courses by Bleed AI

More Info ➔

Developed By Bleed AI