Controlling Subway Surfers Game with Pose Detection using Mediapipe and Python

by Taha Anwar | Aug 9, 2021 | Application, Deep Learning, Featured, Human-computer Interaction, Pose Detection

Watch The Video Here:

In last Week’s tutorial, we learned how to work with real-time pose detection and created a pose classification system. In this week’s tutorial, we’ll learn to play a popular game called “Subway Surfers”

Of Course, there’s more to it, this is an AI Blog after all.

We will actually be using our body pose to control the game, not keyboard controls, the entire application will work in real-time on your CPU, you don’t even need a depth camera or a Kinect, your webcam will suffice.

Excited yet, let’s get into it, but before that let me tell you a short story that motivated me to build this application today. It starts with me giving a lecture on the importance of physical fitness, I know … I know … how this sounds but just bear with me for a bit.

Hi All, Taha Awnar here, So here’s the thing. One of the best things I enjoyed in my early teenage years was having a fast metabolism due to my involvement in physical activities. I could eat whatever I wanted, not make a conscious effort in exercising and still stay fit.

But as I grew older, and started spending most of my time in front of a computer, I noticed that I was actually gaining weight. So no longer could I afford the luxury of binge unhealthy eating and skipping workouts.

Now I’m a bit of a foodie so although I could compromise a bit on how I eat, I still needed to cut weight some other way, so I quickly realized that unless I wanted to get obese, I needed to consciously make effort to workout.

That’s about when I joined a local gym in my area, and guess what? … it didn’t work out, ( or I didn’t work out … enough 🙁 ) So I quitted after a month.

So what was the reason ?.… Well, I could provide multiple excuses but to be honest, I was just lazy.

A few months later I joined the gym again and again I quitted after just 2 months.

Now I could have just quit completely but instead 8 months back I tried again, this time I even hired a trainer to keep me motivated, and as they say it, 3rd time’s a charm and luckily it was!

8 months in, I’m still at it. I did see results and lost a couple of kgs, although I haven’t reached my personal target so I’m still working towards it.

If you’re reading this post then you’re probably into computer science just like me and you most likely need to spend a lot of time in front of a PC and because of that, your physical and mental fitness must take a toll. And I seriously can’t stress enough how important it is that you take out a couple of hours each week to exercise.

I’m not a fitness guru but I can say working out has many key benefits:

Helps you shed excess weight, keeps you physically fit.
Gives you mental clarity and improves your work quality.
Lots of health benefits.
Helps you get a partner, if you’re still single like me … lol

Because of these reasons, even though I have an introverted personality, I consciously take out a couple of hours each week to go to the gym or the park for running.

But here’s the thing, sometimes I wonder why can’t I combine what I do (working on a PC) with some activity so I could … you know hit 2 birds with one stone.

This thought led me to create this post today, so what I did was I created a vision application that allows me to control a very popular game called Subway Surfers via my body movement by utilizing real-time pose detection.

And so In this tutorial, I’ll show you how to create this application that controls the Subway Surfers game using body gestures and movements so that you can also exercise, code, and have fun at the same time.

How will this Work?

So this game is about a character running from a policeman dodging different hurdles by jumping, crouching, and moving left and right. So we will need to worry about four controls that are normally controlled using a keyboard.

Up arrow key to make the character jump
Down arrow key to make the character crouch
Left arrow key to move the character to left
Right arrow key to move the character to right.

Using the Pyautogui library, we will automatically trigger the required keypress events, depending upon the body movement of the person that we’ll capture using Mediapipe’s Pose Detection model.

I want the game’s character to:

Jump whenever the person controlling the character jumps.

Crouch whenever the person controlling the character crouches.

Move left whenever the person controlling the character moves to the left side of the screen.

Move right whenever the person controlling the character moves to the right on the screen.

You can also use the techniques you’ll learn in this tutorial to control any other game. The simpler the game, the easier it will be to control. I have actually published two tutorials about game control via body gestures.

Alright now that we have discussed the basic mechanisms for creating this application, let me walk you through the exact step-by-step process I used to create this.

Outline

Step 1: Perform Pose Detection
Step 2: Control Starting Mechanism
Step 3: Control Horizontal Movements
Step 4: Control Vertical Movements
Step 5: Control Keyboard and Mouse with PyautoGUI
Step 6: Build the Final Application

Alright, let’s get started.

Download Code

Import the Libraries

We will start by importing the required libraries.

import cv2
import pyautogui
from time import time
from math import hypot
import mediapipe as mp
import matplotlib.pyplot as plt

import cv2

import pyautogui

from time import time

from math import hypot

import mediapipe as mp

import matplotlib.pyplot as plt

Initialize the Pose Detection Model

After that we will need to initialize the mp.solutions.pose class and then call the mp.solutions.pose.Pose() function with appropriate arguments and also initialize mp.solutions.drawing_utils class that is needed to visualize the landmarks after detection.

# Initialize mediapipe pose class.
mp_pose = mp.solutions.pose

# Setup the Pose function for images.
pose_image = mp_pose.Pose(static_image_mode=True, min_detection_confidence=0.5, model_complexity=1)

# Setup the Pose function for videos.
pose_video = mp_pose.Pose(static_image_mode=False, model_complexity=1, min_detection_confidence=0.7,
                          min_tracking_confidence=0.7)

# Initialize mediapipe drawing class.
mp_drawing = mp.solutions.drawing_utils

# Initialize mediapipe pose class.

mp_pose = mp.solutions.pose

# Setup the Pose function for images.

pose_image = mp_pose.Pose(static_image_mode=True, min_detection_confidence=0.5, model_complexity=1)

# Setup the Pose function for videos.

pose_video = mp_pose.Pose(static_image_mode=False, model_complexity=1, min_detection_confidence=0.7,

min_tracking_confidence=0.7)

# Initialize mediapipe drawing class.

mp_drawing = mp.solutions.drawing_utils

Step 1: Perform Pose Detection

To implement the game control mechanisms, we will need the current pose info of the person controlling the game, as our intention is to control the character with the movement of the person in the frame. We want the game’s character to move left, right, jump and crouch with the identical movements of the person.

So we will create a function detectPose() that will take an image as input and perform pose detection on the person in the image using the mediapipe’s pose detection solution to get thirty-three 3D landmarks on the body and the function will display the results or return them depending upon the passed arguments.

This function is quite similar to the one we had created in the previous post. The only difference is that we are not plotting the pose landmarks in 3D and we are passing a few more optional arguments to the function mp.solutions.drawing_utils.draw_landmarks() to specify the drawing style.

You probably do not want to lose control of the game’s character whenever some other person comes into the frame (and starts controlling the character), so that annoying scenario is already taken care of, as the solution we are using only detects the landmarks of the most prominent person in the image.

So you do not need to worry about losing control as long as you are the most prominent person in the frame as it will automatically ignore the people in the background.

def detectPose(image, pose, draw=False, display=False):
    '''
    This function performs the pose detection on the most prominent person in an image.
    Args:
        image:   The input image with a prominent person whose pose landmarks needs to be detected.
        pose:    The pose function required to perform the pose detection.
        draw:    A boolean value that is if set to true the function draw pose landmarks on the output image. 
        display: A boolean value that is if set to true the function displays the original input image, and the 
                 resultant image and returns nothing.
    Returns:
        output_image: The input image with the detected pose landmarks drawn if it was specified.
        results:      The output of the pose landmarks detection on the input image.
    '''
    
    # Create a copy of the input image.
    output_image = image.copy()
    
    # Convert the image from BGR into RGB format.
    imageRGB = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    
    # Perform the Pose Detection.
    results = pose.process(imageRGB)
    
    # Check if any landmarks are detected and are specified to be drawn.
    if results.pose_landmarks and draw:
    
        # Draw Pose Landmarks on the output image.
        mp_drawing.draw_landmarks(image=output_image, landmark_list=results.pose_landmarks,
                                  connections=mp_pose.POSE_CONNECTIONS,
                                  landmark_drawing_spec=mp_drawing.DrawingSpec(color=(255,255,255),
                                                                               thickness=3, circle_radius=3),
                                  connection_drawing_spec=mp_drawing.DrawingSpec(color=(49,125,237),
                                                                               thickness=2, circle_radius=2))

    # Check if the original input image and the resultant image are specified to be displayed.
    if display:
    
        # Display the original input image and the resultant image.
        plt.figure(figsize=[22,22])
        plt.subplot(121);plt.imshow(image[:,:,::-1]);plt.title("Original Image");plt.axis('off');
        plt.subplot(122);plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');
        
    # Otherwise
    else:

        # Return the output image and the results of pose landmarks detection.
        return output_image, results

def detectPose(image, pose, draw=False, display=False):

'''

This function performs the pose detection on the most prominent person in an image.

Args:

image: The input image with a prominent person whose pose landmarks needs to be detected.

pose: The pose function required to perform the pose detection.

draw: A boolean value that is if set to true the function draw pose landmarks on the output image.

display: A boolean value that is if set to true the function displays the original input image, and the

resultant image and returns nothing.

Returns:

output_image: The input image with the detected pose landmarks drawn if it was specified.

results: The output of the pose landmarks detection on the input image.

'''

# Create a copy of the input image.

output_image = image.copy()

# Convert the image from BGR into RGB format.

imageRGB = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

# Perform the Pose Detection.

results = pose.process(imageRGB)

# Check if any landmarks are detected and are specified to be drawn.

if results.pose_landmarks and draw:

# Draw Pose Landmarks on the output image.

mp_drawing.draw_landmarks(image=output_image, landmark_list=results.pose_landmarks,

connections=mp_pose.POSE_CONNECTIONS,

landmark_drawing_spec=mp_drawing.DrawingSpec(color=(255,255,255),

thickness=3, circle_radius=3),

connection_drawing_spec=mp_drawing.DrawingSpec(color=(49,125,237),

thickness=2, circle_radius=2))

# Check if the original input image and the resultant image are specified to be displayed.

if display:

# Display the original input image and the resultant image.

plt.figure(figsize=[22,22])

plt.subplot(121);plt.imshow(image[:,:,::-1]);plt.title("Original Image");plt.axis('off');

plt.subplot(122);plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');

# Otherwise

else:

# Return the output image and the results of pose landmarks detection.

return output_image, results

Now we will test the function detectPose() created above to perform pose detection on a sample image and display the results.

# Read a sample image and perform pose landmarks detection on it.
IMG_PATH = 'media/sample.jpg'
image = cv2.imread(IMG_PATH)
detectPose(image, pose_image, draw=True, display=True

# Read a sample image and perform pose landmarks detection on it.

IMG_PATH = 'media/sample.jpg'

image = cv2.imread(IMG_PATH)

detectPose(image, pose_image, draw=True, display=True

It worked pretty well! if you want you can test the function on other images too by just changing the value of the variable IMG_PATH in the cell above, it will work fine as long as there is a prominent person in the image.

Step 2: Control Starting Mechanism

In this step, we will implement the game starting mechanism, what we want is to start the game whenever the most prominent person in the image/frame joins his both hands together. So we will create a function checkHandsJoined() that will check whether the hands of the person in an image are joined or not.

The function checkHandsJoined() will take in the results of the pose detection returned by the function detectPose() and will use the LEFT_WRIST and RIGHT_WRIST landmarks coordinates from the list of thirty-three landmarks, to calculate the euclidean distance between the hands of the person.

And then utilize an appropriate threshold value to compare with and check whether the hands of the person in the image/frame are joined or not and will display or return the results depending upon the passed arguments.

def checkHandsJoined(image, results, draw=False, display=False):
    '''
    This function checks whether the hands of the person are joined or not in an image.
    Args:
        image:   The input image with a prominent person whose hands status (joined or not) needs to be classified.
        results: The output of the pose landmarks detection on the input image.
        draw:    A boolean value that is if set to true the function writes the hands status &amp; distance on the output image. 
        display: A boolean value that is if set to true the function displays the resultant image and returns nothing.
    Returns:
        output_image: The same input image but with the classified hands status written, if it was specified.
        hand_status:  The classified status of the hands whether they are joined or not.
    '''
    
    # Get the height and width of the input image.
    height, width, _ = image.shape
    
    # Create a copy of the input image to write the hands status label on.
    output_image = image.copy()
    
    # Get the left wrist landmark x and y coordinates.
    left_wrist_landmark = (results.pose_landmarks.landmark[mp_pose.PoseLandmark.LEFT_WRIST].x * width,
                          results.pose_landmarks.landmark[mp_pose.PoseLandmark.LEFT_WRIST].y * height)

    # Get the right wrist landmark x and y coordinates.
    right_wrist_landmark = (results.pose_landmarks.landmark[mp_pose.PoseLandmark.RIGHT_WRIST].x * width,
                           results.pose_landmarks.landmark[mp_pose.PoseLandmark.RIGHT_WRIST].y * height)
    
    # Calculate the euclidean distance between the left and right wrist.
    euclidean_distance = int(hypot(left_wrist_landmark[0] - right_wrist_landmark[0],
                                   left_wrist_landmark[1] - right_wrist_landmark[1]))
    
    # Compare the distance between the wrists with a appropriate threshold to check if both hands are joined.
    if euclidean_distance &lt; 130:
        
        # Set the hands status to joined.
        hand_status = 'Hands Joined'
        
        # Set the color value to green.
        color = (0, 255, 0)
        
    # Otherwise.    
    else:
        
        # Set the hands status to not joined.
        hand_status = 'Hands Not Joined'
        
        # Set the color value to red.
        color = (0, 0, 255)
        
    # Check if the Hands Joined status and hands distance are specified to be written on the output image.
    if draw:

        # Write the classified hands status on the image. 
        cv2.putText(output_image, hand_status, (10, 30), cv2.FONT_HERSHEY_PLAIN, 2, color, 3)
        
        # Write the the distance between the wrists on the image. 
        cv2.putText(output_image, f'Distance: {euclidean_distance}', (10, 70),
                    cv2.FONT_HERSHEY_PLAIN, 2, color, 3)
        
    # Check if the output image is specified to be displayed.
    if display:

        # Display the output image.
        plt.figure(figsize=[10,10])
        plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');
    
    # Otherwise
    else:
    
        # Return the output image and the classified hands status indicating whether the hands are joined or not.
        return output_image, hand_status

def checkHandsJoined(image, results, draw=False, display=False):

'''

This function checks whether the hands of the person are joined or not in an image.

Args:

image: The input image with a prominent person whose hands status (joined or not) needs to be classified.

results: The output of the pose landmarks detection on the input image.

draw: A boolean value that is if set to true the function writes the hands status & distance on the output image.

display: A boolean value that is if set to true the function displays the resultant image and returns nothing.

Returns:

output_image: The same input image but with the classified hands status written, if it was specified.

hand_status: The classified status of the hands whether they are joined or not.

'''

# Get the height and width of the input image.

height, width, _ = image.shape

# Create a copy of the input image to write the hands status label on.

output_image = image.copy()

# Get the left wrist landmark x and y coordinates.

left_wrist_landmark = (results.pose_landmarks.landmark[mp_pose.PoseLandmark.LEFT_WRIST].x * width,

results.pose_landmarks.landmark[mp_pose.PoseLandmark.LEFT_WRIST].y * height)

# Get the right wrist landmark x and y coordinates.

right_wrist_landmark = (results.pose_landmarks.landmark[mp_pose.PoseLandmark.RIGHT_WRIST].x * width,

results.pose_landmarks.landmark[mp_pose.PoseLandmark.RIGHT_WRIST].y * height)

# Calculate the euclidean distance between the left and right wrist.

euclidean_distance = int(hypot(left_wrist_landmark[0] - right_wrist_landmark[0],

left_wrist_landmark[1] - right_wrist_landmark[1]))

# Compare the distance between the wrists with a appropriate threshold to check if both hands are joined.

if euclidean_distance < 130:

# Set the hands status to joined.

hand_status = 'Hands Joined'

# Set the color value to green.

color = (0, 255, 0)

# Otherwise.

else:

# Set the hands status to not joined.

hand_status = 'Hands Not Joined'

# Set the color value to red.

color = (0, 0, 255)

# Check if the Hands Joined status and hands distance are specified to be written on the output image.

if draw:

# Write the classified hands status on the image.

cv2.putText(output_image, hand_status, (10, 30), cv2.FONT_HERSHEY_PLAIN, 2, color, 3)

# Write the the distance between the wrists on the image.

cv2.putText(output_image, f'Distance: {euclidean_distance}', (10, 70),

cv2.FONT_HERSHEY_PLAIN, 2, color, 3)

# Check if the output image is specified to be displayed.

if display:

# Display the output image.

plt.figure(figsize=[10,10])

plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');

# Otherwise

else:

# Return the output image and the classified hands status indicating whether the hands are joined or not.

return output_image, hand_status

Now we will test the function checkHandsJoined() created above on a real-time webcam feed to check whether it is working as we had expected or not.

# Initialize the VideoCapture object to read from the webcam.
camera_video = cv2.VideoCapture(1)
camera_video.set(3,1280)
camera_video.set(4,960)

# Create named window for resizing purposes.
cv2.namedWindow('Hands Joined?', cv2.WINDOW_NORMAL)

# Iterate until the webcam is accessed successfully.
while camera_video.isOpened():
    
    # Read a frame.
    ok, frame = camera_video.read()
    
    # Check if frame is not read properly then continue to the next iteration to read the next frame.
    if not ok:
        continue
    
    # Flip the frame horizontally for natural (selfie-view) visualization.
    frame = cv2.flip(frame, 1)
    
    # Get the height and width of the frame of the webcam video.
    frame_height, frame_width, _ = frame.shape
    
    # Perform the pose detection on the frame.
    frame, results = detectPose(frame, pose_video, draw=True)
    
    # Check if the pose landmarks in the frame are detected.
    if results.pose_landmarks:
            
        # Check if the left and right hands are joined.
        frame, _ = checkHandsJoined(frame, results, draw=True)
                
    # Display the frame.
    cv2.imshow('Hands Joined?', frame)
    
    # Wait for 1ms. If a key is pressed, retreive the ASCII code of the key.
    k = cv2.waitKey(1) &amp; 0xFF
    
    # Check if 'ESC' is pressed and break the loop.
    if(k == 27):
        break

# Release the VideoCapture Object and close the windows.
camera_video.release()
cv2.destroyAllWindows()

# Initialize the VideoCapture object to read from the webcam.

camera_video = cv2.VideoCapture(1)

camera_video.set(3,1280)

camera_video.set(4,960)

# Create named window for resizing purposes.

cv2.namedWindow('Hands Joined?', cv2.WINDOW_NORMAL)

# Iterate until the webcam is accessed successfully.

while camera_video.isOpened():

# Read a frame.

ok, frame = camera_video.read()

# Check if frame is not read properly then continue to the next iteration to read the next frame.

if not ok:

continue

# Flip the frame horizontally for natural (selfie-view) visualization.

frame = cv2.flip(frame, 1)

# Get the height and width of the frame of the webcam video.

frame_height, frame_width, _ = frame.shape

# Perform the pose detection on the frame.

frame, results = detectPose(frame, pose_video, draw=True)

# Check if the pose landmarks in the frame are detected.

if results.pose_landmarks:

# Check if the left and right hands are joined.

frame, _ = checkHandsJoined(frame, results, draw=True)

# Display the frame.

cv2.imshow('Hands Joined?', frame)

# Wait for 1ms. If a key is pressed, retreive the ASCII code of the key.

k = cv2.waitKey(1) & 0xFF

# Check if 'ESC' is pressed and break the loop.

if(k == 27):

break

# Release the VideoCapture Object and close the windows.

camera_video.release()

cv2.destroyAllWindows()

Output Video:

Woah! I am stunned, the pose detection solution is best known for its speed which is reflecting in the results as the distance and the hands status are updating very fast and are also highly accurate.

Step 3: Control Horizontal Movements

Now comes the implementation of the left and right movements control mechanism of the game’s character, what we want to do is to make the game’s character move left and right with the horizontal movements of the person in the image/frame.

So we will create a function checkLeftRight() that will take in the pose detection results returned by the function detectPose() and will use the x-coordinates of the RIGHT_SHOULDER and LEFT_SHOULDER landmarks to determine the horizontal position (Left, Right or Center) in the frame after comparing the landmarks with the x-coordinate of the center of the image.

The function will visualize or return the resultant image and the horizontal position of the person depending upon the passed arguments.

def checkLeftRight(image, results, draw=False, display=False):
    '''
    This function finds the horizontal position (left, center, right) of the person in an image.
    Args:
        image:   The input image with a prominent person whose the horizontal position needs to be found.
        results: The output of the pose landmarks detection on the input image.
        draw:    A boolean value that is if set to true the function writes the horizontal position on the output image. 
        display: A boolean value that is if set to true the function displays the resultant image and returns nothing.
    Returns:
        output_image:         The same input image but with the horizontal position written, if it was specified.
        horizontal_position:  The horizontal position (left, center, right) of the person in the input image.
    '''
    
    # Declare a variable to store the horizontal position (left, center, right) of the person.
    horizontal_position = None
    
    # Get the height and width of the image.
    height, width, _ = image.shape
    
    # Create a copy of the input image to write the horizontal position on.
    output_image = image.copy()
    
    # Retreive the x-coordinate of the left shoulder landmark.
    left_x = int(results.pose_landmarks.landmark[mp_pose.PoseLandmark.RIGHT_SHOULDER].x * width)

    # Retreive the x-corrdinate of the right shoulder landmark.
    right_x = int(results.pose_landmarks.landmark[mp_pose.PoseLandmark.LEFT_SHOULDER].x * width)
    
    # Check if the person is at left that is when both shoulder landmarks x-corrdinates
    # are less than or equal to the x-corrdinate of the center of the image.
    if (right_x &lt;= width//2 and left_x &lt;= width//2):
        
        # Set the person's position to left.
        horizontal_position = 'Left'

    # Check if the person is at right that is when both shoulder landmarks x-corrdinates
    # are greater than or equal to the x-corrdinate of the center of the image.
    elif (right_x &gt;= width//2 and left_x &gt;= width//2):
        
        # Set the person's position to right.
        horizontal_position = 'Right'
    
    # Check if the person is at center that is when right shoulder landmark x-corrdinate is greater than or equal to
    # and left shoulder landmark x-corrdinate is less than or equal to the x-corrdinate of the center of the image.
    elif (right_x &gt;= width//2 and left_x &lt;= width//2):
        
        # Set the person's position to center.
        horizontal_position = 'Center'
        
    # Check if the person's horizontal position and a line at the center of the image is specified to be drawn.
    if draw:

        # Write the horizontal position of the person on the image. 
        cv2.putText(output_image, horizontal_position, (5, height - 10), cv2.FONT_HERSHEY_PLAIN, 2, (255, 255, 255), 3)
        
        # Draw a line at the center of the image.
        cv2.line(output_image, (width//2, 0), (width//2, height), (255, 255, 255), 2)
        
    # Check if the output image is specified to be displayed.
    if display:

        # Display the output image.
        plt.figure(figsize=[10,10])
        plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');
    
    # Otherwise
    else:
    
        # Return the output image and the person's horizontal position.
        return output_image, horizontal_position

def checkLeftRight(image, results, draw=False, display=False):

'''

This function finds the horizontal position (left, center, right) of the person in an image.

Args:

image: The input image with a prominent person whose the horizontal position needs to be found.

results: The output of the pose landmarks detection on the input image.

draw: A boolean value that is if set to true the function writes the horizontal position on the output image.

display: A boolean value that is if set to true the function displays the resultant image and returns nothing.

Returns:

output_image: The same input image but with the horizontal position written, if it was specified.

horizontal_position: The horizontal position (left, center, right) of the person in the input image.

'''

# Declare a variable to store the horizontal position (left, center, right) of the person.

horizontal_position = None

# Get the height and width of the image.

height, width, _ = image.shape

# Create a copy of the input image to write the horizontal position on.

output_image = image.copy()

# Retreive the x-coordinate of the left shoulder landmark.

left_x = int(results.pose_landmarks.landmark[mp_pose.PoseLandmark.RIGHT_SHOULDER].x * width)

# Retreive the x-corrdinate of the right shoulder landmark.

right_x = int(results.pose_landmarks.landmark[mp_pose.PoseLandmark.LEFT_SHOULDER].x * width)

# Check if the person is at left that is when both shoulder landmarks x-corrdinates

# are less than or equal to the x-corrdinate of the center of the image.

if (right_x <= width//2 and left_x <= width//2):

# Set the person's position to left.

horizontal_position = 'Left'

# Check if the person is at right that is when both shoulder landmarks x-corrdinates

# are greater than or equal to the x-corrdinate of the center of the image.

elif (right_x >= width//2 and left_x >= width//2):

# Set the person's position to right.

horizontal_position = 'Right'

# Check if the person is at center that is when right shoulder landmark x-corrdinate is greater than or equal to

# and left shoulder landmark x-corrdinate is less than or equal to the x-corrdinate of the center of the image.

elif (right_x >= width//2 and left_x <= width//2):

# Set the person's position to center.

horizontal_position = 'Center'

# Check if the person's horizontal position and a line at the center of the image is specified to be drawn.

if draw:

# Write the horizontal position of the person on the image.

cv2.putText(output_image, horizontal_position, (5, height - 10), cv2.FONT_HERSHEY_PLAIN, 2, (255, 255, 255), 3)

# Draw a line at the center of the image.

cv2.line(output_image, (width//2, 0), (width//2, height), (255, 255, 255), 2)

# Check if the output image is specified to be displayed.

if display:

# Display the output image.

plt.figure(figsize=[10,10])

plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');

# Otherwise

else:

# Return the output image and the person's horizontal position.

return output_image, horizontal_position

Now we will test the function checkLeftRight() created above on a real-time webcam feed and will visualize the results updating in real-time with the horizontal movements.

# Initialize the VideoCapture object to read from the webcam.
camera_video = cv2.VideoCapture(1)
camera_video.set(3,1280)
camera_video.set(4,960)

# Create named window for resizing purposes.
cv2.namedWindow('Horizontal Movements', cv2.WINDOW_NORMAL)

# Iterate until the webcam is accessed successfully.
while camera_video.isOpened():
    
    # Read a frame.
    ok, frame = camera_video.read()
    
    # Check if frame is not read properly then continue to the next iteration to read the next frame.
    if not ok:
        continue
    
    # Flip the frame horizontally for natural (selfie-view) visualization.
    frame = cv2.flip(frame, 1)
    
    # Get the height and width of the frame of the webcam video.
    frame_height, frame_width, _ = frame.shape
    
    # Perform the pose detection on the frame.
    frame, results = detectPose(frame, pose_video, draw=True)
    
    # Check if the pose landmarks in the frame are detected.
    if results.pose_landmarks:
            
        # Check the horizontal position of the person in the frame.
        frame, _ = checkLeftRight(frame, results, draw=True)
                
    # Display the frame.
    cv2.imshow('Horizontal Movements', frame)
    
    # Wait for 1ms. If a a key is pressed, retreive the ASCII code of the key.
    k = cv2.waitKey(1) &amp; 0xFF
    
    # Check if 'ESC' is pressed and break the loop.
    if(k == 27):
        break

# Release the VideoCapture Object and close the windows.
camera_video.release()
cv2.destroyAllWindows()

# Initialize the VideoCapture object to read from the webcam.

camera_video = cv2.VideoCapture(1)

camera_video.set(3,1280)

camera_video.set(4,960)

# Create named window for resizing purposes.

cv2.namedWindow('Horizontal Movements', cv2.WINDOW_NORMAL)

# Iterate until the webcam is accessed successfully.

while camera_video.isOpened():

# Read a frame.

ok, frame = camera_video.read()

# Check if frame is not read properly then continue to the next iteration to read the next frame.

if not ok:

continue

# Flip the frame horizontally for natural (selfie-view) visualization.

frame = cv2.flip(frame, 1)

# Get the height and width of the frame of the webcam video.

frame_height, frame_width, _ = frame.shape

# Perform the pose detection on the frame.

frame, results = detectPose(frame, pose_video, draw=True)

# Check if the pose landmarks in the frame are detected.

if results.pose_landmarks:

# Check the horizontal position of the person in the frame.

frame, _ = checkLeftRight(frame, results, draw=True)

# Display the frame.

cv2.imshow('Horizontal Movements', frame)

# Wait for 1ms. If a a key is pressed, retreive the ASCII code of the key.

k = cv2.waitKey(1) & 0xFF

# Check if 'ESC' is pressed and break the loop.

if(k == 27):

break

# Release the VideoCapture Object and close the windows.

camera_video.release()

cv2.destroyAllWindows()

Output Video:

Cool! the speed and accuracy of this model never fail to impress me.

Step 4: Control Vertical Movements

In this one, we will implement the jump and crouch control mechanism of the game’s character, what we want is to make the game’s character jump and crouch whenever the person in the image/frame jumps and crouches.

So we will create a function checkJumpCrouch() that will check whether the posture of the person in an image is Jumping, Crouching or Standing by utilizing the results of pose detection by the function detectPose().

The function checkJumpCrouch() will retrieve the RIGHT_SHOULDER and LEFT_SHOULDER landmarks from the list to calculate the y-coordinate of the midpoint of both shoulders and will determine the posture of the person by doing a comparison with an appropriate threshold value.

The threshold (MID_Y) will be the approximate y-coordinate of the midpoint of both shoulders of the person while in standing posture. It will be calculated before starting the game in the Step 6: Build the Final Application and will be passed to the function checkJumpCrouch().

But the issue with this approach is that the midpoint of both shoulders of the person while in standing posture will not always be exactly the same as it will vary when the person will move closer or further to the camera.

To tackle this issue we will add and subtract a margin to the threshold to get an upper and lower bound as shown in the image below.

def checkJumpCrouch(image, results, MID_Y=250, draw=False, display=False):
    '''
    This function checks the posture (Jumping, Crouching or Standing) of the person in an image.
    Args:
        image:   The input image with a prominent person whose the posture needs to be checked.
        results: The output of the pose landmarks detection on the input image.
        MID_Y:   The intial center y-coordinate of both shoulders landmarks of the person recorded during starting
                 the game. This will give the idea of the person's height when he is standing straight.
        draw:    A boolean value that is if set to true the function writes the posture on the output image. 
        display: A boolean value that is if set to true the function displays the resultant image and returns nothing.
    Returns:
        output_image: The input image with the person's posture written, if it was specified.
        posture:      The posture (Jumping, Crouching or Standing) of the person in an image.
    '''
    
    # Get the height and width of the image.
    height, width, _ = image.shape
    
    # Create a copy of the input image to write the posture label on.
    output_image = image.copy()
    
    # Retreive the y-coordinate of the left shoulder landmark.
    left_y = int(results.pose_landmarks.landmark[mp_pose.PoseLandmark.RIGHT_SHOULDER].y * height)

    # Retreive the y-coordinate of the right shoulder landmark.
    right_y = int(results.pose_landmarks.landmark[mp_pose.PoseLandmark.LEFT_SHOULDER].y * height)

    # Calculate the y-coordinate of the mid-point of both shoulders.
    actual_mid_y = abs(right_y + left_y) // 2
    
    # Calculate the upper and lower bounds of the threshold.
    lower_bound = MID_Y-15
    upper_bound = MID_Y+100
    
    # Check if the person has jumped that is when the y-coordinate of the mid-point 
    # of both shoulders is less than the lower bound.
    if (actual_mid_y &lt; lower_bound):
        
        # Set the posture to jumping.
        posture = 'Jumping'
    
    # Check if the person has crouched that is when the y-coordinate of the mid-point 
    # of both shoulders is greater than the upper bound.
    elif (actual_mid_y &gt; upper_bound):
        
        # Set the posture to crouching.
        posture = 'Crouching'
    
    # Otherwise the person is standing and the y-coordinate of the mid-point 
    # of both shoulders is between the upper and lower bounds.    
    else:
        
        # Set the posture to Standing straight.
        posture = 'Standing'
        
    # Check if the posture and a horizontal line at the threshold is specified to be drawn.
    if draw:

        # Write the posture of the person on the image. 
        cv2.putText(output_image, posture, (5, height - 50), cv2.FONT_HERSHEY_PLAIN, 2, (255, 255, 255), 3)
        
        # Draw a line at the intial center y-coordinate of the person (threshold).
        cv2.line(output_image, (0, MID_Y),(width, MID_Y),(255, 255, 255), 2)
        
    # Check if the output image is specified to be displayed.
    if display:

        # Display the output image.
        plt.figure(figsize=[10,10])
        plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');
    
    # Otherwise
    else:
    
        # Return the output image and posture indicating whether the person is standing straight or has jumped, or crouched.
        return output_image, posture

def checkJumpCrouch(image, results, MID_Y=250, draw=False, display=False):

'''

This function checks the posture (Jumping, Crouching or Standing) of the person in an image.

Args:

image: The input image with a prominent person whose the posture needs to be checked.

results: The output of the pose landmarks detection on the input image.

MID_Y: The intial center y-coordinate of both shoulders landmarks of the person recorded during starting

the game. This will give the idea of the person's height when he is standing straight.

draw: A boolean value that is if set to true the function writes the posture on the output image.

display: A boolean value that is if set to true the function displays the resultant image and returns nothing.

Returns:

output_image: The input image with the person's posture written, if it was specified.

posture: The posture (Jumping, Crouching or Standing) of the person in an image.

'''

# Get the height and width of the image.

height, width, _ = image.shape

# Create a copy of the input image to write the posture label on.

output_image = image.copy()

# Retreive the y-coordinate of the left shoulder landmark.

left_y = int(results.pose_landmarks.landmark[mp_pose.PoseLandmark.RIGHT_SHOULDER].y * height)

# Retreive the y-coordinate of the right shoulder landmark.

right_y = int(results.pose_landmarks.landmark[mp_pose.PoseLandmark.LEFT_SHOULDER].y * height)

# Calculate the y-coordinate of the mid-point of both shoulders.

actual_mid_y = abs(right_y + left_y) // 2

# Calculate the upper and lower bounds of the threshold.

lower_bound = MID_Y-15

upper_bound = MID_Y+100

# Check if the person has jumped that is when the y-coordinate of the mid-point

# of both shoulders is less than the lower bound.

if (actual_mid_y < lower_bound):

# Set the posture to jumping.

posture = 'Jumping'

# Check if the person has crouched that is when the y-coordinate of the mid-point

# of both shoulders is greater than the upper bound.

elif (actual_mid_y > upper_bound):

# Set the posture to crouching.

posture = 'Crouching'

# Otherwise the person is standing and the y-coordinate of the mid-point

# of both shoulders is between the upper and lower bounds.

else:

# Set the posture to Standing straight.

posture = 'Standing'

# Check if the posture and a horizontal line at the threshold is specified to be drawn.

if draw:

# Write the posture of the person on the image.

cv2.putText(output_image, posture, (5, height - 50), cv2.FONT_HERSHEY_PLAIN, 2, (255, 255, 255), 3)

# Draw a line at the intial center y-coordinate of the person (threshold).

cv2.line(output_image, (0, MID_Y),(width, MID_Y),(255, 255, 255), 2)

# Check if the output image is specified to be displayed.

if display:

# Display the output image.

plt.figure(figsize=[10,10])

plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');

# Otherwise

else:

# Return the output image and posture indicating whether the person is standing straight or has jumped, or crouched.

return output_image, posture

Now we will test the function checkJumpCrouch() created above on the real-time webcam feed and will visualize the resultant frames. For testing purposes, we will be using a default value of the threshold, that if you want you can tune manually set according to your height.

# Initialize the VideoCapture object to read from the webcam.
camera_video = cv2.VideoCapture(1)
camera_video.set(3,1280)
camera_video.set(4,960)

# Create named window for resizing purposes.
cv2.namedWindow('Verticial Movements', cv2.WINDOW_NORMAL)

# Iterate until the webcam is accessed successfully.
while camera_video.isOpened():
    
    # Read a frame.
    ok, frame = camera_video.read()
    
    # Check if frame is not read properly then continue to the next iteration to read the next frame.
    if not ok:
        continue
    
    # Flip the frame horizontally for natural (selfie-view) visualization.
    frame = cv2.flip(frame, 1)
    
    # Get the height and width of the frame of the webcam video.
    frame_height, frame_width, _ = frame.shape
    
    # Perform the pose detection on the frame.
    frame, results = detectPose(frame, pose_video, draw=True)
    
    # Check if the pose landmarks in the frame are detected.
    if results.pose_landmarks:
            
        # Check the posture (jumping, crouching or standing) of the person in the frame. 
        frame, _ = checkJumpCrouch(frame, results, draw=True)
                
    # Display the frame.
    cv2.imshow('Verticial Movements', frame)
    
    # Wait for 1ms. If a a key is pressed, retreive the ASCII code of the key.
    k = cv2.waitKey(1) &amp; 0xFF
    
    # Check if 'ESC' is pressed and break the loop.
    if(k == 27):
        break

# Release the VideoCapture Object and close the windows.
camera_video.release()
cv2.destroyAllWindows()

# Initialize the VideoCapture object to read from the webcam.

camera_video = cv2.VideoCapture(1)

camera_video.set(3,1280)

camera_video.set(4,960)

# Create named window for resizing purposes.

cv2.namedWindow('Verticial Movements', cv2.WINDOW_NORMAL)

# Iterate until the webcam is accessed successfully.

while camera_video.isOpened():

# Read a frame.

ok, frame = camera_video.read()

# Check if frame is not read properly then continue to the next iteration to read the next frame.

if not ok:

continue

# Flip the frame horizontally for natural (selfie-view) visualization.

frame = cv2.flip(frame, 1)

# Get the height and width of the frame of the webcam video.

frame_height, frame_width, _ = frame.shape

# Perform the pose detection on the frame.

frame, results = detectPose(frame, pose_video, draw=True)

# Check if the pose landmarks in the frame are detected.

if results.pose_landmarks:

# Check the posture (jumping, crouching or standing) of the person in the frame.

frame, _ = checkJumpCrouch(frame, results, draw=True)

# Display the frame.

cv2.imshow('Verticial Movements', frame)

# Wait for 1ms. If a a key is pressed, retreive the ASCII code of the key.

k = cv2.waitKey(1) & 0xFF

# Check if 'ESC' is pressed and break the loop.

if(k == 27):

break

# Release the VideoCapture Object and close the windows.

camera_video.release()

cv2.destroyAllWindows()

Output Video:

Great! when I lower my shoulders at a certain range from the horizontal line (threshold), the results are Crouching, and the results are Standing, whenever my shoulders are near the horizontal line (i.e., between the upper and lower bounds), and when my shoulders are at a certain range above the horizontal line, the results are Jumping.

Step 5: Control Keyboard and Mouse with PyautoGUI

The Subway Surfers character wouldn’t be able to move left, right, jump or crouch unless we provide it the required keyboard inputs. Now that we have the functions checkHandsJoined(), checkLeftRight() and checkJumpCrouch(), we need to figure out a way to trigger the required keyboard keypress events, depending upon the output of the functions created above.

This is where the PyAutoGUI API shines. It allows you to easily control the mouse and keyboard event through scripts. To get an idea of PyAutoGUI’s capabilities, you can check this video in which a bot is playing the game Sushi Go Round.

To run the cells in this step, it is not recommended to use the keyboard keys (Shift + Enter) as the cells with keypress events will behave differently when the events will be combined with the keys Shift and Enter. You can either use the menubar (Cell>>Run Cell) or the toolbar (▶️Run) to run the cells.

Now let’s see how simple it is to trigger the up arrow keypress event using pyautogui.

# Press the up key.
pyautogui.press(keys='up')

1 2	# Press the up key. pyautogui.press(keys='up')

Similarly, we can trigger the down arrow or any other keypress event by replacing the argument with that key name (the argument should be a string). You can click here to see the list of valid arguments.

# Press the down key.
pyautogui.press(keys='down')

1 2	# Press the down key. pyautogui.press(keys='down')

To press multiple keys, we can pass a list of strings (key names) to the pyautogui.press() function.

# Press the up (4 times) and down (1 time) key.
pyautogui.press(keys=&#91;'up', 'up', 'up', 'up', 'down'])

1 2	# Press the up (4 times) and down (1 time) key. pyautogui.press(keys=['up', 'up', 'up', 'up', 'down'])

Or to press the same key multiple times, we can pass a value (number of times we want to press the key) to the argument presses in the pyautogui.press() function.

# Press the down key 4 times.
pyautogui.press(keys='down', presses=4)

1 2	# Press the down key 4 times. pyautogui.press(keys='down', presses=4)

This function presses the key(s) down and then releases up the key(s) automatically. We can also control this keypress event and key release event individually by using the functions:

pyautogui.keyDown(key): Presses and holds down the specified key.
pyautogui.keyUp(key): Releases up the specified key.

So with the help of these functions, keys can be pressed for a longer period. Like in the cell below we will hold down the shift key and press the enter key (two times) to run the two cells below this one and then we will release the shift key.

# Hold down the shift key.
pyautogui.keyDown(key='shift') 

# Press the enter key two times.
pyautogui.press(keys='enter', presses=2) 

# Release the shift key.
pyautogui.keyUp(key='shift')

# Hold down the shift key.

pyautogui.keyDown(key='shift')

# Press the enter key two times.

pyautogui.press(keys='enter', presses=2)

# Release the shift key.

pyautogui.keyUp(key='shift')

# This cell will run automatically due to keypress events in the previous cell.
print('Hello!')

1 2	# This cell will run automatically due to keypress events in the previous cell. print('Hello!')

# This cell will also run automatically due to those keypress events.
print('Happy Learning!')

1 2	# This cell will also run automatically due to those keypress events. print('Happy Learning!')

Now we will hold down the shift key and press the tab key and then we will release the shift key. This will switch the tab of your browser so make sure to have multiple tabs before running the cell below.

# Hold down the shift key.
pyautogui.keyDown(key='ctrl') 

# Press the tab key.
pyautogui.press(keys='tab') 

# Release the shift key.
pyautogui.keyUp(key='ctrl')

# Hold down the shift key.

pyautogui.keyDown(key='ctrl')

# Press the tab key.

pyautogui.press(keys='tab')

# Release the shift key.

pyautogui.keyUp(key='ctrl')

To trigger the mouse key press events, we can use pyautogui.click() function and to specify the mouse button that we want to press, we can pass the values left, middle, or right to the argument button.

# Press the mouse right button. It will open up the menu.
pyautogui.click(button='right')

1 2	# Press the mouse right button. It will open up the menu. pyautogui.click(button='right')

We can also move the mouse cursor to a specific position on the screen by specifying the x and y-coordinate values to the arguments x and y respectively.

# Move to 1300, 800, then click the right mouse button
pyautogui.click(x=1300, y=800, button='right')

1 2	# Move to 1300, 800, then click the right mouse button pyautogui.click(x=1300, y=800, button='right')

Step 6: Build the Final Application

In the final step, we will have to combine all the components to build the final application.

We will use the outputs of the functions created above checkHandsJoined() (to start the game), checkLeftRight() (control horizontal movements) and checkJumpCrouch() (control vertical movements) to trigger the relevant keyboard and mouse events and control the game’s character with our body movements.

Now we will run the cell below and click here to play the game in our browser using our body gestures and movements.

# Initialize the VideoCapture object to read from the webcam.
camera_video = cv2.VideoCapture(0)
camera_video.set(3,1280)
camera_video.set(4,960)

# Create named window for resizing purposes.
cv2.namedWindow('Subway Surfers with Pose Detection', cv2.WINDOW_NORMAL)
 
# Initialize a variable to store the time of the previous frame.
time1 = 0

# Initialize a variable to store the state of the game (started or not).
game_started = False   

# Initialize a variable to store the index of the current horizontal position of the person.
# At Start the character is at center so the index is 1 and it can move left (value 0) and right (value 2).
x_pos_index = 1

# Initialize a variable to store the index of the current vertical posture of the person.
# At Start the person is standing so the index is 1 and he can crouch (value 0) and jump (value 2).
y_pos_index = 1

# Declate a variable to store the intial y-coordinate of the mid-point of both shoulders of the person.
MID_Y = None

# Initialize a counter to store count of the number of consecutive frames with person's hands joined.
counter = 0

# Initialize the number of consecutive frames on which we want to check if person hands joined before starting the game.
num_of_frames = 10

# Iterate until the webcam is accessed successfully.
while camera_video.isOpened():
    
    # Read a frame.
    ok, frame = camera_video.read()
    
    # Check if frame is not read properly then continue to the next iteration to read the next frame.
    if not ok:
        continue
    
    # Flip the frame horizontally for natural (selfie-view) visualization.
    frame = cv2.flip(frame, 1)
    
    # Get the height and width of the frame of the webcam video.
    frame_height, frame_width, _ = frame.shape
    
    # Perform the pose detection on the frame.
    frame, results = detectPose(frame, pose_video, draw=game_started)
    
    # Check if the pose landmarks in the frame are detected.
    if results.pose_landmarks:
        
        # Check if the game has started
        if game_started:
            
            # Commands to control the horizontal movements of the character.
            #--------------------------------------------------------------------------------------------------------------
            
            # Get horizontal position of the person in the frame.
            frame, horizontal_position = checkLeftRight(frame, results, draw=True)
            
            # Check if the person has moved to left from center or to center from right.
            if (horizontal_position=='Left' and x_pos_index!=0) or (horizontal_position=='Center' and x_pos_index==2):
                
                # Press the left arrow key.
                pyautogui.press('left')
                
                # Update the horizontal position index of the character.
                x_pos_index -= 1               

            # Check if the person has moved to Right from center or to center from left.
            elif (horizontal_position=='Right' and x_pos_index!=2) or (horizontal_position=='Center' and x_pos_index==0):
                
                # Press the right arrow key.
                pyautogui.press('right')
                
                # Update the horizontal position index of the character.
                x_pos_index += 1
            
            #--------------------------------------------------------------------------------------------------------------
        
        # Otherwise if the game has not started    
        else:
            
            # Write the text representing the way to start the game on the frame. 
            cv2.putText(frame, 'JOIN BOTH HANDS TO START THE GAME.', (5, frame_height - 10), cv2.FONT_HERSHEY_PLAIN,
                        2, (0, 255, 0), 3)
        
        # Command to Start or resume the game.
        #------------------------------------------------------------------------------------------------------------------
        
        # Check if the left and right hands are joined.
        if checkHandsJoined(frame, results)[1] == 'Hands Joined':

            # Increment the count of consecutive frames with +ve condition.
            counter += 1

            # Check if the counter is equal to the required number of consecutive frames.  
            if counter == num_of_frames:

                # Command to Start the game first time.
                #----------------------------------------------------------------------------------------------------------
                
                # Check if the game has not started yet.
                if not(game_started):

                    # Update the value of the variable that stores the game state.
                    game_started = True

                    # Retreive the y-coordinate of the left shoulder landmark.
                    left_y = int(results.pose_landmarks.landmark[mp_pose.PoseLandmark.RIGHT_SHOULDER].y * frame_height)

                    # Retreive the y-coordinate of the right shoulder landmark.
                    right_y = int(results.pose_landmarks.landmark[mp_pose.PoseLandmark.LEFT_SHOULDER].y * frame_height)

                    # Calculate the intial y-coordinate of the mid-point of both shoulders of the person.
                    MID_Y = abs(right_y + left_y) // 2

                    # Move to 1300, 800, then click the left mouse button to start the game.
                    pyautogui.click(x=1300, y=800, button='left')
                
                #----------------------------------------------------------------------------------------------------------

                # Command to resume the game after death of the character.
                #----------------------------------------------------------------------------------------------------------
                
                # Otherwise if the game has started.
                else:

                    # Press the space key.
                    pyautogui.press('space')
                
                #----------------------------------------------------------------------------------------------------------
                
                # Update the counter value to zero.
                counter = 0

        # Otherwise if the left and right hands are not joined.        
        else:

            # Update the counter value to zero.
            counter = 0
            
        #------------------------------------------------------------------------------------------------------------------

        # Commands to control the vertical movements of the character.
        #------------------------------------------------------------------------------------------------------------------
        
        # Check if the intial y-coordinate of the mid-point of both shoulders of the person has a value.
        if MID_Y:
            
            # Get posture (jumping, crouching or standing) of the person in the frame. 
            frame, posture = checkJumpCrouch(frame, results, MID_Y, draw=True)
            
            # Check if the person has jumped.
            if posture == 'Jumping' and y_pos_index == 1:

                # Press the up arrow key
                pyautogui.press('up')
                
                # Update the veritcal position index of  the character.
                y_pos_index += 1 

            # Check if the person has crouched.
            elif posture == 'Crouching' and y_pos_index == 1:

                # Press the down arrow key
                pyautogui.press('down')
                
                # Update the veritcal position index of the character.
                y_pos_index -= 1
            
            # Check if the person has stood.
            elif posture == 'Standing' and y_pos_index   != 1:
                
                # Update the veritcal position index of the character.
                y_pos_index = 1
        
        #------------------------------------------------------------------------------------------------------------------
    
    
    # Otherwise if the pose landmarks in the frame are not detected.       
    else:

        # Update the counter value to zero.
        counter = 0
        
    # Calculate the frames updates in one second
    #----------------------------------------------------------------------------------------------------------------------
    
    # Set the time for this frame to the current time.
    time2 = time()
    
    # Check if the difference between the previous and this frame time &gt; 0 to avoid division by zero.
    if (time2 - time1) &gt; 0:
    
        # Calculate the number of frames per second.
        frames_per_second = 1.0 / (time2 - time1)
        
        # Write the calculated number of frames per second on the frame. 
        cv2.putText(frame, 'FPS: {}'.format(int(frames_per_second)), (10, 30),cv2.FONT_HERSHEY_PLAIN, 2, (0, 255, 0), 3)
    
    # Update the previous frame time to this frame time.
    # As this frame will become previous frame in next iteration.
    time1 = time2
    
    #----------------------------------------------------------------------------------------------------------------------
    
    # Display the frame.            
    cv2.imshow('Subway Surfers with Pose Detection', frame)
    
    # Wait for 1ms. If a a key is pressed, retreive the ASCII code of the key.
    k = cv2.waitKey(1) &amp; 0xFF    
    
    # Check if 'ESC' is pressed and break the loop.
    if(k == 27):
        break

# Release the VideoCapture Object and close the windows.                  
camera_video.release()
cv2.destroyAllWindows()

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

# Initialize the VideoCapture object to read from the webcam.

camera_video = cv2.VideoCapture(0)

camera_video.set(3,1280)

camera_video.set(4,960)

# Create named window for resizing purposes.

cv2.namedWindow('Subway Surfers with Pose Detection', cv2.WINDOW_NORMAL)

# Initialize a variable to store the time of the previous frame.

time1 = 0

# Initialize a variable to store the state of the game (started or not).

game_started = False

# Initialize a variable to store the index of the current horizontal position of the person.

# At Start the character is at center so the index is 1 and it can move left (value 0) and right (value 2).

x_pos_index = 1

# Initialize a variable to store the index of the current vertical posture of the person.

# At Start the person is standing so the index is 1 and he can crouch (value 0) and jump (value 2).

y_pos_index = 1

# Declate a variable to store the intial y-coordinate of the mid-point of both shoulders of the person.

MID_Y = None

# Initialize a counter to store count of the number of consecutive frames with person's hands joined.

counter = 0

# Initialize the number of consecutive frames on which we want to check if person hands joined before starting the game.

num_of_frames = 10

# Iterate until the webcam is accessed successfully.

while camera_video.isOpened():

# Read a frame.

ok, frame = camera_video.read()

# Check if frame is not read properly then continue to the next iteration to read the next frame.

if not ok:

continue

# Flip the frame horizontally for natural (selfie-view) visualization.

frame = cv2.flip(frame, 1)

# Get the height and width of the frame of the webcam video.

frame_height, frame_width, _ = frame.shape

# Perform the pose detection on the frame.

frame, results = detectPose(frame, pose_video, draw=game_started)

# Check if the pose landmarks in the frame are detected.

if results.pose_landmarks:

# Check if the game has started

if game_started:

# Commands to control the horizontal movements of the character.

#--------------------------------------------------------------------------------------------------------------

# Get horizontal position of the person in the frame.

frame, horizontal_position = checkLeftRight(frame, results, draw=True)

# Check if the person has moved to left from center or to center from right.

if (horizontal_position=='Left' and x_pos_index!=0) or (horizontal_position=='Center' and x_pos_index==2):

# Press the left arrow key.

pyautogui.press('left')

# Update the horizontal position index of the character.

x_pos_index -= 1

# Check if the person has moved to Right from center or to center from left.

elif (horizontal_position=='Right' and x_pos_index!=2) or (horizontal_position=='Center' and x_pos_index==0):

# Press the right arrow key.

pyautogui.press('right')

# Update the horizontal position index of the character.

x_pos_index += 1

#--------------------------------------------------------------------------------------------------------------

# Otherwise if the game has not started

else:

# Write the text representing the way to start the game on the frame.

cv2.putText(frame, 'JOIN BOTH HANDS TO START THE GAME.', (5, frame_height - 10), cv2.FONT_HERSHEY_PLAIN,

2, (0, 255, 0), 3)

# Command to Start or resume the game.

#------------------------------------------------------------------------------------------------------------------

# Check if the left and right hands are joined.

if checkHandsJoined(frame, results)[1] == 'Hands Joined':

# Increment the count of consecutive frames with +ve condition.

counter += 1

# Check if the counter is equal to the required number of consecutive frames.

if counter == num_of_frames:

# Command to Start the game first time.

#----------------------------------------------------------------------------------------------------------

# Check if the game has not started yet.

if not(game_started):

# Update the value of the variable that stores the game state.

game_started = True

# Retreive the y-coordinate of the left shoulder landmark.

left_y = int(results.pose_landmarks.landmark[mp_pose.PoseLandmark.RIGHT_SHOULDER].y * frame_height)

# Retreive the y-coordinate of the right shoulder landmark.

right_y = int(results.pose_landmarks.landmark[mp_pose.PoseLandmark.LEFT_SHOULDER].y * frame_height)

# Calculate the intial y-coordinate of the mid-point of both shoulders of the person.

MID_Y = abs(right_y + left_y) // 2

# Move to 1300, 800, then click the left mouse button to start the game.

pyautogui.click(x=1300, y=800, button='left')

#----------------------------------------------------------------------------------------------------------

# Command to resume the game after death of the character.

#----------------------------------------------------------------------------------------------------------

# Otherwise if the game has started.

else:

# Press the space key.

pyautogui.press('space')

#----------------------------------------------------------------------------------------------------------

# Update the counter value to zero.

counter = 0

# Otherwise if the left and right hands are not joined.

else:

# Update the counter value to zero.

counter = 0

#------------------------------------------------------------------------------------------------------------------

# Commands to control the vertical movements of the character.

#------------------------------------------------------------------------------------------------------------------

# Check if the intial y-coordinate of the mid-point of both shoulders of the person has a value.

if MID_Y:

# Get posture (jumping, crouching or standing) of the person in the frame.

frame, posture = checkJumpCrouch(frame, results, MID_Y, draw=True)

# Check if the person has jumped.

if posture == 'Jumping' and y_pos_index == 1:

# Press the up arrow key

pyautogui.press('up')

# Update the veritcal position index of the character.

y_pos_index += 1

# Check if the person has crouched.

elif posture == 'Crouching' and y_pos_index == 1:

# Press the down arrow key

pyautogui.press('down')

# Update the veritcal position index of the character.

y_pos_index -= 1

# Check if the person has stood.

elif posture == 'Standing' and y_pos_index != 1:

# Update the veritcal position index of the character.

y_pos_index = 1

#------------------------------------------------------------------------------------------------------------------

# Otherwise if the pose landmarks in the frame are not detected.

else:

# Update the counter value to zero.

counter = 0

# Calculate the frames updates in one second

#----------------------------------------------------------------------------------------------------------------------

# Set the time for this frame to the current time.

time2 = time()

# Check if the difference between the previous and this frame time > 0 to avoid division by zero.

if (time2 - time1) > 0:

# Calculate the number of frames per second.

frames_per_second = 1.0 / (time2 - time1)

# Write the calculated number of frames per second on the frame.

cv2.putText(frame, 'FPS: {}'.format(int(frames_per_second)), (10, 30),cv2.FONT_HERSHEY_PLAIN, 2, (0, 255, 0), 3)

# Update the previous frame time to this frame time.

# As this frame will become previous frame in next iteration.

time1 = time2

#----------------------------------------------------------------------------------------------------------------------

# Display the frame.

cv2.imshow('Subway Surfers with Pose Detection', frame)

# Wait for 1ms. If a a key is pressed, retreive the ASCII code of the key.

k = cv2.waitKey(1) & 0xFF

# Check if 'ESC' is pressed and break the loop.

if(k == 27):

break

# Release the VideoCapture Object and close the windows.

camera_video.release()

cv2.destroyAllWindows()

Output Video:

While building big applications like this one, I always divide the application into smaller components and then, in the end, integrate all those components to make the final application.

This makes it really easy to learn and understand how everything comes together to build up the full application.

Join My Course Computer Vision For Building Cutting Edge Applications Course

The only course out there that goes beyond basic AI Applications and teaches you how to create next-level apps that utilize physics, deep learning, classical image processing, hand and body gestures. Don’t miss your chance to level up and take your career to new heights

You’ll Learn about:

Creating GUI interfaces for python AI scripts.
Creating .exe DL applications
Using a Physics library in Python & integrating it with AI
Advance Image Processing Skills
Advance Gesture Recognition with Mediapipe

Task Automation with AI & CV
Training an SVM machine Learning Model.
Creating & Cleaning an ML dataset from scratch.
Training DL models & how to use CNN’s & LSTMS.
Creating 10 Advance AI/CV Applications
& More

Whether you’re a seasoned AI professional or someone just looking to start out in AI, this is the course that will teach you, how to Architect & Build complex, real world and thrilling AI applications

Join Now

Summary:

In this tutorial, we learned to perform pose detection on the most prominent person in the frame/image, to get thirty-three 3D landmarks, and then use those landmarks to extract useful info about the body movements (horizontal position i.e., left, center or right and posture i.e. jumping, standing or crouching) of the person and then use that info to control a simple game.

Another thing we have learned is how to automatically trigger the mouse and keyboard events programmatically using the Pyautogui library.

Now one drawback of controlling the game with body movements is that the game becomes much harder compared to controlling it via keyboard presses.

But our aim to make the exercise fun and learn to control Human-Computer Interaction (HCI) based games using AI is achieved. Now if you want, you can extend this application further to control a much more complex application.

You can reach out to me personally for a 1 on 1 consultation session in AI/computer vision regarding your project. Our talented team of vision engineers will help you every step of the way. Get on a call with me directly here.

Ready to seriously dive into State of the Art AI & Computer Vision?
Then Sign up for these premium Courses by Bleed AI

Vehicle Detection with OpenCV using Contours + Background Subtraction (Pt:4)

by Taha Anwar | Jul 31, 2021 | Object Detection, OpenCV

Watch the Full Video Here:

Vehicle detection has been a challenging part of building intelligent traffic management systems. Such systems are critical for addressing the ever-increasing number of vehicles on road networks that cannot keep up with the pace of increasing traffic. Today many methods that deal with this problem use either traditional computer vision or complex deep learning models.

Popular computer vision techniques include vehicle detection using optical flow, but in this tutorial, we are going to perform vehicle detection using another traditional computer vision technique that utilizes background subtraction and contour detection to detect vehicles. This means you won’t have to spend hundreds of hours in data collection or annotation for building deep learning models, which can be tedious, to say the least. Not to mention, the computation power required to train the models.

This post is the fourth and final part of our Contour Detection 101 series. All 4 posts in the series are titled as:

Contour Detection 101: The Basics
Contour Detection 101: Contour Manipulation
Contour Detection 101: Contour Analysis
Vehicle Detection with OpenCV using Contours + Background Subtraction (This Post)

So if you are new to the series and unfamiliar with contour detection, make sure you check them out!

In part 1 of the series, we learned the basics, how to detect and draw the contours, in part 2 we learned to do some contour manipulations and in the third part, we analyzed the detected contours for their properties to perform tasks like object detection. Combining these techniques with background subtraction will enable us to build a useful application that detects vehicles on a road. And not just that but you can use the same principles that you learn in this tutorial to create other motion detection applications.

So let’s dive into how vehicle detection with background subtraction works.

Import the Libraries

Let’s First start by importing the libraries.

import cv2
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import cv2

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

Car Detection using Background Subtraction¶

Background subtraction is a simple yet effective technique to extract objects from an image/video. Consider a highway on which cars are moving, and you want to extract each car. One easy way can be that you take a picture of the highway with the cars (called foreground image) and you also have an image saved in which the highway does not contain any cars (background image) so you subtract the background image from the foreground to get the segmented mask of the cars and then use that mask to extract the cars.

But in many cases you don’t have a clear background image, an example of this can be a highway that is always busy, or maybe a walking destination that is always crowded. So in those cases, you can subtract the background by other means, for example, in the case of a video you can detect the movement of the object, so the objects which move can be foreground and the other part that remain static can be the background.

Several algorithms have been invented for this purpose. OpenCV has implemented a few such algorithms which are very easy to use. Let’s see one of them.

BackgroundSubtractorMOG2

BackgroundSubtractorMOG2 is a Background/Foreground Segmentation Algorithm, based on two papers by Z.Zivkovic, “Improved adaptive Gaussian mixture model for background subtraction” (IEEE 2004) and “Efficient Adaptive Density Estimation per Image Pixel for the Task of Background Subtraction” (Elsevier BV 2006). One important feature of this algorithm is that it provides better adaptability to varying scenes due to illumination changes, which benefits you from having to worry about maintaining a fixed background. Let’s see how it works.

Function Syntax:

object = cv2.createBackgroundSubtractorMOG2(history, varThreshold, detectShadows)

Parameters:

history (optional) – It is the length of the history. Its default value is 500.
varThreshold (optional) – It is the threshold on the squared distance between the pixel and the model to decide whether a pixel is well described by the background model. It does not affect the background update and its default value is 16.
detectShadows (optional) – It is a boolean that determines whether the algorithm will detect and mark shadows or not. It marks shadows in gray color. Its default value is True. It decreases the speed a bit, so if you do not need this feature, set the parameter to false.

Returns:

object – It is the MOG2 Background Subtractor.

# load a video
cap = cv2.VideoCapture('media/videos/vtest.avi')

# you can optionally work on the live web cam
# cap = cv2.VideoCapture(0)

# create the background object, you can choose to detect shadows or not (if True they will be shown as gray)
backgroundobject = cv2.createBackgroundSubtractorMOG2( history = 2, detectShadows = True )

while(1):
    ret, frame = cap.read()  
    if not ret:
        break
        
    # apply the background object on each frame
    fgmask = backgroundobject.apply(frame)

    # also extracting the real detected foreground part of the image (optional)
    real_part = cv2.bitwise_and(frame,frame,mask=fgmask)
    
    # making fgmask 3 channeled so it can be stacked with others
    fgmask_3 = cv2.cvtColor(fgmask, cv2.COLOR_GRAY2BGR)
    
    # Stack all three frames and show the image
    stacked = np.hstack((fgmask_3,frame,real_part))
    cv2.imshow('All three',cv2.resize(stacked,None,fx=0.65,fy=0.65))
 
    k = cv2.waitKey(30) &amp;  0xff
    if k == 27:
        break
   
cap.release()
cv2.destroyAllWindows()

# load a video

cap = cv2.VideoCapture('media/videos/vtest.avi')

# you can optionally work on the live web cam

# cap = cv2.VideoCapture(0)

# create the background object, you can choose to detect shadows or not (if True they will be shown as gray)

backgroundobject = cv2.createBackgroundSubtractorMOG2( history = 2, detectShadows = True )

while(1):

ret, frame = cap.read()

if not ret:

break

# apply the background object on each frame

fgmask = backgroundobject.apply(frame)

# also extracting the real detected foreground part of the image (optional)

real_part = cv2.bitwise_and(frame,frame,mask=fgmask)

# making fgmask 3 channeled so it can be stacked with others

fgmask_3 = cv2.cvtColor(fgmask, cv2.COLOR_GRAY2BGR)

# Stack all three frames and show the image

stacked = np.hstack((fgmask_3,frame,real_part))

cv2.imshow('All three',cv2.resize(stacked,None,fx=0.65,fy=0.65))

k = cv2.waitKey(30) & 0xff

if k == 27:

break

cap.release()

cv2.destroyAllWindows()

Output:

The second frame is the original video, on the left we have the background subtraction result with shadows, while on the right we have the foreground part produced using the background subtraction mask.

Creating the Vehicle Detection Application

Alright once we have our background subtraction method ready, we can build our final application!

Here’s the breakdown of the steps we need to perform the complete background Subtraction based contour detection.

1) Start by loading the video using the function cv2.VideoCapture() and create a background subtractor object using the function cv2.createBackgroundSubtractorMOG2().

2) Then we will use the backgroundsubtractor.apply() method to get the segmented masks for the frames of the video after reading the frames one by one using the function cv2.VideoCapture.read().

3) Next, we will apply thresholding on the mask using the function cv2.threshold() to get rid of shadows and then perform Erosion and Dilation to improve the mask further using the functions cv2.erode() and cv2.dilate().

4) Then we will use the function cv2.findContours() to detect the contours on the mask image and convert the contour coordinates into bounding box coordinates for each car in the frame using the function cv2.boundingRect(). We will also check the area of the contour using cv2.contourArea() to make sure it is greater than a threshold for a car contour.

5) After that we will use the functions cv2.rectangle() and cv2.putText() to draw and label the bounding boxes on each frame and extract the foreground part of the video with the help of the segmented mask using the function cv2.bitwise_and().

# load a video
video = cv2.VideoCapture('media/videos/carsvid.wmv')

# You can set custom kernel size if you want.
kernel = None

# Initialize the background object.
backgroundObject = cv2.createBackgroundSubtractorMOG2(detectShadows = True)

while True:
    
    # Read a new frame.
    ret, frame = video.read()

    # Check if frame is not read correctly.
    if not ret:
        
        # Break the loop.

        break

    # Apply the background object on the frame to get the segmented mask. 
    fgmask = backgroundObject.apply(frame)
    #initialMask = fgmask.copy()
    
    # Perform thresholding to get rid of the shadows.
    _, fgmask = cv2.threshold(fgmask, 250, 255, cv2.THRESH_BINARY)
    #noisymask = fgmask.copy()
    
    # Apply some morphological operations to make sure you have a good mask
    fgmask = cv2.erode(fgmask, kernel, iterations = 1)
    fgmask = cv2.dilate(fgmask, kernel, iterations = 2)
    
    # Detect contours in the frame.
    contours, _ = cv2.findContours(fgmask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    
    # Create a copy of the frame to draw bounding boxes around the detected cars.
    frameCopy = frame.copy()
    
    # loop over each contour found in the frame.
    for cnt in contours:
        
        # Make sure the contour area is somewhat higher than some threshold to make sure its a car and not some noise.
        if cv2.contourArea(cnt) &gt; 400:
            
            # Retrieve the bounding box coordinates from the contour.
            x, y, width, height = cv2.boundingRect(cnt)
            
            # Draw a bounding box around the car.
            cv2.rectangle(frameCopy, (x , y), (x + width, y + height),(0, 0, 255), 2)
            
            # Write Car Detected near the bounding box drawn.
            cv2.putText(frameCopy, 'Car Detected', (x, y-10), cv2.FONT_HERSHEY_SIMPLEX, 0.3, (0,255,0), 1, cv2.LINE_AA)
    
    # Extract the foreground from the frame using the segmented mask.
    foregroundPart = cv2.bitwise_and(frame, frame, mask=fgmask)
        
    # Stack the original frame, extracted foreground, and annotated frame. 
    stacked = np.hstack((frame, foregroundPart, frameCopy))

    # Display the stacked image with an appropriate title.
    cv2.imshow('Original Frame, Extracted Foreground and Detected Cars', cv2.resize(stacked, None, fx=0.5, fy=0.5))
    #cv2.imshow('initial Mask', initialMask)
    #cv2.imshow('Noisy Mask', noisymask)
    #cv2.imshow('Clean Mask', fgmask)


    # Wait until a key is pressed.
    # Retreive the ASCII code of the key pressed
    k = cv2.waitKey(1) &amp; 0xff
    
    # Check if 'q' key is pressed.
    if k == ord('q'):
        
        # Break the loop.
        break

# Release the VideoCapture Object.
video.release()

# Close the windows
cv2.destroyAllWindows()

# load a video

video = cv2.VideoCapture('media/videos/carsvid.wmv')

# You can set custom kernel size if you want.

kernel = None

# Initialize the background object.

backgroundObject = cv2.createBackgroundSubtractorMOG2(detectShadows = True)

while True:

# Read a new frame.

ret, frame = video.read()

# Check if frame is not read correctly.

if not ret:

# Break the loop.

break

# Apply the background object on the frame to get the segmented mask.

fgmask = backgroundObject.apply(frame)

#initialMask = fgmask.copy()

# Perform thresholding to get rid of the shadows.

_, fgmask = cv2.threshold(fgmask, 250, 255, cv2.THRESH_BINARY)

#noisymask = fgmask.copy()

# Apply some morphological operations to make sure you have a good mask

fgmask = cv2.erode(fgmask, kernel, iterations = 1)

fgmask = cv2.dilate(fgmask, kernel, iterations = 2)

# Detect contours in the frame.

contours, _ = cv2.findContours(fgmask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

# Create a copy of the frame to draw bounding boxes around the detected cars.

frameCopy = frame.copy()

# loop over each contour found in the frame.

for cnt in contours:

# Make sure the contour area is somewhat higher than some threshold to make sure its a car and not some noise.

if cv2.contourArea(cnt) > 400:

# Retrieve the bounding box coordinates from the contour.

x, y, width, height = cv2.boundingRect(cnt)

# Draw a bounding box around the car.

cv2.rectangle(frameCopy, (x , y), (x + width, y + height),(0, 0, 255), 2)

# Write Car Detected near the bounding box drawn.

cv2.putText(frameCopy, 'Car Detected', (x, y-10), cv2.FONT_HERSHEY_SIMPLEX, 0.3, (0,255,0), 1, cv2.LINE_AA)

# Extract the foreground from the frame using the segmented mask.

foregroundPart = cv2.bitwise_and(frame, frame, mask=fgmask)

# Stack the original frame, extracted foreground, and annotated frame.

stacked = np.hstack((frame, foregroundPart, frameCopy))

# Display the stacked image with an appropriate title.

cv2.imshow('Original Frame, Extracted Foreground and Detected Cars', cv2.resize(stacked, None, fx=0.5, fy=0.5))

#cv2.imshow('initial Mask', initialMask)

#cv2.imshow('Noisy Mask', noisymask)

#cv2.imshow('Clean Mask', fgmask)

# Wait until a key is pressed.

# Retreive the ASCII code of the key pressed

k = cv2.waitKey(1) & 0xff

# Check if 'q' key is pressed.

if k == ord('q'):

# Break the loop.

break

# Release the VideoCapture Object.

video.release()

# Close the windows

cv2.destroyAllWindows()

Output:

This seems to have worked out well, that too without having to train large-scale Deep learning models!

There are many other background subtraction algorithms in OpenCV that you can use. Check out here and here for further details about them.

Summary

Vehicle Detection is a popular computer vision problem. This post explored how traditional machine vision tools can still be utilized to build applications that can effectively deal with modern vision challenges.

We used a popular background/foreground segmentation technique called background subtraction to isolate our regions of interest from the image.

We also saw how contour detection can prove to be useful when dealing with vision problems. The pre-processing and post-processing that can be used to filter out the noise in the detected contours.

Although these techniques can be robust, they are not as generalizable as Deep learning models so it’s important to put more focus on deployment conditions and possible variations when building vision applications with such techniques.

This post concludes the four-part series on contour detection. If you enjoyed this post and followed the rest of the series do let me know in the comments and you can also support me and the Bleed AI team on patreon here.

If you need 1 on 1 Coaching in AI/computer vision regarding your project, or your career then you reach out to me personally here

Hire Us

Let our team of expert engineers and managers build your next big project using Bleeding Edge AI Tools & Technologies

Real-Time 3D Pose Detection & Pose Classification with Mediapipe and Python

by Taha Anwar | Jul 31, 2021 | Application, Deep Learning, Pose Detection

In this tutorial, we’ll learn how to do real-time 3D pose detection using the mediapipe library in python. After that, we’ll calculate angles between body joints and combine them with some heuristics to create a pose classification system.

All of this will work on real-time camera feed using your CPU as well as on images. See results below.

The code is really simple, for detailed code explanation do also check out the YouTube tutorial, although this blog post will suffice enough to get the code up and running in no time.

This post can be split into 3 parts:

Part 1 (a): Introduction to Pose Detection

Part 1 (b): Mediapipe’s Pose Detection Implementation

Part 2: Using Pose Detection in images and on videos

Part 3: Pose Classification with Angle Heuristics

Part 1 (a): Introduction to Pose Detection:

Pose Detection or Pose Estimation is a very popular problem in computer vision, in fact, it belongs to a broader class of computer vision domain called key point estimation. Today we’ll learn to do Pose Detection where we’ll try to localize 33 key body landmarks on a person e.g. elbows, knees, ankles, etc. see the image below:

Some interesting applications of pose detection are:

Full body Gesture Control to control anything from video games (e.g. kinect) to physical appliances, robots etc. Check this.
Full body Sign Language Recognition. Check this.
Creating Fitness / exercise / dance monitoring applications. Check this.
Creating Augmented reality applications that overlay virtual clothes or other accessories over someone’s body. Check this.

Now, these are just some interesting things you can make using pose detection, as you can see it’s a really interesting problem.

And that’s not it there are other types of key point detection problems too, e.g. facial landmark detection, hand landmark detection, etc.

We will actually learn to do both of the above in the upcoming tutorials.

Key point detection in turn belongs to a major computer vision branch called Image recognition, other broad classes of vision that belong in this branch are Classification, Detection, and Segmentation.

Here’s a very generic definition of each class.

In classification we try to classify whole images or videos as belonging to a certain class.
In Detection we try to classify and localize objects or classes of interest.
In Segmentation, we try to extract/segment or find the exact boundary/outline of our target object/class.
In Keypoint Detection, we try to localize predefined points/landmarks.

It should be noted that each of the major categories above has subcategories or different types, a few weeks ago I wrote a post on Selfie segmentation using mediapipe where I talked about various segmentation types. Be sure to read that post.

If you’re new to Computer vision and just exploring the waters, check this page from paperswithcode, it lists a lot of subcategories from the above major categories. Now don’t be confused by the categorization that paperswtihcode has done, personally speaking, I don’t agree with the way they have sorted subcategories with applications and there are some other issues. The takeaway is that there are a lot of variations in computer vision problems, but the 4 categories I’ve listed above are some major ones.

Part 1 (b): Mediapipe’s Pose Detection Implementation:

Here’s a brief introduction to Mediapipe;

“Mediapipe is a cross-platform/open-source tool that allows you to run a variety of machine learning models in real-time. It’s designed primarily for facilitating the use of ML in streaming media & It was built by Google”

Not only is this tool backed by google but models in Mediapipe are actively used in Google products. So you can expect nothing less than the state of the Art performance from this library.

Now MediaPipe’s Pose detection is a State of the Art solution for high-fidelity (i.e. high quality) and low latency (i.e. Damn fast) for detecting 33 3D landmarks on a person in real-time video feeds on low-end devices i.e. phones, laptops, etc.

Alright, so what makes this pose detection model from Mediapipe so fast?

They are actually using a very successful deep learning recipe that is creating a 2 step detector, where you combine a computationally expensive object detector with a lightweight object tracker.

Here’s how this works:

You run the detector in the first frame of the video to localize the person and provide a bounding box around it, after that the tracker takes over and it predicts the landmark points inside that bounding box ROI, the tracker continues to run on any subsequent frames in the video using the previous frame’s ROI and only calls the detection model again when it fails to track the person with high confidence.

Their model works best if the person is standing 2-4 meters away from the camera and one major limitation of their model is that this approach only works for single-person pose detection, it’s not applicable for multi-person detection.

Mediapipe actually trained 3 models, with different tradeoffs between speed and performance. You’ll be able to use all 3 of them with mediapipe.

Method	LatencyPixel 3 TFLite GPU	LatencyMacBook Pro (15-inch 2017)
BlazePose.Heavy	53 ms	38 ms
BlazePose.Full	25 ms	27 ms
BlazePose.Lite	20 ms	25 ms

The detector used in pose detection is inspired by Mediapiep’s lightweight BlazeFace model, you can read this paper. For the landmark model used in pose detection, you can read this paper for more details. or read Google’s blog on it.

Here are the 33 landmarks that this model detects:

Alright now that we have covered some basic theory and implementation details, let’s get into the code.

Download Code

Part 2: Using Pose Detection in images and on videos

Import the Libraries

Let’s start by importing the required libraries.

import math
import cv2
import numpy as np
from time import time
import mediapipe as mp
import matplotlib.pyplot as plt

import math

import cv2

import numpy as np

from time import time

import mediapipe as mp

import matplotlib.pyplot as plt

Initialize the Pose Detection Model

The first thing that we need to do is initialize the pose class using the mp.solutions.pose syntax and then we will call the setup function mp.solutions.pose.Pose() with the arguments:

static_image_mode – It is a boolean value that is if set to False, the detector is only invoked as needed, that is in the very first frame or when the tracker loses track. If set to True, the person detector is invoked on every input image. So you should probably set this value to True when working with a bunch of unrelated images not videos. Its default value is False.
min_detection_confidence – It is the minimum detection confidence with range (0.0 , 1.0) required to consider the person-detection model’s prediction correct. Its default value is 0.5. This means if the detector has a prediction confidence of greater or equal to 50% then it will be considered as a positive detection.
min_tracking_confidence – It is the minimum tracking confidence ([0.0, 1.0]) required to consider the landmark-tracking model’s tracked pose landmarks valid. If the confidence is less than the set value then the detector is invoked again in the next frame/image, so increasing its value increases the robustness, but also increases the latency. Its default value is 0.5.
model_complexity – It is the complexity of the pose landmark model. As there are three different models to choose from so the possible values are 0, 1, or 2. The higher the value, the more accurate the results are, but at the expense of higher latency. Its default value is 1.
smooth_landmarks – It is a boolean value that is if set to True, pose landmarks across different frames are filtered to reduce noise. But only works when static_image_mode is also set to False. Its default value is True.

Then we will also initialize mp.solutions.drawing_utils class that will allow us to visualize the landmarks after detection, instead of using this, you can also use OpenCV to visualize the landmarks.

# Initializing mediapipe pose class.
mp_pose = mp.solutions.pose

# Setting up the Pose function.
pose = mp_pose.Pose(static_image_mode=True, min_detection_confidence=0.3, model_complexity=2)

# Initializing mediapipe drawing class, useful for annotation.
mp_drawing = mp.solutions.drawing_utils

# Initializing mediapipe pose class.

mp_pose = mp.solutions.pose

# Setting up the Pose function.

pose = mp_pose.Pose(static_image_mode=True, min_detection_confidence=0.3, model_complexity=2)

# Initializing mediapipe drawing class, useful for annotation.

mp_drawing = mp.solutions.drawing_utils

Downloading model to C:\ProgramData\Anaconda3\lib\site-packages\mediapipe/modules/pose_landmark/pose_landmark_heavy.tflite

Read an Image

Now we will read a sample image using the function cv2.imread() and then display that image using the matplotlib library.

# Read an image from the specified path.
sample_img = cv2.imread('media/sample.jpg')

# Specify a size of the figure.
plt.figure(figsize = [10, 10])

# Display the sample image, also convert BGR to RGB for display. 
plt.title("Sample Image");plt.axis('off');plt.imshow(sample_img[:,:,::-1]);plt.show()

# Read an image from the specified path.

sample_img = cv2.imread('media/sample.jpg')

# Specify a size of the figure.

plt.figure(figsize = [10, 10])

# Display the sample image, also convert BGR to RGB for display.

plt.title("Sample Image");plt.axis('off');plt.imshow(sample_img[:,:,::-1]);plt.show()

Perform Pose Detection

Now we will pass the image to the pose detection machine learning pipeline by using the function mp.solutions.pose.Pose().process(). But the pipeline expects the input images in RGB color format so first we will have to convert the sample image from BGR to RGB format using the function cv2.cvtColor() as OpenCV reads images in BGR format (instead of RGB).

After performing the pose detection, we will get a list of thirty-three landmarks representing the body joint locations of the prominent person in the image. Each landmark has:

x – It is the landmark x-coordinate normalized to [0.0, 1.0] by the image width.
y: It is the landmark y-coordinate normalized to [0.0, 1.0] by the image height.
z: It is the landmark z-coordinate normalized to roughly the same scale as x. It represents the landmark depth with midpoint of hips being the origin, so the smaller the value of z, the closer the landmark is to the camera.
visibility: It is a value with range [0.0, 1.0] representing the possibility of the landmark being visible (not occluded) in the image. This is a useful variable when deciding if you want to show a particular joint because it might be occluded or partially visible in the image.

After performing the pose detection on the sample image above, we will display the first two landmarks from the list, so that you get a better idea of the output of the model.

# Perform pose detection after converting the image into RGB format.
results = pose.process(cv2.cvtColor(sample_img, cv2.COLOR_BGR2RGB))

# Check if any landmarks are found.
if results.pose_landmarks:
    
    # Iterate two times as we only want to display first two landmarks.
    for i in range(2):
        
        # Display the found normalized landmarks.
        print(f'{mp_pose.PoseLandmark(i).name}:\n{results.pose_landmarks.landmark[mp_pose.PoseLandmark(i).value]}')

# Perform pose detection after converting the image into RGB format.

results = pose.process(cv2.cvtColor(sample_img, cv2.COLOR_BGR2RGB))

# Check if any landmarks are found.

if results.pose_landmarks:

# Iterate two times as we only want to display first two landmarks.

for i in range(2):

# Display the found normalized landmarks.

print(f'{mp_pose.PoseLandmark(i).name}:\n{results.pose_landmarks.landmark[mp_pose.PoseLandmark(i).value]}')

NOSE:
x: 0.4321258
y: 0.28087094
z: -0.67494285
visibility: 0.99999905

LEFT_EYE_INNER:
x: 0.44070682
y: 0.2621727
z: -0.6380733
visibility: 0.99999845

Now we will convert the two normalized landmarks displayed above into their original scale by using the width and height of the image.

# Retrieve the height and width of the sample image.
image_height, image_width, _ = sample_img.shape

# Check if any landmarks are found.
if results.pose_landmarks:
    
    # Iterate two times as we only want to display first two landmark.
    for i in range(2):
        
        # Display the found landmarks after converting them into their original scale.
        print(f'{mp_pose.PoseLandmark(i).name}:') 
        print(f'x: {results.pose_landmarks.landmark[mp_pose.PoseLandmark(i).value].x * image_width}')
        print(f'y: {results.pose_landmarks.landmark[mp_pose.PoseLandmark(i).value].y * image_height}')
        print(f'z: {results.pose_landmarks.landmark[mp_pose.PoseLandmark(i).value].z * image_width}')
        print(f'visibility: {results.pose_landmarks.landmark[mp_pose.PoseLandmark(i).value].visibility}\n')

# Retrieve the height and width of the sample image.

image_height, image_width, _ = sample_img.shape

# Check if any landmarks are found.

if results.pose_landmarks:

# Iterate two times as we only want to display first two landmark.

for i in range(2):

# Display the found landmarks after converting them into their original scale.

print(f'{mp_pose.PoseLandmark(i).name}:')

print(f'x: {results.pose_landmarks.landmark[mp_pose.PoseLandmark(i).value].x * image_width}')

print(f'y: {results.pose_landmarks.landmark[mp_pose.PoseLandmark(i).value].y * image_height}')

print(f'z: {results.pose_landmarks.landmark[mp_pose.PoseLandmark(i).value].z * image_width}')

print(f'visibility: {results.pose_landmarks.landmark[mp_pose.PoseLandmark(i).value].visibility}\n')

NOSE:
x: 310.69845509529114
y: 303.340619802475
z: -485.28390991687775
visibility: 0.9999990463256836

LEFT_EYE_INNER:
x: 316.86820307374
y: 283.1465148925781
z: -458.774720788002
visibility: 0.9999984502792358

Now we will draw the detected landmarks on the sample image using the function mp.solutions.drawing_utils.draw_landmarks() and display the resultant image using the matplotlib library.

# Create a copy of the sample image to draw landmarks on.
img_copy = sample_img.copy()

# Check if any landmarks are found.
if results.pose_landmarks:
    
    # Draw Pose landmarks on the sample image.
    mp_drawing.draw_landmarks(image=img_copy, landmark_list=results.pose_landmarks, connections=mp_pose.POSE_CONNECTIONS)
       
    # Specify a size of the figure.
    fig = plt.figure(figsize = [10, 10])

    # Display the output image with the landmarks drawn, also convert BGR to RGB for display. 
    plt.title("Output");plt.axis('off');plt.imshow(img_copy[:,:,::-1]);plt.show()

# Create a copy of the sample image to draw landmarks on.

img_copy = sample_img.copy()

# Check if any landmarks are found.

if results.pose_landmarks:

# Draw Pose landmarks on the sample image.

mp_drawing.draw_landmarks(image=img_copy, landmark_list=results.pose_landmarks, connections=mp_pose.POSE_CONNECTIONS)

# Specify a size of the figure.

fig = plt.figure(figsize = [10, 10])

# Display the output image with the landmarks drawn, also convert BGR to RGB for display.

plt.title("Output");plt.axis('off');plt.imshow(img_copy[:,:,::-1]);plt.show()

Now we will go a step further and visualize the landmarks in three-dimensions (3D) using the function mp.solutions.drawing_utils.plot_landmarks(). We will need the POSE_WORLD_LANDMARKS that is another list of pose landmarks in world coordinates that has the 3D coordinates in meters with the origin at the center between the hips of the person.

# Plot Pose landmarks in 3D.
mp_drawing.plot_landmarks(results.pose_world_landmarks, mp_pose.POSE_CONNECTIONS)

1 2	# Plot Pose landmarks in 3D. mp_drawing.plot_landmarks(results.pose_world_landmarks, mp_pose.POSE_CONNECTIONS)

Note: This is actually a neat hack by mediapipe, the coordinates returned are not actually in 3D but by setting hip landmark as the origin allows us to measure the relative distance of the other points from the hip, and since this distance increases or decreases depending upon if you’re close or further from the camera it gives us a sense of the depth of each landmark point.

Create a Pose Detection Function

Now we will put all this together to create a function that will perform pose detection on an image and visualize the results or return the results depending upon the passed arguments.

def detectPose(image, pose, display=True):
    '''
    This function performs pose detection on an image.
    Args:
        image: The input image with a prominent person whose pose landmarks needs to be detected.
        pose: The pose setup function required to perform the pose detection.
        display: A boolean value that is if set to true the function displays the original input image, the resultant image, 
                 and the pose landmarks in 3D plot and returns nothing.
    Returns:
        output_image: The input image with the detected pose landmarks drawn.
        landmarks: A list of detected landmarks converted into their original scale.
    '''
    
    # Create a copy of the input image.
    output_image = image.copy()
    
    # Convert the image from BGR into RGB format.
    imageRGB = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    
    # Perform the Pose Detection.
    results = pose.process(imageRGB)
    
    # Retrieve the height and width of the input image.
    height, width, _ = image.shape
    
    # Initialize a list to store the detected landmarks.
    landmarks = []
    
    # Check if any landmarks are detected.
    if results.pose_landmarks:
    
        # Draw Pose landmarks on the output image.
        mp_drawing.draw_landmarks(image=output_image, landmark_list=results.pose_landmarks,
                                  connections=mp_pose.POSE_CONNECTIONS)
        
        # Iterate over the detected landmarks.
        for landmark in results.pose_landmarks.landmark:
            
            # Append the landmark into the list.
            landmarks.append((int(landmark.x * width), int(landmark.y * height),
                                  (landmark.z * width)))
    
    # Check if the original input image and the resultant image are specified to be displayed.
    if display:
    
        # Display the original input image and the resultant image.
        plt.figure(figsize=[22,22])
        plt.subplot(121);plt.imshow(image[:,:,::-1]);plt.title("Original Image");plt.axis('off');
        plt.subplot(122);plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');
        
        # Also Plot the Pose landmarks in 3D.
        mp_drawing.plot_landmarks(results.pose_world_landmarks, mp_pose.POSE_CONNECTIONS)
        
    # Otherwise
    else:
        
        # Return the output image and the found landmarks.
        return output_image, landmarks

def detectPose(image, pose, display=True):

'''

This function performs pose detection on an image.

Args:

image: The input image with a prominent person whose pose landmarks needs to be detected.

pose: The pose setup function required to perform the pose detection.

display: A boolean value that is if set to true the function displays the original input image, the resultant image,

and the pose landmarks in 3D plot and returns nothing.

Returns:

output_image: The input image with the detected pose landmarks drawn.

landmarks: A list of detected landmarks converted into their original scale.

'''

# Create a copy of the input image.

output_image = image.copy()

# Convert the image from BGR into RGB format.

imageRGB = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

# Perform the Pose Detection.

results = pose.process(imageRGB)

# Retrieve the height and width of the input image.

height, width, _ = image.shape

# Initialize a list to store the detected landmarks.

landmarks = []

# Check if any landmarks are detected.

if results.pose_landmarks:

# Draw Pose landmarks on the output image.

mp_drawing.draw_landmarks(image=output_image, landmark_list=results.pose_landmarks,

connections=mp_pose.POSE_CONNECTIONS)

# Iterate over the detected landmarks.

for landmark in results.pose_landmarks.landmark:

# Append the landmark into the list.

landmarks.append((int(landmark.x * width), int(landmark.y * height),

(landmark.z * width)))

# Check if the original input image and the resultant image are specified to be displayed.

if display:

# Display the original input image and the resultant image.

plt.figure(figsize=[22,22])

plt.subplot(121);plt.imshow(image[:,:,::-1]);plt.title("Original Image");plt.axis('off');

plt.subplot(122);plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');

# Also Plot the Pose landmarks in 3D.

mp_drawing.plot_landmarks(results.pose_world_landmarks, mp_pose.POSE_CONNECTIONS)

# Otherwise

else:

# Return the output image and the found landmarks.

return output_image, landmarks

Now we will utilize the function created above to perform pose detection on a few sample images and display the results.

# Read another sample image and perform pose detection on it.
image = cv2.imread('media/sample1.jpg')
detectPose(image, pose, display=True)

# Read another sample image and perform pose detection on it.

image = cv2.imread('media/sample1.jpg')

detectPose(image, pose, display=True)

# Read another sample image and perform pose detection on it.
image = cv2.imread('media/sample2.jpg')
detectPose(image, pose, display=True)

# Read another sample image and perform pose detection on it.

image = cv2.imread('media/sample2.jpg')

detectPose(image, pose, display=True)

# Read another sample image and perform pose detection on it.
image = cv2.imread('media/sample3.jpg')
detectPose(image, pose, display=True)

# Read another sample image and perform pose detection on it.

image = cv2.imread('media/sample3.jpg')

detectPose(image, pose, display=True)

Pose Detection On Real-Time Webcam Feed/Video

The results on the images were pretty good, now we will try the function on a real-time webcam feed and a video. Depending upon whether you want to run pose detection on a video stored in the disk or on the webcam feed, you can comment and uncomment the initialization code of the VideoCapture object accordingly.

# Setup Pose function for video.
pose_video = mp_pose.Pose(static_image_mode=False, min_detection_confidence=0.5, model_complexity=1)

# Initialize the VideoCapture object to read from the webcam.
#video = cv2.VideoCapture(0)

# Initialize the VideoCapture object to read from a video stored in the disk.
video = cv2.VideoCapture('media/running.mp4')


# Initialize a variable to store the time of the previous frame.
time1 = 0

# Iterate until the video is accessed successfully.
while video.isOpened():
    
    # Read a frame.
    ok, frame = video.read()
    
    # Check if frame is not read properly.
    if not ok:
        
        # Break the loop.
        break
    
    # Flip the frame horizontally for natural (selfie-view) visualization.
    frame = cv2.flip(frame, 1)
    
    # Get the width and height of the frame
    frame_height, frame_width, _ =  frame.shape
    
    # Resize the frame while keeping the aspect ratio.
    frame = cv2.resize(frame, (int(frame_width * (640 / frame_height)), 640))
    
    # Perform Pose landmark detection.
    frame, _ = detectPose(frame, pose_video, display=False)
    
    # Set the time for this frame to the current time.
    time2 = time()
    
    # Check if the difference between the previous and this frame time &gt; 0 to avoid division by zero.
    if (time2 - time1) &gt; 0:
    
        # Calculate the number of frames per second.
        frames_per_second = 1.0 / (time2 - time1)
        
        # Write the calculated number of frames per second on the frame. 
        cv2.putText(frame, 'FPS: {}'.format(int(frames_per_second)), (10, 30),cv2.FONT_HERSHEY_PLAIN, 2, (0, 255, 0), 3)
    
    # Update the previous frame time to this frame time.
    # As this frame will become previous frame in next iteration.
    time1 = time2
    
    # Display the frame.
    cv2.imshow('Pose Detection', frame)
    
    # Wait until a key is pressed.
    # Retreive the ASCII code of the key pressed
    k = cv2.waitKey(1) &amp; 0xFF
    
    # Check if 'ESC' is pressed.
    if(k == 27):
        
        # Break the loop.
        break

# Release the VideoCapture object.
video.release()

# Close the windows.
cv2.destroyAllWindows()

# Setup Pose function for video.

pose_video = mp_pose.Pose(static_image_mode=False, min_detection_confidence=0.5, model_complexity=1)

# Initialize the VideoCapture object to read from the webcam.

#video = cv2.VideoCapture(0)

# Initialize the VideoCapture object to read from a video stored in the disk.

video = cv2.VideoCapture('media/running.mp4')

# Initialize a variable to store the time of the previous frame.

time1 = 0

# Iterate until the video is accessed successfully.

while video.isOpened():

# Read a frame.

ok, frame = video.read()

# Check if frame is not read properly.

if not ok:

# Break the loop.

break

# Flip the frame horizontally for natural (selfie-view) visualization.

frame = cv2.flip(frame, 1)

# Get the width and height of the frame

frame_height, frame_width, _ = frame.shape

# Resize the frame while keeping the aspect ratio.

frame = cv2.resize(frame, (int(frame_width * (640 / frame_height)), 640))

# Perform Pose landmark detection.

frame, _ = detectPose(frame, pose_video, display=False)

# Set the time for this frame to the current time.

time2 = time()

# Check if the difference between the previous and this frame time > 0 to avoid division by zero.

if (time2 - time1) > 0:

# Calculate the number of frames per second.

frames_per_second = 1.0 / (time2 - time1)

# Write the calculated number of frames per second on the frame.

cv2.putText(frame, 'FPS: {}'.format(int(frames_per_second)), (10, 30),cv2.FONT_HERSHEY_PLAIN, 2, (0, 255, 0), 3)

# Update the previous frame time to this frame time.

# As this frame will become previous frame in next iteration.

time1 = time2

# Display the frame.

cv2.imshow('Pose Detection', frame)

# Wait until a key is pressed.

# Retreive the ASCII code of the key pressed

k = cv2.waitKey(1) & 0xFF

# Check if 'ESC' is pressed.

if(k == 27):

# Break the loop.

break

# Release the VideoCapture object.

video.release()

# Close the windows.

cv2.destroyAllWindows()

Output:

Cool! so it works great on the videos too. The model is pretty fast and accurate.

Part 3: Pose Classification with Angle Heuristics

We have learned to perform pose detection, now we will level up our game by also classifying different yoga poses using the calculated angles of various joints. We will first detect the pose landmarks and then use them to compute angles between joints and depending upon those angles we will recognize the yoga pose of the prominent person in an image.

But this approach does have a drawback that limits its use to a controlled environment, the calculated angles vary with the angle between the person and the camera. So the person needs to be facing the camera straight to get the best results.

Create a Function to Calculate Angle between Landmarks

Now we will create a function that will be capable of calculating angles between three landmarks. The angle between landmarks? Do not get confused, as this is the same as calculating the angle between two lines.

The first point (landmark) is considered as the starting point of the first line, the second point (landmark) is considered as the ending point of the first line and the starting point of the second line as well, and the third point (landmark) is considered as the ending point of the second line.

def calculateAngle(landmark1, landmark2, landmark3):
    '''
    This function calculates angle between three different landmarks.
    Args:
        landmark1: The first landmark containing the x,y and z coordinates.
        landmark2: The second landmark containing the x,y and z coordinates.
        landmark3: The third landmark containing the x,y and z coordinates.
    Returns:
        angle: The calculated angle between the three landmarks.

    '''

    # Get the required landmarks coordinates.
    x1, y1, _ = landmark1
    x2, y2, _ = landmark2
    x3, y3, _ = landmark3

    # Calculate the angle between the three points
    angle = math.degrees(math.atan2(y3 - y2, x3 - x2) - math.atan2(y1 - y2, x1 - x2))
    
    # Check if the angle is less than zero.
    if angle &lt; 0:

        # Add 360 to the found angle.
        angle += 360
    
    # Return the calculated angle.
    return angle

def calculateAngle(landmark1, landmark2, landmark3):

'''

This function calculates angle between three different landmarks.

Args:

landmark1: The first landmark containing the x,y and z coordinates.

landmark2: The second landmark containing the x,y and z coordinates.

landmark3: The third landmark containing the x,y and z coordinates.

Returns:

angle: The calculated angle between the three landmarks.

'''

# Get the required landmarks coordinates.

x1, y1, _ = landmark1

x2, y2, _ = landmark2

x3, y3, _ = landmark3

# Calculate the angle between the three points

angle = math.degrees(math.atan2(y3 - y2, x3 - x2) - math.atan2(y1 - y2, x1 - x2))

# Check if the angle is less than zero.

if angle < 0:

# Add 360 to the found angle.

angle += 360

# Return the calculated angle.

return angle

Now we will test the function created above to calculate angle three landmarks with dummy values.

# Calculate the angle between the three landmarks.
angle = calculateAngle((558, 326, 0), (642, 333, 0), (718, 321, 0))

# Display the calculated angle.
print(f'The calculated angle is {angle}')

# Calculate the angle between the three landmarks.

angle = calculateAngle((558, 326, 0), (642, 333, 0), (718, 321, 0))

# Display the calculated angle.

print(f'The calculated angle is {angle}')

The calculated angle is 166.26373169437744

Create a Function to Perform Pose Classification

Now we will create a function that will be capable of classifying different yoga poses using the calculated angles of various joints. The function will be capable of identifying the following yoga poses:

Warrior II Pose
T Pose
Tree Pose

def classifyPose(landmarks, output_image, display=False):
    '''
    This function classifies yoga poses depending upon the angles of various body joints.
    Args:
        landmarks: A list of detected landmarks of the person whose pose needs to be classified.
        output_image: A image of the person with the detected pose landmarks drawn.
        display: A boolean value that is if set to true the function displays the resultant image with the pose label 
        written on it and returns nothing.
    Returns:
        output_image: The image with the detected pose landmarks drawn and pose label written.
        label: The classified pose label of the person in the output_image.

    '''
    
    # Initialize the label of the pose. It is not known at this stage.
    label = 'Unknown Pose'

    # Specify the color (Red) with which the label will be written on the image.
    color = (0, 0, 255)
    
    # Calculate the required angles.
    #----------------------------------------------------------------------------------------------------------------
    
    # Get the angle between the left shoulder, elbow and wrist points. 
    left_elbow_angle = calculateAngle(landmarks[mp_pose.PoseLandmark.LEFT_SHOULDER.value],
                                      landmarks[mp_pose.PoseLandmark.LEFT_ELBOW.value],
                                      landmarks[mp_pose.PoseLandmark.LEFT_WRIST.value])
    
    # Get the angle between the right shoulder, elbow and wrist points. 
    right_elbow_angle = calculateAngle(landmarks[mp_pose.PoseLandmark.RIGHT_SHOULDER.value],
                                       landmarks[mp_pose.PoseLandmark.RIGHT_ELBOW.value],
                                       landmarks[mp_pose.PoseLandmark.RIGHT_WRIST.value])   
    
    # Get the angle between the left elbow, shoulder and hip points. 
    left_shoulder_angle = calculateAngle(landmarks[mp_pose.PoseLandmark.LEFT_ELBOW.value],
                                         landmarks[mp_pose.PoseLandmark.LEFT_SHOULDER.value],
                                         landmarks[mp_pose.PoseLandmark.LEFT_HIP.value])

    # Get the angle between the right hip, shoulder and elbow points. 
    right_shoulder_angle = calculateAngle(landmarks[mp_pose.PoseLandmark.RIGHT_HIP.value],
                                          landmarks[mp_pose.PoseLandmark.RIGHT_SHOULDER.value],
                                          landmarks[mp_pose.PoseLandmark.RIGHT_ELBOW.value])

    # Get the angle between the left hip, knee and ankle points. 
    left_knee_angle = calculateAngle(landmarks[mp_pose.PoseLandmark.LEFT_HIP.value],
                                     landmarks[mp_pose.PoseLandmark.LEFT_KNEE.value],
                                     landmarks[mp_pose.PoseLandmark.LEFT_ANKLE.value])

    # Get the angle between the right hip, knee and ankle points 
    right_knee_angle = calculateAngle(landmarks[mp_pose.PoseLandmark.RIGHT_HIP.value],
                                      landmarks[mp_pose.PoseLandmark.RIGHT_KNEE.value],
                                      landmarks[mp_pose.PoseLandmark.RIGHT_ANKLE.value])
    
    #----------------------------------------------------------------------------------------------------------------
    
    # Check if it is the warrior II pose or the T pose.
    # As for both of them, both arms should be straight and shoulders should be at the specific angle.
    #----------------------------------------------------------------------------------------------------------------
    
    # Check if the both arms are straight.
    if left_elbow_angle &gt; 165 and left_elbow_angle &lt; 195 and right_elbow_angle &gt; 165 and right_elbow_angle &lt; 195:

        # Check if shoulders are at the required angle.
        if left_shoulder_angle &gt; 80 and left_shoulder_angle &lt; 110 and right_shoulder_angle &gt; 80 and right_shoulder_angle &lt; 110:

    # Check if it is the warrior II pose.
    #----------------------------------------------------------------------------------------------------------------

            # Check if one leg is straight.
            if left_knee_angle &gt; 165 and left_knee_angle &lt; 195 or right_knee_angle &gt; 165 and right_knee_angle &lt; 195:

                # Check if the other leg is bended at the required angle.
                if left_knee_angle &gt; 90 and left_knee_angle &lt; 120 or right_knee_angle &gt; 90 and right_knee_angle &lt; 120:

                    # Specify the label of the pose that is Warrior II pose.
                    label = 'Warrior II Pose' 
                        
    #----------------------------------------------------------------------------------------------------------------
    
    # Check if it is the T pose.
    #----------------------------------------------------------------------------------------------------------------
    
            # Check if both legs are straight
            if left_knee_angle &gt; 160 and left_knee_angle &lt; 195 and right_knee_angle &gt; 160 and right_knee_angle &lt; 195:

                # Specify the label of the pose that is tree pose.
                label = 'T Pose'

    #----------------------------------------------------------------------------------------------------------------
    
    # Check if it is the tree pose.
    #----------------------------------------------------------------------------------------------------------------
    
    # Check if one leg is straight
    if left_knee_angle &gt; 165 and left_knee_angle &lt; 195 or right_knee_angle &gt; 165 and right_knee_angle &lt; 195:

        # Check if the other leg is bended at the required angle.
        if left_knee_angle &gt; 315 and left_knee_angle &lt; 335 or right_knee_angle &gt; 25 and right_knee_angle &lt; 45:

            # Specify the label of the pose that is tree pose.
            label = 'Tree Pose'
                
    #----------------------------------------------------------------------------------------------------------------
    
    # Check if the pose is classified successfully
    if label != 'Unknown Pose':
        
        # Update the color (to green) with which the label will be written on the image.
        color = (0, 255, 0)  
    
    # Write the label on the output image. 
    cv2.putText(output_image, label, (10, 30),cv2.FONT_HERSHEY_PLAIN, 2, color, 2)
    
    # Check if the resultant image is specified to be displayed.
    if display:
    
        # Display the resultant image.
        plt.figure(figsize=[10,10])
        plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');
        
    else:
        
        # Return the output image and the classified label.
        return output_image, label

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

def classifyPose(landmarks, output_image, display=False):

'''

This function classifies yoga poses depending upon the angles of various body joints.

Args:

landmarks: A list of detected landmarks of the person whose pose needs to be classified.

output_image: A image of the person with the detected pose landmarks drawn.

display: A boolean value that is if set to true the function displays the resultant image with the pose label

written on it and returns nothing.

Returns:

output_image: The image with the detected pose landmarks drawn and pose label written.

label: The classified pose label of the person in the output_image.

'''

# Initialize the label of the pose. It is not known at this stage.

label = 'Unknown Pose'

# Specify the color (Red) with which the label will be written on the image.

color = (0, 0, 255)

# Calculate the required angles.

#----------------------------------------------------------------------------------------------------------------

# Get the angle between the left shoulder, elbow and wrist points.

left_elbow_angle = calculateAngle(landmarks[mp_pose.PoseLandmark.LEFT_SHOULDER.value],

landmarks[mp_pose.PoseLandmark.LEFT_ELBOW.value],

landmarks[mp_pose.PoseLandmark.LEFT_WRIST.value])

# Get the angle between the right shoulder, elbow and wrist points.

right_elbow_angle = calculateAngle(landmarks[mp_pose.PoseLandmark.RIGHT_SHOULDER.value],

landmarks[mp_pose.PoseLandmark.RIGHT_ELBOW.value],

landmarks[mp_pose.PoseLandmark.RIGHT_WRIST.value])

# Get the angle between the left elbow, shoulder and hip points.

left_shoulder_angle = calculateAngle(landmarks[mp_pose.PoseLandmark.LEFT_ELBOW.value],

landmarks[mp_pose.PoseLandmark.LEFT_SHOULDER.value],

landmarks[mp_pose.PoseLandmark.LEFT_HIP.value])

# Get the angle between the right hip, shoulder and elbow points.

right_shoulder_angle = calculateAngle(landmarks[mp_pose.PoseLandmark.RIGHT_HIP.value],

landmarks[mp_pose.PoseLandmark.RIGHT_SHOULDER.value],

landmarks[mp_pose.PoseLandmark.RIGHT_ELBOW.value])

# Get the angle between the left hip, knee and ankle points.

left_knee_angle = calculateAngle(landmarks[mp_pose.PoseLandmark.LEFT_HIP.value],

landmarks[mp_pose.PoseLandmark.LEFT_KNEE.value],

landmarks[mp_pose.PoseLandmark.LEFT_ANKLE.value])

# Get the angle between the right hip, knee and ankle points

right_knee_angle = calculateAngle(landmarks[mp_pose.PoseLandmark.RIGHT_HIP.value],

landmarks[mp_pose.PoseLandmark.RIGHT_KNEE.value],

landmarks[mp_pose.PoseLandmark.RIGHT_ANKLE.value])

#----------------------------------------------------------------------------------------------------------------

# Check if it is the warrior II pose or the T pose.

# As for both of them, both arms should be straight and shoulders should be at the specific angle.

#----------------------------------------------------------------------------------------------------------------

# Check if the both arms are straight.

if left_elbow_angle > 165 and left_elbow_angle < 195 and right_elbow_angle > 165 and right_elbow_angle < 195:

# Check if shoulders are at the required angle.

if left_shoulder_angle > 80 and left_shoulder_angle < 110 and right_shoulder_angle > 80 and right_shoulder_angle < 110:

# Check if it is the warrior II pose.

#----------------------------------------------------------------------------------------------------------------

# Check if one leg is straight.

if left_knee_angle > 165 and left_knee_angle < 195 or right_knee_angle > 165 and right_knee_angle < 195:

# Check if the other leg is bended at the required angle.

if left_knee_angle > 90 and left_knee_angle < 120 or right_knee_angle > 90 and right_knee_angle < 120:

# Specify the label of the pose that is Warrior II pose.

label = 'Warrior II Pose'

#----------------------------------------------------------------------------------------------------------------

# Check if it is the T pose.

#----------------------------------------------------------------------------------------------------------------

# Check if both legs are straight

if left_knee_angle > 160 and left_knee_angle < 195 and right_knee_angle > 160 and right_knee_angle < 195:

# Specify the label of the pose that is tree pose.

label = 'T Pose'

#----------------------------------------------------------------------------------------------------------------

# Check if it is the tree pose.

#----------------------------------------------------------------------------------------------------------------

# Check if one leg is straight

if left_knee_angle > 165 and left_knee_angle < 195 or right_knee_angle > 165 and right_knee_angle < 195:

# Check if the other leg is bended at the required angle.

if left_knee_angle > 315 and left_knee_angle < 335 or right_knee_angle > 25 and right_knee_angle < 45:

# Specify the label of the pose that is tree pose.

label = 'Tree Pose'

#----------------------------------------------------------------------------------------------------------------

# Check if the pose is classified successfully

if label != 'Unknown Pose':

# Update the color (to green) with which the label will be written on the image.

color = (0, 255, 0)

# Write the label on the output image.

cv2.putText(output_image, label, (10, 30),cv2.FONT_HERSHEY_PLAIN, 2, color, 2)

# Check if the resultant image is specified to be displayed.

if display:

# Display the resultant image.

plt.figure(figsize=[10,10])

plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');

else:

# Return the output image and the classified label.

return output_image, label

Now we will utilize the function created above to perform pose classification on a few images of people and display the results.

Warrior II Pose

The Warrior II Pose (also known as Virabhadrasana II) is the same pose that the person is making in the image above. It can be classified using the following combination of body part angles:

Around 180° at both elbows
Around 90° angle at both shoulders
Around 180° angle at one knee
Around 90° angle at the other knee

# Read a sample image and perform pose classification on it.
image = cv2.imread('media/warriorIIpose.jpg')
output_image, landmarks = detectPose(image, pose, display=False)
if landmarks:
    classifyPose(landmarks, output_image, display=True)

# Read a sample image and perform pose classification on it.

image = cv2.imread('media/warriorIIpose.jpg')

output_image, landmarks = detectPose(image, pose, display=False)

if landmarks:

classifyPose(landmarks, output_image, display=True)

# Read another sample image and perform pose classification on it.
image = cv2.imread('media/warriorIIpose1.jpg')
output_image, landmarks = detectPose(image, pose, display=False)
if landmarks:
    classifyPose(landmarks, output_image, display=True)

# Read another sample image and perform pose classification on it.

image = cv2.imread('media/warriorIIpose1.jpg')

output_image, landmarks = detectPose(image, pose, display=False)

if landmarks:

classifyPose(landmarks, output_image, display=True)

Tree Pose

Tree Pose (also known as Vrikshasana) is another yoga pose for which the person has to keep one leg straight and bend the other leg at a required angle. The pose can be classified easily using the following combination of body part angles:

Around 180° angle at one knee
Around 35° (if right knee) or 335° (if left knee) angle at the other knee

Now to understand it better, you should go back to the pose classification function above to overview the classification code of this yoga pose.

We will perform pose classification on a few images of people in the tree yoga pose and display the results using the same function we had created above.

# Read a sample image and perform pose classification on it.
image = cv2.imread('media/treepose.jpg')
output_image, landmarks = detectPose(image, mp_pose.Pose(static_image_mode=True,
                                         min_detection_confidence=0.5, model_complexity=0), display=False)
if landmarks:
    classifyPose(landmarks, output_image, display=True)

# Read a sample image and perform pose classification on it.

image = cv2.imread('media/treepose.jpg')

output_image, landmarks = detectPose(image, mp_pose.Pose(static_image_mode=True,

min_detection_confidence=0.5, model_complexity=0), display=False)

if landmarks:

classifyPose(landmarks, output_image, display=True)

# Read another sample image and perform pose classification on it.
image = cv2.imread('media/treepose1.jpg')
output_image, landmarks = detectPose(image, mp_pose.Pose(static_image_mode=True,
                                         min_detection_confidence=0.5, model_complexity=0), display=False)
if landmarks:
    classifyPose(landmarks, output_image, display=True)

# Read another sample image and perform pose classification on it.

image = cv2.imread('media/treepose1.jpg')

output_image, landmarks = detectPose(image, mp_pose.Pose(static_image_mode=True,

min_detection_confidence=0.5, model_complexity=0), display=False)

if landmarks:

classifyPose(landmarks, output_image, display=True)

# Read another sample image and perform pose classification on it.
image = cv2.imread('media/treepose2.jpg')
output_image, landmarks = detectPose(image, pose, display=False)
if landmarks:
    classifyPose(landmarks, output_image, display=True)

# Read another sample image and perform pose classification on it.

image = cv2.imread('media/treepose2.jpg')

output_image, landmarks = detectPose(image, pose, display=False)

if landmarks:

classifyPose(landmarks, output_image, display=True)

T Pose

T Pose (also known as a bind pose or reference pose) is the last pose we are dealing with in this lesson. To make this pose, one has to stand up like a tree with both hands wide open as branches. The following body part angles are required to make this one:

Around 180° at both elbows
Around 90° angle at both shoulders
Around 180° angle at both knees

You can now go back to go through the classification code of this T pose in the pose classification function created above.

Now, let’s test the pose classification function on a few images of the T pose.

# Read another sample image and perform pose classification on it.
image = cv2.imread('media/Tpose.jpg')
output_image, landmarks = detectPose(image, pose, display=False)
if landmarks:
    classifyPose(landmarks, output_image, display=True)

# Read another sample image and perform pose classification on it.

image = cv2.imread('media/Tpose.jpg')

output_image, landmarks = detectPose(image, pose, display=False)

if landmarks:

classifyPose(landmarks, output_image, display=True)

# Read another sample image and perform pose classification on it.
image = cv2.imread('media/Tpose1.jpg')
output_image, landmarks = detectPose(image, pose, display=False)
if landmarks:
    classifyPose(landmarks, output_image, display=True)

# Read another sample image and perform pose classification on it.

image = cv2.imread('media/Tpose1.jpg')

output_image, landmarks = detectPose(image, pose, display=False)

if landmarks:

classifyPose(landmarks, output_image, display=True)

So the function is working pretty well on all the known poses on images lets try it on an unknown pose called cobra pose (also known as Bhujangasana).

# Read another sample image and perform pose classification on it.
image = cv2.imread('media/cobrapose1.jpg')
output_image, landmarks = detectPose(image, pose, display=False)
if landmarks:
    classifyPose(landmarks, output_image, display=True)

# Read another sample image and perform pose classification on it.

image = cv2.imread('media/cobrapose1.jpg')

output_image, landmarks = detectPose(image, pose, display=False)

if landmarks:

classifyPose(landmarks, output_image, display=True)

Now if you want you can extend the pose classification function to make it capable of identifying more yoga poses like the one in the image above. The following combination of body part angles can help classify this one:

Around 180° angle at both knees
Around 105° (if the person is facing right side) or 240° (if the person is facing left side) angle at both hips

Pose Classification On Real-Time Webcam Feed

Now we will test the function created above to perform the pose classification on a real-time webcam feed.

# Initialize the VideoCapture object to read from the webcam.
camera_video = cv2.VideoCapture('sample.mp4')

# Initialize a resizable window.
cv2.namedWindow('Pose Classification', cv2.WINDOW_NORMAL)

# Iterate until the webcam is accessed successfully.
while camera_video.isOpened():
    
    # Read a frame.
    ok, frame = camera_video.read()
    
    # Check if frame is not read properly.
    if not ok:
        
        # Continue to the next iteration to read the next frame and ignore the empty camera frame.
        continue
    
    # Flip the frame horizontally for natural (selfie-view) visualization.
    frame = cv2.flip(frame, 1)
    
    # Get the width and height of the frame
    frame_height, frame_width, _ =  frame.shape
    
    # Resize the frame while keeping the aspect ratio.
    frame = cv2.resize(frame, (int(frame_width * (640 / frame_height)), 640))
    
    # Perform Pose landmark detection.
    frame, landmarks = detectPose(frame, pose_video, display=False)
    
    # Check if the landmarks are detected.
    if landmarks:
        
        # Perform the Pose Classification.
        frame, _ = classifyPose(landmarks, frame, display=False)
    
    # Display the frame.
    cv2.imshow('Pose Classification', frame)
    
    # Wait until a key is pressed.
    # Retreive the ASCII code of the key pressed
    k = cv2.waitKey(1) &amp; 0xFF
    
    # Check if 'ESC' is pressed.
    if(k == 27):
        
        # Break the loop.
        break

# Release the VideoCapture object and close the windows.
camera_video.release()
cv2.destroyAllWindows()

# Initialize the VideoCapture object to read from the webcam.

camera_video = cv2.VideoCapture('sample.mp4')

# Initialize a resizable window.

cv2.namedWindow('Pose Classification', cv2.WINDOW_NORMAL)

# Iterate until the webcam is accessed successfully.

while camera_video.isOpened():

# Read a frame.

ok, frame = camera_video.read()

# Check if frame is not read properly.

if not ok:

# Continue to the next iteration to read the next frame and ignore the empty camera frame.

continue

# Flip the frame horizontally for natural (selfie-view) visualization.

frame = cv2.flip(frame, 1)

# Get the width and height of the frame

frame_height, frame_width, _ = frame.shape

# Resize the frame while keeping the aspect ratio.

frame = cv2.resize(frame, (int(frame_width * (640 / frame_height)), 640))

# Perform Pose landmark detection.

frame, landmarks = detectPose(frame, pose_video, display=False)

# Check if the landmarks are detected.

if landmarks:

# Perform the Pose Classification.

frame, _ = classifyPose(landmarks, frame, display=False)

# Display the frame.

cv2.imshow('Pose Classification', frame)

# Wait until a key is pressed.

# Retreive the ASCII code of the key pressed

k = cv2.waitKey(1) & 0xFF

# Check if 'ESC' is pressed.

if(k == 27):

# Break the loop.

break

# Release the VideoCapture object and close the windows.

camera_video.release()

cv2.destroyAllWindows()

Output:

Summary:

Today, we learned about a very popular vision problem called pose detection. We briefly discussed popular computer vision problems then we saw how mediapipe has implemented its pose detection solution and how it used a 2 step detection + tracking pipeline to speed up the process.

After that, we saw step by step how to do real-time 3d pose detection with mediapipe on images and on webcam.

Then we learned to calculate angles between different landmarks and then used some heuristics to build a classification system that could determine 3 poses, T-Pose, Tree Pose, and a Warrior II Pose.

Alright here are some limitations to our pose classification system, it has too many conditions and checks, now for our case it’s not that complicated, but if you throw in a few more poses this system can easily get too confusing and complicated, a much better method is to train an MLP ( a simple multi-layer perceptron) using Keras on landmark points from a few target pose pictures and then classify them. I’m not sure but I might create a separate tutorial for that in the future.

Another issue that I briefly went over was that the pose detection model in mediapipe is only able to detect a single person at a time, now this is fine for most pose-based applications but can prove to be an issue where you’re required to detect more than one person. If you do want to detect more people then you could try other popular models like PoseNet or OpenPose.

Ready to seriously dive into State of the Art AI & Computer Vision?
Then Sign up for these premium Courses by Bleed AI

Building 4 Applications Using Real-Time Selfie Segmentation in Python

by Taha Anwar | Jun 24, 2021 | Application, Deep Learning, Image Segmentation

Watch Video Here

Note: Video Tutorial for this post will soon be released on Bleed AI’s YouTube Channel

In this tutorial, you’ll learn how to do Real-Time Selfie Segmentation using Mediapipe in Python and then build the following 4 applications.

Background Removal/Replacement
Background Blur
Background Desaturation
Convert Image to Transparent PNG

And not only will these applications work on images but I’ll show you how to apply these to your real-time webcam feed running on a CPU.

Also, the model that we’ll use is almost the same one that Google Hangouts is currently using to segment people, So Yes! We’re going to be learning a State of the Art approach for segmentation.

And on top of that, the code for building all 4 applications will be ridiculously simple.

Interested yet? Then keep reading this full post.

In the first part of this post, we’ll understand the problem of image segmentation and its types, then we’ll understand what selfie segmentation is. After that, we’ll take a look at Mediapipe and how to do selfie segmentation with it. And finally how to build all those 4 applications.

What is Image Segmentation?

If you’re somewhat familiar with computer vision basics then you might be familiar with image segmentation, a very popular problem in Computer Vision.

Just like in an object detection task where you localize objects in the image and draw boxes around it, in a segmentation task, you’re almost doing the same thing, but here instead of drawing a bounding box around each object, you’re trying to segment or draw out the exact boundary of each target Object.

**Figure 2: A segmentation model is trying to segment out the busses in the image above**

In other words, in segmentation, you’re trying to divide the image into groups of pixels based on some specific criteria.

So an image segmentation algorithm will take an input image and output groups of pixels, each group will belong to some class. Normally this output is actually an image mask where each pixel consists of a single number indicating the class it belongs to.

Now the task of image segmentation can be divided into several categories, let’s understand each of them.

Semantic Segmentation.
Instance Segmentation
Panoptic Segmentation
Saliency Detection.

What is Semantic Segmentation?

In this type of segmentation, our task is to assign a class label (pedestrian, car, road, tree etc.) to every pixel in the image.

As you can see all the objects in the image, including the buildings, sky, sidewalk are labeled by a certain color indicating that they belong to a certain class e.g all cars are labeled blue, people are labeled red, and so on.

It’s worth noting that although we can extract any individual class, for e.g. we can say extract all cars by looking for blue pixels but we cannot distinguish between different instances of the same class, for e.g. you can’t reliably say which blue pixel belongs to which car.

What is Instance Segmentation?

Another common category of segmentation is called Instance Segmentation. Here the goal is not to label all pixels in the image but only label some selective classes, for which the model was trained on ( for e.g. cars, pedestrians, etc. ).

As you can see in the image, the algorithm ignored the roads, sky, buildings etc. so here we’re only interested in labeling specific classes.

One other major difference in this approach is that we’re also differentiating between different instances of the same classes i.e. you can tell which pixel belongs to which class and so on.

What is Panoptic Segmentation?

If you’re a curious cat like me, you might wonder, well isn’t there an approach that,

A) Labels all pixels in the image like semantic segmentation.

B) And also differentiates between instances of the same class like instance segmentation.

Well, Yes there is! And it’s called Panoptic Segmentation. Where not only every pixel is assigned a class but we can also differentiate between different instances of the same class, i.e. we can tell which pixel belongs to which car.

This type of segmentation is the combination of both instance and semantic segmentation.

What is Saliency Detection?

Don’t be confused by the word “Detection” here, although Saliency Detection is not generally considered as one of the core segmentation methods but it’s still essentially a major segmentation technique.

So here the goal is to segment out the most salient/prominent (things that stand out ) features in the image.

And this is done regardless of the class of the object. Here’s another example.

As you can see the most obvious object in the above image is the cat, which is exactly what’s being segmented out here.

So in saliency detection where trying to segment out the most standing out features in the image.

Selfie Segmentation:

Alright now that we have understood the fundamental segmentation techniques out there, let’s try to understand what selfie segmentation is.

Well, obviously it’s a no brainer, it’s a segmentation technique that segments out people in images.

You might think, how is this different from semantic or instance Segmentation?

Well, to put it simply, you can consider selfie segmentation as a sort of a mix between semantic segmentation and Saliency detection.

What do I mean by that?

Take a look at the example output of Selfie segmentation on two images below.

In the first image (top) the segmentation is done perfectly, as every person is on a similar scale and prominent in the image, whereas in the second image (bottom) the woman is prominent and is segmented out correctly while her colleagues in the background are not segmented properly.

This is why the technique is called selfie segmentation, it tries to segment out prominent people in the image, ideally everyone to be segmented should be on a similar scale in the image.

This is why I said that this technique is sort of a mix between saliency detection and semantic segmentation.

Now, you might think why do we even need to use another segmentation technique, why not just segment people using semantic or instance segmentation methods.

Well, Actually we could do that. Models like Mask-RCNN, DeepLabv3, and others are really good at segmenting people.

But here’s the problem.

These models although provide State of the Art results but are actually really slow, they aren’t a good fit when it comes to real-time applications especially on CPUs.

This is why the Selfie segmentation model that we’ll use today is specifically designed to segment people and also run at real-time speed on CPU and other low-end hardware. It’s built on a slight modification of the MobielNetv3 model. This model itself contains clever algorithmic innovations for maximum speed and performance gains. To understand more about these algorithmic advances in this model, you can read Google AI’s Blog post on this model.

So what are the use cases for Selfie Segmentation?

The most popular use case for this problem is Video Conferencing. In fact, Google Hangouts is using approximately the same model that we’re going to learn to use today.

You can read the Google AI Blog release about this here.

Besides Video Conferencing, there are several other use cases for this model that we’re going to explore today.

MediaPipe:

Mediapipe is a cross-platform tool that allows you to run a variety of machine learning models in real-time. It’s designed primarily for facilitating the use of ML in streaming media.

This is the tool that we’ll be using today in order to use the selfie segmentation model. In future tutorials I’ll also be covering the usage of a few other models and make interesting applications out of them. So Stay tuned for those blog posts at Bleed AI.

Alright Now let’s start with the Code!

Selfie Segmentation Code:

To get started with Mediapipe, you first need to run the following command to install it

pip install mediapipe

Import the Libraries

Now let’s start by importing the required libraries.

import os
import cv2
import numpy as np
import mediapipe as mp
import matplotlib.pyplot as plt
from time import time

import os

import cv2

import numpy as np

import mediapipe as mp

import matplotlib.pyplot as plt

from time import time

Initialize the Selfie Segmentation Model

The first thing that you need to do is initialize the selfie segmentation class using the mp.solutions.selfie_segmentation function and then you need to call the setup function using .SelfieSegmentation(0) now there are two models for segmentation in mediapipe, by passing in 0 you will be using the general model i.e. input is resized to: 256x256x3 (Height, width, columns) and by passing 1 you will be using the landscape model i.e. input resized to: 144x256x3 (Height, width, columns).

You should select the type of model by taking into account the aspect ratio of the original image, although the landscape model is a bit faster. These models automatically resize the input image before passing it through the network and the size of the output image representing the segmentation mask for both models will be the same as the input that is 256x256x1 or 144x256x1.

# Initializing mediapipe segmentation class.
mp_selfie_segmentation = mp.solutions.selfie_segmentation

# Setting up Segmentation function.
segment = mp_selfie_segmentation.SelfieSegmentation(0)

# Initializing mediapipe segmentation class.

mp_selfie_segmentation = mp.solutions.selfie_segmentation

# Setting up Segmentation function.

segment = mp_selfie_segmentation.SelfieSegmentation(0)

Read an Image

Now let’s read a sample image using the function cv2.imread() and display the image using the matplotlib library.

# Read an image from the specified path.
sample_img = cv2.imread('media/sample.jpg')

# Specify a size of the figure.
plt.figure(figsize = [10, 10])

# Display the sample image, also convert BGR to RGB for display. 
plt.title("Sample Image");plt.axis('off');plt.imshow(sample_img[:,:,::-1]);plt.show()

# Read an image from the specified path.

sample_img = cv2.imread('media/sample.jpg')

# Specify a size of the figure.

plt.figure(figsize = [10, 10])

# Display the sample image, also convert BGR to RGB for display.

plt.title("Sample Image");plt.axis('off');plt.imshow(sample_img[:,:,::-1]);plt.show()

Application 1: Remove/Replace Background

We will start by learning to use selfie segmentation to change the background of images. But first, we will have to convert the image into RGB format as the MediaPipe library expects the images in this format but the function cv2.imread() reads the images in BGR format and we will use the function cv2.cvtColor() to do this conversion.

Then we will pass the image to the MediaPipe Segmentation function which will perform the segmentation process and will return a probability map with pixel values near 1 for the indexes where the person is located in the image and pixel values near 0 for the background.

# Convert the sample image from BGR to RGB format.
RGB_sample_img = cv2.cvtColor(sample_img, cv2.COLOR_BGR2RGB)

# Perform the segmentation.
result = segment.process(RGB_sample_img)

# Specify a size of the figure.
plt.figure(figsize=[22,22])

# Display the original sample image and the segmentation result with appropriate titles. 
plt.subplot(121);plt.imshow(sample_img[:,:,::-1]);plt.title("Original Image");plt.axis('off');
plt.subplot(122);plt.imshow(result.segmentation_mask, cmap='gray');plt.title("Probability Map");plt.axis('off');

# Convert the sample image from BGR to RGB format.

RGB_sample_img = cv2.cvtColor(sample_img, cv2.COLOR_BGR2RGB)

# Perform the segmentation.

result = segment.process(RGB_sample_img)

# Specify a size of the figure.

plt.figure(figsize=[22,22])

# Display the original sample image and the segmentation result with appropriate titles.

plt.subplot(121);plt.imshow(sample_img[:,:,::-1]);plt.title("Original Image");plt.axis('off');

plt.subplot(122);plt.imshow(result.segmentation_mask, cmap='gray');plt.title("Probability Map");plt.axis('off');

Notice that we have some gray areas in the map, this signifies that there are areas where the model was not sure if it was the background or the person. So now what we need to do is do some thresholding and set all pixels above certain confidence to white and all other pixels to black.CodeText

So in this step, we’re going to be thresholding the mask above to get a binary black and white mask with a pixel value 1 for the indexes where the person is located and 0 for the background.CodeText

# Get a binary mask having pixel value 1 for the person and 0 for the background.
# Pixel values greater than the threshold value 0.9 (90% Confidence) will become 1 and the remaining will become 0.
binary_mask = result.segmentation_mask &gt; 0.9


# Display the original sample image and the binary mask with appropriate titles. 
plt.figure(figsize=[22,22])
plt.subplot(121);plt.imshow(sample_img[:,:,::-1]);plt.title("Original Image");plt.axis('off');
plt.subplot(122);plt.imshow(binary_mask, cmap='gray');plt.title("Binary Mask");plt.axis('off');

# Get a binary mask having pixel value 1 for the person and 0 for the background.

# Pixel values greater than the threshold value 0.9 (90% Confidence) will become 1 and the remaining will become 0.

binary_mask = result.segmentation_mask > 0.9

# Display the original sample image and the binary mask with appropriate titles.

plt.figure(figsize=[22,22])

plt.subplot(121);plt.imshow(sample_img[:,:,::-1]);plt.title("Original Image");plt.axis('off');

plt.subplot(122);plt.imshow(binary_mask, cmap='gray');plt.title("Binary Mask");plt.axis('off');

Now we will use the numpy.where() function to create a new image which will have the pixel values from the original sample image at the indexes where the mask image have value 1 (white areas) and replace areas where mask have value 0 (black areas) with 255, to give a white background to the object of the sample image. Right now we’re just adding whtie (255) background but later on we’ll add a separate image as background.

But to create the required output image we will first have to convert the mask image (one channel) into a three-channel image using the function numpy.dstack() as the function numpy.where() will need to have all images to have equal number of channels.

# Stack the same mask three times to make it a three channel image.
binary_mask_3 = np.dstack((binary_mask,binary_mask,binary_mask))

# Create the output image to have white background where ever black is present in the mask.
output_image = np.where(binary_mask_3, sample_img, 255)    

# Specify a size of the figure.
plt.figure(figsize=[22,22])

# Display the original sample image and the resultant image. 
plt.subplot(121);plt.imshow(sample_img[:,:,::-1]);plt.title("Original Image");plt.axis('off');
plt.subplot(122);plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');

# Stack the same mask three times to make it a three channel image.

binary_mask_3 = np.dstack((binary_mask,binary_mask,binary_mask))

# Create the output image to have white background where ever black is present in the mask.

output_image = np.where(binary_mask_3, sample_img, 255)

# Specify a size of the figure.

plt.figure(figsize=[22,22])

# Display the original sample image and the resultant image.

plt.subplot(121);plt.imshow(sample_img[:,:,::-1]);plt.title("Original Image");plt.axis('off');

plt.subplot(122);plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');

Now instead of having a white background if you need to add another background image, you just need to replace 255 with a background image in np.where function

# Read a background image from the specified path.
bg_img = cv2.imread('media/background.jpg')

# Create an output image with the pixel values from the original sample image at the indexes where the mask have 
# value 1 and replace the other pixel values (where mask have zero) with the new background image.
output_image = np.where(binary_mask_3, sample_img, bg_img)     

# Display the original sample image and the segmentation result
plt.figure(figsize=[22,22])
plt.subplot(131);plt.imshow(sample_img[:,:,::-1]);plt.title("Original Image");plt.axis('off');
plt.subplot(132);plt.imshow(binary_mask, cmap='gray');plt.title("Binary Mask");plt.axis('off');
plt.subplot(133);plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');

# Read a background image from the specified path.

bg_img = cv2.imread('media/background.jpg')

# Create an output image with the pixel values from the original sample image at the indexes where the mask have

# value 1 and replace the other pixel values (where mask have zero) with the new background image.

output_image = np.where(binary_mask_3, sample_img, bg_img)

# Display the original sample image and the segmentation result

plt.figure(figsize=[22,22])

plt.subplot(131);plt.imshow(sample_img[:,:,::-1]);plt.title("Original Image");plt.axis('off');

plt.subplot(132);plt.imshow(binary_mask, cmap='gray');plt.title("Binary Mask");plt.axis('off');

plt.subplot(133);plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');

Create a Background Modification Function

Now we will create a function that will use the selfie segmentation to modify the background of an image depending upon the passed arguments. The followings will be the modifications that the function will be capable of:

Change Background: The function will replace the background of the image with a different provided background image OR it will make the background white for the cases when a separate background image is not provided.
Blur Background: The function will segment out the prominent person and then blur out the background.
Desaturate Background: The function will desaturate (convert to grayscale) the background of the image, giving the image a very interesting effect.
Transparent Background: The function will make the background of the image transparent.

def modifyBackground(image, background_image = 255, blur = 95, threshold = 0.3, display = True, method='changeBackground'):
    '''
    This function will replace, blur, desature or make the background transparent depending upon the passed arguments.
    Args:
        image: The input image with an object whose background is required to modify.
        background_image: The new background image for the object in the input image.
        threshold: A threshold value between 0 and 1 which will be used in creating a binary mask of the input image.
        display: A boolean value that is if true the function displays the original input image and the resultant image 
                 and returns nothing.
        method: The method name which is required to modify the background of the input image.
    Returns:
        output_image: The image of the object from the input image with a modified background.
        binary_mask_3: A binary mask of the input image. 
    '''

    # Convert the input image from BGR to RGB format.
    RGB_img = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    # Perform the segmentation.
    result = segment.process(RGB_img)
    
    # Get a binary mask having pixel value 1 for the object and 0 for the background.
    # Pixel values greater than the threshold value will become 1 and the remainings will become 0.
    binary_mask = result.segmentation_mask &gt; threshold
    
    # Stack the same mask three times to make it a three channel image.
    binary_mask_3 = np.dstack((binary_mask,binary_mask,binary_mask))
    
    if method == 'changeBackground':
    
        # Resize the background image to become equal to the size of the input image.
        background_image = cv2.resize(background_image, (image.shape[1], image.shape[0]))

        # Create an output image with the pixel values from the original sample image at the indexes where the mask have 
        # value 1 and replace the other pixel values (where mask have zero) with the new background image.
        output_image = np.where(binary_mask_3, image, background_image)
        
    elif method == 'blurBackground':
        
        # Create a blurred copy of the input image.
        blurred_image = cv2.GaussianBlur(image, (blur, blur), 0)

        # Create an output image with the pixel values from the original sample image at the indexes where the mask have 
        # value 1 and replace the other pixel values (where mask have zero) with the new background image.
        output_image = np.where(binary_mask_3, image, blurred_image)
    
    elif method == 'desatureBackground':
        
        # Create a gray-scale copy of the input image.
        grayscale = cv2.cvtColor(src = image, code = cv2.COLOR_BGR2GRAY)

        # Stack the same grayscale image three times to make it a three channel image.
        grayscale_3 = np.dstack((grayscale,grayscale,grayscale))

        # Create an output image with the pixel values from the original sample image at the indexes where the mask have 
        # value 1 and replace the other pixel values (where mask have zero) with the new background image.
        output_image = np.where(binary_mask_3, image, grayscale_3)
        
    elif method == 'transparentBackground':
        
        # Stack the input image and the mask image to get a four channel image. 
        # Here the mask image will act as an alpha channel. 
        # Also multiply the mask with 255 to convert all the 1s into 255.  
        output_image = np.dstack((image, binary_mask * 255))
        
    else:
        # Display the error message.
        print('Invalid Method')
        
        # Return
        return
    
    # Check if the original input image and the resultant image are specified to be displayed.
    if display:
    
        # Display the original input image and the resultant image.
        plt.figure(figsize=[22,22])
        plt.subplot(121);plt.imshow(image[:,:,::-1]);plt.title("Original Image");plt.axis('off');
        plt.subplot(122);plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');
    
    # Otherwise
    else:
        
        # Return the output image and the binary mask.
        # Also convert all the 1s in the mask into 255 and the 0s will remain the same.
        # The mask is returned in case you want to troubleshoot.
        return output_image, (binary_mask_3 * 255).astype('uint8')

def modifyBackground(image, background_image = 255, blur = 95, threshold = 0.3, display = True, method='changeBackground'):

'''

This function will replace, blur, desature or make the background transparent depending upon the passed arguments.

Args:

image: The input image with an object whose background is required to modify.

background_image: The new background image for the object in the input image.

threshold: A threshold value between 0 and 1 which will be used in creating a binary mask of the input image.

display: A boolean value that is if true the function displays the original input image and the resultant image

and returns nothing.

method: The method name which is required to modify the background of the input image.

Returns:

output_image: The image of the object from the input image with a modified background.

binary_mask_3: A binary mask of the input image.

'''

# Convert the input image from BGR to RGB format.

RGB_img = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

# Perform the segmentation.

result = segment.process(RGB_img)

# Get a binary mask having pixel value 1 for the object and 0 for the background.

# Pixel values greater than the threshold value will become 1 and the remainings will become 0.

binary_mask = result.segmentation_mask > threshold

# Stack the same mask three times to make it a three channel image.

binary_mask_3 = np.dstack((binary_mask,binary_mask,binary_mask))

if method == 'changeBackground':

# Resize the background image to become equal to the size of the input image.

background_image = cv2.resize(background_image, (image.shape[1], image.shape[0]))

# Create an output image with the pixel values from the original sample image at the indexes where the mask have

# value 1 and replace the other pixel values (where mask have zero) with the new background image.

output_image = np.where(binary_mask_3, image, background_image)

elif method == 'blurBackground':

# Create a blurred copy of the input image.

blurred_image = cv2.GaussianBlur(image, (blur, blur), 0)

# Create an output image with the pixel values from the original sample image at the indexes where the mask have

# value 1 and replace the other pixel values (where mask have zero) with the new background image.

output_image = np.where(binary_mask_3, image, blurred_image)

elif method == 'desatureBackground':

# Create a gray-scale copy of the input image.

grayscale = cv2.cvtColor(src = image, code = cv2.COLOR_BGR2GRAY)

# Stack the same grayscale image three times to make it a three channel image.

grayscale_3 = np.dstack((grayscale,grayscale,grayscale))

# Create an output image with the pixel values from the original sample image at the indexes where the mask have

# value 1 and replace the other pixel values (where mask have zero) with the new background image.

output_image = np.where(binary_mask_3, image, grayscale_3)

elif method == 'transparentBackground':

# Stack the input image and the mask image to get a four channel image.

# Here the mask image will act as an alpha channel.

# Also multiply the mask with 255 to convert all the 1s into 255.

output_image = np.dstack((image, binary_mask * 255))

else:

# Display the error message.

print('Invalid Method')

# Return

return

# Check if the original input image and the resultant image are specified to be displayed.

if display:

# Display the original input image and the resultant image.

plt.figure(figsize=[22,22])

plt.subplot(121);plt.imshow(image[:,:,::-1]);plt.title("Original Image");plt.axis('off');

plt.subplot(122);plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');

# Otherwise

else:

# Return the output image and the binary mask.

# Also convert all the 1s in the mask into 255 and the 0s will remain the same.

# The mask is returned in case you want to troubleshoot.

return output_image, (binary_mask_3 * 255).astype('uint8')

Now we will utilize the function created above with the argument method='changeBackground' to change the backgrounds of a few sample images and check the results.

# Read a sample image and change background
image2 = cv2.imread('media/sample5.jpg')
modifyBackground(image2, bg_img.copy(), method='changeBackground')

# Read a sample image and change background

image2 = cv2.imread('media/sample5.jpg')

modifyBackground(image2, bg_img.copy(), method='changeBackground')

# Read another sample image and a new background and change it.
image3 = cv2.imread('media/sample6.jpg')
bg_img2 = cv2.imread('media/backgroundimages/2.jpg')
modifyBackground(image3, bg_img2, 0.7, method='changeBackground')

# Read another sample image and a new background and change it.

image3 = cv2.imread('media/sample6.jpg')

bg_img2 = cv2.imread('media/backgroundimages/2.jpg')

modifyBackground(image3, bg_img2, 0.7, method='changeBackground')

# Read another sample image and a new background and change it.
image4 = cv2.imread('media/sample4.jpg')
bg_img3 = cv2.imread('media/backgroundimages/3.jpg')
modifyBackground(image4, bg_img3, 0.55, method='changeBackground')

# Read another sample image and a new background and change it.

image4 = cv2.imread('media/sample4.jpg')

bg_img3 = cv2.imread('media/backgroundimages/3.jpg')

modifyBackground(image4, bg_img3, 0.55, method='changeBackground')

Change Background On Real-Time Web-cam Feed

The results on the images look great, but how will the function we created above fare when applied to our real-time webcam feed. Well, let’s check it out. In the code below we will swap out different background images by pressing the key b on keyboard.

 Initialize the VideoCapture object to read from the webcam.
camera_video = cv2.VideoCapture(0)

# Set width of the frames in the video stream.
camera_video.set(3, 1280)

# Set height of the frames in the video stream.
camera_video.set(4, 720)

# Initialize a list to store the background images.
background_images = []

# Specify the path of the folder which contains the background images.
background_folder = 'media/backgroundimages/'

# Iterate over the images in the background folder.
for img_path in os.listdir(background_folder):
    
    # Read a image.
    image = cv2.imread(f"{background_folder}/{img_path}")
    
    # Append the image into the list.
    background_images.append(image)

# Initialize a variable to store the index of the background image.
bg_img_index = 0

# Initialize a variable to store the time of the previous frame.
time1 = 0

# Iterate until the webcam is accessed successfully.
while camera_video.isOpened():
    
    # Read a frame.
    ok, frame = camera_video.read()
    
    # Check if frame is not read properly.
    if not ok:
        
        # Continue to the next iteration to read the next frame.
        continue

    # Flip the frame horizontally for natural (selfie-view) visualization.
    frame = cv2.flip(frame, 1)
    
    # Change the background of the frame.
    output_frame,_ = modifyBackground(frame, background_image = background_images[bg_img_index % len(background_images)],
                                      threshold = 0.3, display = False, method='changeBackground')
    
    # Set the time for this frame to the current time.
    time2 = time()
    
    # Check if the difference between the previous and this frame time &gt; 0 to avoid division by zero.
    if (time2 - time1) &gt; 0:
    
        # Calculate the number of frames per second.
        frames_per_second = 1.0 / (time2 - time1)
        
        # Write the calculated number of frames per second on the frame. 
        cv2.putText(output_frame, 'fps: {}'.format(int(frames_per_second)), (10, 30),cv2.FONT_HERSHEY_PLAIN, 2, (0, 255, 0), 3)
    
    # Update the previous frame time to this frame time.
    # As this frame will become previous frame in next iteration.
    time1 = time2
    
    # Display the frame with changed background.
    cv2.imshow('Video', output_frame)
    
    # Wait until a key is pressed.
    # Retreive the ASCII code of the key pressed
    k = cv2.waitKey(1) &amp; 0xFF
    
    # Check if 'ESC' is pressed.
    if (k == 27):
        
        # Break the loop.
        break
        
    elif (k == ord('b')):
        bg_img_index = bg_img_index + 1  
        

# Release the VideoCapture Object.
camera_video.release()

# Close the windows.
cv2.destroyAllWindows()

Initialize the VideoCapture object to read from the webcam.

camera_video = cv2.VideoCapture(0)

# Set width of the frames in the video stream.

camera_video.set(3, 1280)

# Set height of the frames in the video stream.

camera_video.set(4, 720)

# Initialize a list to store the background images.

background_images = []

# Specify the path of the folder which contains the background images.

background_folder = 'media/backgroundimages/'

# Iterate over the images in the background folder.

for img_path in os.listdir(background_folder):

# Read a image.

image = cv2.imread(f"{background_folder}/{img_path}")

# Append the image into the list.

background_images.append(image)

# Initialize a variable to store the index of the background image.

bg_img_index = 0

# Initialize a variable to store the time of the previous frame.

time1 = 0

# Iterate until the webcam is accessed successfully.

while camera_video.isOpened():

# Read a frame.

ok, frame = camera_video.read()

# Check if frame is not read properly.

if not ok:

# Continue to the next iteration to read the next frame.

continue

# Flip the frame horizontally for natural (selfie-view) visualization.

frame = cv2.flip(frame, 1)

# Change the background of the frame.

output_frame,_ = modifyBackground(frame, background_image = background_images[bg_img_index % len(background_images)],

threshold = 0.3, display = False, method='changeBackground')

# Set the time for this frame to the current time.

time2 = time()

# Check if the difference between the previous and this frame time > 0 to avoid division by zero.

if (time2 - time1) > 0:

# Calculate the number of frames per second.

frames_per_second = 1.0 / (time2 - time1)

# Write the calculated number of frames per second on the frame.

cv2.putText(output_frame, 'fps: {}'.format(int(frames_per_second)), (10, 30),cv2.FONT_HERSHEY_PLAIN, 2, (0, 255, 0), 3)

# Update the previous frame time to this frame time.

# As this frame will become previous frame in next iteration.

time1 = time2

# Display the frame with changed background.

cv2.imshow('Video', output_frame)

# Wait until a key is pressed.

# Retreive the ASCII code of the key pressed

k = cv2.waitKey(1) & 0xFF

# Check if 'ESC' is pressed.

if (k == 27):

# Break the loop.

break

elif (k == ord('b')):

bg_img_index = bg_img_index + 1

# Release the VideoCapture Object.

camera_video.release()

# Close the windows.

cv2.destroyAllWindows()

Output:

Woah! that was Cool, not only the results are great but the model is pretty fast.

Video on Video Background Replacement:

Let’s take this one step further and instead of changing the background by an image, let’s replace it with a video loop.

# Initialize the VideoCapture object to read from the webcam.
camera_video = cv2.VideoCapture(0)

# Set width of the frames in the video stream.
camera_video.set(3, 1280)

# Set height of the frames in the video stream.
camera_video.set(4, 720)

# Initialize the VideoCapture object to read from the background video stored in the disk.
background_video = cv2.VideoCapture('media/backgroundvideos/1.mp4')

# Set the background video frame counter to zero.
background_frame_counter = 0

# Initialize a variable to store the time of the previous frame.
time1 = 0

# Iterate until the webcam is accessed successfully.
while camera_video.isOpened():
    
    # Read a frame.
    ok, frame = camera_video.read()
    
    # Check if frame is not read properly.
    if not ok:
        
        # Continue to the next iteration to read the next frame.
        continue
        
    # Read a frame from background video
    _, background_frame = background_video.read()
    
    # Increment the background video frame counter.
    background_frame_counter = background_frame_counter + 1
    
    # Check if the current frame is the last frame of the background video.
    if background_frame_counter == background_video.get(cv2.CAP_PROP_FRAME_COUNT):     
        
        # Set the current frame position to first frame to restart the video.
        background_video.set(cv2.CAP_PROP_POS_FRAMES, 0)
        
        # Set the background video frame counter to zero.
        background_frame_counter = 0

    # Flip the frame horizontally for natural (selfie-view) visualization.
    frame = cv2.flip(frame, 1)
    
    # Change the background of the frame.
    output_frame,_ = modifyBackground(frame, background_image=background_frame, threshold=0.3,
                                      display=False, method='changeBackground')
    
    # Set the time for this frame to the current time.
    time2 = time()
    
    # Check if the difference between the previous and this frame time &gt; 0 to avoid division by zero.
    if (time2 - time1) &gt; 0:
    
        # Calculate the number of frames per second.
        frames_per_second = 1.0 / (time2 - time1)
        
        # Write the calculated number of frames per second on the frame. 
        cv2.putText(output_frame, 'fps: {}'.format(int(frames_per_second)), (10, 30),cv2.FONT_HERSHEY_PLAIN, 2, (0, 255, 0), 3)
    
    # Update the previous frame time to this frame time.
    # As this frame will become previous frame in next iteration.
    time1 = time2
    
    
    # Display the frame with changed background.
    cv2.imshow('Video', output_frame)
    
    # Wait until a key is pressed.
    # Retreive the ASCII code of the key pressed
    k = cv2.waitKey(1) &amp; 0xFF
    
    # Check if 'ESC' is pressed.
    if (k == 27):
        
        # Break the loop.
        break      

# Release the VideoCapture Object.
camera_video.release()

# Close the windows.
cv2.destroyAllWindows()

# Initialize the VideoCapture object to read from the webcam.

camera_video = cv2.VideoCapture(0)

# Set width of the frames in the video stream.

camera_video.set(3, 1280)

# Set height of the frames in the video stream.

camera_video.set(4, 720)

# Initialize the VideoCapture object to read from the background video stored in the disk.

background_video = cv2.VideoCapture('media/backgroundvideos/1.mp4')

# Set the background video frame counter to zero.

background_frame_counter = 0

# Initialize a variable to store the time of the previous frame.

time1 = 0

# Iterate until the webcam is accessed successfully.

while camera_video.isOpened():

# Read a frame.

ok, frame = camera_video.read()

# Check if frame is not read properly.

if not ok:

# Continue to the next iteration to read the next frame.

continue

# Read a frame from background video

_, background_frame = background_video.read()

# Increment the background video frame counter.

background_frame_counter = background_frame_counter + 1

# Check if the current frame is the last frame of the background video.

if background_frame_counter == background_video.get(cv2.CAP_PROP_FRAME_COUNT):

# Set the current frame position to first frame to restart the video.

background_video.set(cv2.CAP_PROP_POS_FRAMES, 0)

# Set the background video frame counter to zero.

background_frame_counter = 0

# Flip the frame horizontally for natural (selfie-view) visualization.

frame = cv2.flip(frame, 1)

# Change the background of the frame.

output_frame,_ = modifyBackground(frame, background_image=background_frame, threshold=0.3,

display=False, method='changeBackground')

# Set the time for this frame to the current time.

time2 = time()

# Check if the difference between the previous and this frame time > 0 to avoid division by zero.

if (time2 - time1) > 0:

# Calculate the number of frames per second.

frames_per_second = 1.0 / (time2 - time1)

# Write the calculated number of frames per second on the frame.

cv2.putText(output_frame, 'fps: {}'.format(int(frames_per_second)), (10, 30),cv2.FONT_HERSHEY_PLAIN, 2, (0, 255, 0), 3)

# Update the previous frame time to this frame time.

# As this frame will become previous frame in next iteration.

time1 = time2

# Display the frame with changed background.

cv2.imshow('Video', output_frame)

# Wait until a key is pressed.

# Retreive the ASCII code of the key pressed

k = cv2.waitKey(1) & 0xFF

# Check if 'ESC' is pressed.

if (k == 27):

# Break the loop.

break

# Release the VideoCapture Object.

camera_video.release()

# Close the windows.

cv2.destroyAllWindows()

Output:

That was pretty interesting, now that you’ve learned how to segment the background successfully it’s time to make use of this skill and create some other exciting applications out of it.

Application 2: Apply Background Blur

Now this application will actually save you a lot of money.

How?

Well, remember those expensive DSLR or mirrorless cameras that blur out the background, today you’ll learn to achieve the same effect, infact even better by just using your webcam.

So now we will use the function created above to segment out the prominent person and then blur out the background.

All we need to do is just blur the original image using cv2.GaussianBlur() and then instead of replacing the background with a new image (like we did in the previous application) we’ll just replace it with this blur version of the image. This way the segmented person will retain it’s original form but the rest of the parts will be blurred out.

Now let’s call the function with the argument method='blurBackground' over some samples. You can control the amount of blur by controling the blur variable.

# Read another sample image and blur the background
image2 = cv2.imread('media/sample2.jpg')
modifyBackground(image2, method='blurBackground')

# Read another sample image and blur the background

image2 = cv2.imread('media/sample2.jpg')

modifyBackground(image2, method='blurBackground')

# Read another sample image and blur the background
image3 = cv2.imread('media/sample.jpg')
modifyBackground(image3, method='blurBackground')

# Read another sample image and blur the background

image3 = cv2.imread('media/sample.jpg')

modifyBackground(image3, method='blurBackground')

# Read another sample image and blur the background
image4 = cv2.imread('media/sample1.jpg')
modifyBackground(image4, blur=51, method='blurBackground')

# Read another sample image and blur the background

image4 = cv2.imread('media/sample1.jpg')

modifyBackground(image4, blur=51, method='blurBackground')

Background Blur On Video

Now we will utilize the function created above in a real-time webcam feed where we will be able to blur the background.

# Initialize the VideoCapture object to read from the webcam.
camera_video = cv2.VideoCapture(0)

# Set width of the frames in the video stream.
camera_video.set(3, 1280)

# Set height of the frames in the video stream.
camera_video.set(4, 720)

# Initialize a variable to store the time of the previous frame.
time1 = 0

# Iterate until the webcam is accessed successfully.
while camera_video.isOpened():
    
    # Read a frame.
    ok, frame = camera_video.read()
    
    # Check if frame is not read properly.
    if not ok:
        
        # Continue to the next iteration to read the next frame.
        continue

    # Flip the frame horizontally for natural (selfie-view) visualization.
    frame = cv2.flip(frame, 1)
    
    # Blur the background of the frame.
    output_frame,_ = modifyBackground(frame, threshold = 0.3, display = False, method='blurBackground')
    
    # Set the time for this frame to the current time.
    time2 = time()
    
    # Check if the difference between the previous and this frame time &gt; 0 to avoid division by zero.
    if (time2 - time1) &gt; 0:
    
        # Calculate the number of frames per second.
        frames_per_second = 1.0 / (time2 - time1)
        
        # Write the calculated number of frames per second on the frame. 
        cv2.putText(output_frame, 'fps: {}'.format(int(frames_per_second)), (10, 30),cv2.FONT_HERSHEY_PLAIN, 2,
                    (0, 255, 0), 3)
    
    # Update the previous frame time to this frame time.
    # As this frame will become previous frame in next iteration.
    time1 = time2
    
    # Display the frame with blurred background.
    cv2.imshow('Video', output_frame)
    
    # Wait until a key is pressed.
    # Retreive the ASCII code of the key pressed
    k = cv2.waitKey(1) &amp; 0xFF
    
    # Check if 'ESC' is pressed.
    if (k == 27):
        
        # Break the loop.
        break


# Release the VideoCapture Object.
camera_video.release()

# Close the windows.
cv2.destroyAllWindows()

# Initialize the VideoCapture object to read from the webcam.

camera_video = cv2.VideoCapture(0)

# Set width of the frames in the video stream.

camera_video.set(3, 1280)

# Set height of the frames in the video stream.

camera_video.set(4, 720)

# Initialize a variable to store the time of the previous frame.

time1 = 0

# Iterate until the webcam is accessed successfully.

while camera_video.isOpened():

# Read a frame.

ok, frame = camera_video.read()

# Check if frame is not read properly.

if not ok:

# Continue to the next iteration to read the next frame.

continue

# Flip the frame horizontally for natural (selfie-view) visualization.

frame = cv2.flip(frame, 1)

# Blur the background of the frame.

output_frame,_ = modifyBackground(frame, threshold = 0.3, display = False, method='blurBackground')

# Set the time for this frame to the current time.

time2 = time()

# Check if the difference between the previous and this frame time > 0 to avoid division by zero.

if (time2 - time1) > 0:

# Calculate the number of frames per second.

frames_per_second = 1.0 / (time2 - time1)

# Write the calculated number of frames per second on the frame.

cv2.putText(output_frame, 'fps: {}'.format(int(frames_per_second)), (10, 30),cv2.FONT_HERSHEY_PLAIN, 2,

(0, 255, 0), 3)

# Update the previous frame time to this frame time.

# As this frame will become previous frame in next iteration.

time1 = time2

# Display the frame with blurred background.

cv2.imshow('Video', output_frame)

# Wait until a key is pressed.

# Retreive the ASCII code of the key pressed

k = cv2.waitKey(1) & 0xFF

# Check if 'ESC' is pressed.

if (k == 27):

# Break the loop.

break

# Release the VideoCapture Object.

camera_video.release()

# Close the windows.

cv2.destroyAllWindows()

Output:

Application 3: Desaturate Background

Now we will use the function created above to desaturate (convert to grayscale) the background of the image. Again the only new thing that we’re doing here is just replacing the black parts of the segmented mask with the grayscale version of the original image.

We will have to pass the argument method='desatureBackground' this time, to desaturate the backgrounds of a few sample images.

# Read a sample image and apply the desaturation effect.
image2 = cv2.imread('media/sample6.jpg')
modifyBackground(image2, method='desatureBackground')

# Read a sample image and apply the desaturation effect.

image2 = cv2.imread('media/sample6.jpg')

modifyBackground(image2, method='desatureBackground')

# Read a sample image and apply the desaturation effect.
image3 = cv2.imread('media/sample4.jpg')
modifyBackground(image3, method='desatureBackground')

# Read a sample image and apply the desaturation effect.

image3 = cv2.imread('media/sample4.jpg')

modifyBackground(image3, method='desatureBackground')

# Read a sample image and apply the desaturation effect.
image4 = cv2.imread('media/sample5.jpg')
modifyBackground(image4, method='desatureBackground')

# Read a sample image and apply the desaturation effect.

image4 = cv2.imread('media/sample5.jpg')

modifyBackground(image4, method='desatureBackground')

Background Desaturation On Video

Now we will utilize the function created above in a real-time webcam feed where we will be able to desaturate the background of the video.

# Initialize the VideoCapture object to read from the webcam.
camera_video = cv2.VideoCapture(0)

# Set width of the frames in the video stream.
camera_video.set(3, 1280)

# Set height of the frames in the video stream.
camera_video.set(4, 720)

# Initialize a variable to store the time of the previous frame.
time1 = 0

# Iterate until the webcam is accessed successfully.
while camera_video.isOpened():
    
    # Read a frame.
    ok, frame = camera_video.read()
    
    # Check if frame is not read properly.
    if not ok:
        
        # Continue to the next iteration to read the next frame.
        continue

    # Flip the frame horizontally for natural (selfie-view) visualization.
    frame = cv2.flip(frame, 1)
    
    # Desature the background of the frame.
    output_frame,_ = modifyBackground(frame, threshold = 0.3, display = False, method='desatureBackground')
    
    # Set the time for this frame to the current time.
    time2 = time()
    
    # Check if the difference between the previous and this frame time &gt; 0 to avoid division by zero.
    if (time2 - time1) &gt; 0:
    
        # Calculate the number of frames per second.
        frames_per_second = 1.0 / (time2 - time1)
        
        # Write the calculated number of frames per second on the frame. 
        cv2.putText(output_frame, 'fps: {}'.format(int(frames_per_second)), (10, 30),cv2.FONT_HERSHEY_PLAIN, 2,
                    (0, 255, 0), 3)
    
    # Update the previous frame time to this frame time.
    # As this frame will become previous frame in next iteration.
    time1 = time2
    
    # Display the frame with desatured background.
    cv2.imshow('Video', output_frame)
    
    # Wait until a key is pressed.
    # Retreive the ASCII code of the key pressed
    k = cv2.waitKey(1) &amp; 0xFF
    
    # Check if 'ESC' is pressed.
    if (k == 27):
        
        # Break the loop.
        break

# Release the VideoCapture Object.
camera_video.release()

# Close the windows.
cv2.destroyAllWindows()

# Initialize the VideoCapture object to read from the webcam.

camera_video = cv2.VideoCapture(0)

# Set width of the frames in the video stream.

camera_video.set(3, 1280)

# Set height of the frames in the video stream.

camera_video.set(4, 720)

# Initialize a variable to store the time of the previous frame.

time1 = 0

# Iterate until the webcam is accessed successfully.

while camera_video.isOpened():

# Read a frame.

ok, frame = camera_video.read()

# Check if frame is not read properly.

if not ok:

# Continue to the next iteration to read the next frame.

continue

# Flip the frame horizontally for natural (selfie-view) visualization.

frame = cv2.flip(frame, 1)

# Desature the background of the frame.

output_frame,_ = modifyBackground(frame, threshold = 0.3, display = False, method='desatureBackground')

# Set the time for this frame to the current time.

time2 = time()

# Check if the difference between the previous and this frame time > 0 to avoid division by zero.

if (time2 - time1) > 0:

# Calculate the number of frames per second.

frames_per_second = 1.0 / (time2 - time1)

# Write the calculated number of frames per second on the frame.

cv2.putText(output_frame, 'fps: {}'.format(int(frames_per_second)), (10, 30),cv2.FONT_HERSHEY_PLAIN, 2,

(0, 255, 0), 3)

# Update the previous frame time to this frame time.

# As this frame will become previous frame in next iteration.

time1 = time2

# Display the frame with desatured background.

cv2.imshow('Video', output_frame)

# Wait until a key is pressed.

# Retreive the ASCII code of the key pressed

k = cv2.waitKey(1) & 0xFF

# Check if 'ESC' is pressed.

if (k == 27):

# Break the loop.

break

# Release the VideoCapture Object.

camera_video.release()

# Close the windows.

cv2.destroyAllWindows()

Output:

Application 4: Convert an Image to have a Transparent Background

Now we will use the function created above to segment out the prominent person and then make the background of the image transparent and after that we will store the resultant image into the disk using the function cv2.imwrite().

To create an image with a transparent background (four-channel image) we will need to add another channel called alpha channel to the original image, this channel is a mask which decides which part of the image needs to be transparent and can have values from 0 (black) to 255 (white) which determine the level of visibility. Black (0) acts as the transparent area and white (255) acts as the visible area.

So we just need to add the segmentation mask to the original image.

We will have to pass the argument method='transparentBackground' to the function to get an image with transparent background.

# Specify the path of a sample image.
img_path = 'media/sample.jpg'

# Read the input image from the specified path.
image = cv2.imread(img_path)

# Make the background of the sample image transparent.
trans_background_img, _ = modifyBackground(image, threshold = 0.9, display=False, method='transparentBackground')

# Specify the path to store the resultant image
resultant_img_path = 'output/transparent background ' + img_path.split('/')[-1].split('.')[0]

# Store the resultant image into the disk. Make sure it's stored as `PNG`
cv2.imwrite(resultant_img_path + ".png", trans_background_img)

# Show a success message.
print('The Image with transparent background is successfully stored in the disk')

# Specify the path of a sample image.

img_path = 'media/sample.jpg'

# Read the input image from the specified path.

image = cv2.imread(img_path)

# Make the background of the sample image transparent.

trans_background_img, _ = modifyBackground(image, threshold = 0.9, display=False, method='transparentBackground')

# Specify the path to store the resultant image

resultant_img_path = 'output/transparent background ' + img_path.split('/')[-1].split('.')[0]

# Store the resultant image into the disk. Make sure it's stored as `PNG`

cv2.imwrite(resultant_img_path + ".png", trans_background_img)

# Show a success message.

print('The Image with transparent background is successfully stored in the disk')

You can go to the location where the image is saved, open it up with an image viewer and you’ll see that the background is transparent.

Further Resources

Note: These models work best for the scenarios where the person is close (< 2m) to the camera.

Bleed AI Needs Your Support!

Hi Everyone, Taha Anwar (Founder Bleed AI) here. If my blog posts or videos have helped you in any way in your Computer Vision/AI/ML/DL Learning journey then remember you can help us out too.

Publishing Free high-quality Computer Vision tutorials for you guys so that you can build projects, or land your dream job, or maybe build a startup is our core mission at Bleed AI. But every single post takes a lot of effort and man-hours, and in order to keep publishing Free high-end Tutorials, and me & my team need your support on Patreon, plus you will get some extra perks too.

Summary:

Alright, So today we did a lot!

We Understand the basic terminology regarding different segmentation techniques, in summary:

Image Segmentation: The task of dividing pixels into groups of pixels based on some criteria

Semantic Segmentation: In this type we assign a class label to every pixel in the image.

Instance Segmentation: Here we assign a class label to only selective classes in the image.

Panoptic Segmentation: This approach combines both semantic and instance segmentation.

Saliency Detection: Here we’re just interested in segmenting prominent objects in the image regardless of the class.

Selfie Segmentation: Here we want to segment prominent people in the image.

We also learned that Mediapipe is an awesome tool to use various ML models in real-time. Then we learned how to perform selfie segmentation with this tool and build 4 different useful applications from it. These applications were:

How to remove/replace backgrounds in images & videos.
How to desaturate the background to make the person pop out in an image or a video.
How to blur out the background.
How to give an image a transparent background and save it.

This was my first Mediapipe tutorial and I’m planning to write a tutorial on a few other models too. If you enjoyed this tutorial then do let me know in the comments! You’ll definitely get a reply from me

Hire Us

Let our team of expert engineers and managers build your next big project using Bleeding Edge AI Tools & Technologies

Things You Must Know About OpenCV, Revealing my Best Tips from Years of Experience

by Taha Anwar | Jun 18, 2021 | Theoretical

Watch the Full Video Here:

Today’s Video tutorial is the one I wish I had access to when I was starting out in OpenCV, in this video I reveal to you some very interesting information about the opencv including great tips regarding when to find the right resources, tutorials for the library.

I’ll start by briefly going over the history of OpenCV and then talk about other exciting topics.

Some of the things I will go through in this video

👉How to navigate the opencv docs to find what you’re looking for.
👉How to get details regarding any OpenCV function.
👉The differences between the C++ and python version of OpenCV and which one you should work with.
👉Pip installation of OpenCV vs Source installation.
👉Where to ask questions regarding OpenCV when you’re stuck.

Ready to seriously dive into State of the Art AI & Computer Vision?
Then Sign up for these premium Courses by Bleed AI