In last Week’s tutorial, we learned how to work with real-time pose detection and created a pose classification system. In this week’s tutorial, we’ll learn to play a popular game called “Subway Surfers”
Of Course, there’s more to it, this is an AI Blog after all.
We will actually be using our body pose to control the game, not keyboard controls, the entire application will work in real-time on your CPU, you don’t even need a depth camera or a Kinect, your webcam will suffice.
Excited yet, let’s get into it, but before that let me tell you a short story that motivated me to build this application today. It starts with me giving a lecture on the importance of physical fitness, I know … I know … how this sounds but just bear with me for a bit.
Hi All, Taha Awnar here, So here’s the thing. One of the best things I enjoyed in my early teenage years was having a fast metabolism due to my involvement in physical activities. I could eat whatever I wanted, not make a conscious effort in exercising and still stay fit.
But as I grew older, and started spending most of my time in front of a computer, I noticed that I was actually gaining weight. So no longer could I afford the luxury of binge unhealthy eating and skipping workouts.
Now I’m a bit of a foodie so although I could compromise a bit on how I eat, I still needed to cut weight some other way, so I quickly realized that unless I wanted to get obese, I needed to consciously make effort to workout.
That’s about when I joined a local gym in my area, and guess what? … it didn’t work out, ( or I didn’t work out … enough 🙁 ) So I quitted after a month.
So what was the reason ?.… Well, I could provide multiple excuses but to be honest, I was just lazy.
A few months later I joined the gym again and again I quitted after just 2 months.
Now I could have just quit completely but instead 8 months back I tried again, this time I even hired a trainer to keep me motivated, and as they say it, 3rd time’s a charm and luckily it was!
8 months in, I’m still at it. I did see results and lost a couple of kgs, although I haven’t reached my personal target so I’m still working towards it.
If you’re reading this post then you’re probably into computer science just like me and you most likely need to spend a lot of time in front of a PC and because of that, your physical and mental fitness must take a toll. And I seriously can’t stress enough how important it is that you take out a couple of hours each week to exercise.
I’m not a fitness guru but I can say working out has many key benefits:
Helps you shed excess weight, keeps you physically fit.
Gives you mental clarity and improves your work quality.
Lots of health benefits.
Helps you get a partner, if you’re still single like me … lol
Because of these reasons, even though I have an introverted personality, I consciously take out a couple of hours each week to go to the gym or the park for running.
But here’s the thing, sometimes I wonder why can’t I combine what I do (working on a PC) with some activity so I could … you know hit 2 birds with one stone.
This thought led me to create this post today, so what I did was I created a vision application that allows me to control a very popular game called Subway Surfers via my body movement by utilizing real-time pose detection.
And so In this tutorial, I’ll show you how to create this application that controls the Subway Surfers game using body gestures and movements so that you can also exercise, code, and have fun at the same time.
How will this Work?
So this game is about a character running from a policeman dodging different hurdles by jumping, crouching, and moving left and right. So we will need to worry about four controls that are normally controlled using a keyboard.
Up arrow key to make the character jump
Down arrow key to make the character crouch
Left arrow key to move the character to left
Right arrow key to move the character to right.
Using the Pyautogui library, we will automatically trigger the required keypress events, depending upon the body movement of the person that we’ll capture using Mediapipe’s Pose Detection model.
I want the game’s character to:
Jump whenever the person controlling the character jumps.
Crouch whenever the person controlling the character crouches.
Move left whenever the person controlling the character moves to the left side of the screen.
Move right whenever the person controlling the character moves to the right on the screen.
You can also use the techniques you’ll learn in this tutorial to control any other game. The simpler the game, the easier it will be to control. I have actually published two tutorials about game control via body gestures.
Alright now that we have discussed the basic mechanisms for creating this application, let me walk you through the exact step-by-step process I used to create this.
Outline
Step 1: Perform Pose Detection
Step 2: Control Starting Mechanism
Step 3: Control Horizontal Movements
Step 4: Control Vertical Movements
Step 5: Control Keyboard and Mouse with PyautoGUI
Step 6: Build the Final Application
Alright, let’s get started.
Download Code
[optin-monster slug=”fosdrzvuquq2gad1pccq”]
Import the Libraries
We will start by importing the required libraries.
import cv2
import pyautogui
from time import time
from math import hypot
import mediapipe as mp
import matplotlib.pyplot as plt
Initialize the Pose Detection Model
After that we will need to initialize the mp.solutions.pose class and then call the mp.solutions.pose.Pose() function with appropriate arguments and also initialize mp.solutions.drawing_utils class that is needed to visualize the landmarks after detection.
# Initialize mediapipe pose class.
mp_pose = mp.solutions.pose
# Setup the Pose function for images.
pose_image = mp_pose.Pose(static_image_mode=True, min_detection_confidence=0.5, model_complexity=1)
# Setup the Pose function for videos.
pose_video = mp_pose.Pose(static_image_mode=False, model_complexity=1, min_detection_confidence=0.7,
min_tracking_confidence=0.7)
# Initialize mediapipe drawing class.
mp_drawing = mp.solutions.drawing_utils
Step 1: Perform Pose Detection
To implement the game control mechanisms, we will need the current pose info of the person controlling the game, as our intention is to control the character with the movement of the person in the frame. We want the game’s character to move left, right, jump and crouch with the identical movements of the person.
So we will create a function detectPose() that will take an image as input and perform pose detection on the person in the image using the mediapipe’s pose detection solution to get thirty-three 3D landmarks on the body and the function will display the results or return them depending upon the passed arguments.
This function is quite similar to the one we had created in the previous post. The only difference is that we are not plotting the pose landmarks in 3D and we are passing a few more optional arguments to the function mp.solutions.drawing_utils.draw_landmarks() to specify the drawing style.
You probably do not want to lose control of the game’s character whenever some other person comes into the frame (and starts controlling the character), so that annoying scenario is already taken care of, as the solution we are using only detects the landmarks of the most prominent person in the image.
So you do not need to worry about losing control as long as you are the most prominent person in the frame as it will automatically ignore the people in the background.
def detectPose(image, pose, draw=False, display=False):
'''
This function performs the pose detection on the most prominent person in an image.
Args:
image: The input image with a prominent person whose pose landmarks needs to be detected.
pose: The pose function required to perform the pose detection.
draw: A boolean value that is if set to true the function draw pose landmarks on the output image.
display: A boolean value that is if set to true the function displays the original input image, and the
resultant image and returns nothing.
Returns:
output_image: The input image with the detected pose landmarks drawn if it was specified.
results: The output of the pose landmarks detection on the input image.
'''
# Create a copy of the input image.
output_image = image.copy()
# Convert the image from BGR into RGB format.
imageRGB = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
# Perform the Pose Detection.
results = pose.process(imageRGB)
# Check if any landmarks are detected and are specified to be drawn.
if results.pose_landmarks and draw:
# Draw Pose Landmarks on the output image.
mp_drawing.draw_landmarks(image=output_image, landmark_list=results.pose_landmarks,
connections=mp_pose.POSE_CONNECTIONS,
landmark_drawing_spec=mp_drawing.DrawingSpec(color=(255,255,255),
thickness=3, circle_radius=3),
connection_drawing_spec=mp_drawing.DrawingSpec(color=(49,125,237),
thickness=2, circle_radius=2))
# Check if the original input image and the resultant image are specified to be displayed.
if display:
# Display the original input image and the resultant image.
plt.figure(figsize=[22,22])
plt.subplot(121);plt.imshow(image[:,:,::-1]);plt.title("Original Image");plt.axis('off');
plt.subplot(122);plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');
# Otherwise
else:
# Return the output image and the results of pose landmarks detection.
return output_image, results
Now we will test the function detectPose() created above to perform pose detection on a sample image and display the results.
# Read a sample image and perform pose landmarks detection on it.
IMG_PATH = 'media/sample.jpg'
image = cv2.imread(IMG_PATH)
detectPose(image, pose_image, draw=True, display=True
It worked pretty well! if you want you can test the function on other images too by just changing the value of the variable IMG_PATH in the cell above, it will work fine as long as there is a prominent person in the image.
Step 2: Control Starting Mechanism
In this step, we will implement the game starting mechanism, what we want is to start the game whenever the most prominent person in the image/frame joins his both hands together. So we will create a function checkHandsJoined() that will check whether the hands of the person in an image are joined or not.
The function checkHandsJoined() will take in the results of the pose detection returned by the function detectPose() and will use the LEFT_WRIST and RIGHT_WRIST landmarks coordinates from the list of thirty-three landmarks, to calculate the euclidean distance between the hands of the person.
And then utilize an appropriate threshold value to compare with and check whether the hands of the person in the image/frame are joined or not and will display or return the results depending upon the passed arguments.
def checkHandsJoined(image, results, draw=False, display=False):
'''
This function checks whether the hands of the person are joined or not in an image.
Args:
image: The input image with a prominent person whose hands status (joined or not) needs to be classified.
results: The output of the pose landmarks detection on the input image.
draw: A boolean value that is if set to true the function writes the hands status & distance on the output image.
display: A boolean value that is if set to true the function displays the resultant image and returns nothing.
Returns:
output_image: The same input image but with the classified hands status written, if it was specified.
hand_status: The classified status of the hands whether they are joined or not.
'''
# Get the height and width of the input image.
height, width, _ = image.shape
# Create a copy of the input image to write the hands status label on.
output_image = image.copy()
# Get the left wrist landmark x and y coordinates.
left_wrist_landmark = (results.pose_landmarks.landmark[mp_pose.PoseLandmark.LEFT_WRIST].x * width,
results.pose_landmarks.landmark[mp_pose.PoseLandmark.LEFT_WRIST].y * height)
# Get the right wrist landmark x and y coordinates.
right_wrist_landmark = (results.pose_landmarks.landmark[mp_pose.PoseLandmark.RIGHT_WRIST].x * width,
results.pose_landmarks.landmark[mp_pose.PoseLandmark.RIGHT_WRIST].y * height)
# Calculate the euclidean distance between the left and right wrist.
euclidean_distance = int(hypot(left_wrist_landmark[0] - right_wrist_landmark[0],
left_wrist_landmark[1] - right_wrist_landmark[1]))
# Compare the distance between the wrists with a appropriate threshold to check if both hands are joined.
if euclidean_distance < 130:
# Set the hands status to joined.
hand_status = 'Hands Joined'
# Set the color value to green.
color = (0, 255, 0)
# Otherwise.
else:
# Set the hands status to not joined.
hand_status = 'Hands Not Joined'
# Set the color value to red.
color = (0, 0, 255)
# Check if the Hands Joined status and hands distance are specified to be written on the output image.
if draw:
# Write the classified hands status on the image.
cv2.putText(output_image, hand_status, (10, 30), cv2.FONT_HERSHEY_PLAIN, 2, color, 3)
# Write the the distance between the wrists on the image.
cv2.putText(output_image, f'Distance: {euclidean_distance}', (10, 70),
cv2.FONT_HERSHEY_PLAIN, 2, color, 3)
# Check if the output image is specified to be displayed.
if display:
# Display the output image.
plt.figure(figsize=[10,10])
plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');
# Otherwise
else:
# Return the output image and the classified hands status indicating whether the hands are joined or not.
return output_image, hand_status
Now we will test the function checkHandsJoined() created above on a real-time webcam feed to check whether it is working as we had expected or not.
# Initialize the VideoCapture object to read from the webcam.
camera_video = cv2.VideoCapture(1)
camera_video.set(3,1280)
camera_video.set(4,960)
# Create named window for resizing purposes.
cv2.namedWindow('Hands Joined?', cv2.WINDOW_NORMAL)
# Iterate until the webcam is accessed successfully.
while camera_video.isOpened():
# Read a frame.
ok, frame = camera_video.read()
# Check if frame is not read properly then continue to the next iteration to read the next frame.
if not ok:
continue
# Flip the frame horizontally for natural (selfie-view) visualization.
frame = cv2.flip(frame, 1)
# Get the height and width of the frame of the webcam video.
frame_height, frame_width, _ = frame.shape
# Perform the pose detection on the frame.
frame, results = detectPose(frame, pose_video, draw=True)
# Check if the pose landmarks in the frame are detected.
if results.pose_landmarks:
# Check if the left and right hands are joined.
frame, _ = checkHandsJoined(frame, results, draw=True)
# Display the frame.
cv2.imshow('Hands Joined?', frame)
# Wait for 1ms. If a key is pressed, retreive the ASCII code of the key.
k = cv2.waitKey(1) & 0xFF
# Check if 'ESC' is pressed and break the loop.
if(k == 27):
break
# Release the VideoCapture Object and close the windows.
camera_video.release()
cv2.destroyAllWindows()
Output Video:
Woah! I am stunned, the pose detection solution is best known for its speed which is reflecting in the results as the distance and the hands status are updating very fast and are also highly accurate.
Step 3: Control Horizontal Movements
Now comes the implementation of the left and right movements control mechanism of the game’s character, what we want to do is to make the game’s character move left and right with the horizontal movements of the person in the image/frame.
So we will create a function checkLeftRight() that will take in the pose detection results returned by the function detectPose() and will use the x-coordinates of the RIGHT_SHOULDER and LEFT_SHOULDER landmarks to determine the horizontal position (Left, Right or Center) in the frame after comparing the landmarks with the x-coordinate of the center of the image.
The function will visualize or return the resultant image and the horizontal position of the person depending upon the passed arguments.
def checkLeftRight(image, results, draw=False, display=False):
'''
This function finds the horizontal position (left, center, right) of the person in an image.
Args:
image: The input image with a prominent person whose the horizontal position needs to be found.
results: The output of the pose landmarks detection on the input image.
draw: A boolean value that is if set to true the function writes the horizontal position on the output image.
display: A boolean value that is if set to true the function displays the resultant image and returns nothing.
Returns:
output_image: The same input image but with the horizontal position written, if it was specified.
horizontal_position: The horizontal position (left, center, right) of the person in the input image.
'''
# Declare a variable to store the horizontal position (left, center, right) of the person.
horizontal_position = None
# Get the height and width of the image.
height, width, _ = image.shape
# Create a copy of the input image to write the horizontal position on.
output_image = image.copy()
# Retreive the x-coordinate of the left shoulder landmark.
left_x = int(results.pose_landmarks.landmark[mp_pose.PoseLandmark.RIGHT_SHOULDER].x * width)
# Retreive the x-corrdinate of the right shoulder landmark.
right_x = int(results.pose_landmarks.landmark[mp_pose.PoseLandmark.LEFT_SHOULDER].x * width)
# Check if the person is at left that is when both shoulder landmarks x-corrdinates
# are less than or equal to the x-corrdinate of the center of the image.
if (right_x <= width//2 and left_x <= width//2):
# Set the person's position to left.
horizontal_position = 'Left'
# Check if the person is at right that is when both shoulder landmarks x-corrdinates
# are greater than or equal to the x-corrdinate of the center of the image.
elif (right_x >= width//2 and left_x >= width//2):
# Set the person's position to right.
horizontal_position = 'Right'
# Check if the person is at center that is when right shoulder landmark x-corrdinate is greater than or equal to
# and left shoulder landmark x-corrdinate is less than or equal to the x-corrdinate of the center of the image.
elif (right_x >= width//2 and left_x <= width//2):
# Set the person's position to center.
horizontal_position = 'Center'
# Check if the person's horizontal position and a line at the center of the image is specified to be drawn.
if draw:
# Write the horizontal position of the person on the image.
cv2.putText(output_image, horizontal_position, (5, height - 10), cv2.FONT_HERSHEY_PLAIN, 2, (255, 255, 255), 3)
# Draw a line at the center of the image.
cv2.line(output_image, (width//2, 0), (width//2, height), (255, 255, 255), 2)
# Check if the output image is specified to be displayed.
if display:
# Display the output image.
plt.figure(figsize=[10,10])
plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');
# Otherwise
else:
# Return the output image and the person's horizontal position.
return output_image, horizontal_position
Now we will test the function checkLeftRight() created above on a real-time webcam feed and will visualize the results updating in real-time with the horizontal movements.
# Initialize the VideoCapture object to read from the webcam.
camera_video = cv2.VideoCapture(1)
camera_video.set(3,1280)
camera_video.set(4,960)
# Create named window for resizing purposes.
cv2.namedWindow('Horizontal Movements', cv2.WINDOW_NORMAL)
# Iterate until the webcam is accessed successfully.
while camera_video.isOpened():
# Read a frame.
ok, frame = camera_video.read()
# Check if frame is not read properly then continue to the next iteration to read the next frame.
if not ok:
continue
# Flip the frame horizontally for natural (selfie-view) visualization.
frame = cv2.flip(frame, 1)
# Get the height and width of the frame of the webcam video.
frame_height, frame_width, _ = frame.shape
# Perform the pose detection on the frame.
frame, results = detectPose(frame, pose_video, draw=True)
# Check if the pose landmarks in the frame are detected.
if results.pose_landmarks:
# Check the horizontal position of the person in the frame.
frame, _ = checkLeftRight(frame, results, draw=True)
# Display the frame.
cv2.imshow('Horizontal Movements', frame)
# Wait for 1ms. If a a key is pressed, retreive the ASCII code of the key.
k = cv2.waitKey(1) & 0xFF
# Check if 'ESC' is pressed and break the loop.
if(k == 27):
break
# Release the VideoCapture Object and close the windows.
camera_video.release()
cv2.destroyAllWindows()
Output Video:
Cool! the speed and accuracy of this model never fail to impress me.
Step 4: Control Vertical Movements
In this one, we will implement the jump and crouch control mechanism of the game’s character, what we want is to make the game’s character jump and crouch whenever the person in the image/frame jumps and crouches.
So we will create a function checkJumpCrouch() that will check whether the posture of the person in an image is Jumping, Crouching or Standing by utilizing the results of pose detection by the function detectPose().
The function checkJumpCrouch() will retrieve the RIGHT_SHOULDER and LEFT_SHOULDER landmarks from the list to calculate the y-coordinate of the midpoint of both shoulders and will determine the posture of the person by doing a comparison with an appropriate threshold value.
The threshold (MID_Y) will be the approximate y-coordinate of the midpoint of both shoulders of the person while in standing posture. It will be calculated before starting the game in the Step 6: Build the Final Application and will be passed to the function checkJumpCrouch().
But the issue with this approach is that the midpoint of both shoulders of the person while in standing posture will not always be exactly the same as it will vary when the person will move closer or further to the camera.
To tackle this issue we will add and subtract a margin to the threshold to get an upper and lower bound as shown in the image below.
def checkJumpCrouch(image, results, MID_Y=250, draw=False, display=False):
'''
This function checks the posture (Jumping, Crouching or Standing) of the person in an image.
Args:
image: The input image with a prominent person whose the posture needs to be checked.
results: The output of the pose landmarks detection on the input image.
MID_Y: The intial center y-coordinate of both shoulders landmarks of the person recorded during starting
the game. This will give the idea of the person's height when he is standing straight.
draw: A boolean value that is if set to true the function writes the posture on the output image.
display: A boolean value that is if set to true the function displays the resultant image and returns nothing.
Returns:
output_image: The input image with the person's posture written, if it was specified.
posture: The posture (Jumping, Crouching or Standing) of the person in an image.
'''
# Get the height and width of the image.
height, width, _ = image.shape
# Create a copy of the input image to write the posture label on.
output_image = image.copy()
# Retreive the y-coordinate of the left shoulder landmark.
left_y = int(results.pose_landmarks.landmark[mp_pose.PoseLandmark.RIGHT_SHOULDER].y * height)
# Retreive the y-coordinate of the right shoulder landmark.
right_y = int(results.pose_landmarks.landmark[mp_pose.PoseLandmark.LEFT_SHOULDER].y * height)
# Calculate the y-coordinate of the mid-point of both shoulders.
actual_mid_y = abs(right_y + left_y) // 2
# Calculate the upper and lower bounds of the threshold.
lower_bound = MID_Y-15
upper_bound = MID_Y+100
# Check if the person has jumped that is when the y-coordinate of the mid-point
# of both shoulders is less than the lower bound.
if (actual_mid_y < lower_bound):
# Set the posture to jumping.
posture = 'Jumping'
# Check if the person has crouched that is when the y-coordinate of the mid-point
# of both shoulders is greater than the upper bound.
elif (actual_mid_y > upper_bound):
# Set the posture to crouching.
posture = 'Crouching'
# Otherwise the person is standing and the y-coordinate of the mid-point
# of both shoulders is between the upper and lower bounds.
else:
# Set the posture to Standing straight.
posture = 'Standing'
# Check if the posture and a horizontal line at the threshold is specified to be drawn.
if draw:
# Write the posture of the person on the image.
cv2.putText(output_image, posture, (5, height - 50), cv2.FONT_HERSHEY_PLAIN, 2, (255, 255, 255), 3)
# Draw a line at the intial center y-coordinate of the person (threshold).
cv2.line(output_image, (0, MID_Y),(width, MID_Y),(255, 255, 255), 2)
# Check if the output image is specified to be displayed.
if display:
# Display the output image.
plt.figure(figsize=[10,10])
plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');
# Otherwise
else:
# Return the output image and posture indicating whether the person is standing straight or has jumped, or crouched.
return output_image, posture
Now we will test the function checkJumpCrouch() created above on the real-time webcam feed and will visualize the resultant frames. For testing purposes, we will be using a default value of the threshold, that if you want you can tune manually set according to your height.
# Initialize the VideoCapture object to read from the webcam.
camera_video = cv2.VideoCapture(1)
camera_video.set(3,1280)
camera_video.set(4,960)
# Create named window for resizing purposes.
cv2.namedWindow('Verticial Movements', cv2.WINDOW_NORMAL)
# Iterate until the webcam is accessed successfully.
while camera_video.isOpened():
# Read a frame.
ok, frame = camera_video.read()
# Check if frame is not read properly then continue to the next iteration to read the next frame.
if not ok:
continue
# Flip the frame horizontally for natural (selfie-view) visualization.
frame = cv2.flip(frame, 1)
# Get the height and width of the frame of the webcam video.
frame_height, frame_width, _ = frame.shape
# Perform the pose detection on the frame.
frame, results = detectPose(frame, pose_video, draw=True)
# Check if the pose landmarks in the frame are detected.
if results.pose_landmarks:
# Check the posture (jumping, crouching or standing) of the person in the frame.
frame, _ = checkJumpCrouch(frame, results, draw=True)
# Display the frame.
cv2.imshow('Verticial Movements', frame)
# Wait for 1ms. If a a key is pressed, retreive the ASCII code of the key.
k = cv2.waitKey(1) & 0xFF
# Check if 'ESC' is pressed and break the loop.
if(k == 27):
break
# Release the VideoCapture Object and close the windows.
camera_video.release()
cv2.destroyAllWindows()
Output Video:
Great! when I lower my shoulders at a certain range from the horizontal line (threshold), the results are Crouching, and the results are Standing, whenever my shoulders are near the horizontal line (i.e., between the upper and lower bounds), and when my shoulders are at a certain range above the horizontal line, the results are Jumping.
Step 5: Control Keyboard and Mouse with PyautoGUI
The Subway Surfers character wouldn’t be able to move left, right, jump or crouch unless we provide it the required keyboard inputs. Now that we have the functions checkHandsJoined(), checkLeftRight() and checkJumpCrouch(), we need to figure out a way to trigger the required keyboard keypress events, depending upon the output of the functions created above.
This is where the PyAutoGUI API shines. It allows you to easily control the mouse and keyboard event through scripts. To get an idea of PyAutoGUI’s capabilities, you can check this video in which a bot is playing the game Sushi Go Round.
To run the cells in this step, it is not recommended to use the keyboard keys (Shift + Enter) as the cells with keypress events will behave differently when the events will be combined with the keys Shift and Enter. You can either use the menubar (Cell>>Run Cell) or the toolbar (▶️Run) to run the cells.
Now let’s see how simple it is to trigger the up arrow keypress event using pyautogui.
# Press the up key.
pyautogui.press(keys='up')
Similarly, we can trigger the down arrow or any other keypress event by replacing the argument with that key name (the argument should be a string). You can click here to see the list of valid arguments.
# Press the down key.
pyautogui.press(keys='down')
To press multiple keys, we can pass a list of strings (key names) to the pyautogui.press() function.
# Press the up (4 times) and down (1 time) key.
pyautogui.press(keys=['up', 'up', 'up', 'up', 'down'])
Or to press the same key multiple times, we can pass a value (number of times we want to press the key) to the argument presses in the pyautogui.press() function.
# Press the down key 4 times.
pyautogui.press(keys='down', presses=4)
This function presses the key(s) down and then releases up the key(s) automatically. We can also control this keypress event and key release event individually by using the functions:
pyautogui.keyDown(key): Presses and holds down the specified key.
pyautogui.keyUp(key): Releases up the specified key.
So with the help of these functions, keys can be pressed for a longer period. Like in the cell below we will hold down the shift key and press the enter key (two times) to run the two cells below this one and then we will release the shift key.
# Hold down the shift key.
pyautogui.keyDown(key='shift')
# Press the enter key two times.
pyautogui.press(keys='enter', presses=2)
# Release the shift key.
pyautogui.keyUp(key='shift')
# This cell will run automatically due to keypress events in the previous cell.
print('Hello!')
# This cell will also run automatically due to those keypress events.
print('Happy Learning!')
Now we will hold down the shift key and press the tab key and then we will release the shift key. This will switch the tab of your browser so make sure to have multiple tabs before running the cell below.
# Hold down the shift key.
pyautogui.keyDown(key='ctrl')
# Press the tab key.
pyautogui.press(keys='tab')
# Release the shift key.
pyautogui.keyUp(key='ctrl')
To trigger the mouse key press events, we can use pyautogui.click() function and to specify the mouse button that we want to press, we can pass the values left, middle, or right to the argument button.
# Press the mouse right button. It will open up the menu.
pyautogui.click(button='right')
We can also move the mouse cursor to a specific position on the screen by specifying the x and y-coordinate values to the arguments x and y respectively.
# Move to 1300, 800, then click the right mouse button
pyautogui.click(x=1300, y=800, button='right')
Step 6: Build the Final Application
In the final step, we will have to combine all the components to build the final application.
We will use the outputs of the functions created above checkHandsJoined() (to start the game), checkLeftRight() (control horizontal movements) and checkJumpCrouch() (control vertical movements) to trigger the relevant keyboard and mouse events and control the game’s character with our body movements.
Now we will run the cell below and click here to play the game in our browser using our body gestures and movements.
# Initialize the VideoCapture object to read from the webcam.
camera_video = cv2.VideoCapture(0)
camera_video.set(3,1280)
camera_video.set(4,960)
# Create named window for resizing purposes.
cv2.namedWindow('Subway Surfers with Pose Detection', cv2.WINDOW_NORMAL)
# Initialize a variable to store the time of the previous frame.
time1 = 0
# Initialize a variable to store the state of the game (started or not).
game_started = False
# Initialize a variable to store the index of the current horizontal position of the person.
# At Start the character is at center so the index is 1 and it can move left (value 0) and right (value 2).
x_pos_index = 1
# Initialize a variable to store the index of the current vertical posture of the person.
# At Start the person is standing so the index is 1 and he can crouch (value 0) and jump (value 2).
y_pos_index = 1
# Declate a variable to store the intial y-coordinate of the mid-point of both shoulders of the person.
MID_Y = None
# Initialize a counter to store count of the number of consecutive frames with person's hands joined.
counter = 0
# Initialize the number of consecutive frames on which we want to check if person hands joined before starting the game.
num_of_frames = 10
# Iterate until the webcam is accessed successfully.
while camera_video.isOpened():
# Read a frame.
ok, frame = camera_video.read()
# Check if frame is not read properly then continue to the next iteration to read the next frame.
if not ok:
continue
# Flip the frame horizontally for natural (selfie-view) visualization.
frame = cv2.flip(frame, 1)
# Get the height and width of the frame of the webcam video.
frame_height, frame_width, _ = frame.shape
# Perform the pose detection on the frame.
frame, results = detectPose(frame, pose_video, draw=game_started)
# Check if the pose landmarks in the frame are detected.
if results.pose_landmarks:
# Check if the game has started
if game_started:
# Commands to control the horizontal movements of the character.
#--------------------------------------------------------------------------------------------------------------
# Get horizontal position of the person in the frame.
frame, horizontal_position = checkLeftRight(frame, results, draw=True)
# Check if the person has moved to left from center or to center from right.
if (horizontal_position=='Left' and x_pos_index!=0) or (horizontal_position=='Center' and x_pos_index==2):
# Press the left arrow key.
pyautogui.press('left')
# Update the horizontal position index of the character.
x_pos_index -= 1
# Check if the person has moved to Right from center or to center from left.
elif (horizontal_position=='Right' and x_pos_index!=2) or (horizontal_position=='Center' and x_pos_index==0):
# Press the right arrow key.
pyautogui.press('right')
# Update the horizontal position index of the character.
x_pos_index += 1
#--------------------------------------------------------------------------------------------------------------
# Otherwise if the game has not started
else:
# Write the text representing the way to start the game on the frame.
cv2.putText(frame, 'JOIN BOTH HANDS TO START THE GAME.', (5, frame_height - 10), cv2.FONT_HERSHEY_PLAIN,
2, (0, 255, 0), 3)
# Command to Start or resume the game.
#------------------------------------------------------------------------------------------------------------------
# Check if the left and right hands are joined.
if checkHandsJoined(frame, results)[1] == 'Hands Joined':
# Increment the count of consecutive frames with +ve condition.
counter += 1
# Check if the counter is equal to the required number of consecutive frames.
if counter == num_of_frames:
# Command to Start the game first time.
#----------------------------------------------------------------------------------------------------------
# Check if the game has not started yet.
if not(game_started):
# Update the value of the variable that stores the game state.
game_started = True
# Retreive the y-coordinate of the left shoulder landmark.
left_y = int(results.pose_landmarks.landmark[mp_pose.PoseLandmark.RIGHT_SHOULDER].y * frame_height)
# Retreive the y-coordinate of the right shoulder landmark.
right_y = int(results.pose_landmarks.landmark[mp_pose.PoseLandmark.LEFT_SHOULDER].y * frame_height)
# Calculate the intial y-coordinate of the mid-point of both shoulders of the person.
MID_Y = abs(right_y + left_y) // 2
# Move to 1300, 800, then click the left mouse button to start the game.
pyautogui.click(x=1300, y=800, button='left')
#----------------------------------------------------------------------------------------------------------
# Command to resume the game after death of the character.
#----------------------------------------------------------------------------------------------------------
# Otherwise if the game has started.
else:
# Press the space key.
pyautogui.press('space')
#----------------------------------------------------------------------------------------------------------
# Update the counter value to zero.
counter = 0
# Otherwise if the left and right hands are not joined.
else:
# Update the counter value to zero.
counter = 0
#------------------------------------------------------------------------------------------------------------------
# Commands to control the vertical movements of the character.
#------------------------------------------------------------------------------------------------------------------
# Check if the intial y-coordinate of the mid-point of both shoulders of the person has a value.
if MID_Y:
# Get posture (jumping, crouching or standing) of the person in the frame.
frame, posture = checkJumpCrouch(frame, results, MID_Y, draw=True)
# Check if the person has jumped.
if posture == 'Jumping' and y_pos_index == 1:
# Press the up arrow key
pyautogui.press('up')
# Update the veritcal position index of the character.
y_pos_index += 1
# Check if the person has crouched.
elif posture == 'Crouching' and y_pos_index == 1:
# Press the down arrow key
pyautogui.press('down')
# Update the veritcal position index of the character.
y_pos_index -= 1
# Check if the person has stood.
elif posture == 'Standing' and y_pos_index != 1:
# Update the veritcal position index of the character.
y_pos_index = 1
#------------------------------------------------------------------------------------------------------------------
# Otherwise if the pose landmarks in the frame are not detected.
else:
# Update the counter value to zero.
counter = 0
# Calculate the frames updates in one second
#----------------------------------------------------------------------------------------------------------------------
# Set the time for this frame to the current time.
time2 = time()
# Check if the difference between the previous and this frame time > 0 to avoid division by zero.
if (time2 - time1) > 0:
# Calculate the number of frames per second.
frames_per_second = 1.0 / (time2 - time1)
# Write the calculated number of frames per second on the frame.
cv2.putText(frame, 'FPS: {}'.format(int(frames_per_second)), (10, 30),cv2.FONT_HERSHEY_PLAIN, 2, (0, 255, 0), 3)
# Update the previous frame time to this frame time.
# As this frame will become previous frame in next iteration.
time1 = time2
#----------------------------------------------------------------------------------------------------------------------
# Display the frame.
cv2.imshow('Subway Surfers with Pose Detection', frame)
# Wait for 1ms. If a a key is pressed, retreive the ASCII code of the key.
k = cv2.waitKey(1) & 0xFF
# Check if 'ESC' is pressed and break the loop.
if(k == 27):
break
# Release the VideoCapture Object and close the windows.
camera_video.release()
cv2.destroyAllWindows()
Output Video:
While building big applications like this one, I always divide the application into smaller components and then, in the end, integrate all those components to make the final application.
This makes it really easy to learn and understand how everything comes together to build up the full application.
Join My Course Computer Vision For Building Cutting Edge Applications Course
The only course out there that goes beyond basic AI Applications and teaches you how to create next-level apps that utilize physics, deep learning, classical image processing, hand and body gestures. Don’t miss your chance to level up and take your career to new heights
You’ll Learn about:
Creating GUI interfaces for python AI scripts.
Creating .exe DL applications
Using a Physics library in Python & integrating it with AI
Advance Image Processing Skills
Advance Gesture Recognition with Mediapipe
Task Automation with AI & CV
Training an SVM machine Learning Model.
Creating & Cleaning an ML dataset from scratch.
Training DL models & how to use CNN’s & LSTMS.
Creating 10 Advance AI/CV Applications
& More
Whether you’re a seasoned AI professional or someone just looking to start out in AI, this is the course that will teach you, how to Architect & Build complex, real world and thrilling AI applications
In this tutorial, we learned to perform pose detection on the most prominent person in the frame/image, to get thirty-three 3D landmarks, and then use those landmarks to extract useful info about the body movements (horizontal position i.e., left, center or right and posture i.e. jumping, standing or crouching) of the person and then use that info to control a simple game.
Another thing we have learned is how to automatically trigger the mouse and keyboard events programmatically using the Pyautogui library.
Now one drawback of controlling the game with body movements is that the game becomes much harder compared to controlling it via keyboard presses.
But our aim to make the exercise fun and learn to control Human-Computer Interaction (HCI) based games using AI is achieved. Now if you want, you can extend this application further to control a much more complex application.
You can reach out to me personally for a 1 on 1 consultation session in AI/computer vision regarding your project. Our talented team of vision engineers will help you every step of the way. Get on a call with me directlyhere.
Ready to seriously dive into State of the Art AI & Computer Vision? Then Sign up for these premium Courses by Bleed AI
In this tutorial, we’ll learn how to do real-time 3D hands landmarks detection using the Mediapipe library in python. After that, we’ll learn to perform hands type classification (i.e. is it a left or right hand) and then draw the bounding boxes around the hands by retrieving the required coordinates from the detected landmarks.
Below are the results on a few sample images, and this will work on real-time camera feed or recorded videos as well.
And last but not the least, we will also create a customized landmarks annotation of the hands utilizing the depth (distance from the camera) of the hands, as Mediapipe’s solution provides three-dimensional landmarks.
The annotation provided by Mediapipe allows users to annotate the images and videos with a single line of code but it does not utilize the depth and draws fixed-sized lines and circles on the hands.
But in our customized annotation, the thickness of the lines (connections) and circles (landmarks) for each hand will vary in real-time by using the depth of each hand, with the decrease in the distance from the camera (depth), the size of the annotation increases.
The motivation behind this was that the size of the hand in the image/frame increases when the hand is moved closer to the camera, so using a fixed size annotation for a hand that varies in size was hard to digest for me. You can see the comparison of Mediapipe’s annotation and our customized annotation below.
The code for all this is very easy and is explained in the simplest manner possible.
Now before diving further, you can go and watch the youtube tutorial for the detailed explanation, although this blog post alone can also serve the purpose.
In this one, we will learn to localize twenty-one key landmarks on hand(s) e.g. wrist, thumb and fingertips, etc. See the image below:
It is an important and popular pre-processing task in many vision applications as it allows to localize and recognize the shape and motion of hands that opens up the door to the future by making a ton of applications feasible like:
Augmented Reality Applications that overlay Digital Content and Information over Reality like a Virtual Painter.
Sign Language Recognition.
Hand Gesture Controlled Drones and Robots or any other Physical Appliances.
Using landmark detection is also a great way to interact with any Human-Computer Interaction (HCI) based system as the human hand is capable of making multiple gestures without much effort.
Some other types of keypoint detection problems are facial landmark detection, pose detection, etc.
I have already made a tutorial on pose detection and will explain facial landmark detection in an upcoming tutorial.
Part 1 (b): Mediapipe’s Hands Landmarks Detection Implementation
Here’s a brief introduction to Mediapipe;
“Mediapipe is a cross-platform/open-source tool that allows you to run a variety of machine learning models in real-time. It’s designed primarily for facilitating the use of ML in streaming media & It was built by Google”
All the models in Mediapipe have state-of-the-art performance and are highly optimized as well and are used in a lot of well-known products out there.
It provides a robust solution capable of predicting twenty-one 3D landmarks on a hand in real-time with high accuracy and speed as well, even on low-end devices i.e. phones, laptops, etc., which makes it stand out from the other solutions out there.
Now you may be thinking what makes it so fast?
Actually, they have combined a computationally costly object detector with an object tracker that requires a lot less processing. So for the videos, a tracker is used instead of invoking the object detector at every frame which makes this solution so fast and brilliant.
The detector is only invoked as needed, that is in the very first frame or when the tracker loses track of any of the hands. The detector localizes the hand in the image/frame and outputs the bounding box coordinates around the hand.
Then the region of interest (ROI) is cropped from the image/frame using the bounding box coordinates and after that, the cropped image/frame is used by the hand landmark model to predict the landmarks within the ROI.
The palm detector is used, as detecting hands is a comparatively more complex task than the palm or fist (i.e., rigid objects). Palms require square bounding boxes which reduces the number of anchors (predefined bounding boxes of a certain height and width). Also as palms are smaller objects, which reduces the possibility of self-occlusion (hidden by themselves), like handshakes. Comparatively, hands lack easily distinguishable patterns and are often occluded.
The image below shows the twenty-one hand landmarks, that this solution detects, along with their indexes.
They manually annotated around 30K real-world images with 3D coordinates by using images depth maps and also rendered a high-quality synthetic hand model over various backgrounds and then mapped the model with different backgrounds to the corresponding 3D coordinates.
So they kind of automated the whole annotation process. The image below contains the aligned hands cropped images and the rendered synthetic hand images with ground truth annotation.
Alright now we have learned the required basic theory and implementation details, let’s dive into the code.
Download Code:
[optin-monster slug=”ko5ox1sveedosrhdesgq”]
Import the Libraries
We will start by importing the required libraries.
import cv2
import numpy as np
from time import time
import mediapipe as mp
import matplotlib.pyplot as plt
Part 2: Using Hands Landmarks Detection on images and videos
Initialize the Hands Landmarks Detection Model
To use the Mediapipe’s hands solution, we first have to initialize the hands class using the mp.solutions.hands syntax and then we will have to call the function mp.solutions.hands.Hands() with the arguments explained below:
static_image_mode – It is a boolean value that is if set to False, the solution treats the input images as a video stream. It will try to detect hands in the first input images, and upon a successful detection further localizes the hand landmarks. In subsequent images, once all max_num_hands hands are detected and the corresponding hand landmarks are localized, it simply tracks those landmarks without invoking another detection until it loses track of any of the hands. This reduces latency and is ideal for processing video frames. If set to True, hand detection runs on every input image, ideal for processing a batch of static, possibly unrelated, images. Its default value is False.
max_num_hands – It is the maximum number of hands to detect. Its default value is 2.
min_detection_confidence – It is the minimum detection confidence ([0.0, 1.0]) required to consider the palm-detection model’s prediction correct. Its default value is 0.5 which means that all the detections with prediction confidence less than 50% are ignored by default.
min_tracking_confidence – It is the minimum tracking confidence ([0.0, 1.0]) required to consider the landmark-tracking model’s tracked hands landmarks valid. If the confidence is less than this argument value then the detector is invoked again in the next frame/image, so increasing its value increases the robustness, but also increases the latency. Its default value is 0.5.
Then we will also need to initialize the mp.solutions.drawing_utils class that is very useful to visualize the landmarks on the images/frames.
# Initialize the mediapipe hands class.
mp_hands = mp.solutions.hands
# Set up the Hands function.
hands = mp_hands.Hands(static_image_mode=True, max_num_hands=2, min_detection_confidence=0.3)
# Initialize the mediapipe drawing class.
mp_drawing = mp.solutions.drawing_utils
Read an Image
Now we will use the function cv2.imread() to read a sample image and then display it using the matplotlib library.
# Read an image from the specified path.
sample_img = cv2.imread('media/sample.jpg')
# Specify a size of the figure.
plt.figure(figsize = [10, 10])
# Display the sample image, also convert BGR to RGB for display.
plt.title("Sample Image");plt.axis('off');plt.imshow(sample_img[:,:,::-1]);plt.show()
Perform Hands Landmarks Detection
Now we will pass the image to the hand’s landmarks detection machine learning pipeline by using the function mp.solutions.hands.Hands().process(). But first, we will have to convert the image from BGR to RGB format using the function cv2.cvtColor() as OpenCV reads images in BGR format and the ml pipeline expects the input images to be in RGB color format.
The machine learning pipeline outputs a list of twenty-one landmarks of the prominent hands in the image. Each landmark has:
x – It is the landmark x-coordinate normalized to [0.0, 1.0] by the image width.
y: It is the landmark y-coordinate normalized to [0.0, 1.0] by the image height.
z: It is the landmark z-coordinate normalized to roughly the same scale as x. It represents the landmark depth with the wrist being the origin, so the smaller the value the closer the landmark is to the camera.
To get more intuition, we will display the first two landmarks of each hand, the ml pipeline outputs an object that has an attribute multi_hand_landmarks that contains the found landmarks coordinates of each hand as an element of a list.
Note:The z-coordinate is just the relative distance of the landmark from the wrist, and this distance increases and decreases depending upon the distance from the camera so that is why it represents the depth of each landmark point.
# Perform hands landmarks detection after converting the image into RGB format.
results = hands.process(cv2.cvtColor(sample_img, cv2.COLOR_BGR2RGB))
# Check if landmarks are found.
if results.multi_hand_landmarks:
# Iterate over the found hands.
for hand_no, hand_landmarks in enumerate(results.multi_hand_landmarks):
print(f'HAND NUMBER: {hand_no+1}')
print('-----------------------')
# Iterate two times as we only want to display first two landmarks of each hand.
for i in range(2):
# Display the found normalized landmarks.
print(f'{mp_hands.HandLandmark(i).name}:')
print(f'{hand_landmarks.landmark[mp_hands.HandLandmark(i).value]}')
As you can see that the landmarks are normalized to specific scales, so now we will convert them back to their original scale by using the width and height of the sample image and display them.
# Retrieve the height and width of the sample image.
image_height, image_width, _ = sample_img.shape
# Check if landmarks are found.
if results.multi_hand_landmarks:
# Iterate over the found hands.
for hand_no, hand_landmarks in enumerate(results.multi_hand_landmarks):
print(f'HAND NUMBER: {hand_no+1}')
print('-----------------------')
# Iterate two times as we only want to display first two landmark of each hand.
for i in range(2):
# Display the found landmarks after converting them into their original scale.
print(f'{mp_hands.HandLandmark(i).name}:')
print(f'x: {hand_landmarks.landmark[mp_hands.HandLandmark(i).value].x * image_width}')
print(f'y: {hand_landmarks.landmark[mp_hands.HandLandmark(i).value].y * image_height}')
print(f'z: {hand_landmarks.landmark[mp_hands.HandLandmark(i).value].z * image_width}\n')
Now we will draw the detected landmarks on a copy of the sample image using the function mp.solutions.drawing_utils.draw_landmarks() from the class mp.solutions.drawing_utils, we had initialized earlier and will display the resultant image.
# Create a copy of the sample image to draw landmarks on.
img_copy = sample_img.copy()
# Check if landmarks are found.
if results.multi_hand_landmarks:
# Iterate over the found hands.
for hand_no, hand_landmarks in enumerate(results.multi_hand_landmarks):
# Draw the hand landmarks on the copy of the sample image.
mp_drawing.draw_landmarks(image = img_copy, landmark_list = hand_landmarks,
connections = mp_hands.HAND_CONNECTIONS)
# Specify a size of the figure.
fig = plt.figure(figsize = [10, 10])
# Display the resultant image with the landmarks drawn, also convert BGR to RGB for display.
plt.title("Resultant Image");plt.axis('off');plt.imshow(img_copy[:,:,::-1]);plt.show()
Part 3: Hands Classification (i.e., Left or Right)
Create a Hands Landmarks Detection Function
Now we will put all this together to create a function that will perform hands landmarks detection on an image and will visualize the resultant image along with the original image or return the resultant image along with the output of the model depending upon the passed arguments.
def detectHandsLandmarks(image, hands, display = True):
'''
This function performs hands landmarks detection on an image.
Args:
image: The input image with prominent hand(s) whose landmarks needs to be detected.
hands: The hands function required to perform the hands landmarks detection.
display: A boolean value that is if set to true the function displays the original input image, and the output
image with hands landmarks drawn and returns nothing.
Returns:
output_image: The input image with the detected hands landmarks drawn.
results: The output of the hands landmarks detection on the input image.
'''
# Create a copy of the input image to draw landmarks on.
output_image = image.copy()
# Convert the image from BGR into RGB format.
imgRGB = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
# Perform the Hands Landmarks Detection.
results = hands.process(imgRGB)
# Check if landmarks are found.
if results.multi_hand_landmarks:
# Iterate over the found hands.
for hand_landmarks in results.multi_hand_landmarks:
# Draw the hand landmarks on the copy of the input image.
mp_drawing.draw_landmarks(image = output_image, landmark_list = hand_landmarks,
connections = mp_hands.HAND_CONNECTIONS)
# Check if the original input image and the output image are specified to be displayed.
if display:
# Display the original input image and the output image.
plt.figure(figsize=[15,15])
plt.subplot(121);plt.imshow(image[:,:,::-1]);plt.title("Original Image");plt.axis('off');
plt.subplot(122);plt.imshow(output_image[:,:,::-1]);plt.title("Output");plt.axis('off');
# Otherwise
else:
# Return the output image and results of hands landmarks detection.
return output_image, results
Now we will utilize the function created above to perform hands landmarks detection on a few sample images and display the results.
# Read another sample image and perform hands landmarks detection on it.
image = cv2.imread('media/sample1.jpg')
detectHandsLandmarks(image, hands, display=True)
# Read another sample image and perform hands landmarks detection on it.
image = cv2.imread('media/sample2.jpg')
detectHandsLandmarks(image, hands, display=True)
# Read another sample image and perform hands landmarks detection on it.
image = cv2.imread('media/sample3.jpg')
detectHandsLandmarks(image, hands, display=True)
Hands Landmarks Detection on Real-Time Webcam Feed
The results on the images were excellent, but now the real test begins, we will try the function on a real-time webcam feed. We will also calculate and display the number of frames being updated in one second to get an idea of whether this solution can work in real-time on a CPU or not. As that is the only thing that differentiates it from the other solutions out there.
# Setup Hands function for video.
hands_video = mp_hands.Hands(static_image_mode=False, max_num_hands=2,
min_detection_confidence=0.7, min_tracking_confidence=0.4)
# Initialize the VideoCapture object to read from the webcam.
camera_video = cv2.VideoCapture(1)
camera_video.set(3,1280)
camera_video.set(4,960)
# Create named window for resizing purposes.
cv2.namedWindow('Hands Landmarks Detection', cv2.WINDOW_NORMAL)
# Initialize a variable to store the time of the previous frame.
time1 = 0
# Iterate until the webcam is accessed successfully.
while camera_video.isOpened():
# Read a frame.
ok, frame = camera_video.read()
# Check if frame is not read properly then continue to the next iteration to read the next frame.
if not ok:
continue
# Flip the frame horizontally for natural (selfie-view) visualization.
frame = cv2.flip(frame, 1)
# Perform Hands landmarks detection.
frame, _ = detectHandsLandmarks(frame, hands_video, display=False)
# Set the time for this frame to the current time.
time2 = time()
# Check if the difference between the previous and this frame time > 0 to avoid division by zero.
if (time2 - time1) > 0:
# Calculate the number of frames per second.
frames_per_second = 1.0 / (time2 - time1)
# Write the calculated number of frames per second on the frame.
cv2.putText(frame, 'FPS: {}'.format(int(frames_per_second)), (10, 30),cv2.FONT_HERSHEY_PLAIN, 2, (0, 255, 0), 3)
# Update the previous frame time to this frame time.
# As this frame will become previous frame in next iteration.
time1 = time2
# Display the frame.
cv2.imshow('Hands Landmarks Detection', frame)
# Wait for 1ms. If a a key is pressed, retreive the ASCII code of the key.
k = cv2.waitKey(1) & 0xFF
# Check if 'ESC' is pressed and break the loop.
if(k == 27):
break
# Release the VideoCapture Object and close the windows.
camera_video.release()
cv2.destroyAllWindows()
Output
Woah! that was impressive not only it was fast but the results were quite accurate too.
Create a Hand Type Classification Function
Now we will create a function that will perform hands type classification (i.e. is it a left or right hand). The output (object) by the hand’s landmarks detector model contains another attribute multi_handedness that contains a score (probability of the predicted label being correct) and label (either "Left" or "Right") for each detected hand.
While determining the label, it is assumed that the input image is mirrored (i.e., flipped horizontally). So we had already performed classification during the hand’s landmarks detection, and now we only need to access the information stored in the attribute multi_handedness.
def getHandType(image, results, draw=True, display = True):
'''
This function performs hands type (left or right) classification on hands.
Args:
image: The image of the hands that needs to be classified, with the hands landmarks detection already performed.
results: The output of the hands landmarks detection performed on the image in which hands types needs
to be classified.
draw: A boolean value that is if set to true the function writes the hand type label on the output image.
display: A boolean value that is if set to true the function displays the output image and returns nothing.
Returns:
output_image: The image of the hands with the classified hand type label written if it was specified.
hands_status: A dictionary containing classification info of both hands.
'''
# Create a copy of the input image to write hand type label on.
output_image = image.copy()
# Initialize a dictionary to store the classification info of both hands.
hands_status = {'Right': False, 'Left': False, 'Right_index' : None, 'Left_index': None}
# Iterate over the found hands in the image.
for hand_index, hand_info in enumerate(results.multi_handedness):
# Retrieve the label of the found hand.
hand_type = hand_info.classification[0].label
# Update the status of the found hand.
hands_status[hand_type] = True
# Update the index of the found hand.
hands_status[hand_type + '_index'] = hand_index
# Check if the hand type label is specified to be written.
if draw:
# Write the hand type on the output image.
cv2.putText(output_image, hand_type + ' Hand Detected', (10, (hand_index+1) * 30),cv2.FONT_HERSHEY_PLAIN,
2, (0,255,0), 2)
# Check if the output image is specified to be displayed.
if display:
# Display the output image.
plt.figure(figsize=[10,10])
plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');
# Otherwise
else:
# Return the output image and the hands status dictionary that contains classification info.
return output_image, hands_status
Now we will utilize the function created above to perform hand type classification on a few sample images and display the results.
# Read a sample image with one hand and perform hand type classification on it after flipping it horizontally.
image = cv2.imread('media/sample5.jpg')
flipped_image = cv2.flip(image, 1)
_, results = detectHandsLandmarks(flipped_image, hands, display=False)
if results.multi_hand_landmarks:
getHandType(image, results)
# Read a sample image with one hand and perform hand type classification on it after flipping it horizontally.
image = cv2.imread('media/sample6.jpg')
flipped_image = cv2.flip(image, 1)
_, results = detectHandsLandmarks(flipped_image, hands, display=False)
if results.multi_hand_landmarks:
getHandType(image, results)
# Read a sample image with one hand and perform hand type classification on it after flipping it horizontally.
image = cv2.imread('media/sample7.jpg')
flipped_image = cv2.flip(image, 1)
_, results = detectHandsLandmarks(flipped_image, hands, display=False)
if results.multi_hand_landmarks:
getHandType(image, results)
Cool! it worked perfectly on each of the sample images.
Part 4 (a): Draw Bounding Boxes around the Hands
Create a Function to Draw Bounding Boxes
Now we will create a function that will draw bounding boxes around the hands and write their classified types near them. We will first convert the normalized landmarks back to their original scale by using the width and height of the image. We will then get the bounding box coordinates ((x1,y1), (x2, y2)) for each hand.
Top Left Coordinate:
x1 – the smallest x-coordinate in the list of the found landmarks of the hand.
y1 – the smallest y-coordinate in the list of the found landmarks of the hand.
Bottom Right Coordinate:
x2 – the largest x-coordinate in the list of the found landmarks of the hand.
y2 – the largest y-coordinate in the list of the found landmarks of the hand.
Then we will draw the bounding boxes around the hands using the found coordinates and the specified padding and write the classified types of each hand near them using the bounding box coordinates. And after that, we will either display the resultant image or return it depending upon the passed arguments.
def drawBoundingBoxes(image, results, hand_status, padd_amount = 10, draw=True, display=True):
'''
This function draws bounding boxes around the hands and write their classified types near them.
Args:
image: The image of the hands on which the bounding boxes around the hands needs to be drawn and the
classified hands types labels needs to be written.
results: The output of the hands landmarks detection performed on the image on which the bounding boxes needs
to be drawn.
hand_status: The dictionary containing the classification info of both hands.
padd_amount: The value that specifies the space inside the bounding box between the hand and the box's borders.
draw: A boolean value that is if set to true the function draws bounding boxes and write their classified
types on the output image.
display: A boolean value that is if set to true the function displays the output image and returns nothing.
Returns:
output_image: The image of the hands with the bounding boxes drawn and hands classified types written if it
was specified.
output_landmarks: The dictionary that stores both (left and right) hands landmarks as different elements.
'''
# Create a copy of the input image to draw bounding boxes on and write hands types labels.
output_image = image.copy()
# Initialize a dictionary to store both (left and right) hands landmarks as different elements.
output_landmarks = {}
# Get the height and width of the input image.
height, width, _ = image.shape
# Iterate over the found hands.
for hand_index, hand_landmarks in enumerate(results.multi_hand_landmarks):
# Initialize a list to store the detected landmarks of the hand.
landmarks = []
# Iterate over the detected landmarks of the hand.
for landmark in hand_landmarks.landmark:
# Append the landmark into the list.
landmarks.append((int(landmark.x * width), int(landmark.y * height),
(landmark.z * width)))
# Get all the x-coordinate values from the found landmarks of the hand.
x_coordinates = np.array(landmarks)[:,0]
# Get all the y-coordinate values from the found landmarks of the hand.
y_coordinates = np.array(landmarks)[:,1]
# Get the bounding box coordinates for the hand with the specified padding.
x1 = int(np.min(x_coordinates) - padd_amount)
y1 = int(np.min(y_coordinates) - padd_amount)
x2 = int(np.max(x_coordinates) + padd_amount)
y2 = int(np.max(y_coordinates) + padd_amount)
# Initialize a variable to store the label of the hand.
label = "Unknown"
# Check if the hand we are iterating upon is the right one.
if hand_status['Right_index'] == hand_index:
# Update the label and store the landmarks of the hand in the dictionary.
label = 'Right Hand'
output_landmarks['Right'] = landmarks
# Check if the hand we are iterating upon is the left one.
elif hand_status['Left_index'] == hand_index:
# Update the label and store the landmarks of the hand in the dictionary.
label = 'Left Hand'
output_landmarks['Left'] = landmarks
# Check if the bounding box and the classified label is specified to be written.
if draw:
# Draw the bounding box around the hand on the output image.
cv2.rectangle(output_image, (x1, y1), (x2, y2), (155, 0, 255), 3, cv2.LINE_8)
# Write the classified label of the hand below the bounding box drawn.
cv2.putText(output_image, label, (x1, y2+25), cv2.FONT_HERSHEY_COMPLEX, 0.7, (20,255,155), 1, cv2.LINE_AA)
# Check if the output image is specified to be displayed.
if display:
# Display the output image.
plt.figure(figsize=[10,10])
plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');
# Otherwise
else:
# Return the output image and the landmarks dictionary.
return output_image, output_landmarks
Now we will utilize the function created above to perform hand type classification and draw bounding boxes around the hands on a real-time webcam feed.
# Initialize the VideoCapture object to read from the webcam.
camera_video = cv2.VideoCapture(1)
# Initialize a resizable window.
cv2.namedWindow('Hands Landmarks Detection', cv2.WINDOW_NORMAL)
# Iterate until the webcam is accessed successfully.
while camera_video.isOpened():
# Read a frame.
ok, frame = camera_video.read()
# Check if frame is not read properly then continue to the next iteration to read the next frame.
if not ok:
continue
# Flip the frame horizontally for natural (selfie-view) visualization.
frame = cv2.flip(frame, 1)
# Perform Hands landmarks detection.
frame, results = detectHandsLandmarks(frame, hands_video, display=False)
# Check if landmarks are found in the frame.
if results.multi_hand_landmarks:
# Perform hand(s) type (left or right) classification.
_, hands_status = getHandType(frame.copy(), results, draw=False, display=False)
# Draw bounding boxes around the detected hands and write their classified types near them.
frame, _ = drawBoundingBoxes(frame, results, hands_status, display=False)
# Display the frame.
cv2.imshow('Hands Landmarks Detection', frame)
# Wait for 1ms. If a a key is pressed, retreive the ASCII code of the key.
k = cv2.waitKey(1) & 0xFF
# Check if 'ESC' is pressed and break the loop.
if(k == 27):
break
# Release the VideoCapture Object and close the windows.
camera_video.release()
cv2.destroyAllWindows()
Output
Great! the classification, along with localization, works pretty accurately on a real-time webcam feed too.
Part 4 (b): Draw Bounding Boxes around the Hands
Create a Function to Draw Customized Landmarks Annotation
Now we will create a function that will draw customized landmarks of the hands. What we are doing differently, is that we are utilizing the depth (z-coordinate) values to increase and decrease the size of the lines and circles whereas Mediapipe’s annotation uses the fixed sizes. As we have learned that z-coordinate represents the landmark depth, so the smaller the value the closer the landmark is to the camera.
We are calculating the average depth of every landmark of a hand and with the decrease in the average depth of a hand, we are increasing the thickness of the annotation circles and the lines of that hand which means the closer the hand is to the camera bigger the annotation will be, to adjust the annotation size with the size of the hand.
def customLandmarksAnnotation(image, landmark_dict):
'''
This function draws customized landmarks annotation utilizing the z-coordinate (depth) values of the hands.
Args:
image: The image of the hands on which customized landmarks annotation of the hands needs to be drawn.
landmark_dict: The dictionary that stores the hand(s) landmarks as different elements with keys as hand
types(i.e., left and right).
Returns:
output_image: The image of the hands with the customized annotation drawn.
depth: A dictionary that contains the average depth of all landmarks of the hand(s) in the image.
'''
# Create a copy of the input image to draw annotation on.
output_image = image.copy()
# Initialize a dictionary to store the average depth of all landmarks of hand(s).
depth = {}
# Initialize a list with the arrays of indexes of the landmarks that will make the required
# line segments to draw on the hand.
segments = [np.arange(0,5), np.arange(5,9) , np.arange(9,13), np.arange(13, 17), np.arange(17, 21),
np.arange(5,18,4), np.array([0,5]), np.array([0,17])]
# Iterate over the landmarks dictionary.
for hand_type, hand_landmarks in landmark_dict.items():
# Get all the z-coordinates (depth) of the landmarks of the hand.
depth_values = np.array(hand_landmarks)[:,-1]
# Calculate the average depth of the hand.
average_depth = int(sum(depth_values) / len(depth_values))
# Get all the x-coordinates of the landmarks of the hand.
x_values = np.array(hand_landmarks)[:,0]
# Get all the y-coordinates of the landmarks of the hand.
y_values = np.array(hand_landmarks)[:,1]
# Initialize a list to store the arrays of x and y coordinates of the line segments for the hand.
line_segments = []
# Iterate over the arrays of indexes of the landmarks that will make the required line segments.
for segment_indexes in segments:
# Get an array of a line segment coordinates of the hand.
line_segment = np.array([[int(x_values[index]), int(y_values[index])] for index in segment_indexes])
# Append the line segment coordinates into the list.
line_segments.append(line_segment)
# Check if the average depth of the hand is less than 0.
if average_depth < 0:
# Set the thickness of the line segments of the hand accordingly to the average depth.
line_thickness = int(np.ceil(0.1*abs(average_depth))) + 2
# Set the thickness of the circles of the hand landmarks accordingly to the average depth.
circle_thickness = int(np.ceil(0.1*abs(average_depth))) + 3
# Otherwise.
else:
# Set the thickness of the line segments of the hand to 2 (i.e. the minimum thickness we are specifying).
line_thickness = 2
# Set the thickness of the circles to 3 (i.e. the minimum thickness)
circle_thickness = 3
# Draw the line segments on the hand.
cv2.polylines(output_image, line_segments, False, (100,250,55), line_thickness)
# Write the average depth of the hand on the output image.
cv2.putText(output_image,'Depth: {}'.format(average_depth),(10,30), cv2.FONT_HERSHEY_COMPLEX, 1, (20,25,255), 1,
cv2.LINE_AA)
# Iterate over the x and y coordinates of the hand landmarks.
for x, y in zip(x_values, y_values):
# Draw a circle on the x and y coordinate of the hand.
cv2.circle(output_image,(int(x), int(y)), circle_thickness, (55,55,250), -1)
# Store the calculated average depth in the dictionary.
depth[hand_type] = average_depth
# Return the output image and the average depth dictionary of the hand(s).
return output_image, depth
Mediapipe’s Annotation vs Our Customized Annotation on Real-Time Webcam Feed
Now we will utilize the function created above to draw the customized annotation on a real-time webcam feed and stack it with the results of Mediapipe’s annotation to visualize the difference.
# Initialize the VideoCapture object to read from the webcam.
camera_video = cv2.VideoCapture(1)
# Initialize a resizable window.
cv2.namedWindow('Hands Landmarks Detection', cv2.WINDOW_NORMAL)
# Iterate until the webcam is accessed successfully.
while camera_video.isOpened():
# Read a frame.
ok, frame = camera_video.read()
# Check if frame is not read properly then continue to the next iteration to read the next frame.
if not ok:
continue
# Flip the frame horizontally for natural (selfie-view) visualization.
frame = cv2.flip(frame, 1)
# Perform Hands landmarks detection.
annotated_frame, results = detectHandsLandmarks(frame, hands_video, display=False)
# Check if landmarks are found in the frame.
if results.multi_hand_landmarks:
# Perform hand(s) type (left or right) classification.
_, hands_status = getHandType(frame.copy(), results, draw=False, display=False)
# Get the landmarks dictionary that stores each hand landmarks as different elements.
frame, landmark_dict = drawBoundingBoxes(frame, results, hands_status, draw=False, display=False)
# Draw customized landmarks annotation ultilizing the z-coordinate (depth) values of the hand(s).
custom_ann_frame, _ = customLandmarksAnnotation(frame, landmark_dict)
# Stack the frame annotated using mediapipe with the customized one.
final_output = np.hstack((annotated_frame, custom_ann_frame))
# Otherwise.
else:
# Stack the frame two time.
final_output = np.hstack((frame, frame))
# Display the stacked frame.
cv2.imshow('Hands Landmarks Detection', final_output)
# Wait for 1ms. If a a key is pressed, retreive the ASCII code of the key.
k = cv2.waitKey(1) & 0xFF
# Check if 'ESC' is pressed and break the loop.
if(k == 27):
break
# Release the VideoCapture Object and close the windows.
camera_video.release()
cv2.destroyAllWindows()
Output
As expected, the results were remarkable! The thickness of the annotation circles and the lines of each hand increased with the decrease in the distance, so the hack that Mediapipe uses to calculate the depth works pretty well and is also computationally very reasonable.
Join My Course Computer Vision For Building Cutting Edge Applications Course
The only course out there that goes beyond basic AI Applications and teaches you how to create next-level apps that utilize physics, deep learning, classical image processing, hand and body gestures. Don’t miss your chance to level up and take your career to new heights
You’ll Learn about:
Creating GUI interfaces for python AI scripts.
Creating .exe DL applications
Using a Physics library in Python & integrating it with AI
Advance Image Processing Skills
Advance Gesture Recognition with Mediapipe
Task Automation with AI & CV
Training an SVM machine Learning Model.
Creating & Cleaning an ML dataset from scratch.
Training DL models & how to use CNN’s & LSTMS.
Creating 10 Advance AI/CV Applications
& More
Whether you’re a seasoned AI professional or someone just looking to start out in AI, this is the course that will teach you, how to Architect & Build complex, real world and thrilling AI applications
In this tutorial, we have learned about a very popular and useful computer vision problem called hand landmarks detection. First, we covered what exactly this is, along with its applications, and then we moved to the implementation details that Mediapipe has used to provide the solution.
Also, we learned how it used a detection/tracker pipeline to provide the speed for which it stands out. After that, we performed 3D hands landmarks detection using Mediapipe’s solution on images and a real-time webcam feed.
Then we learned to classify hands as left or right and draw bounding boxes around them and after that, we learned to draw customized landmarks annotation utilizing the z-coordinate (depth) values of the hands.
Now a drawback of using this hand landmarks detection system is that you have to provide the maximum number of hands possible in the image/frame beforehand and the computationally expensive detector is invoked on every frame until the number of hands detection becomes equal to the provided maximum number of hands.
Another limitation is that the z-coordinate is just the relative distance of the landmark from the wrist, and that distance can also vary depending upon the posture of the hand i.e., whether the hand is closed or wide open. So it does not truly represent the depth, but still, it’s a great hack to calculate the depth from 2D images without using a depth camera.
You can reach out to me personally for a 1 on 1 consultation session in AI/computer vision regarding your project. Our talented team of vision engineers will help you every step of the way. Get on a call with me directlyhere.
In this tutorial, we will learn how to manipulate facial expressions and create a DeepFake video out of a static image using the famous First-Order Motion Model. Yes, you heard that right, we just need a single 2D image of a person to create the DeepFake video.
Excited yet? … not that much ? .. well what if I tell you, the whole tutorial is actually on Google Colab, so you don’t need to worry about installation or GPUs to run, everything is configured.
And you know what the best part is?
Utilizing the colab that you will get in this tutorial, you can generate deepfakes in a matter of seconds, yes seconds, not weeks, not days, not hours but seconds.
What is a DeepFake?
The term DeepFake is a combination of two words; Deep refers to the technology responsible for generating DeepFake content, known as Deep learning, and Fake refers to the falsified content. The technology generates synthetic media, to create falsified content, which can be done by either replacing or synthesizing the new content (can be a video or even audio).
Below you can see the results on a few sample images:
This feels like putting your own words in a person’s mouth but on a whole new level.
Also, you may have noticed, in the results above, that we are generating the output video utilizing the whole frame/image, not just on the face ROI that people normally do.
First-Order Motion Model
We will be using the aforementioned First-Order Motion Model, so let’s start by understanding what it is and how it works?
The term First-Order Motion refers to a change in luminance over space and time, and the first-order motion model utilizes this change to capture motion in the source video (also known as the driving video).
The framework is composed of two main components: motion estimation (which predicts a dense motion field) and image generation (which predicts the resultant video). You don’t have to worry about the technical details of these modules to use this model. If you are not a computer vision practitioner, you should skip the paragraph below.
The Motion Extractor module uses the unsupervised key point detector to get the relevant key points from the source image and a driving video frame. The local affine transformation is calculated concerning the frame from the driving video. A Dense Motion Network then generates an occlusion map and a dense optical flow, which is fed into the Generator Module alongside the source image. The Generator Module generates the output frame, which is a replica of the relevant motion from the driving video’s frame onto the source image.
This approach can also be used to manipulate faces, human bodies, and even animated characters, given that the model is trained on a set of videos of similar object categories.
Now that we have gone through the prerequisite theory and implementation details of the approach we will be using, let’s dive into the code.
# Discard the output of this cell.
%%capture
# Clone the First Order Motion Model Github Repository.
!git clone https://github.com/AliaksandrSiarohin/first-order-model
# Change Current Working Directory to "first-order-model".
%cd first-order-model
# Clone the Face Alignment Repository.
!git clone https://github.com/1adrianb/face-alignment
# Change Current Working Directory to "face-alignment".
%cd face-alignment
Step 1.2: Install the required Modules
Install helper modules that are required to perform the necessary pre- and post-processing.
# Discard the output of this cell.
%%capture
# Install the modules required to use the Face Alignment module.
!pip install -r requirements.txt
# Install the Face Alignment module.
!python setup.py install
# Install the mediapipe library.
!pip install mediapipe
# Move one Directory back, i.e., to first-order-model Directory.
%cd ..
Import the required libraries.
import os
import cv2
import mediapipe as mp
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import demo
import imageio
import warnings
warnings.filterwarnings("ignore")
import requests
from skimage.transform import resize
from skimage import img_as_ubyte
from google.colab import files
from IPython.display import display, Javascript,HTML
from google.colab.output import eval_js
from base64 import b64encode, b64decode
Step 2: Prepare a driving video
In this step, we will create a driving video and will make it ready to be passed into the model.
Step 2.1: Record a video from the webcam
Create a function record_video() that can access the webcam utilizing JavaScript.
Remember that Colab is a web IDE that runs entirely on the cloud, so that’s why JavaScript is needed to access the system Webcam.
def record_video(filename = 'Video.mp4'):
'''
This function will record a video, by accessing the Webcam using the javascript and store it into a Video file.
Args:
filename: It is the name by which recorded video will be saved. Its default value is 'Video.mp4'.
'''
# Java Script Code for accessing the Webcam and Recording the Video.
js=Javascript("""
async function recordVideo() {
// Create a div. It is a division or a section in an HTML document.
// This div will contain the buttons and the video.
const div = document.createElement('div');
// Create a start recording button.
const capture = document.createElement('button');
// Create a stop recording button.
const stopCapture = document.createElement("button");
// Set the text content, background color and foreground color of the button.
capture.textContent = "Start Recording";
capture.style.background = "orange";
capture.style.color = "white";
// Set the text content, background color and foreground color of the button.
stopCapture.textContent = "Recording";
stopCapture.style.background = "red";
stopCapture.style.color = "white";
// Append the start recording button into the div.
div.appendChild(capture);
// Create a video element.
const video = document.createElement('video');
video.style.display = 'block';
// Prompt the user for permission to use a media input.
const stream = await navigator.mediaDevices.getUserMedia({audio:true, video: true});
// Create a MediaRecorder Object.
let recorder = new MediaRecorder(stream, { mimeType: "video/webm" });
// Append the div into the document.
document.body.appendChild(div);
// Append the video into the div.
div.appendChild(video);
// Set the video source.
video.srcObject = stream;
// Mute the video.
video.muted = true;
// Play the video.
await video.play();
// Set height of the output.
google.colab.output.setIframeHeight(document.documentElement.scrollHeight, true);
// Wait until the video recording button is pressed.
await new Promise((resolve) => {
capture.onclick = resolve;
});
// Start recording the video.
recorder.start();
// Replace the start recording button with the stop recording button.
capture.replaceWith(stopCapture);
// Stop recording automatically after 11 seconds.
setTimeout(()=>{recorder.stop();}, 11000);
// Get the recording.
let recData = await new Promise((resolve) => recorder.ondataavailable = resolve);
let arrBuff = await recData.data.arrayBuffer();
// Stop the stream.
stream.getVideoTracks()[0].stop();
// Remove the div.
div.remove();
// Convert the recording into a binaryString.
let binaryString = "";
let bytes = new Uint8Array(arrBuff);
bytes.forEach((byte) => {
binaryString += String.fromCharCode(byte);
})
// Return the results.
return btoa(binaryString);
}
""")
# Create a try block.
try:
# Execute the javascript code and display the webcam results.
display(js)
data=eval_js('recordVideo({})')
# Decode the recorded data.
binary=b64decode(data)
# Write the video file on the disk.
with open(filename,"wb") as video_file:
video_file.write(binary)
# Display the success message.
print(f"Saved recorded video at: {filename}")
# Handle the exceptions.
except Exception as err:
print(str(err))
Now utilize the record_video() function created above, to record a video. Click the recording button, and then the browser will ask for user permission to access the webcam and microphone (if you have not allowed these by default) after allowing, the video will start recording and will be saved into the disk after a few seconds. Please make sure to have neutral facial expressions at the start of the video to get the best Deep Fake results.
You can also use a pre-recorded video if you want, by skipping this step and saving that pre-recorded video at the video_path.
# Specify the width at which the video will be displayed.
video_width = 300
# Specify the path of the video.
video_path = 'Video.mp4'
# Record the video.
record_video(video_path)
# Read the Video file.
video_file = open(video_path, "r+b").read()
# Display the Recorded Video, using HTML.
video_url = f"data:video/mp4;base64,{b64encode(video_file).decode()}"
HTML(f"""<video width={video_width} controls><source src="{video_url}"></video>""")
The video is saved, but the issue is that the video is just a set of frames with no FPS and Duration information, and this can cause issues later on, so now, before proceeding further, resolve the issue by utilizing the FFMPEG command.
# Discard the output of this cell.
%%capture
# Check if the source video already exists.
if os.path.exists('source_video.mp4'):
# Remove the video.
os.remove('source_video.mp4')
# Set the FPS=23 of the Video.mp4 and save it with the name source_video.mp4.
!ffmpeg -i Video.mp4 -filter:v fps=23 source_video.mp4
Step 2.2: Crop the face from the recorded video
Crop the face from the video by utilizing the crop-video.py script provided in the First-Order-Model repository.
The Script will generate a FFMPEG Command that we can use to align and crop the face region of interest after resizing it to 256x256. Note that it does not print any FFMPEG Command if it fails to detect the face in the video.
# Generate the `FFMPEG` to crop the face from the video.
!python crop-video.py --inp source_video.mp4
Utilize the FFMPEG command generated by the crop-video.py script to create the desired video.
# Discard the output of this cell.
%%capture
# Check if the face video already exists.
if os.path.exists('crop.mp4'):
# Remove the video.
os.remove('crop.mp4')
# Crop the face from the video and resize it to 256x256.
!ffmpeg -i source_video.mp4 -ss 0.0 -t 6.913043478260869 -filter:v "crop=866:866:595:166, scale=256:256" crop.mp4
Now that the cropped face video is stored in the disk, display it to make sure that we have extracted exactly what we desired.
# Read the Cropped Video file.
video_file = open('crop.mp4', "r+b").read()
# Display the Cropped Video, using HTML.
video_url = f"data:video/mp4;base64,{b64encode(video_file).decode()}"
HTML(f"""<video width={video_width} controls><source src="{video_url}"></video>""")
Perfect! The driving video looks good. Now we can start working on a source image.
Step 3: Prepare a source Image
In this step, we will make the source Image ready to be passed into the model.
Download the Image
Download the image that we want to pass to the First-Order Motion Model utilizing the wget command.
# Discard the output of this cell.
%%capture
# Specify the path of the images directory.
IMAGES_DIR = 'media'
# Check if the images directory does not already exist.
if not os.path.exists(os.getcwd()+"/"+IMAGES_DIR):
# Download the images directory.
!wget -O {IMAGES_DIR + '.zip'} 'https://drive.google.com/uc?export=download&id=18t14YLm0nDc7USp550pIjslcZ3g5ZJ0t'
# Extract the compressed directory.
!unzip {os.getcwd() + "/" + IMAGES_DIR + '.zip'}
Load the Image
Read the image using the function cv2.imread() and display it utilizing the matplotlib library.
Note: In case you want to use a different source image, make sure to use an image of a person with neutral expressions to get the best results.
%matplotlib inline
# Specify the source image name.
image_name = 'imran.jpeg'
# Read the source image.
source_image = cv2.imread(os.path.join(os.getcwd(), IMAGES_DIR , image_name))
# Resize the image to make its width 720, while keeping its aspect ratio constant.
source_image = cv2.resize(source_image, dsize=(720, int((720/source_image.shape[1])*source_image.shape[0])))
# Display the image.
plt.imshow(source_image[:,:,::-1]);plt.title("Source Image");plt.axis("off");plt.show()
Step 3.1: Detect the face
Similar to the driving video, we can’t pass the whole source image into the First-Order Motion Model, we have to crop the face from the image and then pass the face image into the model. For this we will need a Face Detector to get the Face Bounding Box coordinates and we will utilize the Mediapipe’s Face Detection Solution.
Initialize the Mediapipe Face Detection Model
To use the Mediapipe’s Face Detection solution, initialize the face detection class using the syntax mp.solutions.face_detection, and then call the function mp.solutions.face_detection.FaceDetection() with the arguments explained below:
model_selection – It is an integer index ( i.e., 0 or 1 ). When set to 0, a short-range model is selected that works best for faces within 2 meters from the camera, and when set to 1, a full-range model is selected that works best for faces within 5 meters. Its default value is 0.
min_detection_confidence – It is the minimum detection confidence between ([0.0, 1.0]) required to consider the face-detection model’s prediction successful. Its default value is 0.5 ( i.e., 50% ) which means that all the detections with prediction confidence less than 0.5 are ignored by default.
# Initialize the mediapipe face detection class.
mp_face_detection = mp.solutions.face_detection
# Setup the face detection function.
face_detection = mp_face_detection.FaceDetection(model_selection=0, min_detection_confidence=0.5)
Create a function to detect face
Create a function detect_face() that will utilize the Mediapipe’s Face Detection Solution to detect a face in an image and will return the bounding box coordinates of the detected face.
To perform the face detection, pass the image (in RGB format) into the loaded face detection model by using the function mp.solutions.face_detection.FaceDetection().process(). The output object returned will have an attribute detections that contains a list of a bounding box and six key points for each face in the image.
Note that the bounding boxes are composed of xmin and width (both normalized to [0.0, 1.0] by the image width) and ymin and height (both normalized to [0.0, 1.0] by the image height). Ignore the face key points for now as we are only interested in the bounding box coordinates.
After performing the detection, convert the bounding box coordinates back to their original scale utilizing the image width and height. Also draw the bounding box on a copy of the source image using the function cv2.rectangle().
def detect_face(image, face_detection, draw=False, display=True):
'''
This function performs face detection, converts the bounding box coordinates back to their original scale,
and returns the coordinates.
Args:
image: The input image of the person's face whose face needs to be detected.
face_detection: The Mediapipe's face detection function required to perform the face detection.
draw: A boolean value that is if set to true the function draws the face bounding box on the output image.
display: A boolean value that is if set to true the function displays the output image with
the face bounding box drawn and returns nothing.
Returns:
face_bbox: A tuple (xmin, ymin, box_width, box_height) containing the face bounding box coordinates.
'''
# Get the height and width of the input image.
image_height, image_width, _ = image.shape
# Create a copy of the input image to draw a face bounding box.
output_image = image.copy()
# Convert the image from BGR into RGB format.
imgRGB = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
# Perform the face detection on the image.
face_detection_results = face_detection.process(imgRGB)
# Initialize a tuple to store the face bounding box coordinates.
face_bbox = ()
# Check if the face(s) in the image are found.
if face_detection_results.detections:
# Iterate over the found faces.
for face_no, face in enumerate(face_detection_results.detections):
# Get the bounding box coordinates and convert them back to their original scale.
xmin = int(face.location_data.relative_bounding_box.xmin * image_width)
ymin = int(face.location_data.relative_bounding_box.ymin * image_height)
box_width = int(face.location_data.relative_bounding_box.width * image_width)
box_height = int(face.location_data.relative_bounding_box.height * image_height)
# Update the bounding box tuple values.
face_bbox = (xmin, ymin, box_width, box_height)
# Check if the face bounding box is specified to be drawn.
if draw:
# Draw the face bounding box on the output image.
cv2.rectangle(output_image, (xmin, ymin), (xmin+box_width, ymin+box_height), (0, 0, 255), 2)
# Check if the output image is specified to be displayed.
if display:
# Display the output image.
plt.figure(figsize=[15,15])
plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');
# Otherwise.
else:
# Return the face bounding box coordinates.
return face_bbox
Utilize the detect_face() function created above to detect the face in the source image and display the results.
# Perform face detection on the image.
detect_face(source_image, face_detection, draw=True, display=True)
Nice! face detection is working perfectly.
Step 3.2: Align and crop the face
Another very important preprocessing step is the Face Alignment on the source image. Make sure that the face is properly aligned in the source image otherwise the model can generate weird/funny output results.
To align the face in the source image, first detect the 468 facial landmarks using Mediapipe’s Face Mesh Solution, then extract the eyes center and nose tip landmarks to calculate the face orientation and then finally rotate the image accordingly to align the face.
Initialize the Face Landmarks Detection Model
To use the Mediapipe’s Face Mesh solution, initialize the face mesh class using the syntax mp.solutions.face_mesh and call the function mp.solutions.face_mesh.FaceMesh() with the arguments explained below:
static_image_mode – It is a boolean value that is if set to False, the solution treats the input images as a video stream. It will try to detect faces in the first input images, and upon a successful detection further localizes the face landmarks. In subsequent images, once all max_num_faces faces are detected and the corresponding face landmarks are localized, it simply tracks those landmarks without invoking another detection until it loses track of any of the faces. This reduces latency and is ideal for processing video frames. If set to True, face detection runs on every input image, ideal for processing a batch of static, possibly unrelated, images. Its default value is False.
max_num_faces – It is the maximum number of faces to detect. Its default value is 1.
refine_landmarks – It is a boolean value that is if set to True, the solution further refines the landmark coordinates around the eyes and lips, and outputs additional landmarks around the irises by applying the Attention Mesh Model. Its default value is False.
min_detection_confidence – It is the minimum detection confidence ([0.0, 1.0]) required to consider the face-detection model’s prediction correct. Its default value is 0.5 which means that all the detections with prediction confidence less than 50% are ignored by default.
min_tracking_confidence – It is the minimum tracking confidence ([0.0, 1.0]) from the landmark-tracking model for the face landmarks to be considered tracked successfully, or otherwise face detection will be invoked automatically on the next input image, so increasing its value increases the robustness, but also increases the latency. It is ignored if static_image_mode is True, where face detection simply runs on every image. Its default value is 0.5.
We will be working with images only, so we will have to set the static_image_mode to True. We will also define the eyes and nose landmarks indexes that are required to extract the eyes and nose landmarks.
# Initialize the mediapipe face mesh class.
mp_face_mesh = mp.solutions.face_mesh
# Set up the face landmarks function for images.
face_mesh = mp_face_mesh.FaceMesh(static_image_mode=True, max_num_faces=2,
refine_landmarks=True, min_detection_confidence=0.5)
# Specify the nose and eyes indexes.
NOSE = 2
LEFT_EYE = [362, 263] # [right_landmark left_landmark]
RIGHT_EYE = [33, 133] # [right_landmark left_landmark]
Create a function to extract eyes and nose landmarks
Create a function extract_landmarks() that will utilize the Mediapipe’s Face Mesh Solution to detect the 468 Facial Landmarks and then extract the left and right eyes corner landmarks and the nose tip landmark.
To perform the Face(s) landmarks detection, pass the image to the face’s landmarks detection machine learning pipeline by using the function mp.solutions.face_mesh.FaceMesh().process(). But first, convert the image from BGR to RGB format using the function cv2.cvtColor() as OpenCV reads images in BGR format and the ml pipeline expects the input images to be in RGB color format.
The machine learning pipeline outputs an object that has an attribute multi_face_landmarks that contains the 468 3D facial landmarks for each detected face in the image. Each landmark has:
x – It is the landmark x-coordinate normalized to [0.0, 1.0] by the image width.
y – It is the landmark y-coordinate normalized to [0.0, 1.0] by the image height.
z – It is the landmark z-coordinate normalized to roughly the same scale as x. It represents the landmark depth with the center of the head being the origin, and the smaller the value is, the closer the landmark is to the camera.
After performing face landmarks detection on the image, convert the landmarks’ x and y coordinates back to their original scale utilizing the image width and height and then extract the required landmarks utilizing the indexes we had specified earlier. Also draw the extracted landmarks on a copy of the source image using the function cv2.circle(), just for visualization purposes.
def extract_landmarks(image, face_mesh, draw=False, display=True):
'''
This function performs face landmarks detection, converts the landmarks x and y coordinates back to their original scale,
and extracts left and right eyes corner landmarks and the nose tip landmark.
Args:
image: The input image of the person's face whose facial landmarks needs to be extracted.
face_mesh: The Mediapipe's face landmarks detection function required to perform the landmarks detection.
draw: A boolean value that is if set to true the function draws the extracted landmarks on the output image.
display: A boolean value that is if set to true the function displays the output image with
the extracted landmarks drawn and returns nothing.
Returns:
extracted_landmarks: A list containing the left and right eyes corner landmarks and the nose tip landmark.
'''
# Get the height and width of the input image.
height, width, _ = image.shape
# Initialize an array to store the face landmarks.
face_landmarks = np.array([])
# Create a copy of the input image to draw facial landmarks.
output_image = image.copy()
# Convert the image from BGR into RGB format.
imgRGB = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
# Perform the facial landmarks detection on the image.
results = face_mesh.process(imgRGB)
# Check if facial landmarks are found.
if results.multi_face_landmarks:
# Iterate over the found faces.
for face in results.multi_face_landmarks:
# Convert the Face landmarks x and y coordinates into their original scale,
# And store them into a numpy array.
# For simplicity, we are only storing face landmarks of a single face,
# you can extend it to work with multiple faces if you want.
face_landmarks = np.array([(landmark.x*width, landmark.y*height)
for landmark in face.landmark], dtype=np.int32)
# Extract the right eye landmarks.
right_eye_landmarks = [face_landmarks[RIGHT_EYE[0]], face_landmarks[RIGHT_EYE[1]]]
# Extract the left eye landmarks.
left_eye_landmarks = [face_landmarks[LEFT_EYE[0]], face_landmarks[LEFT_EYE[1]]]
# Extract the nose tip landmark.
nose_landmarks = face_landmarks[NOSE]
# Initialize a list to store the extracted landmarks
extracted_landmarks = [nose_landmarks, left_eye_landmarks, right_eye_landmarks]
# Check if extracted landmarks are specified to be drawn.
if draw:
# Draw the left eye extracted landmarks.
cv2.circle(output_image, tuple(left_eye_landmarks[0]), 3, (0, 0, 255), -1)
cv2.circle(output_image, tuple(left_eye_landmarks[1]), 3, (255, 0, 0), -1)
# Draw the right eye extracted landmarks.
cv2.circle(output_image, tuple(right_eye_landmarks[0]), 3, (0, 0, 255), -1)
cv2.circle(output_image, tuple(right_eye_landmarks[1]), 3, (255, 0, 0), -1)
# Draw the nose landmark.
cv2.circle(output_image, tuple(nose_landmarks), 3, (255, 0, 0), -1)
# Check if the output image is specified to be displayed.
if display:
# Display the output image.
plt.figure(figsize=[15,15])
plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');
# Otherwise.
else:
# Return the extracted landamarks.
return extracted_landmarks
Now we will utilize the extract_landmarks() function created above to detect and extract the eyes and nose landmarks and visualize the results.
# Extract the left and right eyes corner landmarks and the nose tip landmark.
extract_landmarks(source_image, face_mesh, draw=True, display=True)
Cool! it is accurately extracting the required landmarks.
Create a function to calculate eyes center
Create a function calculate_eyes_center() that will find the left and right eyes center landmarks by utilizing the eyes corner landmarks that we had extracted in the extract_landmarks() function created above.
def calculate_eyes_center(image, extracted_landmarks, draw=False, display=False):
'''
This function calculates the center landmarks of the left and right eye.
Args:
image: The input image of the person's face whose eyes center landmarks needs to be calculated.
extracted_landmarks: A list containing the left and right eyes corner landmarks and the nose tip landmark.
draw: A boolean value that is if set to true the function draws the eyes center and nose tip
landmarks on the output image.
display: A boolean value that is if set to true the function displays the output image with the
landmarks drawn and returns nothing.
Returns:
landmarks: A list containing the left and right eyes center landmarks and the nose tip landmark.
'''
# Create a copy of the input image to draw landmarks.
output_image = image.copy()
# Get the nose tip landmark.
nose_landmark = extracted_landmarks[0]
# Calculate the center landmarks of the left and right eye.
left_eye_center = np.mean(extracted_landmarks[1], axis=0, dtype=np.int32)
right_eye_center = np.mean(extracted_landmarks[2], axis=0, dtype=np.int32)
# Initialize a list to store the left and right eyes center landmarks and the nose tip landmark.
landmarks = [nose_landmark, left_eye_center, right_eye_center]
# Check if the landmarks are specified to be drawn.
if draw:
# Draw the center landmarks of the left and right eye.
cv2.circle(output_image, tuple(left_eye_center), 3, (0, 0, 255), -1)
cv2.circle(output_image, tuple(right_eye_center), 3, (0, 0, 255), -1)
# Draw the nose tip landmark.
cv2.circle(output_image, tuple(nose_landmark), 3, (0, 0, 255), -1)
# Check if the output image is specified to be displayed.
if display:
# Display the output image.
plt.figure(figsize=[15,15])
plt.imshow(output_image[:,:,::-1]);plt.title("Output Image");plt.axis('off');
# Otherwise.
else:
# Return the left and right eyes center landmarks and the nose tip landmark.
return landmarks
Use the extracted_landmarks() and the calculate_eyes_center() function to calculate the central landmarks of the left and right eyes on the source image.
# Get the left and right eyes center landmarks and the nose tip landmark.
extracted_landmarks = extract_landmarks(source_image, face_mesh, draw=False, display=False)
calculate_eyes_center(source_image, extracted_landmarks, draw=True, display=True)
Working perfectly fine!
Create a function to rotate images
Create a function rotate_image() that will simply rotate an image in a counter-clockwise direction with a specific angle without losing any portion of the image.
def rotate_image(image, angle, display=True):
'''
This function rotates an image in counter-clockwise direction with a specific angle.
Args:
image: The input image that needs to be rotated.
angle: It is the angle (in degrees) with which the image needs to be rotated. -ve values can rotate clockwise.
display: A boolean value that is if set to true the function displays the original input image,
and the output rotated image and returns nothing.
Returns:
rotated_image: The image rotated in counter-clockwise direction with the specified angle.
'''
# Get the height and width of the input image.
image_height, image_width, _ = image.shape
# Get the center coordinate x and y values of the image.
(center_x, center_y) = (image_width / 2, image_height / 2)
# Get the rotation matrix to rotate the image with the specified angle at the same scale.
rotation_matrix = cv2.getRotationMatrix2D(center=(center_x, center_y), angle=angle, scale=1.0)
# Compute the new height and width of the image.
new_height = int((image_height * np.abs(rotation_matrix[0, 0])) +
(image_width * np.abs(rotation_matrix[0, 1])))
new_width = int((image_height * np.abs(rotation_matrix[0, 1])) +
(image_width * np.abs(rotation_matrix[0, 0])))
# Adjust the rotation matrix accordingly to the new height and width.
rotation_matrix[0, 2] += (new_width / 2) - center_x
rotation_matrix[1, 2] += (new_height / 2) - center_y
# Perform the actual rotation on the image.
rotated_image = cv2.warpAffine(image.copy(), rotation_matrix, (new_width, new_height))
# Check if the original input image and the output image are specified to be displayed.
if display:
# Display the original input image and the output image.
plt.figure(figsize=[15,15])
plt.subplot(121);plt.imshow(image[:,:,::-1]);plt.title("Original Image");plt.axis('off');
plt.subplot(122);plt.imshow(rotated_image[:,:,::-1]);plt.title(f"Rotated Image angle:{angle}");plt.axis('off');
# Otherwise.
else:
# Return the rotated image.
return rotated_image
Utilize the rotate_image() function to rotate the source image at an angle of 45 degrees.
# Rotate the source image with an angle of 45 degrees.
rotated_img = rotate_image(source_image, 45, display= True)
Rotation looks good, but rotating the image with a random angle will not bring us any good.
Create a function to find the face orientation
Create a function calculate_face_angle() that will find the face orientation, and then we will rotate the image accordingly utilizing the function rotate_image() created above, to appropriately align the face in the source image.
To find the face angle, first get the eyes and nose landmarks using the extract_landmarks() function then we will pass these landmarks to the calculate_eyes_center() function to get the eyes center landmarks, then utilizing the eyes center landmarks we will calculate the midpoint of the eyes i.e., the center of the forehead. And we will use the detect_face() function created in the previous step, to get the face bounding box coordinates and then utilize those coordinates to find the center_pred point i.e., the mid-point of the bounding box top-right and top_left coordinate.
And then finally, find the distance between the nose, center_of_forehead and center_pred landmarks as shown in the gif above to calculate the face angle utilizing the famous cosine-law.
def calculate_face_angle(image, face_mesh, face_detection):
'''
This function calculates the face orientation in an image.
Args:
image: The input image of the person whose face angle needs to be calculated.
face_mesh: The Mediapipe's face landmarks detection function required to perform the landmarks detection.
face_detection: The Mediapipe's face detection function required to perform the face detection.
Returns:
angle: The calculated face angle in degrees.
'''
# Create a helper function to find distance between two points.
def calculate_distance(point1, point2):
'''
This function calculates euclidean distance between two points.
Args:
point1: A tuple containing the x and y coordinates of the first point.
point2: A tuple containing the x and y coordinates of the second point.
Returns:
distance: The distance calculated between the two points.
'''
# Calculate euclidean distance between the two points.
distance = np.sqrt((point1[0] - point2[0]) ** 2 + (point1[1] - point2[1]) ** 2)
# Return the calculated distance.
return distance
# Extract the left and right eyes corner landmarks and the nose tip landmark.
nose_and_eyes_landmarks = extract_landmarks(image, face_mesh, draw=False, display=False)
# Get the center of each eye, from Eyes Landmarks.
nose, left_eye_center, right_eye_center = calculate_eyes_center(image, nose_and_eyes_landmarks, draw=False, display=False)
# Calculate the midpoint of the eye center landmarks i.e., the center of the forehead.
center_of_forehead = ((left_eye_center[0] + right_eye_center[0]) // 2,
(left_eye_center[1] + right_eye_center[1]) // 2,)
# Get the face bounding box coordinates.
xmin, ymin, box_width, box_height = detect_face(image, face_detection, display=False)
# Get the mid-point of the bounding box top-right and top_left coordinate.
center_pred = int(xmin + (box_width//2)), ymin
# Find the distance between forehead and nose.
length_line1 = calculate_distance(center_of_forehead, nose)
# Find the distance between center_pred and nose.
length_line2 = calculate_distance(center_pred, nose)
# Find the distance between center_pred and center_of_forehead.
length_line3 = calculate_distance(center_pred, center_of_forehead)
# Use the cosine law to find the cos A.
cos_a = -(length_line3 ** 2 - length_line2 ** 2 - length_line1 ** 2) / (2 * length_line2 * length_line1)
# Get the inverse of the cosine function.
angle = np.arccos(cos_a)
# Set the nose tip landmark as the origin.
origin_x, origin_y = nose
# Get the center of forehead x and y coordinates.
point_x, point_y = center_of_forehead
# Rotate the x and y coordinates w.r.t the origin with the found angle.
rotated_x = int(origin_x + np.cos(angle) * (point_x - origin_x) - np.sin(angle) * (point_y - origin_y))
rotated_y = int(origin_y + np.sin(angle) * (point_x - origin_x) + np.cos(angle) * (point_y - origin_y))
# Initialize a tuple to store the rotated points.
rotated_point = rotated_x, rotated_y
# Do some mathematics to find a few numbers that will help us determine whether the angle has to be positive or negative.
c1 = ((center_of_forehead[0] - nose[0]) * (rotated_point[1] - nose[1]) - (center_of_forehead[1] - nose[1]) *
(rotated_point[0] - nose[0]))
c2 = ((center_pred[0] - center_of_forehead[0]) * (rotated_point[1] - center_of_forehead[1]) -
(center_pred[1] - center_of_forehead[1]) * (rotated_point[0] - center_of_forehead[0]))
c3 = ((nose[0] - center_pred[0]) * (rotated_point[1] - center_pred[1]) -
(nose[1] - center_pred[1]) * (rotated_point[0] - center_pred[0]))
# Check if the angle needs to be negative.
if (c1 < 0 and c2 < 0 and c3 < 0) or (c1 > 0 and c2 > 0 and c3 > 0):
# Make the angle -ve, and convert it into degrees.
angle = np.degrees(-angle)
# Otherwise.
else:
# Convert the angle into degrees.
angle = np.degrees(angle)
# Return the angle.
return angle
Utilize the calculate_face_angle() function created above the find the face angle of the source image and display it.
# Calculate the face angle.
face_angle = calculate_face_angle(source_image, face_mesh, face_detection)
print(f'Face Angle: {face_angle}')
Face Angle: -8.50144759667417
Now that we have the face angle, we can move on to aligning the face in the source image.
Create a Function to Align the Face and Crop the Face Region
Create a function align_crop_face() that will first utilize the function calculate_face_angle() to get the face angle, then rotate the image accordingly utilizing the rotate_image() function and finally crop the face from the image utilizing the face bounding box coordinates (after scaling) returned by the detect_face() function. In the end, it will also resize the face image to the size 256x256 that is required by the First-Order Motion Model.
def align_crop_face(image, face_mesh, face_detection, face_scale_factor=1, display=True):
'''
This function aligns and crop the face and then resizes it into 256x256 dimensions.
Args:
image: The input image of the person whose face needs to be aligned and cropped.
face_mesh: The Mediapipe's face landmarks detection function required to perform the landmarks detection.
face_detection: The Mediapipe's face detection function required to perform the face detection.
face_scale_factor: The factor to scale up or down the face bouding box coordinates.
display: A boolean value that is if set to true the function displays the original input
image, rotated image and the face roi image.
Returns:
face_roi: A copy of the aligned face roi of the input image.
face_angle: The calculated face angle in degrees.
face_bbox: A tuple (xmin, ymin, xmax, ymax) containing the face bounding box coordinates.
'''
# Get the height and width of the input image.
image_height, image_width, _ = image.shape
# Get the angle of the face in the input image.
face_angle = calculate_face_angle(image, face_mesh, face_detection)
# Rotate the input image with the face angle.
rotated_image = rotate_image(source_image, face_angle, display=False)
# Perform face detection on the image.
face_bbox = detect_face(rotated_image, face_detection, display=False)
# Check if the face was detected in the image.
if len(face_bbox) > 0:
# Get the face bounding box coordinates.
xmin, ymin, box_width, box_height = face_bbox
# Calculate the bottom right coordinate values of the face bounding box.
xmax = xmin + box_width
ymax = ymin + box_height
# Get the face scale value according to the bounding box height.
face_scale = int((box_height * face_scale_factor))
# Add padding to the face bounding box.
xmin = xmin - face_scale//2 if xmin - face_scale//2 > 0 else 0
ymin = ymin - int(face_scale*1.8) if ymin - int(face_scale*1.8) > 0 else 0
xmax = xmax + face_scale//2 if xmax + face_scale//2 < image_width else image_width
ymax = ymax + int(face_scale/1.8) if ymax + int(face_scale/1.8) < image_height else image_height
# Update the face bounding box tuple.
face_bbox = (xmin, ymin, xmax, ymax)
# Crop the face from the image.
face_roi = rotated_image[ymin: ymax, xmin : xmax]
# Resize the face region to 256x256 dimensions.
face_roi = cv2.resize(face_roi, (256, 256), interpolation=cv2.INTER_AREA)
# Save the image on the disk.
cv2.imwrite('source_image.jpg', face_roi)
# Check if the original input image, rotated image and the face roi image are specified to be displayed.
if display:
# Display the original input image, rotated image and the face roi image.
plt.figure(figsize=[15,15])
plt.subplot(131);plt.imshow(image[:,:,::-1]);plt.title("Original Image");plt.axis('off');
plt.subplot(132);plt.imshow(rotated_image[:,:,::-1]);plt.title(f"Rotated Image angle: {round(face_angle, 2)}");plt.axis('off');
plt.subplot(133);plt.imshow(face_roi[:,:,::-1]);plt.title(f"Face ROI");plt.axis('off');
# Return the face roi, the face angle and the face bounding box.
return face_roi, face_angle, face_bbox
Use the function align_crop_face() on the source image and visualize the results.
Make sure that the whole face is present in the cropped face ROI results. Increase/decrease the face_scale_factor value if you are testing this colab on a different source image. Increase the value if the face is being cropped in the source image and decrease the value if the face ROI image contains too much background.
# Perform face alignment and crop the face.
face_roi, face_angle, face_bbox = align_crop_face(source_image, face_mesh, face_detection,
face_scale_factor=0.3, display=True)
I must say its looking good! all the preprocessing steps went as we intended. But now comes a post-processing step, after generating the output from the First-Order Motion Model.
Remember that later on, we will have to embed the manipulated face back into the source image, so a function to restore the source image’s original state after embedding the output is also required.
Create a function to restore the original source image
So now we will create a function restore_source_image() that will undo the rotation we had applied on the image and will remove the black borders which appeared after the rotation.
def restore_source_image(rotated_image, rotation_angle, image_size, display=True):
'''
This function undoes the rotation and removes the black borders of an image.
Args:
rotated_image: The rotated image which needs to be restored.
rotation_angle: The angle with which the image was rotated.
image_size: A tuple containing the original height and width of the image.
display: A boolean value that is if set to true the function displays the original
input image, and the output image and returns nothing.
Returns:
output_image: The rotated image after being restored to its original state.
'''
# Get the height and width of the image.
height, width = image_size
# Undo the rotation of the image by rotating again with a -ve angle.
output_image = rotate_image(rotated_image, -rotation_angle, display=False)
# Find the center of the image.
center_x = output_image.shape[1] // 2
center_y = output_image.shape[0] // 2
# Crop the undo_rotation image, and remove the black borders.
output_image = output_image[center_y - height//2 : center_y + height//2,
center_x - width//2 : center_x + width//2]
# Check if the original input image and the output image are specified to be displayed.
if display:
# Display the original input image and the output image.
plt.figure(figsize=[15,15])
plt.subplot(121);plt.imshow(rotated_image[:,:,::-1]);plt.title("Rotated Image");plt.axis('off');
plt.subplot(122);plt.imshow(output_image[:,:,::-1]);plt.title(f"Restored Image");plt.axis('off');
# Otherwise.
else:
# Return the output image.
return output_image
Utilize the calculate_face_angle() and rotate_image() function to create a rotated image and then check if the restore_source_image() can restore the images original state by undoing the rotation and removing the black borders from image.
# Calculate the face angle and rotate the image with the face angle.
face_angle = calculate_face_angle(source_image, face_mesh, face_detection)
rotated_image = rotate_image(source_image, face_angle, display=False)
# Restore the rotated image.
restore_source_image(rotated_image, face_angle, image_size=source_image.shape[:2], display=True)
Step 4: Create the DeepFake
Now that the source image and driving video is ready, so now in this step, we will create a DeepFake video.
Step 4.1: Download the First-Order Motion Model
Now we will download the required pre-trained network from the Yandex Disk Models. We have multiple options there, but since we are only interested in face manipulation, we will only download the vox-adv-cpk.pth.tar file.
# Specify the name of the file.
filename ='vox-adv-cpk.pth.tar'
# Download the pre-trained network.
download = requests.get(requests.get('https://cloud-api.yandex.net/v1/disk/public/resources/download?public_key=https://yadi.sk/d/lEw8uRm140L_eQ&path=/' + filename).json().get('href'))
# Open the file and write the downloaded content.
with open(filename, 'wb') as checkpoint:
checkpoint.write(download.content)
Create a function to display the results
Create a function display_results() that will concatenate the source image, driving video, and the generated video together and will show the results.
def display_results(source_image, driving_video, generated_video=None):
'''
This function stacks and displays the source image, driving video, and generated video together.
Args:
source_image: The source image ((contains facial appearance info)) that is used to create the deepfake video.
driving_video: The driving video (contains facial motion info) that is used to create the deepfake video.
generated_video: The deepfake video generated by combining the source image and the driving video.
Returns:
resultant_video: A stacked video containing the source image, driving video, and the generated video.
'''
# Create a figure.
fig = plt.figure(figsize=(8 + 4 * (generated_video is not None), 6))
# Create a list to store the frames of the resultant_video.
frames = []
# Iterate the number of times equal to the number of frames in the driving video.
for i in range(len(driving_video)):
# Create a list to store the stack elements.
stack = [source_image]
# Append the driving video into the stack.
stack.append(driving_video[i])
# Check if a valid generated video is passed.
if generated_video is not None:
# Append the generated video into the stack.
stack.append(generated_video[i])
# Concatenate all the elements in the stack.
stacked_image = plt.imshow(np.concatenate(stack, axis=1), animated=True)
# Turn off the axis.
plt.axis('off')
# Append the image into the list.
frames.append([stacked_image])
# Create the stacked video.
resultant_video = animation.ArtistAnimation(fig, frames, interval=50, repeat_delay=1000)
# Close the figure window.
plt.close()
# Return the results.
return resultant_video
Step 4.2: Load source image and driving video (Face cropped)
Load the pre-processed source image and the driving video and then display them utilizing the display_results() function created above.
# Ignore the warnings.
warnings.filterwarnings("ignore")
# Load the Source Image and the driving video.
source_image = imageio.imread('source_image.jpg')
driving_video = imageio.mimread('crop.mp4')
# Resize the Source Image and the driving video to 256x256.
source_image = resize(source_image, (256, 256))[..., :3]
driving_video = [resize(frame, (256, 256))[..., :3] for frame in driving_video]
# Display the Source Image and the driving video.
HTML(display_results(source_image, driving_video).to_html5_video())
Step 4.3: Generate the video
Now that everything is ready, utilize the demo.py script that was imported earlier to finally generate the DeepFake video. First load the model file that was downloaded earlier along with the configuration file that was available in the First-Order-Model repository that was cloned. And then generate the video utilizing the demo.make_animation() function and display the results utilizing the display_results() function.
# Load the pre-trained (check points) network and config file.
generator, kp_detector = demo.load_checkpoints(config_path='config/vox-256.yaml',
checkpoint_path='vox-adv-cpk.pth.tar')
# Create the deepfake video.
predictions = demo.make_animation(source_image, driving_video, generator, kp_detector, relative=True)
# Read the driving video, to get details, like FPS, duration etc.
reader = imageio.get_reader('crop.mp4')
# Get the Frame Per Second (fps) information.
fps = reader.get_meta_data()['fps']
# Save the generated video to the disk.
imageio.mimsave('results.mp4', [img_as_ubyte(frame) for frame in predictions], fps=fps)
# Display the source image, driving video and the generated video.
HTML(display_results(source_image, driving_video, predictions).to_html5_video())
Step 4.4: Embed the manipulated face into the source image
Create a function embed_face() that will simply insert the manipulated face in the generated video back to the source image.
def embed_face(source_image, source_image_data, generated_video_path, debugging=False):
'''
This function inserts the manipulated face in the generated video back to the source image.
Args:
source_image: The original source image from which the face was cropped.
source_image_data: A list containing the information required to embed the face back to the source image.
generated_video_path: The path where the video generated by the model is stored.
debugging: A boolean value that is if set to True, the intermediate steps are displayed.
Returns:
output_video_path: The path where the output video is stored.
'''
# Resize the image to make its width 720, while keeping its aspect ratio constant.
source_image = cv2.resize(source_image, dsize=(720, int((720/source_image.shape[1])*source_image.shape[0])))
# Get the height and width of the image.
height, width, _ = source_image.shape
# Get the face coordinates in the original image and calculate the face angle.
(xmin, ymin, xmax, ymax), face_angle = source_image_data
# Rotate the source image with the face angle.
rotated_image = rotate_image(source_image, face_angle, display=False)
# Get the height and width of the rotated image.
rotated_height, rotated_width, _ = rotated_image.shape
# Create a black image with size equal to the rotated image.
mask = np.zeros(shape=(rotated_height, rotated_width), dtype=np.uint8)
# Get the width and height of the face bounding box.
bbox_width, bbox_height = xmax-xmin, ymax-ymin
# Calculate the center coordinate of the face bounding box.
center_x, center_y = xmin+(bbox_width//2), ymin+(bbox_height//2)
# Initialize a variable to store the weight.
weight = 1
# Get the approximate width and height of the face in the bounding box.
roi_width = int(bbox_width/1.3)
roi_height = int(bbox_height/1.2)
# Draw a white filled rectangle at the center of the face bounding box on the mask image.
mask = cv2.rectangle(mask, (center_x-(roi_width//2), center_y-(roi_height//2)),
(center_x+(roi_width//2), center_y+(roi_height//2)),
(255*weight), thickness=-1)
# Iterate until the roi size is less than the face bounding box.
while roi_width<bbox_width and roi_height<bbox_height:
# Draw a gray rectangle around the face rectangle on the mask image.
# This will help in blending the face roi in the source image.
mask = cv2.rectangle(mask, (center_x-(roi_width//2), center_y-(roi_height//2)),
(center_x+(roi_width//2), center_y+(roi_height//2)),
(255*weight), thickness=int(roi_height/40))
# Check if the roi width is less than the face bounding box width.
if roi_width<bbox_width:
# Increment the roi width.
roi_width+=bbox_width//40
# Check if the roi height is less than the face bounding box height.
if roi_height<bbox_height:
# Increment the roi height.
roi_height+=bbox_height//40
# Decrement the weightage.
weight-=0.1
# Draw a rectangle at the edge of the face bounding box.
mask = cv2.rectangle(mask, (center_x-(roi_width//2), center_y-(roi_height//2)),
(center_x+(roi_width//2), center_y+(roi_height//2)),
(255*weight), thickness=int(roi_height/40))
# Load the generated video file.
video_reader = cv2.VideoCapture(generated_video_path)
# Define the Codec for Video Writer.
fourcc = cv2.VideoWriter_fourcc(*"XVID")
# Specify the path to store the final video.
output_video_path = "final_video.mp4"
# Initialize the video writer.
video_writer = cv2.VideoWriter(output_video_path, fourcc, 24, (1280, int((1280/width)*height)))
# Merge the mask three times to make it a three channel image.
mask = cv2.merge((mask, mask, mask)).astype(float)/255
# Iterate until the video is accessed successfully.
while video_reader.isOpened():
# Read a frame.
ok, frame = video_reader.read()
# Check if the frame is not read properly then break the loop.
if not ok:
break
# Resize the frame to match the size of the cropped (face) region.
frame = cv2.resize(frame, dsize=(xmax-xmin, ymax-ymin))
# Create a copy of the rotated image.
rotated_frame = rotated_image.copy()
# Embed the face from the generated video into the rotated source image.
rotated_frame[ymin: ymax, xmin : xmax] = frame
# Blend the edges of the image.
output_image = (((1-mask)) * rotated_image.astype(float)) + (rotated_frame.astype(float) * (mask))
# Undo the rotation and remove the black borders.
output_image = restore_source_image(output_image.astype(np.uint8), face_angle, image_size=source_image.shape[:2],
display=False)
# Resize the image to make its width 1280, while keeping its aspect ratio constant.
output_image = cv2.resize(output_image, dsize=(1280, int((1280/width)*height)))
# Write the frame.
video_writer.write(output_image)
# Check if debugging is enabled.
if debugging:
# Display the intermediate steps.
plt.figure(figsize=[15,15])
plt.subplot(121);plt.imshow(mask, cmap='gray');plt.title("Mask Image");plt.axis('off');
plt.subplot(122);plt.imshow(output_image[:,:,::-1]);plt.title(f"Output Image");plt.axis('off');
break
# Release the video writer, video reader and close all the windows.
video_writer.release()
video_reader.release()
cv2.destroyAllWindows()
# Return the output video path.
return output_video_path
Now let’s utilize the function embed_face() to insert the manipulated face into the source image.
# Discard the output of this cell.
%%capture
# Embed the face into the source image.
video_path = embed_face(cv2.imread(os.path.join(os.getcwd(), IMAGES_DIR , image_name)),
source_image_data=[face_bbox, face_angle], generated_video_path="results.mp4",
debugging=False)
# Check if the video with the FPS already exists.
if os.path.exists('final_video_with_fps.mp4'):
# Remove the video.
os.remove('final_video_with_fps.mp4')
# Add FPS information to the video.
!ffmpeg -i {video_path} -filter:v fps=fps=23 final_video_with_fps.mp4
The video is now stored on the disk, so now we can display it to see what the final result looks like.
# Download the video.
files.download('result_with_audio.mp4')
# Load the video.
video = open("result_with_audio.mp4", "rb").read()
# Decode the video.
data_url = "data:video/mp4;base64," + b64encode(video).decode()
# Display the video.
HTML(f"""<video width=400 controls><source src="{data_url}" type="video/mp4"></video>""")
Step 5: Add Audio (of the Driving Video) to the DeepFake Output Video
In the last step, first copy the audio from the driving video into the generated video and then download the video on the disk.
# Discard the output of this cell.
%%capture
# Check if the video with the audio already exists.
if os.path.exists('result_with_audio.mp4'):
# Remove the video.
os.remove('result_with_audio.mp4')
# Copy audio from the driving video into the generated video.
!ffmpeg -i crop.mp4 -i final_video_with_fps.mp4 -c copy -map 1:v:0 -map 0:a:0 -shortest result_with_audio.mp4
# Download the video.
files.download('result_with_audio.mp4')
The video should have started downloading in your system.
Bonus: Generate more examples
Now let’s try to generate more videos with different source images.
# Discard the output of this cell.
%%capture
# Specify the path of the source image.
image_path = 'elon.jpeg' # face_scale_factor=0.45
# image_path = 'drstrange.jpeg' # face_scale_factor= 0.55
# image_path = 'johnny.jpeg' # face_scale_factor=0.7
# image_path = 'mark.jpeg' # face_scale_factor=0.55
# Read another source image.
source_image = cv2.imread(os.path.join(os.getcwd(), IMAGES_DIR , image_path))
# Resize the image to make its width 720, while keeping its aspect ratio constant.
source_image = cv2.resize(source_image, dsize=(720, int((720/source_image.shape[1])*source_image.shape[0])))
# Perform face alignment and crop the face.
face_roi, angle, bbox = align_crop_face(source_image, face_mesh, face_detection,
face_scale_factor=0.45, display=False)
# Resize the Source Image to 256x256.
face_roi = resize(face_roi, (256, 256))[..., :3]
# Create the deepfake video.
predictions = demo.make_animation(face_roi[:,:,::-1], driving_video, generator, kp_detector, relative=True)
# Read the driving video, to get details, like FPS, duration etc.And get the Frame Per Second (fps) information.
reader = imageio.get_reader('crop.mp4')
fps = reader.get_meta_data()['fps']
# Save the generated video to the disk.
imageio.mimsave('generated_results.mp4', [img_as_ubyte(frame) for frame in predictions], fps=fps)
# Embed the face into the source image.
video_path = embed_face(source_image, source_image_data=[bbox, angle], generated_video_path="generated_results.mp4")
# Check if the video with the FPS already exists.
if os.path.exists('final_video_with_fps.mp4'):
# Remove the video.
os.remove('final_video_with_fps.mp4')
# Add FPS information to the video.
!ffmpeg -i {video_path} -filter:v fps=fps=23 final_video_with_fps.mp4
# Check if the video with the FPS already exists.
if os.path.exists('result_with_audio.mp4'):
# Remove the video.
os.remove('result_with_audio.mp4')
# Copy audio from the driving video into the generated video.
!ffmpeg -i crop.mp4 -i final_video_with_fps.mp4 -c copy -map 1:v:0 -map 0:a:0 -shortest result_with_audio.mp4
# Download the video.
files.download('result_with_audio.mp4')
# Load the video.
video = open("result_with_audio.mp4", "rb").read()
# Decode the video.
data_url = "data:video/mp4;base64," + b64encode(video).decode()
# Display the video.
HTML(f"""<video width=400 controls><source src="{data_url}" type="video/mp4"></video>""")
And here are a few more results on different sample images:
After Johnny Depp, comes Mark Zuckerberg sponsoring Bleed AI.
And last but not least, of course, comes someone from the Marvel Universe, yes it’s Dr. Strange himself asking you to visit Bleed AI.
You can now share these videos that you have generated on social media. Make sure that you mention that it is a DeepFake video in the post’s caption.
Conclusion
One of the current limitations of the approach we are using is when the person is moving too much in the driving video. The final results will be terrible because we are only getting the face ROI video from the First-Order Motion Model and then embedding the face video into the source image using image processing techniques. We can’t move the body of the person in the source image if the face is moving in the generated face ROI video. So for the driving videos in which the person is moving too much, you can skip the face embedding part or just train a First-Order Motion Model to manipulate the whole body instead of just the face, I might cover that in a future post.
A Message on Deepfakes by Taha
These days, It’s not a difficult job to create a DeepFake video, as you can see, anyone with access to the colab repo (provided when you download the code) can generate deepfakes in minutes.
Now these fakes are although realistic but you should be easily be able to tell between fake manipulation and real ones, this is because the model is particularly designed for faster interference, there are other approaches where it can take hours or days to render deepfakes but those are very hard to tell from real ones.
The model I used today, is not new but it’s already been out there for a few years (Fun fact: we were actually working on this blogpost since mid of last year so yeah this got delayed for more than a year) Anyways, the point is, the deepfake technology is fast evolving and leads to two things,
1) Easier accessibility: More and more high-level tools and coming which makes the barrier to entry easier and more non-technical people can use these tools to generate deepfakes, I’m sure you know some mobile apps that let common generate these.
2) Algorithms: algorithms are getting better and better such that, you’re going to find a lot of difficulty in identifying a deepfake vs a real video. Today, professional deepfake creators actually export the output of a deepfake model to a video editor and get rid of bad frames or correct them so people are not able to easily figure out if it’s a fake and it makes sense if the model generates a 10 sec (30fps) frames then not all 300 outputs are going to be perfect.
Obviously, deepfake tech has many harmful effects, it has been used to generate fake news, spread propaganda, and create pornography but it also has its creative use cases in the entertainment industry (check wombo) and in the content industry, just check out the amazing work syntheisia.io is doing and how it had helped people and companies.
One thing you might wonder is that in these times, how should you equip yourself to spot deepfakes?
Well, there are certainly some things you can do to better prepare yourself, for one, you can learn a thing or two about digital forensics and how you can spot the fakes from anomalies, pixel manipulations, metadata, etc.
Even as a non-tech consumer you can do a lot in identifying a fake from a real video by fact-checking and finding the original source of the video. For e.g. if you find your country’s president talking about starting a nuclear war with North Korea on some random person’s Twitter, then it’s probably fake no matter how real the scene looks. An excellent resource to learn about fact-checking is this youtube series called Navigating Digital Information by Crashcourse. Do check it out.
[optin-monster slug=”n7m5f6assjorcd80egr9″]
Hire Us
Let our team of expert engineers and managers build your next big project using Bleeding Edge AI Tools & Technologies
Join My Course Computer Vision For Building Cutting Edge Applications Course
The only course out there that goes beyond basic AI Applications and teaches you how to create next-level apps that utilize physics, deep learning, classical image processing, hand and body gestures. Don’t miss your chance to level up and take your career to new heights
You’ll Learn about:
Creating GUI interfaces for python AI scripts.
Creating .exe DL applications
Using a Physics library in Python & integrating it with AI
Advance Image Processing Skills
Advance Gesture Recognition with Mediapipe
Task Automation with AI & CV
Training an SVM machine Learning Model.
Creating & Cleaning an ML dataset from scratch.
Training DL models & how to use CNN’s & LSTMS.
Creating 10 Advance AI/CV Applications
& More
Whether you’re a seasoned AI professional or someone just looking to start out in AI, this is the course that will teach you, how to Architect & Build complex, real world and thrilling AI applications