In last Week’s tutorial, we learned how to work with real-time pose detection and created a pose classification system. In this week’s tutorial, we’ll learn to play a popular game called “Subway Surfers”
Of Course, there’s more to it, this is an AI Blog after all.
We will actually be using our body pose to control the game, not keyboard controls, the entire application will work in real-time on your CPU, you don’t even need a depth camera or a Kinect, your webcam will suffice.
Excited yet, let’s get into it, but before that let me tell you a short story that motivated me to build this application today. It starts with me giving a lecture on the importance of physical fitness, I know … I know … how this sounds but just bear with me for a bit.
Hi All, Taha Awnar here, So here’s the thing. One of the best things I enjoyed in my early teenage years was having a fast metabolism due to my involvement in physical activities. I could eat whatever I wanted, not make a conscious effort in exercising and still stay fit.
But as I grew older, and started spending most of my time in front of a computer, I noticed that I was actually gaining weight. So no longer could I afford the luxury of binge unhealthy eating and skipping workouts.
Now I’m a bit of a foodie so although I could compromise a bit on how I eat, I still needed to cut weight some other way, so I quickly realized that unless I wanted to get obese, I needed to consciously make effort to workout.
That’s about when I joined a local gym in my area, and guess what? … it didn’t work out, ( or I didn’t work out … enough 🙁 ) So I quitted after a month.
So what was the reason ?.… Well, I could provide multiple excuses but to be honest, I was just lazy.
A few months later I joined the gym again and again I quitted after just 2 months.
Now I could have just quit completely but instead 8 months back I tried again, this time I even hired a trainer to keep me motivated, and as they say it, 3rd time’s a charm and luckily it was!
8 months in, I’m still at it. I did see results and lost a couple of kgs, although I haven’t reached my personal target so I’m still working towards it.
If you’re reading this post then you’re probably into computer science just like me and you most likely need to spend a lot of time in front of a PC and because of that, your physical and mental fitness must take a toll. And I seriously can’t stress enough how important it is that you take out a couple of hours each week to exercise.
I’m not a fitness guru but I can say working out has many key benefits:
Helps you shed excess weight, keeps you physically fit.
Gives you mental clarity and improves your work quality.
Lots of health benefits.
Helps you get a partner, if you’re still single like me … lol
Because of these reasons, even though I have an introverted personality, I consciously take out a couple of hours each week to go to the gym or the park for running.
But here’s the thing, sometimes I wonder why can’t I combine what I do (working on a PC) with some activity so I could … you know hit 2 birds with one stone.
This thought led me to create this post today, so what I did was I created a vision application that allows me to control a very popular game called Subway Surfers via my body movement by utilizing real-time pose detection.
And so In this tutorial, I’ll show you how to create this application that controls the Subway Surfers game using body gestures and movements so that you can also exercise, code, and have fun at the same time.
How will this Work?
So this game is about a character running from a policeman dodging different hurdles by jumping, crouching, and moving left and right. So we will need to worry about four controls that are normally controlled using a keyboard.
Up arrow key to make the character jump
Down arrow key to make the character crouch
Left arrow key to move the character to left
Right arrow key to move the character to right.
Using the Pyautogui library, we will automatically trigger the required keypress events, depending upon the body movement of the person that we’ll capture using Mediapipe’s Pose Detection model.
I want the game’s character to:
Jump whenever the person controlling the character jumps.
Crouch whenever the person controlling the character crouches.
Move left whenever the person controlling the character moves to the left side of the screen.
Move right whenever the person controlling the character moves to the right on the screen.
You can also use the techniques you’ll learn in this tutorial to control any other game. The simpler the game, the easier it will be to control. I have actually published two tutorials about game control via body gestures.
Alright now that we have discussed the basic mechanisms for creating this application, let me walk you through the exact step-by-step process I used to create this.
Outline
Step 1: Perform Pose Detection
Step 2: Control Starting Mechanism
Step 3: Control Horizontal Movements
Step 4: Control Vertical Movements
Step 5: Control Keyboard and Mouse with PyautoGUI
Step 6: Build the Final Application
Alright, let’s get started.
Download Code
Import the Libraries
We will start by importing the required libraries.
Python
1
2
3
4
5
6
importcv2
importpyautogui
fromtimeimporttime
frommathimporthypot
importmediapipe asmp
importmatplotlib.pyplot asplt
Initialize the Pose Detection Model
After that we will need to initialize the mp.solutions.pose class and then call the mp.solutions.pose.Pose() function with appropriate arguments and also initialize mp.solutions.drawing_utils class that is needed to visualize the landmarks after detection.
To implement the game control mechanisms, we will need the current pose info of the person controlling the game, as our intention is to control the character with the movement of the person in the frame. We want the game’s character to move left, right, jump and crouch with the identical movements of the person.
So we will create a function detectPose() that will take an image as input and perform pose detection on the person in the image using the mediapipe’s pose detection solution to get thirty-three 3D landmarks on the body and the function will display the results or return them depending upon the passed arguments.
This function is quite similar to the one we had created in the previous post. The only difference is that we are not plotting the pose landmarks in 3D and we are passing a few more optional arguments to the function mp.solutions.drawing_utils.draw_landmarks() to specify the drawing style.
You probably do not want to lose control of the game’s character whenever some other person comes into the frame (and starts controlling the character), so that annoying scenario is already taken care of, as the solution we are using only detects the landmarks of the most prominent person in the image.
So you do not need to worry about losing control as long as you are the most prominent person in the frame as it will automatically ignore the people in the background.
It worked pretty well! if you want you can test the function on other images too by just changing the value of the variable IMG_PATH in the cell above, it will work fine as long as there is a prominent person in the image.
Step 2: Control Starting Mechanism
In this step, we will implement the game starting mechanism, what we want is to start the game whenever the most prominent person in the image/frame joins his both hands together. So we will create a function checkHandsJoined() that will check whether the hands of the person in an image are joined or not.
The function checkHandsJoined() will take in the results of the pose detection returned by the function detectPose() and will use the LEFT_WRIST and RIGHT_WRIST landmarks coordinates from the list of thirty-three landmarks, to calculate the euclidean distance between the hands of the person.
And then utilize an appropriate threshold value to compare with and check whether the hands of the person in the image/frame are joined or not and will display or return the results depending upon the passed arguments.
# Check if the pose landmarks in the frame are detected.
ifresults.pose_landmarks:
# Check if the left and right hands are joined.
frame,_=checkHandsJoined(frame,results,draw=True)
# Display the frame.
cv2.imshow('Hands Joined?',frame)
# Wait for 1ms. If a key is pressed, retreive the ASCII code of the key.
k=cv2.waitKey(1)&0xFF
# Check if 'ESC' is pressed and break the loop.
if(k==27):
break
# Release the VideoCapture Object and close the windows.
camera_video.release()
cv2.destroyAllWindows()
Output Video:
Woah! I am stunned, the pose detection solution is best known for its speed which is reflecting in the results as the distance and the hands status are updating very fast and are also highly accurate.
Step 3: Control Horizontal Movements
Now comes the implementation of the left and right movements control mechanism of the game’s character, what we want to do is to make the game’s character move left and right with the horizontal movements of the person in the image/frame.
So we will create a function checkLeftRight() that will take in the pose detection results returned by the function detectPose() and will use the x-coordinates of the RIGHT_SHOULDER and LEFT_SHOULDER landmarks to determine the horizontal position (Left, Right or Center) in the frame after comparing the landmarks with the x-coordinate of the center of the image.
The function will visualize or return the resultant image and the horizontal position of the person depending upon the passed arguments.
# Return the output image and the person's horizontal position.
returnoutput_image,horizontal_position
Now we will test the function checkLeftRight() created above on a real-time webcam feed and will visualize the results updating in real-time with the horizontal movements.
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Initialize the VideoCapture object to read from the webcam.
# Check if the pose landmarks in the frame are detected.
ifresults.pose_landmarks:
# Check the horizontal position of the person in the frame.
frame,_=checkLeftRight(frame,results,draw=True)
# Display the frame.
cv2.imshow('Horizontal Movements',frame)
# Wait for 1ms. If a a key is pressed, retreive the ASCII code of the key.
k=cv2.waitKey(1)&0xFF
# Check if 'ESC' is pressed and break the loop.
if(k==27):
break
# Release the VideoCapture Object and close the windows.
camera_video.release()
cv2.destroyAllWindows()
Output Video:
Cool! the speed and accuracy of this model never fail to impress me.
Step 4: Control Vertical Movements
In this one, we will implement the jump and crouch control mechanism of the game’s character, what we want is to make the game’s character jump and crouch whenever the person in the image/frame jumps and crouches.
So we will create a function checkJumpCrouch() that will check whether the posture of the person in an image is Jumping, Crouching or Standing by utilizing the results of pose detection by the function detectPose().
The function checkJumpCrouch() will retrieve the RIGHT_SHOULDER and LEFT_SHOULDER landmarks from the list to calculate the y-coordinate of the midpoint of both shoulders and will determine the posture of the person by doing a comparison with an appropriate threshold value.
The threshold (MID_Y) will be the approximate y-coordinate of the midpoint of both shoulders of the person while in standing posture. It will be calculated before starting the game in the Step 6: Build the Final Application and will be passed to the function checkJumpCrouch().
But the issue with this approach is that the midpoint of both shoulders of the person while in standing posture will not always be exactly the same as it will vary when the person will move closer or further to the camera.
To tackle this issue we will add and subtract a margin to the threshold to get an upper and lower bound as shown in the image below.
# Return the output image and posture indicating whether the person is standing straight or has jumped, or crouched.
returnoutput_image,posture
Now we will test the function checkJumpCrouch() created above on the real-time webcam feed and will visualize the resultant frames. For testing purposes, we will be using a default value of the threshold, that if you want you can tune manually set according to your height.
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Initialize the VideoCapture object to read from the webcam.
# Check if the pose landmarks in the frame are detected.
ifresults.pose_landmarks:
# Check the posture (jumping, crouching or standing) of the person in the frame.
frame,_=checkJumpCrouch(frame,results,draw=True)
# Display the frame.
cv2.imshow('Verticial Movements',frame)
# Wait for 1ms. If a a key is pressed, retreive the ASCII code of the key.
k=cv2.waitKey(1)&0xFF
# Check if 'ESC' is pressed and break the loop.
if(k==27):
break
# Release the VideoCapture Object and close the windows.
camera_video.release()
cv2.destroyAllWindows()
Output Video:
Great! when I lower my shoulders at a certain range from the horizontal line (threshold), the results are Crouching, and the results are Standing, whenever my shoulders are near the horizontal line (i.e., between the upper and lower bounds), and when my shoulders are at a certain range above the horizontal line, the results are Jumping.
Step 5: Control Keyboard and Mouse with PyautoGUI
The Subway Surfers character wouldn’t be able to move left, right, jump or crouch unless we provide it the required keyboard inputs. Now that we have the functions checkHandsJoined(), checkLeftRight() and checkJumpCrouch(), we need to figure out a way to trigger the required keyboard keypress events, depending upon the output of the functions created above.
This is where the PyAutoGUI API shines. It allows you to easily control the mouse and keyboard event through scripts. To get an idea of PyAutoGUI’s capabilities, you can check this video in which a bot is playing the game Sushi Go Round.
To run the cells in this step, it is not recommended to use the keyboard keys (Shift + Enter) as the cells with keypress events will behave differently when the events will be combined with the keys Shift and Enter. You can either use the menubar (Cell>>Run Cell) or the toolbar (▶️Run) to run the cells.
Now let’s see how simple it is to trigger the up arrow keypress event using pyautogui.
Python
1
2
# Press the up key.
pyautogui.press(keys='up')
Similarly, we can trigger the down arrow or any other keypress event by replacing the argument with that key name (the argument should be a string). You can click here to see the list of valid arguments.
Python
1
2
# Press the down key.
pyautogui.press(keys='down')
To press multiple keys, we can pass a list of strings (key names) to the pyautogui.press() function.
Or to press the same key multiple times, we can pass a value (number of times we want to press the key) to the argument presses in the pyautogui.press() function.
Python
1
2
# Press the down key 4 times.
pyautogui.press(keys='down',presses=4)
This function presses the key(s) down and then releases up the key(s) automatically. We can also control this keypress event and key release event individually by using the functions:
pyautogui.keyDown(key): Presses and holds down the specified key.
pyautogui.keyUp(key): Releases up the specified key.
So with the help of these functions, keys can be pressed for a longer period. Like in the cell below we will hold down the shift key and press the enter key (two times) to run the two cells below this one and then we will release the shift key.
Python
1
2
3
4
5
6
7
8
# Hold down the shift key.
pyautogui.keyDown(key='shift')
# Press the enter key two times.
pyautogui.press(keys='enter',presses=2)
# Release the shift key.
pyautogui.keyUp(key='shift')
Python
1
2
# This cell will run automatically due to keypress events in the previous cell.
print('Hello!')
Python
1
2
# This cell will also run automatically due to those keypress events.
print('Happy Learning!')
Now we will hold down the shift key and press the tab key and then we will release the shift key. This will switch the tab of your browser so make sure to have multiple tabs before running the cell below.
Python
1
2
3
4
5
6
7
8
# Hold down the shift key.
pyautogui.keyDown(key='ctrl')
# Press the tab key.
pyautogui.press(keys='tab')
# Release the shift key.
pyautogui.keyUp(key='ctrl')
To trigger the mouse key press events, we can use pyautogui.click() function and to specify the mouse button that we want to press, we can pass the values left, middle, or right to the argument button.
Python
1
2
# Press the mouse right button. It will open up the menu.
pyautogui.click(button='right')
We can also move the mouse cursor to a specific position on the screen by specifying the x and y-coordinate values to the arguments x and y respectively.
Python
1
2
# Move to 1300, 800, then click the right mouse button
pyautogui.click(x=1300,y=800,button='right')
Step 6: Build the Final Application
In the final step, we will have to combine all the components to build the final application.
We will use the outputs of the functions created above checkHandsJoined() (to start the game), checkLeftRight() (control horizontal movements) and checkJumpCrouch() (control vertical movements) to trigger the relevant keyboard and mouse events and control the game’s character with our body movements.
Now we will run the cell below and click here to play the game in our browser using our body gestures and movements.
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
# Initialize the VideoCapture object to read from the webcam.
camera_video=cv2.VideoCapture(0)
camera_video.set(3,1280)
camera_video.set(4,960)
# Create named window for resizing purposes.
cv2.namedWindow('Subway Surfers with Pose Detection',cv2.WINDOW_NORMAL)
# Initialize a variable to store the time of the previous frame.
time1=0
# Initialize a variable to store the state of the game (started or not).
game_started=False
# Initialize a variable to store the index of the current horizontal position of the person.
# At Start the character is at center so the index is 1 and it can move left (value 0) and right (value 2).
x_pos_index=1
# Initialize a variable to store the index of the current vertical posture of the person.
# At Start the person is standing so the index is 1 and he can crouch (value 0) and jump (value 2).
y_pos_index=1
# Declate a variable to store the intial y-coordinate of the mid-point of both shoulders of the person.
MID_Y=None
# Initialize a counter to store count of the number of consecutive frames with person's hands joined.
counter=0
# Initialize the number of consecutive frames on which we want to check if person hands joined before starting the game.
num_of_frames=10
# Iterate until the webcam is accessed successfully.
whilecamera_video.isOpened():
# Read a frame.
ok,frame=camera_video.read()
# Check if frame is not read properly then continue to the next iteration to read the next frame.
ifnotok:
continue
# Flip the frame horizontally for natural (selfie-view) visualization.
frame=cv2.flip(frame,1)
# Get the height and width of the frame of the webcam video.
cv2.imshow('Subway Surfers with Pose Detection',frame)
# Wait for 1ms. If a a key is pressed, retreive the ASCII code of the key.
k=cv2.waitKey(1)&0xFF
# Check if 'ESC' is pressed and break the loop.
if(k==27):
break
# Release the VideoCapture Object and close the windows.
camera_video.release()
cv2.destroyAllWindows()
Output Video:
While building big applications like this one, I always divide the application into smaller components and then, in the end, integrate all those components to make the final application.
This makes it really easy to learn and understand how everything comes together to build up the full application.
Join My Course Computer Vision For Building Cutting Edge Applications Course
The only course out there that goes beyond basic AI Applications and teaches you how to create next-level apps that utilize physics, deep learning, classical image processing, hand and body gestures. Don’t miss your chance to level up and take your career to new heights
You’ll Learn about:
Creating GUI interfaces for python AI scripts.
Creating .exe DL applications
Using a Physics library in Python & integrating it with AI
Advance Image Processing Skills
Advance Gesture Recognition with Mediapipe
Task Automation with AI & CV
Training an SVM machine Learning Model.
Creating & Cleaning an ML dataset from scratch.
Training DL models & how to use CNN’s & LSTMS.
Creating 10 Advance AI/CV Applications
& More
Whether you’re a seasoned AI professional or someone just looking to start out in AI, this is the course that will teach you, how to Architect & Build complex, real world and thrilling AI applications
In this tutorial, we learned to perform pose detection on the most prominent person in the frame/image, to get thirty-three 3D landmarks, and then use those landmarks to extract useful info about the body movements (horizontal position i.e., left, center or right and posture i.e. jumping, standing or crouching) of the person and then use that info to control a simple game.
Another thing we have learned is how to automatically trigger the mouse and keyboard events programmatically using the Pyautogui library.
Now one drawback of controlling the game with body movements is that the game becomes much harder compared to controlling it via keyboard presses.
But our aim to make the exercise fun and learn to control Human-Computer Interaction (HCI) based games using AI is achieved. Now if you want, you can extend this application further to control a much more complex application.
You can reach out to me personally for a 1 on 1 consultation session in AI/computer vision regarding your project. Our talented team of vision engineers will help you every step of the way. Get on a call with me directlyhere.
Ready to seriously dive into State of the Art AI & Computer Vision? Then Sign up for these premium Courses by Bleed AI
Vehicle detection has been a challenging part of building intelligent traffic management systems. Such systems are critical for addressing the ever-increasing number of vehicles on road networks that cannot keep up with the pace of increasing traffic. Today many methods that deal with this problem use either traditional computer vision or complex deep learning models.
Popular computer vision techniques include vehicle detection using optical flow, but in this tutorial, we are going to perform vehicle detection using another traditional computer vision technique that utilizes background subtraction and contour detection to detect vehicles. This means you won’t have to spend hundreds of hours in data collection or annotation for building deep learning models, which can be tedious, to say the least. Not to mention, the computation power required to train the models.
This post is the fourth and final part of our Contour Detection 101 series. All 4 posts in the series are titled as:
Vehicle Detection with OpenCV using Contours + Background Subtraction (This Post)
So if you are new to the series and unfamiliar with contour detection, make sure you check them out!
In part 1 of the series, we learned the basics, how to detect and draw the contours, in part 2 we learned to do some contour manipulations and in the third part, we analyzed the detected contours for their properties to perform tasks like object detection. Combining these techniques with background subtraction will enable us to build a useful application that detects vehicles on a road. And not just that but you can use the same principles that you learn in this tutorial to create other motion detection applications.
So let’s dive into how vehicle detection with background subtraction works.
Background subtraction is a simple yet effective technique to extract objects from an image/video. Consider a highway on which cars are moving, and you want to extract each car. One easy way can be that you take a picture of the highway with the cars (called foreground image) and you also have an image saved in which the highway does not contain any cars (background image) so you subtract the background image from the foreground to get the segmented mask of the cars and then use that mask to extract the cars.
But in many cases you don’t have a clear background image, an example of this can be a highway that is always busy, or maybe a walking destination that is always crowded. So in those cases, you can subtract the background by other means, for example, in the case of a video you can detect the movement of the object, so the objects which move can be foreground and the other part that remain static can be the background.
Several algorithms have been invented for this purpose. OpenCV has implemented a few such algorithms which are very easy to use. Let’s see one of them.
history (optional) – It is the length of the history. Its default value is 500.
varThreshold (optional) – It is the threshold on the squared distance between the pixel and the model to decide whether a pixel is well described by the background model. It does not affect the background update and its default value is 16.
detectShadows (optional) – It is a boolean that determines whether the algorithm will detect and mark shadows or not. It marks shadows in gray color. Its default value is True. It decreases the speed a bit, so if you do not need this feature, set the parameter to false.
Returns:
object – It is the MOG2 Background Subtractor.
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# load a video
cap=cv2.VideoCapture('media/videos/vtest.avi')
# you can optionally work on the live web cam
# cap = cv2.VideoCapture(0)
# create the background object, you can choose to detect shadows or not (if True they will be shown as gray)
The second frame is the original video, on the left we have the background subtraction result with shadows, while on the right we have the foreground part produced using the background subtraction mask.
Creating the Vehicle Detection Application
Alright once we have our background subtraction method ready, we can build our final application!
Here’s the breakdown of the steps we need to perform the complete background Subtraction based contour detection.
1) Start by loading the video using the function cv2.VideoCapture() and create a background subtractor object using the function cv2.createBackgroundSubtractorMOG2().
3) Next, we will apply thresholding on the mask using the function cv2.threshold() to get rid of shadows and then perform Erosion and Dilation to improve the mask further using the functions cv2.erode() and cv2.dilate().
4) Then we will use the function cv2.findContours() to detect the contours on the mask image and convert the contour coordinates into bounding box coordinates for each car in the frame using the function cv2.boundingRect(). We will also check the area of the contour using cv2.contourArea() to make sure it is greater than a threshold for a car contour.
5) After that we will use the functions cv2.rectangle() and cv2.putText() to draw and label the bounding boxes on each frame and extract the foreground part of the video with the help of the segmented mask using the function cv2.bitwise_and().
# Display the stacked image with an appropriate title.
cv2.imshow('Original Frame, Extracted Foreground and Detected Cars',cv2.resize(stacked,None,fx=0.5,fy=0.5))
#cv2.imshow('initial Mask', initialMask)
#cv2.imshow('Noisy Mask', noisymask)
#cv2.imshow('Clean Mask', fgmask)
# Wait until a key is pressed.
# Retreive the ASCII code of the key pressed
k=cv2.waitKey(1)&0xff
# Check if 'q' key is pressed.
ifk==ord('q'):
# Break the loop.
break
# Release the VideoCapture Object.
video.release()
# Close the windows
cv2.destroyAllWindows()
Output:
This seems to have worked out well, that too without having to train large-scale Deep learning models!
There are many other background subtraction algorithms in OpenCV that you can use. Check out here and here for further details about them.
Summary
Vehicle Detection is a popular computer vision problem. This post explored how traditional machine vision tools can still be utilized to build applications that can effectively deal with modern vision challenges.
We used a popular background/foreground segmentation technique called background subtraction to isolate our regions of interest from the image.
We also saw how contour detection can prove to be useful when dealing with vision problems. The pre-processing and post-processing that can be used to filter out the noise in the detected contours.
Although these techniques can be robust, they are not as generalizable as Deep learning models so it’s important to put more focus on deployment conditions and possible variations when building vision applications with such techniques.
This post concludes the four-part series on contour detection. If you enjoyed this post and followed the rest of the series do let me know in the comments and you can also support me and the Bleed AI team on patreon here.
If you need 1 on 1 Coaching in AI/computer vision regarding your project, or your career then you reach out to mepersonally here
Hire Us
Let our team of expert engineers and managers build your next big project using Bleeding Edge AI Tools & Technologies
In this tutorial, we’ll learn how to do real-time 3D pose detection using the mediapipe library in python. After that, we’ll calculate angles between body joints and combine them with some heuristics to create a pose classification system.
All of this will work on real-time camera feed using your CPU as well as on images. See results below.
The code is really simple, for detailed code explanation do also check out the YouTube tutorial, although this blog post will suffice enough to get the code up and running in no time.
Pose Detection or Pose Estimation is a very popular problem in computer vision, in fact, it belongs to a broader class of computer vision domain called key point estimation. Today we’ll learn to do Pose Detection where we’ll try to localize 33 key body landmarks on a person e.g. elbows, knees, ankles, etc. see the image below:
Some interesting applications of pose detection are:
Full body Gesture Control to control anything from video games (e.g. kinect) to physical appliances, robots etc. Check this.
Creating Augmented reality applications that overlay virtual clothes or other accessories over someone’s body. Check this.
Now, these are just some interesting things you can make using pose detection, as you can see it’s a really interesting problem.
And that’s not it there are other types of key point detection problems too, e.g. facial landmark detection, hand landmark detection, etc.
We will actually learn to do both of the above in the upcoming tutorials.
Key point detection in turn belongs to a major computer vision branch called Image recognition, other broad classes of vision that belong in this branch are Classification, Detection, and Segmentation.
Here’s a very generic definition of each class.
In classificationwe try to classify whole images or videos as belonging to a certain class.
In Detection we try to classify and localize objects or classes of interest.
In Segmentation, we try to extract/segment or find the exact boundary/outline of our target object/class.
In Keypoint Detection, we try to localize predefined points/landmarks.
If you’re new to Computer vision and just exploring the waters, check this page from paperswithcode, it lists a lot of subcategories from the above major categories. Now don’t be confused by the categorization that paperswtihcode has done, personally speaking, I don’t agree with the way they have sorted subcategories with applications and there are some other issues. The takeaway is that there are a lot of variations in computer vision problems, but the 4 categories I’ve listed above are some major ones.
Part 1 (b): Mediapipe’s Pose Detection Implementation:
Here’s a brief introduction to Mediapipe;
“Mediapipe is a cross-platform/open-source tool that allows you to run a variety of machine learning models in real-time. It’s designed primarily for facilitating the use of ML in streaming media & It was built by Google”
Not only is this tool backed by google but models in Mediapipe are actively used in Google products. So you can expect nothing less than the state of the Art performance from this library.
Now MediaPipe’s Pose detection is a State of the Art solution for high-fidelity (i.e. high quality) and low latency (i.e. Damn fast) for detecting 33 3D landmarks on a person in real-time video feeds on low-end devices i.e. phones, laptops, etc.
Alright, so what makes this pose detection model from Mediapipe so fast?
They are actually using a very successful deep learning recipe that is creating a 2 step detector, where you combine a computationally expensive object detector with a lightweight object tracker.
Here’s how this works:
You run the detector in the first frame of the video to localize the person and provide a bounding box around it, after that the tracker takes over and it predicts the landmark points inside that bounding box ROI, the tracker continues to run on any subsequent frames in the video using the previous frame’s ROI and only calls the detection model again when it fails to track the person with high confidence.
Their model works best if the person is standing 2-4 meters away from the camera and one major limitation of their model is that this approach only works for single-person pose detection, it’s not applicable for multi-person detection.
Mediapipe actually trained 3 models, with different tradeoffs between speed and performance. You’ll be able to use all 3 of them with mediapipe.
The detector used in pose detection is inspired by Mediapiep’s lightweight BlazeFace model, you can read this paper. For the landmark model used in pose detection, you can read this paper for more details.or readGoogle’s blogon it.
Here are the 33 landmarks that this model detects:
Alright now that we have covered some basic theory and implementation details, let’s get into the code.
Download Code
Part 2: Using Pose Detection in images and on videos
Import the Libraries
Let’s start by importing the required libraries.
Python
1
2
3
4
5
6
importmath
importcv2
importnumpy asnp
fromtimeimporttime
importmediapipe asmp
importmatplotlib.pyplot asplt
Initialize the Pose Detection Model
The first thing that we need to do is initialize the pose class using the mp.solutions.pose syntax and then we will call the setup function mp.solutions.pose.Pose() with the arguments:
static_image_mode – It is a boolean value that is if set to False, the detector is only invoked as needed, that is in the very first frame or when the tracker loses track. If set to True, the person detector is invoked on every input image. So you should probably set this value to True when working with a bunch of unrelated images not videos. Its default value is False.
min_detection_confidence – It is the minimum detection confidence with range (0.0 , 1.0) required to consider the person-detection model’s prediction correct. Its default value is 0.5. This means if the detector has a prediction confidence of greater or equal to 50% then it will be considered as a positive detection.
min_tracking_confidence – It is the minimum tracking confidence ([0.0, 1.0]) required to consider the landmark-tracking model’s tracked pose landmarks valid. If the confidence is less than the set value then the detector is invoked again in the next frame/image, so increasing its value increases the robustness, but also increases the latency. Its default value is 0.5.
model_complexity – It is the complexity of the pose landmark model. As there are three different models to choose from so the possible values are 0, 1, or 2. The higher the value, the more accurate the results are, but at the expense of higher latency. Its default value is 1.
smooth_landmarks – It is a boolean value that is if set to True, pose landmarks across different frames are filtered to reduce noise. But only works when static_image_mode is also set to False. Its default value is True.
Then we will also initialize mp.solutions.drawing_utils class that will allow us to visualize the landmarks after detection, instead of using this, you can also use OpenCV to visualize the landmarks.
Now we will pass the image to the pose detection machine learning pipeline by using the function mp.solutions.pose.Pose().process(). But the pipeline expects the input images in RGB color format so first we will have to convert the sample image from BGR to RGB format using the function cv2.cvtColor() as OpenCV reads images in BGR format (instead of RGB).
After performing the pose detection, we will get a list of thirty-three landmarks representing the body joint locations of the prominent person in the image. Each landmark has:
x – It is the landmark x-coordinate normalized to [0.0, 1.0] by the image width.
y: It is the landmark y-coordinate normalized to [0.0, 1.0] by the image height.
z: It is the landmark z-coordinate normalized to roughly the same scale as x. It represents the landmark depth with midpoint of hips being the origin, so the smaller the value of z, the closer the landmark is to the camera.
visibility: It is a value with range [0.0, 1.0] representing the possibility of the landmark being visible (not occluded) in the image. This is a useful variable when deciding if you want to show a particular joint because it might be occluded or partially visible in the image.
After performing the pose detection on the sample image above, we will display the first two landmarks from the list, so that you get a better idea of the output of the model.
Python
1
2
3
4
5
6
7
8
9
10
11
# Perform pose detection after converting the image into RGB format.
Now we will draw the detected landmarks on the sample image using the function mp.solutions.drawing_utils.draw_landmarks() and display the resultant image using the matplotlib library.
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Create a copy of the sample image to draw landmarks on.
Now we will go a step further and visualize the landmarks in three-dimensions (3D) using the function mp.solutions.drawing_utils.plot_landmarks(). We will need the POSE_WORLD_LANDMARKS that is another list of pose landmarks in world coordinates that has the 3D coordinates in meters with the origin at the center between the hips of the person.
Note: This is actually a neat hack by mediapipe, the coordinates returned are not actually in 3D but by setting hip landmark as the origin allows us to measure the relative distance of the other points from the hip, and since this distance increases or decreases depending upon if you’re close or further from the camera it gives us a sense of the depth of each landmark point.
Create a Pose Detection Function
Now we will put all this together to create a function that will perform pose detection on an image and visualize the results or return the results depending upon the passed arguments.
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
defdetectPose(image,pose,display=True):
'''
This function performs pose detection on an image.
Args:
image: The input image with a prominent person whose pose landmarks needs to be detected.
pose: The pose setup function required to perform the pose detection.
display: A boolean value that is if set to true the function displays the original input image, the resultant image,
and the pose landmarks in 3D plot and returns nothing.
Returns:
output_image: The input image with the detected pose landmarks drawn.
landmarks: A list of detected landmarks converted into their original scale.
'''
# Create a copy of the input image.
output_image=image.copy()
# Convert the image from BGR into RGB format.
imageRGB=cv2.cvtColor(image,cv2.COLOR_BGR2RGB)
# Perform the Pose Detection.
results=pose.process(imageRGB)
# Retrieve the height and width of the input image.
height,width,_=image.shape
# Initialize a list to store the detected landmarks.
# Return the output image and the found landmarks.
returnoutput_image,landmarks
Now we will utilize the function created above to perform pose detection on a few sample images and display the results.
Python
1
2
3
# Read another sample image and perform pose detection on it.
image=cv2.imread('media/sample1.jpg')
detectPose(image,pose,display=True)
Python
1
2
3
# Read another sample image and perform pose detection on it.
image=cv2.imread('media/sample2.jpg')
detectPose(image,pose,display=True)
Python
1
2
3
# Read another sample image and perform pose detection on it.
image=cv2.imread('media/sample3.jpg')
detectPose(image,pose,display=True)
Pose Detection On Real-Time Webcam Feed/Video
The results on the images were pretty good, now we will try the function on a real-time webcam feed and a video. Depending upon whether you want to run pose detection on a video stored in the disk or on the webcam feed, you can comment and uncomment the initialization code of the VideoCapture object accordingly.
# Update the previous frame time to this frame time.
# As this frame will become previous frame in next iteration.
time1=time2
# Display the frame.
cv2.imshow('Pose Detection',frame)
# Wait until a key is pressed.
# Retreive the ASCII code of the key pressed
k=cv2.waitKey(1)&0xFF
# Check if 'ESC' is pressed.
if(k==27):
# Break the loop.
break
# Release the VideoCapture object.
video.release()
# Close the windows.
cv2.destroyAllWindows()
Output:
Cool! so it works great on the videos too. The model is pretty fast and accurate.
Part 3: Pose Classification with Angle Heuristics
We have learned to perform pose detection, now we will level up our game by also classifying different yoga poses using the calculated angles of various joints. We will first detect the pose landmarks and then use them to compute angles between joints and depending upon those angles we will recognize the yoga pose of the prominent person in an image.
But this approach does have a drawback that limits its use to a controlled environment, the calculated angles vary with the angle between the person and the camera. So the person needs to be facing the camera straight to get the best results.
Create a Function to Calculate Angle between Landmarks
Now we will create a function that will be capable of calculating angles between three landmarks. The angle between landmarks? Do not get confused, as this is the same as calculating the angle between two lines.
The first point (landmark) is considered as the starting point of the first line, the second point (landmark) is considered as the ending point of the first line and the starting point of the second line as well, and the third point (landmark) is considered as the ending point of the second line.
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
defcalculateAngle(landmark1,landmark2,landmark3):
'''
This function calculates angle between three different landmarks.
Args:
landmark1: The first landmark containing the x,y and z coordinates.
landmark2: The second landmark containing the x,y and z coordinates.
landmark3: The third landmark containing the x,y and z coordinates.
Returns:
angle: The calculated angle between the three landmarks.
Now we will create a function that will be capable of classifying different yoga poses using the calculated angles of various joints. The function will be capable of identifying the following yoga poses:
# Return the output image and the classified label.
returnoutput_image,label
Now we will utilize the function created above to perform pose classification on a few images of people and display the results.
Warrior II Pose
The Warrior II Pose (also known as Virabhadrasana II) is the same pose that the person is making in the image above. It can be classified using the following combination of body part angles:
Around 180° at both elbows
Around 90° angle at both shoulders
Around 180° angle at one knee
Around 90° angle at the other knee
Python
1
2
3
4
5
# Read a sample image and perform pose classification on it.
Tree Pose (also known as Vrikshasana) is another yoga pose for which the person has to keep one leg straight and bend the other leg at a required angle. The pose can be classified easily using the following combination of body part angles:
Around 180° angle at one knee
Around 35° (if right knee) or 335° (if left knee) angle at the other knee
Now to understand it better, you should go back to the pose classification function above to overview the classification code of this yoga pose.
We will perform pose classification on a few images of people in the tree yoga pose and display the results using the same function we had created above.
Python
1
2
3
4
5
6
# Read a sample image and perform pose classification on it.
T Pose (also known as a bind pose or reference pose) is the last pose we are dealing with in this lesson. To make this pose, one has to stand up like a tree with both hands wide open as branches. The following body part angles are required to make this one:
Around 180° at both elbows
Around 90° angle at both shoulders
Around 180° angle at both knees
You can now go back to go through the classification code of this T pose in the pose classification function created above.
Now, let’s test the pose classification function on a few images of the T pose.
Python
1
2
3
4
5
# Read another sample image and perform pose classification on it.
Now if you want you can extend the pose classification function to make it capable of identifying more yoga poses like the one in the image above. The following combination of body part angles can help classify this one:
Around 180° angle at both knees
Around 105° (if the person is facing right side) or 240° (if the person is facing left side) angle at both hips
Pose Classification On Real-Time Webcam Feed
Now we will test the function created above to perform the pose classification on a real-time webcam feed.
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# Initialize the VideoCapture object to read from the webcam.
# Release the VideoCapture object and close the windows.
camera_video.release()
cv2.destroyAllWindows()
Output:
Summary:
Today, we learned about a very popular vision problem called pose detection. We briefly discussed popular computer vision problems then we saw how mediapipe has implemented its pose detection solution and how it used a 2 step detection + tracking pipeline to speed up the process.
After that, we saw step by step how to do real-time 3d pose detection with mediapipe on images and on webcam.
Then we learned to calculate angles between different landmarks and then used some heuristics to build a classification system that could determine 3 poses, T-Pose, Tree Pose, and a Warrior II Pose.
Alright here are some limitations to our pose classification system, it has too many conditions and checks, now for our case it’s not that complicated, but if you throw in a few more poses this system can easily get too confusing and complicated, a much better method is to train an MLP ( a simple multi-layer perceptron) using Keras on landmark points from a few target pose pictures and then classify them. I’m not sure but I might create a separate tutorial for that in the future.
Another issue that I briefly went over was that the pose detection model in mediapipe is only able to detect a single person at a time, now this is fine for most pose-based applications but can prove to be an issue where you’re required to detect more than one person. If you do want to detect more people then you could try other popular models like PoseNet or OpenPose.
You can reach out to me personally for a 1 on 1 consultation session in AI/computer vision regarding your project. Our talented team of vision engineers will help you every step of the way. Get on a call with me directlyhere.
Ready to seriously dive into State of the Art AI & Computer Vision? Then Sign up for these premium Courses by Bleed AI
In this tutorial, you’ll learn how to do Real-Time Selfie Segmentation using Mediapipe in Python and then build the following 4 applications.
Background Removal/Replacement
Background Blur
Background Desaturation
Convert Image to Transparent PNG
And not only will these applications work on images but I’ll show you how to apply these to your real-time webcam feed running on a CPU.
Also, the model that we’ll use is almost the same one that Google Hangouts is currently using to segment people, So Yes! We’re going to be learning a State of the Art approach for segmentation.
And on top of that, the code for building all 4 applications will be ridiculously simple.
Interested yet? Then keep reading this full post.
In the first part of this post, we’ll understand the problem of image segmentation and its types, then we’ll understand what selfie segmentation is. After that, we’ll take a look at Mediapipeand how to do selfie segmentation with it. And finally how to build all those 4 applications.
What is Image Segmentation?
If you’re somewhat familiar with computer vision basics then you might be familiar with image segmentation, a very popular problem in Computer Vision.
Just like in an object detection task where you localize objects in the image and draw boxes around it, in a segmentation task, you’re almost doing the same thing, but here instead of drawing a bounding box around each object, you’re trying to segment or draw out the exact boundary of each target Object.
In other words, in segmentation, you’re trying to divide the image into groups of pixels based on some specific criteria.
So an image segmentation algorithm will take an input image and output groups of pixels, each group will belong to some class. Normally this output is actually an image mask where each pixel consists of a single number indicating the class it belongs to.
Now the task of image segmentation can be divided into several categories, let’s understand each of them.
Semantic Segmentation.
Instance Segmentation
Panoptic Segmentation
Saliency Detection.
What is Semantic Segmentation?
In this type of segmentation, our task is to assign a class label (pedestrian, car, road, tree etc.) to every pixel in the image.
As you can see all the objects in the image, including the buildings, sky, sidewalk are labeled by a certain color indicating that they belong to a certain class e.g all cars are labeled blue, people are labeled red, and so on.
It’s worth noting that although we can extract any individual class, for e.g. we can say extract all cars by looking for blue pixels but we cannot distinguish between different instances of the same class, for e.g. you can’t reliably say which blue pixel belongs to which car.
What is Instance Segmentation?
Another common category of segmentation is called Instance Segmentation. Here the goal is not to label all pixels in the image but only label some selective classes, for which the model was trained on ( for e.g. cars, pedestrians, etc. ).
As you can see in the image, the algorithm ignored the roads, sky, buildings etc. so here we’re only interested in labeling specific classes.
One other major difference in this approach is that we’re also differentiating between different instances of the same classes i.e. you can tell which pixel belongs to which class and so on.
What is Panoptic Segmentation?
If you’re a curious cat like me, you might wonder, well isn’t there an approach that,
A) Labels all pixels in the image like semantic segmentation.
B) And also differentiates between instances of the same class like instance segmentation.
Well, Yes there is! And it’s called PanopticSegmentation. Where not only every pixel is assigned a class but we can also differentiate between different instances of the same class, i.e. we can tell which pixel belongs to which car.
This type of segmentation is the combination of both instance and semantic segmentation.
What is Saliency Detection?
Don’t be confused by the word “Detection” here, although Saliency Detection is not generally considered as one of the core segmentation methods but it’s still essentially a major segmentation technique.
So here the goal is to segment out the most salient/prominent (things that stand out ) features in the image.
And this is done regardless of the class of the object. Here’s another example.
As you can see the most obvious object in the above image is the cat, which is exactly what’s being segmented out here.
So in saliency detection where trying to segment out the most standing out features in the image.
Selfie Segmentation:
Alright now that we have understood the fundamental segmentation techniques out there, let’s try to understand what selfie segmentation is.
Well, obviously it’s a no brainer, it’s a segmentation technique that segments out people in images.
You might think, how is this different from semantic or instance Segmentation?
Well, to put it simply, you can consider selfie segmentation as a sort of a mix between semantic segmentation and Saliency detection.
What do I mean by that?
Take a look at the example output of Selfie segmentation on two images below.
In the first image (top) the segmentation is done perfectly, as every person is on a similar scale and prominent in the image, whereas in the second image (bottom) the woman is prominent and is segmented out correctly while her colleagues in the background are not segmented properly.
This is why the technique is called selfie segmentation, it tries to segment out prominent people in the image, ideally everyone to be segmented should be on a similar scale in the image.
This is why I said that this technique is sort of a mix between saliency detection and semantic segmentation.
Now, you might think why do we even need to use another segmentation technique, why not just segment people using semantic or instance segmentation methods.
Well, Actually we could do that. Models like Mask-RCNN, DeepLabv3, and others are really good at segmenting people.
But here’s the problem.
These models although provide State of the Art results but are actually really slow, they aren’t a good fit when it comes to real-time applications especially on CPUs.
This is why the Selfie segmentation model that we’ll use today is specifically designed to segment people and also run at real-time speed on CPU and other low-end hardware. It’s built on a slight modification of the MobielNetv3 model. This model itself contains clever algorithmic innovations for maximum speed and performance gains. To understand more about these algorithmic advances in this model, you can read Google AI’s Blog post on this model.
So what are the use cases for Selfie Segmentation?
The most popular use case for this problem is Video Conferencing. In fact, Google Hangouts is using approximately the same model that we’re going to learn to use today.
Besides Video Conferencing, there are several other use cases for this model that we’re going to explore today.
MediaPipe:
Mediapipe is a cross-platform tool that allows you to run a variety of machine learning models in real-time. It’s designed primarily for facilitating the use of ML in streaming media.
This is the tool that we’ll be using today in order to use the selfie segmentation model. In future tutorials I’ll also be covering the usage of a few other models and make interesting applications out of them. So Stay tuned for those blog posts at Bleed AI.
Alright Now let’s start with the Code!
Selfie Segmentation Code:
To get started with Mediapipe, you first need to run the following command to install it
Now let’s start by importing the required libraries.
Python
1
2
3
4
5
6
importos
importcv2
importnumpy asnp
importmediapipe asmp
importmatplotlib.pyplot asplt
fromtimeimporttime
Initialize the Selfie Segmentation Model
The first thing that you need to do is initialize the selfie segmentation class using the mp.solutions.selfie_segmentation function and then you need to call the setup function using .SelfieSegmentation(0) now there are two models for segmentation in mediapipe, by passing in 0 you will be using the general model i.e. input is resized to: 256x256x3 (Height, width, columns) and by passing 1 you will be using the landscape model i.e. input resized to: 144x256x3 (Height, width, columns).
You should select the type of model by taking into account the aspect ratio of the original image, although the landscape model is a bit faster. These models automatically resize the input image before passing it through the network and the size of the output image representing the segmentation mask for both models will be the same as the input that is 256x256x1 or 144x256x1.
We will start by learning to use selfie segmentation to change the background of images. But first, we will have to convert the image into RGB format as the MediaPipe library expects the images in this format but the function cv2.imread() reads the images in BGR format and we will use the function cv2.cvtColor() to do this conversion.
Then we will pass the image to the MediaPipe Segmentation function which will perform the segmentation process and will return a probability map with pixel values near 1 for the indexes where the person is located in the image and pixel values near 0 for the background.
Python
1
2
3
4
5
6
7
8
9
10
11
12
# Convert the sample image from BGR to RGB format.
Notice that we have some gray areas in the map, this signifies that there are areas where the model was not sure if it was the background or the person. So now what we need to do is do some thresholding and set all pixels above certain confidence to white and all other pixels to black.CodeText
So in this step, we’re going to be thresholding the mask above to get a binary black and white mask with a pixel value 1 for the indexes where the person is located and 0 for the background.CodeText
Python
1
2
3
4
5
6
7
8
9
# Get a binary mask having pixel value 1 for the person and 0 for the background.
# Pixel values greater than the threshold value 0.9 (90% Confidence) will become 1 and the remaining will become 0.
binary_mask=result.segmentation_mask>0.9
# Display the original sample image and the binary mask with appropriate titles.
Now we will use the numpy.where() function to create a new image which will have the pixel values from the original sample image at the indexes where the mask image have value 1 (white areas) and replace areas where mask have value 0 (black areas) with 255, to give a white background to the object of the sample image. Right now we’re just adding whtie (255) background but later on we’ll add a separate image as background.
But to create the required output image we will first have to convert the mask image (one channel) into a three-channel image using the function numpy.dstack() as the function numpy.where() will need to have all images to have equal number of channels.
Python
1
2
3
4
5
6
7
8
9
10
11
12
# Stack the same mask three times to make it a three channel image.
Now instead of having a white background if you need to add another background image, you just need to replace 255 with a background image in np.where function
Python
1
2
3
4
5
6
7
8
9
10
11
12
# Read a background image from the specified path.
bg_img=cv2.imread('media/background.jpg')
# Create an output image with the pixel values from the original sample image at the indexes where the mask have
# value 1 and replace the other pixel values (where mask have zero) with the new background image.
Now we will create a function that will use the selfie segmentation to modify the background of an image depending upon the passed arguments. The followings will be the modifications that the function will be capable of:
Change Background: The function will replace the background of the image with a different provided background image OR it will make the background white for the cases when a separate background image is not provided.
Blur Background: The function will segment out the prominent person and then blur out the background.
Desaturate Background: The function will desaturate (convert to grayscale) the background of the image, giving the image a very interesting effect.
Transparent Background: The function will make the background of the image transparent.
Now we will utilize the function created above with the argument method='changeBackground' to change the backgrounds of a few sample images and check the results.
The results on the images look great, but how will the function we created above fare when applied to our real-time webcam feed. Well, let’s check it out. In the code below we will swap out different background images by pressing the key b on keyboard.
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
Initialize the VideoCapture objectto read fromthe webcam.
camera_video=cv2.VideoCapture(0)
# Set width of the frames in the video stream.
camera_video.set(3,1280)
# Set height of the frames in the video stream.
camera_video.set(4,720)
# Initialize a list to store the background images.
background_images=[]
# Specify the path of the folder which contains the background images.
background_folder='media/backgroundimages/'
# Iterate over the images in the background folder.
# Update the previous frame time to this frame time.
# As this frame will become previous frame in next iteration.
time1=time2
# Display the frame with changed background.
cv2.imshow('Video',output_frame)
# Wait until a key is pressed.
# Retreive the ASCII code of the key pressed
k=cv2.waitKey(1)&0xFF
# Check if 'ESC' is pressed.
if(k==27):
# Break the loop.
break
# Release the VideoCapture Object.
camera_video.release()
# Close the windows.
cv2.destroyAllWindows()
Output:
That was pretty interesting, now that you’ve learned how to segment the background successfully it’s time to make use of this skill and create some other exciting applications out of it.
Application 2: Apply Background Blur
Now this application will actually save you a lot of money.
How?
Well, remember those expensive DSLR or mirrorless cameras that blur out the background, today you’ll learn to achieve the same effect, infact even better by just using your webcam.
So now we will use the function created above to segment out the prominent person and then blur out the background.
All we need to do is just blur the original image using cv2.GaussianBlur() and then instead of replacing the background with a new image (like we did in the previous application) we’ll just replace it with this blur version of the image. This way the segmented person will retain it’s original form but the rest of the parts will be blurred out.
Now let’s call the function with the argument method='blurBackground' over some samples. You can control the amount of blur by controling the blur variable.
Python
1
2
3
# Read another sample image and blur the background
image2=cv2.imread('media/sample2.jpg')
modifyBackground(image2,method='blurBackground')
Python
1
2
3
# Read another sample image and blur the background
image3=cv2.imread('media/sample.jpg')
modifyBackground(image3,method='blurBackground')
Python
1
2
3
# Read another sample image and blur the background
# Update the previous frame time to this frame time.
# As this frame will become previous frame in next iteration.
time1=time2
# Display the frame with blurred background.
cv2.imshow('Video',output_frame)
# Wait until a key is pressed.
# Retreive the ASCII code of the key pressed
k=cv2.waitKey(1)&0xFF
# Check if 'ESC' is pressed.
if(k==27):
# Break the loop.
break
# Release the VideoCapture Object.
camera_video.release()
# Close the windows.
cv2.destroyAllWindows()
Output:
Application 3: Desaturate Background
Now we will use the function created above to desaturate (convert to grayscale) the background of the image. Again the only new thing that we’re doing here is just replacing the black parts of the segmented mask with the grayscale version of the original image.
We will have to pass the argument method='desatureBackground' this time, to desaturate the backgrounds of a few sample images.
Python
1
2
3
# Read a sample image and apply the desaturation effect.
# Update the previous frame time to this frame time.
# As this frame will become previous frame in next iteration.
time1=time2
# Display the frame with desatured background.
cv2.imshow('Video',output_frame)
# Wait until a key is pressed.
# Retreive the ASCII code of the key pressed
k=cv2.waitKey(1)&0xFF
# Check if 'ESC' is pressed.
if(k==27):
# Break the loop.
break
# Release the VideoCapture Object.
camera_video.release()
# Close the windows.
cv2.destroyAllWindows()
Output:
Application 4: Convert an Image to have a Transparent Background
Now we will use the function created above to segment out the prominent person and then make the background of the image transparent and after that we will store the resultant image into the disk using the function cv2.imwrite().
To create an image with a transparent background (four-channel image) we will need to add another channel called alpha channel to the original image, this channel is a mask which decides which part of the image needs to be transparent and can have values from 0 (black) to 255 (white) which determine the level of visibility. Black (0) acts as the transparent area and white (255) acts as the visible area.
So we just need to add the segmentation mask to the original image.
We will have to pass the argument method='transparentBackground' to the function to get an image with transparent background.
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Specify the path of a sample image.
img_path='media/sample.jpg'
# Read the input image from the specified path.
image=cv2.imread(img_path)
# Make the background of the sample image transparent.
Note: These models work best for the scenarios where the person is close (< 2m) to the camera.
Bleed AI Needs Your Support!
Hi Everyone, Taha Anwar (Founder Bleed AI) here. If my blog posts or videos have helped you in any way in your Computer Vision/AI/ML/DL Learning journey then remember you can help us out too.
Publishing Free high-quality Computer Vision tutorials for you guys so that you can build projects, or land your dream job, or maybe build a startup is our core mission at Bleed AI. But every single post takes a lot of effort and man-hours, and in order to keep publishing Free high-end Tutorials, and me & my team need your support on Patreon, plus you will get some extra perks too.
Summary:
Alright, So today we did a lot!
We Understand the basic terminology regarding different segmentation techniques, in summary:
Image Segmentation: The task of dividing pixels into groups of pixels based on some criteria
Semantic Segmentation: In this type we assign a class label to every pixel in the image.
Instance Segmentation: Here we assign a class label to only selective classes in the image.
Panoptic Segmentation: This approach combines both semantic and instance segmentation.
Saliency Detection: Here we’re just interested in segmenting prominent objects in the image regardless of the class.
Selfie Segmentation: Here we want to segment prominent people in the image.
We also learned that Mediapipe is an awesome tool to use various ML models in real-time. Then we learned how to perform selfie segmentation with this tool and build 4 different useful applications from it. These applications were:
How to remove/replace backgrounds in images & videos.
How to desaturate the background to make the person pop out in an image or a video.
How to blur out the background.
How to give an image a transparent background and save it.
This was my first Mediapipe tutorial and I’m planning to write a tutorial on a few other models too. If you enjoyed this tutorial then do let me know in the comments! You’ll definitely get a reply from me
Hire Us
Let our team of expert engineers and managers build your next big project using Bleeding Edge AI Tools & Technologies
Today’s Video tutorial is the one I wish I had access to when I was starting out in OpenCV, in this video I reveal to you some very interesting information about the opencv including great tips regarding when to find the right resources, tutorials for the library.
I’ll start by briefly going over the history of OpenCV and then talk about other exciting topics.
Some of the things I will go through in this video
👉How to navigate the opencv docs to find what you’re looking for. 👉How to get details regarding any OpenCV function. 👉The differences between the C++ and python version of OpenCV and which one you should work with. 👉Pip installation of OpenCV vs Source installation. 👉Where to ask questions regarding OpenCV when you’re stuck.
You can reach out to me personally for a 1 on 1 consultation session in AI/computer vision regarding your project. Our talented team of vision engineers will help you every step of the way. Get on a call with me directlyhere.
Ready to seriously dive into State of the Art AI & Computer Vision? Then Sign up for these premium Courses by Bleed AI