Controlling Subway Surfers Game with Pose Detection using Mediapipe and Python

Controlling Subway Surfers Game with Pose Detection using Mediapipe and Python

Watch The Video Here:

In last Week’s tutorial, we learned how to work with real-time pose detection and created a pose classification system. In this week’s tutorial, we’ll learn to play a popular game called “Subway Surfers”  

Of Course, there’s more to it, this is an AI Blog after all.

We will actually be using our body pose to control the game, not keyboard controls, the entire application will work in real-time on your CPU, you don’t even need a depth camera or a Kinect, your webcam will suffice.

Excited yet, let’s get into it, but before that let me tell you a short story that motivated me to build this application today. It starts with me giving a lecture on the importance of physical fitness, I know … I know … how this sounds but just bear with me for a bit.

Hi All, Taha Awnar here, So here’s the thing. One of the best things I enjoyed in my early teenage years was having a fast metabolism due to my involvement in physical activities. I could eat whatever I wanted, not make a conscious effort in exercising and still stay fit.

But as I grew older, and started spending most of my time in front of a computer, I noticed that I was actually gaining weight. So no longer could I afford the luxury of binge unhealthy eating and skipping workouts.

Now I’m a bit of a foodie so although I could compromise a bit on how I eat, I still needed to cut weight some other way, so I quickly realized that unless I wanted to get obese, I needed to consciously make effort to workout.

That’s about when I joined a local gym in my area, and guess what? … it didn’t work out, ( or I didn’t work out … enough 🙁  ) So I quitted after a month.  

So what was the reason ?.Well, I could provide multiple excuses but to be honest, I was just lazy.

A few months later I joined the gym again and again I quitted after just 2 months.

Now I could have just quit completely but instead 8 months back I tried again, this time I even hired a trainer to keep me motivated, and as they say it, 3rd time’s a charm and luckily it was!

8 months in, I’m still at it. I did see results and lost a couple of kgs, although I haven’t reached my personal target so I’m still working towards it.

If you’re reading this post then you’re probably into computer science just like me and you most likely need to spend a lot of time in front of a PC and because of that, your physical and mental fitness must take a toll. And I seriously can’t stress enough how important it is that you take out a couple of hours each week to exercise.

I’m not a fitness guru but I can say working out has many key benefits:

  • Helps you shed excess weight, keeps you physically fit.
  • Gives you mental clarity and improves your work quality.
  • Lots of health benefits.
  • Helps you get a partner, if you’re still single like me … lol

Because of these reasons, even though I have an introverted personality, I consciously take out a couple of hours each week to go to the gym or the park for running.

But here’s the thing, sometimes I wonder why can’t I combine what I do (working on a PC) with some activity so I could … you know hit 2 birds with one stone.

This thought led me to create this post today, so what I did was I created a vision application that allows me to control a very popular game called Subway Surfers via my body movement by utilizing real-time pose detection.

And so In this tutorial, I’ll show you how to create this application that controls the Subway Surfers game using body gestures and movements so that you can also exercise, code, and have fun at the same time.

pose detection subway surfers

How will this Work?

So this game is about a character running from a policeman dodging different hurdles by jumping, crouching, and moving left and right. So we will need to worry about four controls that are normally controlled using a keyboard.

  • Up arrow key to make the character jump
  • Down arrow key to make the character crouch
  • Left arrow key to move  the character to left
  • Right arrow key to move the character to right.
moves pose detection

Using the Pyautogui library, we will automatically trigger the required keypress events, depending upon the body movement of the person that we’ll capture using Mediapipe’s Pose Detection model.

I want the game’s character to:

  • Jump whenever the person controlling the character jumps. 
pose detection jump
  • Crouch whenever the person controlling the character crouches.
pose detection crouch
  • Move left whenever the person controlling the character moves to the left side of the screen.
pose detection left
  • Move right whenever the person controlling the character moves to the right on the screen.  
pose detection right

You can also use the techniques you’ll learn in this tutorial to control any other game. The simpler the game, the easier it will be to control. I have actually published two tutorials about game control via body gestures.

Alright now that we have discussed the basic mechanisms for creating this application, let me walk you through the exact step-by-step process I used to create this.

Outline

  1. Step 1: Perform Pose Detection
  2. Step 2: Control Starting Mechanism
  3. Step 3: Control Horizontal Movements
  4. Step 4: Control Vertical Movements
  5. Step 5: Control Keyboard and Mouse with PyautoGUI
  6. Step 6: Build the Final Application

Alright, let’s get started.

Download Code

Import the Libraries

We will start by importing the required libraries.

Initialize the Pose Detection Model

After that we will need to initialize the mp.solutions.pose class and then call the mp.solutions.pose.Pose() function with appropriate arguments and also initialize mp.solutions.drawing_utils class that is needed to visualize the landmarks after detection.

Step 1: Perform Pose Detection

To implement the game control mechanisms, we will need the current pose info of the person controlling the game, as our intention is to control the character with the movement of the person in the frame. We want the game’s character to move left, right, jump and crouch with the identical movements of the person.

So we will create a function detectPose() that will take an image as input and perform pose detection on the person in the image using the mediapipe’s pose detection solution to get thirty-three 3D landmarks on the body and the function will display the results or return them depending upon the passed arguments.

pose detection landmark

This function is quite similar to the one we had created in the previous post. The only difference is that we are not plotting the pose landmarks in 3D and we are passing a few more optional arguments to the function mp.solutions.drawing_utils.draw_landmarks() to specify the drawing style.

You probably do not want to lose control of the game’s character whenever some other person comes into the frame (and starts controlling the character), so that annoying scenario is already taken care of, as the solution we are using only detects the landmarks of the most prominent person in the image.

So you do not need to worry about losing control as long as you are the most prominent person in the frame as it will automatically ignore the people in the background.

Now we will test the function detectPose() created above to perform pose detection on a sample image and display the results.

pose detection output

It worked pretty well! if you want you can test the function on other images too by just changing the value of the variable IMG_PATH in the cell above, it will work fine as long as there is a prominent person in the image.

Step 2: Control Starting Mechanism

In this step, we will implement the game starting mechanism, what we want is to start the game whenever the most prominent person in the image/frame joins his both hands together. So we will create a function checkHandsJoined() that will check whether the hands of the person in an image are joined or not.

The function checkHandsJoined() will take in the results of the pose detection returned by the function detectPose() and will use the LEFT_WRIST and RIGHT_WRIST landmarks coordinates from the list of thirty-three landmarks, to calculate the euclidean distance between the hands of the person.

pose detection distance

And then utilize an appropriate threshold value to compare with and check whether the hands of the person in the image/frame are joined or not and will display or return the results depending upon the passed arguments.

Now we will test the function checkHandsJoined() created above on a real-time webcam feed to check whether it is working as we had expected or not.

Output Video:

Woah! I am stunned, the pose detection solution is best known for its speed which is reflecting in the results as the distance and the hands status are updating very fast and are also highly accurate.

Step 3: Control Horizontal Movements

Now comes the implementation of the left and right movements control mechanism of the game’s character, what we want to do is to make the game’s character move left and right with the horizontal movements of the person in the image/frame.

So we will create a function checkLeftRight() that will take in the pose detection results returned by the function detectPose() and will use the x-coordinates of the RIGHT_SHOULDER and LEFT_SHOULDER landmarks to determine the horizontal position (LeftRight or Center) in the frame after comparing the landmarks with the x-coordinate of the center of the image.

The function will visualize or return the resultant image and the horizontal position of the person depending upon the passed arguments.

visualize pose detection

Now we will test the function checkLeftRight() created above on a real-time webcam feed and will visualize the results updating in real-time with the horizontal movements.

Output Video:

Cool! the speed and accuracy of this model never fail to impress me.

Step 4: Control Vertical Movements

In this one, we will implement the jump and crouch control mechanism of the game’s character, what we want is to make the game’s character jump and crouch whenever the person in the image/frame jumps and crouches.

So we will create a function checkJumpCrouch() that will check whether the posture of the person in an image is JumpingCrouching or Standing by utilizing the results of pose detection by the function detectPose().

The function checkJumpCrouch() will retrieve the RIGHT_SHOULDER and LEFT_SHOULDER landmarks from the list to calculate the y-coordinate of the midpoint of both shoulders and will determine the posture of the person by doing a comparison with an appropriate threshold value.

The threshold (MID_Y) will be the approximate y-coordinate of the midpoint of both shoulders of the person while in standing posture. It will be calculated before starting the game in the Step 6: Build the Final Application and will be passed to the function checkJumpCrouch().

But the issue with this approach is that the midpoint of both shoulders of the person while in standing posture will not always be exactly the same as it will vary when the person will move closer or further to the camera.

To tackle this issue we will add and subtract a margin to the threshold to get an upper and lower bound as shown in the image below.

pose detection threshold

Now we will test the function checkJumpCrouch() created above on the real-time webcam feed and will visualize the resultant frames. For testing purposes, we will be using a default value of the threshold, that if you want you can tune manually set according to your height.

Output Video:

Great! when I lower my shoulders at a certain range from the horizontal line (threshold), the results are Crouching, and the results are Standing, whenever my shoulders are near the horizontal line (i.e., between the upper and lower bounds), and when my shoulders are at a certain range above the horizontal line, the results are Jumping.

Step 5: Control Keyboard and Mouse with PyautoGUI

The Subway Surfers character wouldn’t be able to move left, right, jump or crouch unless we provide it the required keyboard inputs. Now that we have the functions checkHandsJoined()checkLeftRight() and checkJumpCrouch(), we need to figure out a way to trigger the required keyboard keypress events, depending upon the output of the functions created above.

This is where the PyAutoGUI API shines. It allows you to easily control the mouse and keyboard event through scripts. To get an idea of PyAutoGUI’s capabilities, you can check this video in which a bot is playing the game Sushi Go Round.

To run the cells in this step, it is not recommended to use the keyboard keys (Shift + Enter) as the cells with keypress events will behave differently when the events will be combined with the keys Shift and Enter. You can either use the menubar (Cell>>Run Cell) or the toolbar (▶️Run) to run the cells.

Now let’s see how simple it is to trigger the up arrow keypress event using pyautogui.

Similarly, we can trigger the down arrow or any other keypress event by replacing the argument with that key name (the argument should be a string). You can click here to see the list of valid arguments.

To press multiple keys, we can pass a list of strings (key names) to the pyautogui.press() function.

Or to press the same key multiple times, we can pass a value (number of times we want to press the key) to the argument presses in the pyautogui.press() function.

This function presses the key(s) down and then releases up the key(s) automatically. We can also control this keypress event and key release event individually by using the functions:

  • pyautogui.keyDown(key): Presses and holds down the specified key.
  • pyautogui.keyUp(key): Releases up the specified key.

So with the help of these functions, keys can be pressed for a longer period. Like in the cell below we will hold down the shift key and press the enter key (two times) to run the two cells below this one and then we will release the shift key.

Now we will hold down the shift key and press the tab key and then we will release the shift key. This will switch the tab of your browser so make sure to have multiple tabs before running the cell below.

To trigger the mouse key press events, we can use pyautogui.click() function and to specify the mouse button that we want to press, we can pass the values leftmiddle, or right to the argument button.

We can also move the mouse cursor to a specific position on the screen by specifying the x and y-coordinate values to the arguments x and y respectively.

Step 6: Build the Final Application

In the final step, we will have to combine all the components to build the final application.

We will use the outputs of the functions created above checkHandsJoined() (to start the game), checkLeftRight() (control horizontal movements) and checkJumpCrouch() (control vertical movements) to trigger the relevant keyboard and mouse events and control the game’s character with our body movements.

Now we will run the cell below and click here to play the game in our browser using our body gestures and movements.

Output Video:

While building big applications like this one, I always divide the application into smaller components and then, in the end, integrate all those components to make the final application.

This makes it really easy to learn and understand how everything comes together to build up the full application.

Join My Course Computer Vision For Building Cutting Edge Applications Course

The only course out there that goes beyond basic AI Applications and teaches you how to create next-level apps that utilize physics, deep learning, classical image processing, hand and body gestures. Don’t miss your chance to level up and take your career to new heights

You’ll Learn about:

  • Creating GUI interfaces for python AI scripts.
  • Creating .exe DL applications
  • Using a Physics library in Python & integrating it with AI
  • Advance Image Processing Skills
  • Advance Gesture Recognition with Mediapipe
  • Task Automation with AI & CV
  • Training an SVM machine Learning Model.
  • Creating & Cleaning an ML dataset from scratch.
  • Training DL models & how to use CNN’s & LSTMS.
  • Creating 10 Advance AI/CV Applications
  • & More

Whether you’re a seasoned AI professional or someone just looking to start out in AI, this is the course that will teach you, how to Architect & Build complex, real world and thrilling AI applications

Summary:

In this tutorial, we learned to perform pose detection on the most prominent person in the frame/image, to get thirty-three 3D landmarks, and then use those landmarks to extract useful info about the body movements (horizontal position i.e., left, center or right and posture i.e. jumping, standing or crouching) of the person and then use that info to control a simple game.

Another thing we have learned is how to automatically trigger the mouse and keyboard events programmatically using the Pyautogui library.

Now one drawback of controlling the game with body movements is that the game becomes much harder compared to controlling it via keyboard presses. 

But our aim to make the exercise fun and learn to control Human-Computer Interaction (HCI) based games using AI is achieved. Now if you want, you can extend this application further to control a much more complex application.

You can reach out to me personally for a 1 on 1 consultation session in AI/computer vision regarding your project. Our talented team of vision engineers will help you every step of the way. Get on a call with me directly here.

Ready to seriously dive into State of the Art AI & Computer Vision?
Then Sign up for these premium Courses by Bleed AI

Vehicle Detection with OpenCV using Contours + Background Subtraction (Pt:4)

Vehicle Detection with OpenCV using Contours + Background Subtraction (Pt:4)

Watch the Full Video Here:

Vehicle detection has been a challenging part of building intelligent traffic management systems. Such systems are critical for addressing the ever-increasing number of vehicles on road networks that cannot keep up with the pace of increasing traffic. Today many methods that deal with this problem use either traditional computer vision or complex deep learning models.

Popular computer vision techniques include vehicle detection using optical flow, but in this tutorial, we are going to perform vehicle detection using another traditional computer vision technique that utilizes background subtraction and contour detection to detect vehicles. This means you won’t have to spend hundreds of hours in data collection or annotation for building deep learning models, which can be tedious, to say the least. Not to mention, the computation power required to train the models.

This post is the fourth and final part of our Contour Detection 101 series. All 4 posts in the series are titled as:

  1. Contour Detection 101: The Basics  
  2. Contour Detection 101: Contour Manipulation
  3. Contour Detection 101: Contour Analysis 
  4. Vehicle Detection with OpenCV using Contours + Background Subtraction (This Post)

So if you are new to the series and unfamiliar with contour detection, make sure you check them out!

In part 1 of the series, we learned the basics, how to detect and draw the contours, in part 2 we learned to do some contour manipulations and in the third part, we analyzed the detected contours for their properties to perform tasks like object detection. Combining these techniques with background subtraction will enable us to build a useful application that detects vehicles on a road. And not just that but you can use the same principles that you learn in this tutorial to create other motion detection applications.

So let’s dive into how vehicle detection with background subtraction works.

Import the Libraries

Let’s First start by importing the libraries.

Car Detection using Background Subtraction

Background subtraction is a simple yet effective technique to extract objects from an image/video. Consider a highway on which cars are moving, and you want to extract each car. One easy way can be that you take a picture of the highway with the cars (called foreground image) and you also have an image saved in which the highway does not contain any cars (background image) so you subtract the background image from the foreground to get the segmented mask of the cars and then use that mask to extract the cars.
Image

But in many cases you don’t have a clear background image, an example of this can be a highway that is always busy, or maybe a walking destination that is always crowded. So in those cases, you can subtract the background by other means, for example, in the case of a video you can detect the movement of the object, so the objects which move can be foreground and the other part that remain static can be the background.

Several algorithms have been invented for this purpose. OpenCV has implemented a few such algorithms which are very easy to use. Let’s see one of them.

BackgroundSubtractorMOG2

BackgroundSubtractorMOG2 is a Background/Foreground Segmentation Algorithm, based on two papers by Z.Zivkovic, “Improved adaptive Gaussian mixture model for background subtraction” (IEEE 2004) and “Efficient Adaptive Density Estimation per Image Pixel for the Task of Background Subtraction” (Elsevier BV 2006). One important feature of this algorithm is that it provides better adaptability to varying scenes due to illumination changes, which benefits you from having to worry about maintaining a fixed background. Let’s see how it works.

Function Syntax:

object = cv2.createBackgroundSubtractorMOG2(history, varThreshold, detectShadows)

Parameters:

  • history (optional) – It is the length of the history. Its default value is 500.
  • varThreshold (optional) – It is the threshold on the squared distance between the pixel and the model to decide whether a pixel is well described by the background model. It does not affect the background update and its default value is 16.
  • detectShadows (optional) – It is a boolean that determines whether the algorithm will detect and mark shadows or not. It marks shadows in gray color. Its default value is True. It decreases the speed a bit, so if you do not need this feature, set the parameter to false.

Returns:

  • object – It is the MOG2 Background Subtractor.

Output:

The second frame is the original video, on the left we have the background subtraction result with shadows, while on the right we have the foreground part produced using the background subtraction mask.

Creating the Vehicle Detection Application

Alright once we have our background subtraction method ready, we can build our final application!

Here’s the breakdown of the steps we need to perform the complete background Subtraction based contour detection.

1) Start by loading the video using the function cv2.VideoCapture() and create a background subtractor object using the function cv2.createBackgroundSubtractorMOG2().

2) Then we will use the backgroundsubtractor.apply() method to get the segmented masks for the frames of the video after reading the frames one by one using the function cv2.VideoCapture.read().

3) Next, we will apply thresholding on the mask using the function cv2.threshold() to get rid of shadows and then perform Erosion and Dilation to improve the mask further using the functions cv2.erode() and cv2.dilate().

4) Then we will use the function cv2.findContours() to detect the contours on the mask image and convert the contour coordinates into bounding box coordinates for each car in the frame using the function cv2.boundingRect(). We will also check the area of the contour using cv2.contourArea() to make sure it is greater than a threshold for a car contour.

5) After that we will use the functions cv2.rectangle() and cv2.putText() to draw and label the bounding boxes on each frame and extract the foreground part of the video with the help of the segmented mask using the function cv2.bitwise_and().

Output:

This seems to have worked out well, that too without having to train large-scale Deep learning models!

There are many other background subtraction algorithms in OpenCV that you can use. Check out here and here for further details about them.

Summary

Vehicle Detection is a popular computer vision problem. This post explored how traditional machine vision tools can still be utilized to build applications that can effectively deal with modern vision challenges.

We used a popular background/foreground segmentation technique called background subtraction to isolate our regions of interest from the image.    

We also saw how contour detection can prove to be useful when dealing with vision problems. The pre-processing and post-processing that can be used to filter out the noise in the detected contours.

Although these techniques can be robust, they are not as generalizable as Deep learning models so it’s important to put more focus on deployment conditions and possible variations when building vision applications with such techniques.

This post concludes the four-part series on contour detection. If you enjoyed this post and followed the rest of the series do let me know in the comments and you can also support me and the Bleed AI team on patreon here.

If you need 1 on 1 Coaching in AI/computer vision regarding your project, or your career then you reach out to me personally here

services-siteicon

Hire Us

Let our team of expert engineers and managers build your next big project using Bleeding Edge AI Tools & Technologies

Docker-small-icon
unity-logo-small-icon
Amazon-small-icon
NVIDIA-small-icon
flutter-small-icon
OpenCV-small-icon
Real-Time 3D Pose Detection & Pose Classification with Mediapipe and Python

Real-Time 3D Pose Detection & Pose Classification with Mediapipe and Python

In this tutorial, we’ll learn how to do real-time 3D pose detection using the mediapipe library in python. After that, we’ll calculate angles between body joints and combine them with some heuristics to create a pose classification system. 

All of this will work on real-time camera feed using your CPU as well as on images. See results below.

pose detection

The code is really simple, for detailed code explanation do also check out the YouTube tutorial, although this blog post will suffice enough to get the code up and running in no time.

This post can be split into 3 parts:

Part 1 (a): Introduction to Pose Detection

Part 1 (b): Mediapipe’s Pose Detection Implementation

Part 2: Using Pose Detection in images and on videos

Part 3: Pose Classification with Angle Heuristics

Part 1 (a): Introduction to Pose Detection:

Pose Detection or Pose Estimation is a very popular problem in computer vision, in fact, it belongs to a broader class of computer vision domain called key point estimation. Today we’ll learn to do Pose Detection where we’ll try to localize 33 key body landmarks on a person e.g. elbows, knees, ankles, etc. see the image below:

Some interesting applications of pose detection are:

  • Full body Gesture Control to control anything from video games (e.g. kinect) to physical appliances, robots etc. Check this.
  • Full body Sign Language Recognition. Check this.
  • Creating Fitness / exercise / dance monitoring applications. Check this.
  • Creating Augmented reality applications that overlay virtual clothes or other accessories over someone’s body. Check this.

Now, these are just some interesting things you can make using pose detection, as you can see it’s a really interesting problem.

And that’s not it there are other types of key point detection problems too, e.g. facial landmark detection, hand landmark detection, etc.

We will actually learn to do both of the above in the upcoming tutorials.

Key point detection in turn belongs to a major computer vision branch called Image recognition, other broad classes of vision that belong in this branch are Classification, Detection, and Segmentation.

Here’s a very generic definition of each class.

  • In classification we try to classify whole images or videos as belonging to a certain class.
  • In Detection we try to classify and localize objects or classes of interest.
  • In Segmentation, we try to extract/segment or find the exact boundary/outline of our target object/class.
  • In Keypoint Detection, we try to localize predefined points/landmarks.

It should be noted that each of the major categories above has subcategories or different types, a few weeks ago I wrote a post on Selfie segmentation using mediapipe where I talked about various segmentation types. Be sure to read that post.

If you’re new to Computer vision and just exploring the waters, check this page from paperswithcode, it lists a lot of subcategories from the above major categories. Now don’t be confused by the categorization that paperswtihcode has done, personally speaking, I don’t agree with the way they have sorted subcategories with applications and there are some other issues. The takeaway is that there are a lot of variations in computer vision problems, but the 4 categories I’ve listed above are some major ones.

Part 1 (b): Mediapipe’s Pose Detection Implementation:

Here’s a brief introduction to Mediapipe;

 “Mediapipe is a cross-platform/open-source tool that allows you to run a variety of machine learning models in real-time. It’s designed primarily for facilitating the use of ML in streaming media & It was built by Google”

Not only is this tool backed by google but models in Mediapipe are actively used in Google products. So you can expect nothing less than the state of the Art performance from this library.

Now MediaPipe’s Pose detection is a State of the Art solution for high-fidelity (i.e. high quality) and low latency (i.e. Damn fast) for detecting 33 3D landmarks on a person in real-time video feeds on low-end devices i.e. phones, laptops, etc.

Alright, so what makes this pose detection model from Mediapipe so fast?

They are actually using a very successful deep learning recipe that is creating a 2 step detector, where you combine a computationally expensive object detector with a lightweight object tracker.

Here’s how this works:

You run the detector in the first frame of the video to localize the person and provide a bounding box around it, after that the tracker takes over and it predicts the landmark points inside that bounding box ROI, the tracker continues to run on any subsequent frames in the video using the previous frame’s ROI and only calls the detection model again when it fails to track the person with high confidence.

Their model works best if the person is standing 2-4 meters away from the camera and one major limitation of their model is that this approach only works for single-person pose detection, it’s not applicable for multi-person detection.

Mediapipe actually trained 3 models, with different tradeoffs between speed and performance. You’ll be able to use all 3 of them with mediapipe.

MethodLatencyPixel 3 TFLite GPULatencyMacBook Pro (15-inch 2017)
BlazePose.Heavy53 ms38 ms
BlazePose.Full25 ms27 ms
BlazePose.Lite20 ms25 ms

The detector used in pose detection is inspired by Mediapiep’s lightweight BlazeFace model, you can read this paper. For the landmark model used in pose detection, you can read this paper for more details. or read Google’s blog on it.

Here are the 33 landmarks that this model detects:

Alright now that we have covered some basic theory and implementation details, let’s get into the code.

Download Code

Part 2: Using Pose Detection in images and on videos

Import the Libraries

Let’s start by importing the required libraries.

Initialize the Pose Detection Model

The first thing that we need to do is initialize the pose class using the mp.solutions.pose syntax and then we will call the setup function mp.solutions.pose.Pose() with the arguments:

  • static_image_mode – It is a boolean value that is if set to False, the detector is only invoked as needed, that is in the very first frame or when the tracker loses track. If set to True, the person detector is invoked on every input image. So you should probably set this value to True when working with a bunch of unrelated images not videos. Its default value is False.
  • min_detection_confidence – It is the minimum detection confidence with range (0.0 , 1.0) required to consider the person-detection model’s prediction correct. Its default value is 0.5. This means if the detector has a prediction confidence of greater or equal to 50% then it will be considered as a positive detection.
  • min_tracking_confidence – It is the minimum tracking confidence ([0.0, 1.0]) required to consider the landmark-tracking model’s tracked pose landmarks valid. If the confidence is less than the set value then the detector is invoked again in the next frame/image, so increasing its value increases the robustness, but also increases the latency. Its default value is 0.5.
  • model_complexity – It is the complexity of the pose landmark model. As there are three different models to choose from so the possible values are 01, or 2. The higher the value, the more accurate the results are, but at the expense of higher latency. Its default value is 1.
  • smooth_landmarks – It is a boolean value that is if set to True, pose landmarks across different frames are filtered to reduce noise. But only works when static_image_mode is also set to False. Its default value is True.

Then we will also initialize mp.solutions.drawing_utils class that will allow us to visualize the landmarks after detection, instead of using this, you can also use OpenCV to visualize the landmarks.

Downloading model to C:\ProgramData\Anaconda3\lib\site-packages\mediapipe/modules/pose_landmark/pose_landmark_heavy.tflite

Read an Image

Now we will read a sample image using the function cv2.imread() and then display that image using the matplotlib library.

Perform Pose Detection

Now we will pass the image to the pose detection machine learning pipeline by using the function mp.solutions.pose.Pose().process(). But the pipeline expects the input images in RGB color format so first we will have to convert the sample image from BGR to RGB format using the function cv2.cvtColor() as OpenCV reads images in BGR format (instead of RGB).

After performing the pose detection, we will get a list of thirty-three landmarks representing the body joint locations of the prominent person in the image. Each landmark has:

  • x – It is the landmark x-coordinate normalized to [0.0, 1.0] by the image width.
  • y: It is the landmark y-coordinate normalized to [0.0, 1.0] by the image height.
  • z: It is the landmark z-coordinate normalized to roughly the same scale as x. It represents the landmark depth with midpoint of hips being the origin, so the smaller the value of z, the closer the landmark is to the camera.
  • visibility: It is a value with range [0.0, 1.0] representing the possibility of the landmark being visible (not occluded) in the image. This is a useful variable when deciding if you want to show a particular joint because it might be occluded or partially visible in the image.

After performing the pose detection on the sample image above, we will display the first two landmarks from the list, so that you get a better idea of the output of the model.

NOSE:
x: 0.4321258
y: 0.28087094
z: -0.67494285
visibility: 0.99999905

LEFT_EYE_INNER:
x: 0.44070682
y: 0.2621727
z: -0.6380733
visibility: 0.99999845

Now we will convert the two normalized landmarks displayed above into their original scale by using the width and height of the image.

NOSE:
x: 310.69845509529114
y: 303.340619802475
z: -485.28390991687775
visibility: 0.9999990463256836

LEFT_EYE_INNER:
x: 316.86820307374
y: 283.1465148925781
z: -458.774720788002
visibility: 0.9999984502792358

Now we will draw the detected landmarks on the sample image using the function mp.solutions.drawing_utils.draw_landmarks() and display the resultant image using the matplotlib library.

Now we will go a step further and visualize the landmarks in three-dimensions (3D) using the function mp.solutions.drawing_utils.plot_landmarks(). We will need the POSE_WORLD_LANDMARKS that is another list of pose landmarks in world coordinates that has the 3D coordinates in meters with the origin at the center between the hips of the person.

Image

Note: This is actually a neat hack by mediapipe, the coordinates returned are not actually in 3D but by setting hip landmark as the origin allows us to measure the relative distance of the other points from the hip, and since this distance increases or decreases depending upon if you’re close or further from the camera it gives us a sense of the depth of each landmark point.

Create a Pose Detection Function

Now we will put all this together to create a function that will perform pose detection on an image and visualize the results or return the results depending upon the passed arguments.

Now we will utilize the function created above to perform pose detection on a few sample images and display the results.

Pose Detection On Real-Time Webcam Feed/Video

The results on the images were pretty good, now we will try the function on a real-time webcam feed and a video. Depending upon whether you want to run pose detection on a video stored in the disk or on the webcam feed, you can comment and uncomment the initialization code of the VideoCapture object accordingly.

Output:


Cool! so it works great on the videos too. The model is pretty fast and accurate.

Part 3: Pose Classification with Angle Heuristics

We have learned to perform pose detection, now we will level up our game by also classifying different yoga poses using the calculated angles of various joints. We will first detect the pose landmarks and then use them to compute angles between joints and depending upon those angles we will recognize the yoga pose of the prominent person in an image.

Image

But this approach does have a drawback that limits its use to a controlled environment, the calculated angles vary with the angle between the person and the camera. So the person needs to be facing the camera straight to get the best results.

Create a Function to Calculate Angle between Landmarks

Now we will create a function that will be capable of calculating angles between three landmarks. The angle between landmarks? Do not get confused, as this is the same as calculating the angle between two lines.

The first point (landmark) is considered as the starting point of the first line, the second point (landmark) is considered as the ending point of the first line and the starting point of the second line as well, and the third point (landmark) is considered as the ending point of the second line.

Image

Now we will test the function created above to calculate angle three landmarks with dummy values.

The calculated angle is 166.26373169437744

Create a Function to Perform Pose Classification

Now we will create a function that will be capable of classifying different yoga poses using the calculated angles of various joints. The function will be capable of identifying the following yoga poses:

  • Warrior II Pose
  • T Pose
  • Tree Pose

Now we will utilize the function created above to perform pose classification on a few images of people and display the results.

Warrior II Pose

The Warrior II Pose (also known as Virabhadrasana II) is the same pose that the person is making in the image above. It can be classified using the following combination of body part angles:

  • Around 180° at both elbows
  • Around 90° angle at both shoulders
  • Around 180° angle at one knee
  • Around 90° angle at the other knee

Tree Pose

Tree Pose (also known as Vrikshasana) is another yoga pose for which the person has to keep one leg straight and bend the other leg at a required angle. The pose can be classified easily using the following combination of body part angles:

  • Around 180° angle at one knee
  • Around 35° (if right knee) or 335° (if left knee) angle at the other knee

Now to understand it better, you should go back to the pose classification function above to overview the classification code of this yoga pose.

We will perform pose classification on a few images of people in the tree yoga pose and display the results using the same function we had created above.

T Pose

T Pose (also known as a bind pose or reference pose) is the last pose we are dealing with in this lesson. To make this pose, one has to stand up like a tree with both hands wide open as branches. The following body part angles are required to make this one:

  • Around 180° at both elbows
  • Around 90° angle at both shoulders
  • Around 180° angle at both knees

You can now go back to go through the classification code of this T pose in the pose classification function created above.

Now, let’s test the pose classification function on a few images of the T pose.

So the function is working pretty well on all the known poses on images lets try it on an unknown pose called cobra pose (also known as Bhujangasana).

Now if you want you can extend the pose classification function to make it capable of identifying more yoga poses like the one in the image above. The following combination of body part angles can help classify this one:

  • Around 180° angle at both knees
  • Around 105° (if the person is facing right side) or 240° (if the person is facing left side) angle at both hips

Pose Classification On Real-Time Webcam Feed

Now we will test the function created above to perform the pose classification on a real-time webcam feed.

Output:


Summary:

Today, we learned about a very popular vision problem called pose detection. We briefly discussed popular computer vision problems then we saw how mediapipe has implemented its pose detection solution and how it used a 2 step detection + tracking pipeline to speed up the process.

After that, we saw step by step how to do real-time 3d pose detection with mediapipe on images and on webcam.

Then we learned to calculate angles between different landmarks and then used some heuristics to build a classification system that could determine 3 poses, T-Pose, Tree Pose, and a Warrior II Pose.

Alright here are some limitations to our pose classification system, it has too many conditions and checks, now for our case it’s not that complicated, but if you throw in a few more poses this system can easily get too confusing and complicated, a much better method is to train an MLP ( a simple multi-layer perceptron) using Keras on landmark points from a few target pose pictures and then classify them. I’m not sure but I might create a separate tutorial for that in the future.

Another issue that I briefly went over was that the pose detection model in mediapipe is only able to detect a single person at a time, now this is fine for most pose-based applications but can prove to be an issue where you’re required to detect more than one person. If you do want to detect more people then you could try other popular models like PoseNet or OpenPose.

You can reach out to me personally for a 1 on 1 consultation session in AI/computer vision regarding your project. Our talented team of vision engineers will help you every step of the way. Get on a call with me directly here.

Ready to seriously dive into State of the Art AI & Computer Vision?
Then Sign up for these premium Courses by Bleed AI

Building 4 Applications Using Real-Time Selfie Segmentation in Python

Building 4 Applications Using Real-Time Selfie Segmentation in Python

Watch Video Here

 

Note: Video Tutorial for this post will soon be released on Bleed AI’s YouTube Channel

In this tutorial, you’ll learn how to do Real-Time Selfie Segmentation using Mediapipe in Python and then build the following 4 applications.

  1. Background Removal/Replacement
  2. Background Blur
  3. Background Desaturation
  4. Convert Image to Transparent PNG

And not only will these applications work on images but I’ll show you how to apply these to your real-time webcam feed running on a CPU.

Selfie Segmentation
Figure 1

Also, the model that we’ll use is almost the same one that Google Hangouts is currently using to segment people, So Yes! We’re going to be learning a State of the Art approach for segmentation.

And on top of that, the code for building all 4 applications will be ridiculously simple.

Interested yet? Then keep reading this full post.

In the first part of this post, we’ll understand the problem of image segmentation and its types, then we’ll understand what selfie segmentation is. After that, we’ll take a look at Mediapipe and how to do selfie segmentation with it. And finally how to build all those 4 applications.

What is Image Segmentation?

If you’re somewhat familiar with computer vision basics then you might be familiar with image segmentation, a very popular problem in Computer Vision.

Just like in an object detection task where you localize objects in the image and draw boxes around it, in a segmentation task, you’re almost doing the same thing, but here instead of drawing a bounding box around each object, you’re trying to segment or draw out the exact boundary of each target Object.

Figure 2: A segmentation model is trying to segment out the busses in the image above

In other words, in segmentation, you’re trying to divide the image into groups of pixels based on some specific criteria.

So an image segmentation algorithm will take an input image and output groups of pixels, each group will belong to some class. Normally this output is actually an image mask where each pixel consists of a single number indicating the class it belongs to.

Now the task of image segmentation can be divided into several categories, let’s understand each of them.

  1. Semantic Segmentation.
  2. Instance Segmentation
  3. Panoptic Segmentation
  4. Saliency Detection.

What is Semantic Segmentation?

In this type of segmentation, our task is to assign a class label (pedestrian, car, road, tree etc.)  to every pixel in the image.

Figure 3

As you can see all the objects in the image, including the buildings, sky, sidewalk are labeled by a certain color indicating that they belong to a certain class e.g all cars are labeled blue, people are labeled red, and so on.

It’s worth noting that although we can extract any individual class, for e.g. we can say extract all cars by looking for blue pixels but we cannot distinguish between different instances of the same class, for e.g. you can’t reliably say which blue pixel belongs to which car.

What is Instance Segmentation?

Another common category of segmentation is called Instance Segmentation. Here the goal is not to label all pixels in the image but only label some selective classes, for which the model was trained on ( for e.g. cars, pedestrians, etc. ). 

Figure 4

As you can see in the image, the algorithm ignored the roads, sky, buildings etc. so here we’re only interested in labeling specific classes.

One other major difference in this approach is that we’re also differentiating between different instances of the same classes i.e. you can tell which pixel belongs to which class and so on.

What is Panoptic Segmentation?

If you’re a curious cat like me, you might wonder, well isn’t there an approach that,

A) Labels all pixels in the image like semantic segmentation.

B) And also differentiates between instances of the same class like instance segmentation.

Well, Yes there is! And it’s called Panoptic Segmentation. Where not only every pixel is assigned a class but we can also differentiate between different instances of the same class, i.e. we can tell which pixel belongs to which car.

Figure 5

This type of segmentation is the combination of both instance and semantic segmentation.

What is Saliency Detection?

Don’t be confused by the word “Detection” here, although Saliency Detection is not generally considered as one of the core segmentation methods but it’s still essentially a major segmentation technique.

So here the goal is to segment out the most salient/prominent (things that stand out ) features in the image.

Figure 6

And this is done regardless of the class of the object. Here’s another example.

Figure 7

As you can see the most obvious object in the above image is the cat, which is exactly what’s being segmented out here.

So in saliency detection where trying to segment out the most standing out features in the image.


Selfie Segmentation:

Alright now that we have understood the fundamental segmentation techniques out there, let’s try to understand what selfie segmentation is.

Figure 8

Well, obviously it’s a no brainer, it’s a segmentation technique that segments out people in images.

Figure 9

You might think, how is this different from semantic or instance Segmentation?

Well, to put it simply, you can consider selfie segmentation as a sort of a mix between semantic segmentation and Saliency detection.

What do I mean by that?

Take a look at the example output of Selfie segmentation on two images below.

Figure 8

In the first image (top) the segmentation is done perfectly, as every person is on a similar scale and prominent in the image, whereas in the second image (bottom) the woman is prominent and is segmented out correctly while her colleagues in the background are not segmented properly.

This is why the technique is called selfie segmentation, it tries to segment out prominent people in the image, ideally everyone to be segmented should be on a similar scale in the image.

This is why I said that this technique is sort of a mix between saliency detection and semantic segmentation.

Now, you might think why do we even need to use another segmentation technique, why not just segment people using semantic or instance segmentation methods.

Well, Actually we could do that. Models like Mask-RCNN, DeepLabv3, and others are really good at segmenting people.

But here’s the problem.

These models although provide State of the Art results but are actually really slow, they aren’t a good fit when it comes to real-time applications especially on CPUs.

This is why the Selfie segmentation model that we’ll use today is specifically designed to segment people and also run at real-time speed on CPU and other low-end hardware. It’s built on a slight modification of the MobielNetv3 model. This model itself contains clever algorithmic innovations for maximum speed and performance gains. To understand more about these algorithmic advances in this model, you can read Google AI’s Blog post on this model.

So what are the use cases for Selfie Segmentation?

The most popular use case for this problem is Video Conferencing. In fact, Google Hangouts is using approximately the same model that we’re going to learn to use today.

Figure 9

You can read the Google AI Blog release about this here.

Besides Video Conferencing, there are several other use cases for this model that we’re going to explore today.

MediaPipe:

Mediapipe is a cross-platform tool that allows you to run a variety of machine learning models in real-time. It’s designed primarily for facilitating the use of ML in streaming media.

This is the tool that we’ll be using today in order to use the selfie segmentation model.  In future tutorials I’ll also be covering the usage of a few other models and make interesting applications out of them. So Stay tuned for those blog posts at Bleed AI.

Alright Now let’s start with the Code!


Selfie Segmentation Code:

To get started with Mediapipe, you first need to run the following command to install it

pip install mediapipe

Import the Libraries

Now let’s start by importing the required libraries.


Initialize the Selfie Segmentation Model

The first thing that you need to do is initialize the selfie segmentation class using the mp.solutions.selfie_segmentation function and then you need to call the setup function using .SelfieSegmentation(0) now there are two models for segmentation in mediapipe, by passing in 0 you will be using the general model i.e. input is resized to: 256x256x3 (Height, width, columns) and by passing 1 you will be using the landscape model i.e. input resized to: 144x256x3 (Height, width, columns).

You should select the type of model by taking into account the aspect ratio of the original image, although the landscape model is a bit faster. These models automatically resize the input image before passing it through the network and the size of the output image representing the segmentation mask for both models will be the same as the input that is 256x256x1 or 144x256x1.



Read an Image

Now let’s read a sample image using the function cv2.imread() and display the image using the matplotlib library.


Application 1: Remove/Replace Background

We will start by learning to use selfie segmentation to change the background of images. But first, we will have to convert the image into RGB format as the MediaPipe library expects the images in this format but the function cv2.imread() reads the images in BGR format and we will use the function cv2.cvtColor() to do this conversion.

Then we will pass the image to the MediaPipe Segmentation function which will perform the segmentation process and will return a probability map with pixel values near 1 for the indexes where the person is located in the image and pixel values near 0 for the background.

Notice that we have some gray areas in the map, this signifies that there are areas where the model was not sure if it was the background or the person. So now what we need to do is do some thresholding and set all pixels above certain confidence to white and all other pixels to black.CodeText

So in this step, we’re going to be thresholding the mask above to get a binary black and white mask with a pixel value 1 for the indexes where the person is located and 0 for the background.CodeText

Now we will use the numpy.where() function to create a new image which will have the pixel values from the original sample image at the indexes where the mask image have value 1 (white areas) and replace areas where mask have value 0 (black areas) with 255, to give a white background to the object of the sample image. Right now we’re just adding whtie (255) background but later on we’ll add a separate image as background.

But to create the required output image we will first have to convert the mask image (one channel) into a three-channel image using the function numpy.dstack() as the function numpy.where() will need to have all images to have equal number of channels.

Now instead of having a white background if you need to add another background image, you just need to replace 255 with a background image in np.where function


Create a Background Modification Function

Now we will create a function that will use the selfie segmentation to modify the background of an image depending upon the passed arguments. The followings will be the modifications that the function will be capable of:

  • Change Background: The function will replace the background of the image with a different provided background image OR it will make the background white for the cases when a separate background image is not provided.
  • Blur Background: The function will segment out the prominent person and then blur out the background.
  • Desaturate Background: The function will desaturate (convert to grayscale) the background of the image, giving the image a very interesting effect.
  • Transparent Background: The function will make the background of the image transparent.

Now we will utilize the function created above with the argument method='changeBackground' to change the backgrounds of a few sample images and check the results.


Change Background On Real-Time Web-cam Feed

The results on the images look great, but how will the function we created above fare when applied to our real-time webcam feed. Well, let’s check it out. In the code below we will swap out different background images by pressing the key b on keyboard.

Output:


Woah! that was Cool, not only the results are great but the model is pretty fast.


Video on Video Background Replacement:

Let’s take this one step further and instead of changing the background by an image, let’s replace it with a video loop.

Output:

That was pretty interesting, now that you’ve learned how to segment the background successfully it’s time to make use of this skill and create some other exciting applications out of it.


Application 2: Apply Background Blur

Now this application will actually save you a lot of money.

How?

Well, remember those expensive DSLR or mirrorless cameras that blur out the background, today you’ll learn to achieve the same effect, infact even better by just using your webcam.

So now we will use the function created above to segment out the prominent person and then blur out the background.

All we need to do is just blur the original image using cv2.GaussianBlur() and then instead of replacing the background with a new image (like we did in the previous application) we’ll just replace it with this blur version of the image. This way the segmented person will retain it’s original form but the rest of the parts will be blurred out.

Now let’s call the function with the argument method='blurBackground' over some samples. You can control the amount of blur by controling the blur variable.


Background Blur On Video

Now we will utilize the function created above in a real-time webcam feed where we will be able to blur the background.

Output:


Application 3: Desaturate Background

Now we will use the function created above to desaturate (convert to grayscale) the background of the image. Again the only new thing that we’re doing here is just replacing the black parts of the segmented mask with the grayscale version of the original image.

We will have to pass the argument method='desatureBackground' this time, to desaturate the backgrounds of a few sample images.


Background Desaturation On Video

Now we will utilize the function created above in a real-time webcam feed where we will be able to desaturate the background of the video.

Output:


Application 4: Convert an Image to have a Transparent Background

Now we will use the function created above to segment out the prominent person and then make the background of the image transparent and after that we will store the resultant image into the disk using the function cv2.imwrite().

To create an image with a transparent background (four-channel image) we will need to add another channel called alpha channel to the original image, this channel is a mask which decides which part of the image needs to be transparent and can have values from 0 (black) to 255 (white) which determine the level of visibility. Black (0) acts as the transparent area and white (255) acts as the visible area.

So we just need to add the segmentation mask to the original image.

We will have to pass the argument method='transparentBackground' to the function to get an image with transparent background.

You can go to the location where the image is saved, open it up with an image viewer and you’ll see that the background is transparent.

Further Resources

Note: These models work best for the scenarios where the person is close (< 2m) to the camera.


Bleed AI Needs Your Support!

Hi Everyone, Taha Anwar (Founder Bleed AI) here. If my blog posts or videos have helped you in any way in your Computer Vision/AI/ML/DL Learning journey then remember you can help us out too.

Bleed AI patreon.

Publishing Free high-quality Computer Vision tutorials for you guys so that you can build projects, or land your dream job, or maybe build a startup is our core mission at Bleed AI. But every single post takes a lot of effort and man-hours, and in order to keep publishing Free high-end Tutorials, and me & my team need your support on Patreon, plus you will get some extra perks too.



Summary:

Alright, So today we did a lot!

We Understand the basic terminology regarding different segmentation techniques, in summary:

  • Image Segmentation: The task of dividing pixels into groups of pixels based on some criteria
  • Semantic Segmentation: In this type we assign a class label to every pixel in the image.
  • Instance Segmentation: Here we assign a class label to only selective classes in the image.
  • Panoptic Segmentation: This approach combines both semantic and instance segmentation.
  • Saliency Detection: Here we’re just interested in segmenting prominent objects in the image regardless of the class.
  • Selfie Segmentation: Here we want to segment prominent people in the image.

We also learned that Mediapipe is an awesome tool to use various ML models in real-time. Then we learned how to perform selfie segmentation with this tool and build 4 different useful applications from it. These applications were:

  • How to remove/replace backgrounds in images & videos.
  • How to desaturate the background to make the person pop out in an image or a video.
  • How to blur out the background.
  • How to give an image a transparent background and save it.

This was my first Mediapipe tutorial and I’m planning to write a tutorial on a few other models too. If you enjoyed this tutorial then do let me know in the comments! You’ll definitely get a reply from me  

services-siteicon

Hire Us

Let our team of expert engineers and managers build your next big project using Bleeding Edge AI Tools & Technologies

Docker-small-icon
unity-logo-small-icon
Amazon-small-icon
NVIDIA-small-icon
flutter-small-icon
OpenCV-small-icon

Things You Must Know About OpenCV, Revealing my Best Tips from Years of Experience

Things You Must Know About OpenCV, Revealing my Best Tips from Years of Experience

Watch the Full Video Here:

Today’s Video tutorial is the one I wish I had access to when I was starting out in OpenCV, in this video I reveal to you some very interesting information about the opencv including great tips regarding when to find the right resources, tutorials for the library.

I’ll start by briefly going over the history of OpenCV and then talk about other exciting topics.

Some of the things I will go through in this video

👉How to navigate the opencv docs to find what you’re looking for.
👉How to get details regarding any OpenCV function.
👉The differences between the C++ and python version of OpenCV and which one you should work with.
👉Pip installation of OpenCV vs Source installation.
👉Where to ask questions regarding OpenCV when you’re stuck.

You can reach out to me personally for a 1 on 1 consultation session in AI/computer vision regarding your project. Our talented team of vision engineers will help you every step of the way. Get on a call with me directly here.

Ready to seriously dive into State of the Art AI & Computer Vision?
Then Sign up for these premium Courses by Bleed AI