Super Resolution, Going from 3x to 8x Resolution in OpenCV

Super Resolution, Going from 3x to 8x Resolution in OpenCV

A few weeks ago I published a tutorial on doing Super-resolution with OpenCV using the DNN module.

I would recommend that you go over that tutorial before reading this one but you can still easily follow along with this tutorial. For those of you who don’t know what Super-resolution is then here is an explanation.

Super Resolution can be defined as the class of Algorithms that upscales an image without losing quality, meaning you take a  low-resolution image like an image of size 224×224 and upscale it to a high-resolution version like 1792×1792 (An 8x resolution)  without any loss in quality. How cool is that?

Anyways that is Super resolution, so how is this different from the normal resizing you do?

When you normally resize or upscale an image you use Nearest Neighbor Interpolation. This just means you expand the pixels of the original image and then fill the gaps by copying the values of the nearest neighboring pixels.

The result is a pixelated version of the image.

There are better interpolation methods for resizing like bilinear or bicubic interpolation which take weighted average of neighboring pixels instead of just copying them.

Still the results are blurry and not great.

The super resolution methods enhance/enlarge the image without the loss of quality, Again, for more details on the theory of super resolution methods, I would recommend that you read my Super Resolution with OpenCV Tutorial.

In the above tutorial I describe several architectural improvements that happened with SR Networks over the years.

But unfortunately in that tutorial, I only showed you guys a single SR model which was good but it only did a 3x resolution. It was also from a 2016 paper Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network” 

That all changes now, in this tutorial we will work with multiple models, even those that will do 8x resolution.

Today, we won’t be using the DNN module, we could do that but for the super resolution problem OpenCV comes with a special module called dnn_superres which is designed to use 4 different powerful super resolution networks. One of the best things about this module is that It does the required pre and post processing internally, so with only a few lines of code you can do super resolution.

The 4 models we are going to use are:

  • EDSR: Enhanced Deep Residual Network from the paper Enhanced Deep Residual Networks for Single Image Super-Resolution (CVPR 2017) by Bee Lim et al.

  • ESPCN: Efficient Subpixel Convolutional Network from the paper Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network (CVPR 2016) by Wenzhe Shi et al.

  • FSRCNN: Fast Super-Resolution Convolutional Neural Networks from the paper Accelerating the Super-Resolution Convolutional Neural Network (ECCV 2016) by Chao Dong et al.

  • LapSRN: Laplacian Pyramid Super-Resolution Network from the paper Deep Laplacian pyramid networks for fast and accurate super-resolution (CVPR 2017) by Wei-Sheng Lai et al.

Here are the papers for the models and some extra resources.

Make sure to download the zip folder from the download code section above. As you can see by clicking the Download models link that each model has different versions like 3x, 4x etc. This means that the model can perform 3x resolution, 4x resolution of the image, and so on. The download zip that I provide contains only a single version of each of the 4 models above.

You can feel free to test out other models by downloading them. These models should be present in your working directory if you want to use them with the dnn_superres module.

Now the inclusion of this super easy to use dnn_superres module is the result of the work of 2 developers Xavier Weber and Fanny Monori. They developed this module as part of their GSOC (Google summer of code) project. GSOC 2019 also made NVIDIA GPU support possible. 

It’s always amazing to see how a summer project for students by google brings forward some great developers making awesome contributions to the largest Computer Vision library out there.

The dnn_superes module in OpenCV was included in version 4.1.2  for C++ but the python wrappers were added in 4.3 version about a month back, so you have to make sure that you have OpenCV version 4.3 installed. And of course, since this module is included in the contrib module so make sure you have also installed OpenCV contrib package.

[UPDATE 7/8/2020, OPENCV 4.3 IS NOW PIP INSTALLABLE]

Note: You can’t install OpenCV 4.3 version by doing pip install as the latest version here open-contrib-python from pip is still version 4.2.0.34.

So the pypi version of OpenCV is maintained by just one guy named: Olli-Pekka Heinisuo by username: skvark and he updates the pypi OpenCV package in his free time. Currently, he’s facing a compiling issue which is why 4.3 version has not come out as of 7-15-2020. But from what I have read, he will be building the .whl files for 4.3 version soon, it may be out this month. If that happens then I’ll update this post.

So right now the only way you will be able to use this module is if you have installed OpenCV 4.3 from Source. If you haven’t done that then you can easily follow my installation tutorial.

I should also take this moment to highlight the fact you should not always rely on OpenCV’s pypi package, no doubt skvark has been doing a tremendous job maintaining OpenCV’s pypi repo but this issue tells you that you can’t rely on  a single developer’s free time to update the library for production use cases, learn to install the Official library from source. Still, pip install opencv-contrib-python is a huge blessing for people starting out or in early stages of learning OpenCV, so hats off to skvark.

As you might have noticed among the 4 models above we have already learned to use ESPCNN in the previous tutorial, we will use it again but this time with the dnn_superres module.

Super Resolution with dnn_superres Code

[optin-monster slug=”rvfkmnfpxleeisjulg1h”]

Directory Hierarchy

After downloading the zip folder, unzip it and you will have the following directory structure.

This is how our directory structure looks like, it has a Jupyter notebook, a media folder with images and the model folder containing all 4 models.

Super resolution_Going_from_3x_to_8x_Resolution
    │   Super resolution_Going_from_3x_to_8x_Resolution.ipynb
    │
    ├───Media
    │       bird2.JPG
    │       butterfly.JPG
    │       demo1.mp4
    │       fman1.JPG
    │       manh.JPG
    │       nature.JPG
    │       road.jpg
    │
    ├───models
    │       EDSR_x4.pb
    │       ESPCN_x4.pb
    │       FSRCNN_x3.pb
    │       LapSRN_x8.pb
    │
    └───outputs
            enhanced_bird2.jpg
            enhanced_butterfly.jpg
            enhanced_fman.jpg
            enhanced_manh.jpg
            enhanced_road.jpg
            fman_enhanced.jpg
            testoutput.png

You can now run the notebook Super_Resolution_with_dnn_superres.ipynb and start executing each cell as follows.

Import Libraries

Start by Importing the required libraries.

import cv2
import numpy as np
import matplotlib.pyplot as plt
import os
import time

from cv2 import dnn_superres


Initialize the Super Resolution Object

First you have to create the dnn_superres constructor by the following command.

# Create an SR object
sr = dnn_superres.DnnSuperResImpl_create()


Read Image

We will start by reading and displaying a sample image. We will be running the EDSR model (with 4x scale) to upscale this image.

# Read the image
image = cv2.imread("Media/nature.jpg")

# Display image
plt.figure(figsize=[12,12])
plt.imshow(image[:,:,::-1], interpolation = 'bicubic');plt.axis('off');


Extracting Model Name & Scale

In the next few steps, will be using a setModel() function in which we will pass the model’s name and its scale. We could manually do that but all this information is already present in the model’s pathname so we just need to extract the model’s name and scale using simple text processing.

# Define model path, if you want to use a different model then just change this path.
model_path = "models/EDSR_x4.pb"

# Extract model name, get the text between '/' and '_'
model_name = model_path.split('/')[1].split('_')[0].lower()

# Extract model scale
model_scale = int(model_path.split('/')[1].split('_')[1].split('.')[0][1])

# Display the name and scale
print("model name: "+ model_name)
print("model scale: " + str(model_scale))

model name: edsr
model scale: 4


Reading the model

Finally we will read the model, this is where all the required weights of the model gets loaded. This is equivalent to DNN module’s readnet function

# Read the desired model
sr.readModel(model_path)


Setting Model Name & Scale

Here we are setting the name and scale of the model which we extracted above.

Why do we need to do that ?

So remember when I said that this module does not require us to do preprocessing or postprocessing because it does that internally. So in order to initiate the correct pre and post-processing pipelines, the module needs to know which model we will be using and what version meaning what scale 2x, 3x, 4x etc.

# Set the desired model and scale to get correct pre-processing and post-processing
sr.setModel(model_name, model_scale)


Running the Network

This is where all the magic happens. In this line a forward pass of the network is performed along with required pre and post-processing. We are also making note of the time taken as this information will tell us if the model can be run in real-time or not.

As you can see it takes a lot of time, in fact, EDSR is the most expensive model out of the four in terms of computation.

It should be noted that larger your input image’s resolution is the more time its going to take in this step.

%%time
# Upscale the image
Final_Img = sr.upsample(image)

Wall time: 45.1 s

Check the Shapes

We’re also checking the shapes of the original image and the super resolution image. As you can see the model upscaled the image by 4 times.

print('Shape of Original Image: {} , Shape of Super Resolution Image: {}'.format(image.shape, result.shape))

Shape of Original Image: (262, 347, 3) , Shape of Super Resolution Image: (1200, 1200, 3)


Comparing the Original Image & Result

Finally we will display the original image along with its super resolution version. Observe the difference in Quality.

# Display Image
plt.figure(figsize=[23,23])
plt.subplot(2,1,1);plt.imshow(image[:,:,::-1], interpolation = 'bicubic');plt.title("Original Image");plt.axis("off");
plt.subplot(2,1,2);plt.imshow(Final_Img[:,:,::-1], interpolation = 'bicubic');
plt.title("SR Model: {}, Scale: {}x ".format(model_name.upper(),model_scale)); plt.axis("off");


Save the High Resolution Image

Although you can see the improvement in quality but still you can’t observe the true difference with matplotlib so its recommended that you save the SR image in disk and then look at it.

# Save the image
cv2.imwrite("outputs/testoutput.png", Final_Img);


Creating Functions

Now that we have seen a step by step implementation of the whole pipeline, we’ll create the 2 following python functions so we can use different models on different images by just calling a function and passing some parameters.

Initialization Function: This function will contain parts of the network that will be set once, like loading the model.

Main Function: This function will contain the rest of the code. it will also have the option to either return the image or display it with matplotlib. We can also use this function to process a real-time video.

Initialization Function

def init_super(model, base_path='models'):
    
    # Define global variable
    global sr, model_name, model_scale
    
    # Create an SR object
    sr = dnn_superres.DnnSuperResImpl_create()
    
    # Define model path
    model_path = os.path.join(base_path , model +".pb")
    
    # Extract model name from model path
    model_name = model.split('_')[0].lower()
    
    # Extract model scale from model path
    model_scale = int(model.split("_")[1][1])
        
    # Read the desired model
    sr.readModel(model_path)
    
    sr.setModel(model_name, model_scale)


Main Function

Set returndata = True when you just want the image. This is usually done when I’m working with videos. I’ve also added a few more optional variables to the method.

print_shape: This variable decides if you want to print out the shape of the model’s output.

name: This is the name by which you will save the image in disk.

save_img: This variable decides if you want to save the images in disk or not.

def super_res(image, returndata=False, save_img=True, name='test.png', print_shape=True):
    
    # Upscale the image
    Final_Img = sr.upsample(image)
    
    if  returndata:
        return Final_Img
    
    else:
        
        if print_shape:
            print('Shape of Original Image: {} , Shape of Super Resolution Image: {}'.format(image.shape, Final_Img.shape))
            
            
        if save_img:
            cv2.imwrite("outputs/" + name, Final_Img)
        
        
        plt.figure(figsize=[25,25])
        plt.subplot(2,1,1);plt.imshow(image[:,:,::-1], interpolation = 'bicubic');plt.title("Original Image");plt.axis("off");
        plt.subplot(2,1,2);plt.imshow(Final_Img[:,:,::-1], interpolation = 'bicubic');
        plt.title("SR Model: {}, Scale: {}x ".format(model_name.upper(), model_scale)); plt.axis("off");

Now that we have created the initialization function and a main function, lets use all 4 models on different examples

The function above displays the original image along with the SR Image.

Initialize Enhanced Deep Residual Network (EDSR, 4x Resolution)

init_super("EDSR_x4")

Run the network

%%time
image = cv2.imread("Media/bird2.jpg")
super_res(image, name= 'enhanced_bird2.jpg')

Shape of Original Image: (221, 283, 3) , Shape of Super Resolution Image: (884, 1132, 3)
Wall time: 43.1 s

Initialize Efficient Subpixel Convolutional Network (ESPCN, 4x Resolution)

init_super("ESPCN_x4")

Run the network

%%time
image = cv2.imread("Media/road.jpg")
super_res(image, name='enhanced_road.jpg')

Shape of Original Image: (256, 256, 3) , Shape of Super Resolution Image: (1024, 1024, 3)
Wall time: 295 ms

Initialize Fast Super-Resolution Convolutional Neural Networks (FSRCNN, 3x Resolution)

init_super("FSRCNN_x3")

Run the network

%%time
image = cv2.imread("Media/manh.jpg")
super_res(image, name = 'enhanced_manh.jpg')

Shape of Original Image: (232, 270, 3) , Shape of Super Resolution Image: (696, 810, 3)
Wall time: 253 ms

Initialize Laplacian Pyramid Super-Resolution Network (LapSRN, 8x Resolution)

init_super("LapSRN_x8")

Run the network

%%time
image = cv2.imread("Media/butterfly.jpg")
super_res(image, name='enhanced_butterfly.jpg')

Shape of Original Image: (302, 357, 3) , Shape of Super Resolution Image: (2416, 2856, 3)
Wall time: 26 s


Applying Super Resolution on Video

Lastly, I’m also providing the code to run Super-resolution on Videos. Although the example video I’ve used sucks, but that’s the only video I tested on primarily because I’m only interested in doing super resolution on images as this is where most of my use cases lie. Feel free to test out different models for real-time feed.

Tip: You might also want to save the High res video in disk using the VideoWriter Class.

# Set the fps counter to 0
fps=0

# Initialize the network.
init_super("ESPCN_x4")

# Initialize the videcapture object with the video.
cap = cv2.VideoCapture('media/demo1.mp4')


while(True):    
    
    # Note the starting time for fps calculation.
    start_time = time.time()

    # Read frame by frame.
    ret,frame=cap.read() 
    
    # Break the loop if the video ends.
    if not ret:
        break
    
    # Perform SR with returndata = True.       
    image = super_res(image, returndata=True)
    
    # Put the value of FPS on the video.
    cv2.putText(image, 'FPS: {:.2f}'.format(fps), (10, 20), cv2.FONT_HERSHEY_SIMPLEX,0.8, (255, 20, 55), 1)

    # Show the current frame.
    cv2.imshow("Super Resolution", image)
    
    # Wait 1 ms and calculate the fps.
    k = cv2.waitKey(1)
    fps= (1.0 / (time.time() - start_time))
    
    # If the user presses the `q` button then break the loop.
    if k == ord('q'):
        break

# Release the camera and destroy all the windows.
cap.release() 
cv2.destroyAllWindows() 


Conclusion

Here’s a chart for benchmarks using a 768×512 image with 4x resolution on an Intel i7-9700K CPU for all models.

The benchmark shows PSNR (Peak signal to noise ratio) and SSIM (structural similarity index measure) scores, these are the scores which measure how good the supre res network’s output is.

The best performing model is EDSR but it has the slowest inference time, the rest of the models can work in real time.

For detailed benchmarks you can see this page.  Also make sure to check Official OpenCV contrib page on dnn_superres module

If you thought upscaling to 8x resolution was cool then take a guess on the scaling ability of the current state of the Art algorithm in super-resolution?

So believe it or not the state of the art in SR can actually do a 64x resolution…yes 64x, that wasn’t a typo.

In fact, the model that does 64x was published just last month, here’s the paper for that model, here’s the GitHub repo and here is a ready to run colab notebook to test out the code. Also here’s a video demo of it. It’s pretty rare that such good stuff is easily accessible for programmers just a month after publication so make sure to check it out.

The model is far too complex to explain in this post but the authors took a totally different approach, instead of using supervised learning they used self-supervised learning. (This seems to be on the rise).

You’ll come across many Computer Vision courses out there, but nothing beats a 1 on 1 video call support from an expert in the field. Plus there is a plethora of subfields and tons of courses on AI and computer vision out there, you need someone to lay out a step-by-step learning path customized to your needs. This is where I come in, whether you need monthly support or just want to have a one-time chat with me, I’ve got you covered. Check all the coaching details and packages here

Ready? Get Started on 1×1 Coaching here.

Summary: 

In today’s tutorial we learned to use 4 different architectures to do Super resolution going from 3x to 8x resolution. 

Since the library handles preprocessing and postprocessing, so the code for all the models was almost the same and pretty short.

As I mentioned earlier, I only showed you results of a single version of each model, you should go ahead and try other versions of each model.

These models have been trained using DIV2K  BSDS and General100 datasets which contains images of diverse objects but the best results from a super-resolution model is obtained by training them for a domain-specific task, for e.g if you want the SR model to perform best on pedestrians then your dataset should consist mostly of pedestrian images. The best part about training SR networks is that you don’t need to spend hours doing manual annotation, you can just resize them and you’re all set.

Also I would raise a concern regarding these models that we must be careful using SR networks, for e.g. consider this scenario:

 You caught an image of a thief stealing your mail on your low res front door cam, the image looks blurry and you can’t make out who’s in the image.

Now you being a Computer Vision enthusiast thought of running a super res network to get a clearer picture.
After running the network, you get a much clearer image and you can almost swear that it’s Joe from the next block.

The same Joe that you thought was a friend of yours.

The same Joe that made different poses to help you create a pedestrian datasets for that SR network you’re using right now.

How could Joe do this?

Now you feel betrayed but yet you feel really Smart, you solved a crime with AI right?

You Start STORMING to Joe’s house to confront him with PROOF.

Now hold on! … like really hold on.

Don’t do that, seriously don’t do that.

Why did I go on a rant like that?

Well to be honest back when I initially learned about SR networks that’s almost exactly what I thought I would do. Solve Crimes by AI by doing just that (I know it was a ridiculous idea). But soon I realize that SR networks only learn to hallucinate data based on learned data, they can’t visualize a face with 100% accuracy that they’ve never seen. It’s still pretty useful but you have to use this technology carefully.

I hope you enjoyed this tutorial, feel free to comment below and I’ll gladly reply.

You can reach out to me personally for a 1 on 1 consultation session in AI/computer vision regarding your project. Our talented team of vision engineers will help you every step of the way. Get on a call with me directly here.

A Crash Course with Dlib Library, 101 to Mastery

A Crash Course with Dlib Library, 101 to Mastery

Main Image

This tutorial will serve as a crash course to dlib library. Dlib library is another powerful computer vision library out there. It is not as extensive as OpenCV but still, there is a lot you can do with it.

This crash course assumes you’re somewhat familiar with OpenCV, if not then I’ve also published a crash course on OpenCV too. Make sure to download Dlib Resource Guide above which includes all important links in this post.

Side Note: I missed publishing a tutorial last week as I tested covid positive and was ill, still not 100% but getting better 🙂

The Dlib Library is created and maintained by Davis King, It’s a C++ toolkit containing machine learning & Computer Vision algorithms for a number of important tasks including, Facial Landmark detection, Deep Metric Learning, Object tracking, and more. It also has a python API.

Note: It’s worth noting that the main power of dlib library is in numerical optimization but today I’m only going to focus on applications, you can look at optimization examples here.

It’s a popular library that is used by people in both industry and academia in a wide range of domains including robotics, embedded devices, and other areas.

I plan to cover most of the prominent features and algorithms present in dlib library so this blog post alone can give you the best overview of dlib library and its functionality. Now, this is a big statement, If I had to explain most of dlib features in a single place then I would probably be writing a book or making a course on it but rather I plan to explain it all in this post.

So how am I going to accomplish that?

So here’s the thing I’m not going to write and explain the code for each algorithm with dlib library, because I don’t want to write several thousand’s of words worth of a blog post and also because almost all of the features of dlib library have been explained pretty well in several posts on the internet.

So if everything is out there then why the heck am I trying to make a crash course out of it ?

So here’s the real added value of this crash course:

In this post, I will connect all the best and the most important tutorials on different aspects of dlib library out there in a nice hierarchical order. This will not only serve as a golden Dlib library 101 to Mastery post for people just starting out with dlib but will also serve as a well-structured reference guide for dlib library users.

The post is split into various sections, in each section, I will briefly explain a useful algorithm or technique present in dlib library. If that explanation intrigues you and you feel that you need to explore that particular algorithm further then in each section I provide links to high-quality tutorials that goes in-depth about that function, the links would mostly be from Pyimagesearch, LearnOpenCV as these are golden sites when it comes to Computer Vision Tutorials. 

When learning some topic, ideally we prefer these two things:

  • A Collection of all the useful material regarding the topic presented at one place in a nice and neat hierarchical order.
  • Each material presented and delivered in a high-quality format preferably by an author who knows how to teach it the right way.

In this post, I’ve made sure both of these points are true, all the information is presented in a nice order and the posts that I link to will be of high quality. Other than that I will also try to include other extra resources where I feel necessary. 

[optin-monster slug=”vo6yhgi5h0rivacnipak”]

Here’s the outline for this crash course:

Installation:

The easiest way to install dlib library is to do:

pip install dlib

This will only work if you have Visual Studio (i.e. you need a C++ compiler) and CMake installed as dlib will build and compile first before installing. If you don’t have these then you can use my OpenCV’s source installation tutorial to install these two things.

If you don’t want to bother installing these then here’s what you can do, if you have a python version greater then 3.6 then create a virtual environment for python 3.6 using Anaconda or virtualenv.

After creating a python 3.6 environment you can do:

pip install dlib==19.8.1

This will let you directly install pre-built binaries of dlib but this currently only works with python 3.6 and below.

Extra Resources on dlib:

Installing dlib in Mac, Raspi & Ubuntu.

Face Detection:

Now that we have installed dlib, let’s start with face detection.

Why face detection ?

Well, most of the interesting use cases in dlib for computer vision is with faces, like facial landmark detection, face recognition, etc so before we can detect facial landmarks, we need to detect faces in the image.

Dlib not only comes with a face detector but it actually comes with 2 of them. If you’re a computer vision practitioner then you would most likely be familiar with the old Haar cascade based face detector. Although this face detector is a lot popular, it’s almost 2 decades old and not very effective when it comes to different orientations of the faces.

Dlib comes with 2 face detection algorithms that are way more effective than the haar cascade based detectors.

These 2 detectors are:

HOG (histogram of oriented gradients) based detection: This detector uses HOG and Support vector machines, its slower than haar cascades but its more accurate and able to handle different orientations
CNN Based Detector: This is a really accurate deep learning based detector but its extremely slow on a CPU, you should only use this if you’ve compiled dlib with GPU.

You can learn more about these detectors here. Other than that I published a library called bleedfacedetector which lets you use these 2 detectors using just a few lines of the same code, and the library also has 2 other face detectors including the haar cascade one. You can look at bleedfacedetector here.

Extra Resources:

Here’s a tutorial on different Face detection methods including the dlib ones.


Facial Landmark Detection:

Now that we have learned how to detect faces in images, we will now learn the most common use case of dlib library which is facial landmark detection, with this method you will be able to detect key landmarks/features of the face like eyes, lips, etc.

The detection of these features will allow you to do a lot of things like track the movement of eyes, lips to determine the facial expression of a person, control a virtual Avatar with your facial expressions, understand 3d facial pose of a person, virtual makeover, face swapping, morphing, etc.

Remember those smart Snapchat overlays which trigger based on the facial movement, like that tongue that pops out when you open your mouth, well you can also make that using facial landmarks.

So its suffice to say that Facial landmark detection has a lot of interesting applications.

The landmark detector in dlib is based on the paper “One Millisecond Face Alignment with an Ensemble of Regression Trees”, its robust enough to correctly detect landmarks in different facial orientations and expressions. And it easily runs in real-time.

The detector returns 68 important landmarks, these can be seen in below image.

The 68 specific human face landmarks | Download Scientific Diagram

You can read a detailed tutorial on Facial Landmark detection here.

After reading the above tutorial the next step is to learn to manipulate the ROI of these landmarks so, you can modify or extract the individual features like the eyes, nose lips, etc. You can learn that by reading this Tutorial.

After you have gone through both of the above tutorials then you’re ready for running the landmark detector in real time but if you’re still confused about the exact process then take a look at this tutorial

Extra Resources:

Here’s another great tutorial on Facial Landmark Detection.

Facial Landmark Detection Applications (Blink, yawn, smile detection & Snapchat filters):

After you’re fully comfortable working with facial landmarks that’s when the fun starts. Now you’re ready to make some exciting applications, you can start by making a blink detection system by going through the tutorial here. 

The main idea for a blink detection system is really simple, you just look at 2 vertical landmark points of the eyes and take the distance between these points, if the distance is too small (below some threshold) then that means the eyes are closed.

Of course, for a robust estimate, you won’t just settle for the distance between two points but rather you will take a smart average of several distances. One smart approach is to calculate a metric called Eye aspect ratio (EAR) for each eye. This metric was introduced in a paper called “ Real-Time Eye Blink Detection using Facial Landmarks

This will allow you to utilize all 6 x,y landmark points of the eyes returned by dlib, and this way you can accurately tell if there was a blink or not.

Here’s the equation to calculate the EAR.

The full implementation details are explained in the tutorial linked above.

You can also easily extend the above method to create a drowsiness detector that alerts drivers if they feel drowsy, this can be done by monitoring how long the eyes are closed for. This is a really simple extension of the above and have real-world applications and could be used to save lives. Here’s a tutorial that explains how to build a step by step drowsiness detection system.

Interestingly you can take the same blink detection approach above and apply it to lips instead of the eyes, and create a smile detector. Yeah, the only thing you would need to change would be the x,y point coordinates (replace eye points with lip points), the EAR equation (use trial and error or intuition to change this), and the threshold.

Few years back I created this smile camera application with only a few lines of code, it takes a picture when you smile. You can easily create that by modifying the above tutorial.

What more can you create with this ?

How about a yawn detector, or a detector that tells if the user’s mouth is opened or not. You can do this by slightly modifying the above approach, you will be using the same lips x,y landmark points, the only difference would be how you’re calculating the distance between points.

Here’s a cool application I built a while back, its the infamous google dino game that’s controlled by me opening and closing the mouth.

The only drawback of the above application is that I can’t munch food while playing this game.

Taking the same concepts above you can create interesting snapchat overlay triggers. 

Here’s an eye bulge and fire throw filter I created that triggers when I glare or open my mouth.

Similarly you can create lots of cool things using the facial landmarks.

Facial Alignment & Filter Orientation Correction:

Doing a bit of math with the facial landmarks will allow you to do facial alignment correction. Facial alignment allows you to correctly orient a rotated face.

Why is facial alignment important?

One of the most important use case for facial alignment is in face recognition, there are many classical face recognition algorithms that will perform better if the face is oriented correctly before performing inference on them.

Here’s a full tutorial on facial Alignment.

One other useful thing concerning facial alignment is that you can actually extract the angle of the rotated face, this is pretty useful when you’re working with an augmented reality filter application as this will allow you to rotate the filters according to the orientation of the face.

Here’s an application I built that does that. 

Head Pose Estimation:

A problem similar to facial alignment correction could be head pose estimation. In this technique instead of determining the 2d head rotation, you will learn to extract the full 3d head pose orientation. This is particularly useful when you’re working with an augmented reality application like overlaying a 3d mask on the face. You will only be able to correctly render the 3d object on the face if you know the face’s 3d orientation.

Here’s a great tutorial that teaches you head pose estimation in great detail.



Single & Multi-Object Tracking with Dlib:

Landmark detection is not all dlib has to offer, there are other useful techniques like a correlation tracking algorithm for Object Tracking that comes packed with dlib.

The tracker is based on Danelljan et al’s 2014 paper, Accurate Scale Estimation for Robust Visual Tracking

This tracker works well with changes in translation and scale and it works in real time.

Object Detection VS  Object Tracking:

If you’re just starting out in your computer vision journey and have some confusion regarding object detection vs tracking then understand that in Object Detection, you try to find an instance of the target object in the whole image. And you perform this detection in each frame of the video. There can be multiple instances of the same object and you’ll detect all of them with no differentiation between those object instances.

What I’m trying to say above is that a single image or frame of a video can contain multiple objects of the same class for e.g. multiple cats can be present on the same image and the object detector will see it as the same thing `CAT` with no difference between the individual cats throughout the video.

Whereas an Object Tracking algorithm will track each cat separately in each frame and will recognize each cat by a unique ID throughout the video. 

You can read this tutorial that goes over Dlib correlation tracker.

After reading the above tutorial you can go ahead and read this tutorial for using the correlation tracker to track multiple objects.



Face Swapping, Averaging & Morphing:

Here’s a series of cool facial manipulations you can do by utilizing facial landmarks and some other techniques.

Face Morphing:

What you see in the above video is called facial morphing. I’m sure you have seen such effects in other apps and movies. This effect is a lot more than a simple image pixel blending or transition.

To have a morph effect like the above, you need to do image alignment, establish pixel correspondences using facial landmark detection and more.

Here’s a nice tutorial that teaches you face morphing step by step.

By understanding and utilizing facial morphing techniques you can even do morphing between dissimilar objects like a face to a lion.

Face Swapping:

After you’ve understood face morphing then another really interesting you can do is face swapping, where you take a source face and put it over a destination face. Like putting Modi’s face over Musharaf’s above.

The techniques underlying face swapping is pretty similar to the one used in face morphing so there is not much new here.

The way this swapping is done makes the results look real and freakishly weird. See how everything from lightning to skin tone is matched.

Here’s a full tutorial on face swapping.

Tip: If you want to make the above code work in real-time then you would need to replace the seamless cloning function with some other faster cloning method, the results won’t be as good but it’ll work in real-time.

Alternative Tutorial:
Switching eds with python

Note: It should be noted this technique although gives excellent results but the state of the art in face swapping is achieved by deep learning based methods (deepfakes, FaceApp etc).

Face Averaging:

Average face of: Aiman Khan, Ayeza Khan, Mahira Khan, Mehwish Hayat, Saba Qamar & Syra Yousuf 

Similar to above methods there’s also Face averaging where you smartly average several faces together utilizing facial landmarks.

The face image you see above is the average face I created using 6 different Pakistani female celebrities.

Personally speaking out of all the applications here I find face averaging the least useful or fun. But Satya has written a really interesting Tutorial on face averaging here that is worth a read.

Face Recognition:

It should not come as a surprise that dlib also has a face recognition pipeline, not only that but the Face recognition implementation is really robust one and is a modified version of  ResNet-34, based on the paper “ Deep Residual Learning for Image Recognition paper by He et al.”, it has an accuracy of 99.38% on the Labeled Faces in the Wild (LFW) dataset. This dataset contains ~3 million images.

The model was trained using deep metric learning and for each face, it learned to output a 128-dimensional vector. This vector encodes all the important information about the face. This vector is also called a face embedding.

First, you will store some face embeddings of target faces and then you will test on different new face images. Meaning you will extract embedding from test images and compare it with the saved embeddings of the target faces.

If two vectors are similar (i.e. the euclidean distance between them is small) then it’s said to be a match. This way you can make thousands of matches pretty fast. The approach is really accurate and works in real-time.

Dlib’s Implementation of face recognition can be found here. But I would recommend that you use the face_recognition library to do face recognition.This library uses dlib internally and makes the code a lot simpler.

You can follow this nice tutorial on doing face recognition with face_recognition library.

Extra resources:

An Excellent Guide on face recognition by Adam Geitgey.


Face Clustering:

Image Credit: Dlib Blog

Consider this, you went to a museum with a number of friends, all of them asked you to take their pictures behind several monuments/statues such that each of your friend had several images of them taken by you. 

Now after the trip, all your friends ask for their pictures, now you don’t want to send each of them your whole folder. So what can you do here?

Fortunately, face clustering can help you out here, this method will allow you to make clusters of images of each unique individual.

Consider another use case: You want to quickly build a face recognition dataset for 10 office people that reside in a single room. Instead of taking manual face samples of each person, you instead record a short video of everyone together in the room, you then use a face detector to extract all the faces in each frame, and then you can use a face clustering algorithm to sort all those faces into clusters/folders. Later on, you just need to name these folders and your dataset is ready.

Clustering is a useful unsupervised problem and has many more use cases.
Face clustering is built on top of face recognition so once you’ve understood the recognition part this is easy.

You can follow this tutorial to perform face clustering.

Training a Custom Landmark Predictor:

Just like the Dlib’s Facial Landmark detector, you can train your own custom landmark detector. This detector is also called a shape predictor. Now you aren’t restricted to only facial landmarks but you can go ahead and train a landmark detector for almost anything, body joints of a person, some key points of a particular object, etc. 

As long as you can get sufficient annotated data for the key points, you can use dlib to train a landmark detector on it.

Here’s a tutorial that teaches you how to train a custom Landmark detector.

After going through the above tutorial, you may want to learn how to further optimize your trained model in terms of model size, accuracy, and speed. 

So there are multiple Hyperparameters that you can tune to get better performance, here’s a tutorial that lets you automate the tuning process, also take a look a this too.

Extra Resources:

Here’s another tutorial on training a shape predictor.

Training a Custom Object Detector:

Just like a custom landmark detector, you can train a custom Object detector with dlib. Dlib uses Histogram of Oriented Gradients (HOG) as features and a Support Vector Machine (SVM) Classifier. This combined with sliding windows and image pyramids, you’ve got yourself an Object detector. The only limitation is that you can train it to detect a single object at a time.

The Object detection approach in dlib is based on the same series of steps used in the sliding window based object detector first published by Dalal and Triggs in 2005 in the Histograms of Oriented Gradients for Human Detection.

HOG + SVM based detector are the strongest non Deep learning based approach for object detection, Here’s a hand detector I built using this approach a few years back. 

I didn’t even annotated nor collected training data for my hands but instead made a sliding window application that automatically collected my hand pictures as it moved on the screen and I placed my hands in the bounding box.

Afterward, I took this hand detector created a  Video car game controller, so now I was steering the Video game car with my hands literally. To be honest, that wasn’t a pleasant experience, my hand was sore afterwards. Making something cool is not hard but it would take a whole lot effort to make a practical VR or AR-based application. 

Here’s Dlib Code for Training an Object Detector and here’s a blog post that teaches you how to do that.

Extra Resources:


Here’s another Tutorial on training the detector.



Dlib Optimizations For Faster & Better Performance:

Here’s a bunch of techniques and tutorials that will help you get the most out of dlib’s landmark detection.

Using A Faster Landmark Detector:

Beside’s the 68 point landmark detector, dlib also has 5 point landmark detector that is 10 times smaller and faster (about 10%) than the 68 point one. If you need more speed and the 5 landmark points as visualized above is all you need then you should opt for this detector. Also from what I’ve seen its also somewhat more efficient than the 68 point detector.

Here’s a tutorial that explains how to use this faster landmark detector.

Speeding Up the Detection Pipeline:

There are a bunch of tips and techniques that you can use to get a faster detection speed, now a landmark detector itself is really fast, the rest of the pipeline takes up a lot of time. Some tricks you can do to increase speed are:

Skip Frames:

If you’re reading from a high fps camera then it won’t hurt to perform detection on every other frame, this will effectively double your speed.

Reduce image Size: 

If you’re using Hog + Sliding window based detection or a haar cascade + Sliding window based one then the face detection speed depends upon the size of the image. So one smart thing you can do is reduce the image size before face detection and then rescale the detected coordinates for the original image later.

Both of the above techniques and some others are explained in this tutorial.

Tip: The biggest bottleneck you’ll face in the landmark detection pipeline is the HOG based face detector in dlib which is pretty slow. You can replace this with haar cascades or the SSD based face detector for faster performance.

Summary:

Let’s wrap up, in this tutorial we went over a number of algorithms and techniques in dlib.

We started with installation, moved on to face detection and landmark prediction, and learned to build a number of applications using landmark detection. We also looked at other techniques like correlation tracking and facial recognition.

We also learned that you can train your own landmark detectors and object detectors with dlib.

At the end we learned some nice optimizations that we can do with our landmark predictor. 

Extra Resources:

Final Tip: I know most of you won’t be able to go over all the tutorials linked here in a single day so I would recommend that you save and bookmark this page and tackle a single problem at a time. Only when you’ve understood a certain technique move on to the next.

It goes without saying that Dlib is a must learn tool for serious computer vision practitioners out there.

I hope you enjoyed this tutorial and found it useful. If you have any questions feel free to ask them in the comments and I’ll happily address it.

You can reach out to me personally for a 1 on 1 consultation session in AI/computer vision regarding your project. Our talented team of vision engineers will help you every step of the way. Get on a call with me directly here.

Ready to seriously dive into State of the Art AI & Computer Vision?
Then Sign up for these premium Courses by Bleed AI

(LearnOpenCV) Creating a Virtual Pen And Eraser with OpenCV

(LearnOpenCV) Creating a Virtual Pen And Eraser with OpenCV

Wouldn’t it be cool if you could just wave a pen in the air to draw something virtually and it actually draws it on the screen? It could be even more interesting if we didn’t use any special hardware to actually achieve this, just plain simple computer vision would do, in fact, we wouldn’t even need to use machine learning or deep learning to achieve this.

Here’s a demo of the Application that we will built.

(Urdu/Hindi ) Learn how to make an ML classifier without programming or installing anything.

(Urdu/Hindi ) Learn how to make an ML classifier without programming or installing anything.

Teachable Machine Version 1 (Google AI Experiments) 

In this video lesson I’ll teach you how to create an Image Classifier without actually coding. For this I’m using Teachable Machine version 1 which is part of Google AI Experiments. This application allows you to create image classifiers and introduces computer vision to new comers in a really fun and exciting way. I also go in the technical working of this application so people who already have some fundamental knowledge about about building classifiers can benefit from this.

Here’s the link to access this amazing tool. This is version 1, version 2 of this has also been released which deals with pose and voice recognition and even lets you export the models.

I’m offering a premium 3-month Comprehensive State of the Art course in Computer Vision & Image Processing with Python (Urdu/Hindi). This course is a must take if you’re planning to start a career in Computer vision & Artificial Intelligence, the only prerequisite to this course is some programming experience in any language.

This course goes into the foundations of Image processing and Computer Vision, you learn from the ground up what the image is and how to manipulate it at the lowest level and then you gradually built up from there in the course, you learn other foundational techniques with their theories and how to use them effectively.

services-siteicon

Hire Us

Let our team of expert engineers and managers build your next big project using Bleeding Edge AI Tools & Technologies

Docker-small-icon
unity-logo-small-icon
Amazon-small-icon
NVIDIA-small-icon
flutter-small-icon
OpenCV-small-icon

You can reach out to me personally for a 1 on 1 consultation session in AI/computer vision regarding your project. Our talented team of vision engineers will help you every step of the way. Get on a call with me directly here.

(Urdu/Hindi ) Learn How to Convert Images to Song Lyrics With Artificial Intelligence

(Urdu/Hindi ) Learn How to Convert Images to Song Lyrics With Artificial Intelligence

Giorgio Cam (Google AI Experiments) 

This Video goes covers Giorgio Cam which is a part of Google AI experiments. In this video I explain how you can convert images or object images to actual song lyrics using Giorgio Cam. So basically this is doing Image recognition to recognize the contents in the image and then uses Speech synthesis to convert the results of recognition to some lyrical sentence.

A lot more is happening in the background and I also went into the technical working of this application so you can build something on top of it or at least be inspired enough to build something cool.

Here’s the link for the Application.

I’m offering a premium 3-month Comprehensive State of the Art course in Computer Vision & Image Processing with Python (Urdu/Hindi). This course is a must take if you’re planning to start a career in Computer vision & Artificial Intelligence, the only prerequisite to this course is some programming experience in any language.

This course goes into the foundations of Image processing and Computer Vision, you learn from the ground up what the image is and how to manipulate it at the lowest level and then you gradually built up from there in the course, you learn other foundational techniques with their theories and how to use them effectively.

services-siteicon

Hire Us

Let our team of expert engineers and managers build your next big project using Bleeding Edge AI Tools & Technologies

Docker-small-icon
unity-logo-small-icon
Amazon-small-icon
NVIDIA-small-icon
flutter-small-icon
OpenCV-small-icon

You can reach out to me personally for a 1 on 1 consultation session in AI/computer vision regarding your project. Our talented team of vision engineers will help you every step of the way. Get on a call with me directly here.

More Info ➔

Developed By Bleed AI