Reading Video Sources in OpenCV: IP Camera, Webcam, Videos & GIFS

Reading Video Sources in OpenCV: IP Camera, Webcam, Videos & GIFS

Watch Video Here

Processing videos in OpenCV is one of the most common jobs, many people already know how to leverage the VideoCapture function in OpenCV to read from a live camera or video saved on disk. 

But here’s some food for thought, do you know that you can also read other video sources e.g. read a live feed from an IP Camera (Or your phone’s Camera) or even read GIFS.

Yes, you’ll learn all about reading these sources with videoCapture in today’s tutorial and I’ll also be covering some very useful additional things like getting and setting different video properties (height, width, frame count, fps, etc), manually changing current frame position to repeatedly display the same video, and capturing different key events.

This will be an excellent tutorial to help you properly get started with video processing in OpenCV. 

Alright, let’s first rewind a bit and go back to the basics, What is a video? 

Well,  it is just a sequence of multiple still images (aka. frames) that are updated really fast creating the appearance of a motion. Below you can see a combination of different still images of some guy (You know who xD) dancing.

And how fast these still images are updated is measured by a metric called Frames Per Second (FPS). Different videos have different FPS and the higher the FPS, the smoother the video is. Below you can see the visualization of the smoothness in the motion of the higher FPS balls. The ball that is moving at 120 FPS has the smoothest motion, although it’s hard to tell the difference between 60fps and the 120fps ball.

Note: Consider each ball as a separate video clip.

So, a 5-second video with 15 Frames Per Second (FPS) will have a total of 75 (i.e., 15*5) frames in the whole video with each frame being updated after 60 milliseconds. While a 5-second video with 30 FPS will have 150 (i.e., 30*5) frames with each frame being updated after 30 milliseconds. 

So a 30 FPS will display the same frame (still image) only for 30 milliseconds, while a 15 FPS video will display the same frame for 60 milliseconds (longer period) which will make the motion jerkier and slower and in extreme cases (< 10 FPS) may convert a video into a slideshow.

Other than FPS, there are some other properties too which determine the quality of a video like its resolution (i.e., width x height), and bitrate (i.e., amount of information in a given unit of time), etc. The higher the resolution and bitrate of a video are, the better the quality is.

This tutorial also has a video version that you can go and watch for a detailed explanation, although this blog post alone can also suffice.

Alright now we have gone through the required basic theoretical details about videos and their properties, so without further ado,  let’s get started with the code.

Download Code:

Import the Libraries

We will start by importing the required libraries.

Loading a Video

To read a video, first, we will have to initialize the video capture object by using the function cv2.VideoCapture().

Function Syntax:

Parameters:

  • filename – It can be:
    1. Name of video file (eg. video.avi)
    2. or Image sequence (eg. img_%02d.jpg, which will read samples like img_00.jpg, img_01.jpg, img_02.jpg, ...)
    3. or URL of video stream (eg. protocol://host:port/script_name?script_params|auth). You can refer to the documentation of the source stream to know the right URL scheme.
  • index – It is the id of a video capturing device to open. To open the default camera using the default backend, you can just pass 0. In case of multiple cameras connected to the computer, you can select the second camera by passing 1, the third camera by passing 2, and so on.
  • apiPreference – It is the preferred capture API backend to use. Can be used to enforce a specific reader implementation if multiple are available: e.g. cv2.CAP_FFMPEG or cv2.CAP_IMAGES or cv2.CAP_DSHOW. Its default value is cv2.CAP_ANY. Check cv2.VideoCaptureAPIs for details.

Returns:

  • video_reader – It is the video loaded from the source specified.

So to simply put, this cv2.VideoCapture() function opens up a webcam or a video file/images sequence or an IP video stream for video capturing with API Preference. After initializing the object, we will use .isOpened() function to check if the video is accessed successfully. It returns True for success and False for failure.

Reading a Frame

If the video is accessed successfully, then the next step will be to read the frames of the video one by one which can be done using the function .read().

Function Syntax:

ret, frame = cv2.VideoCapture.read()

Returns:

  • ret – It is a boolean value i.e., True if the frame is read successfully otherwise False.
  • frame – It is a frame/image of our video.

Note: Every time we run .read() function, it will give us a new frame i.e., the next frame of the video so we can put .read() in a loop to read all the frames of a video and the ret value is really important in such scenarios since after reading the last frame, from the video this ret will be False indicating that the video has ended.

Get and Set Properties of the Video

Now that we know how to read a video, we will now see how to get and set different properties of a video using the functions:

Here, propId is the Property ID and new_value is the value we want to set for the property.

Property IDEnumeratorProperty
0cv2.CAP_PROP_POS_MSECCurrent position of the video in milliseconds.
1cv2.CAP_PROP_POS_FRAMES0-based index of the frame to be decoded/captured next.
3cv2.CAP_PROP_FRAME_WIDTHWidth of the frames in the video stream.
4cv2.CAP_PROP_FRAME_HEIGHTHeight of the frames in the video stream.
5cv2.CAP_PROP_FPSFrame rate of the video.
7cv2.CAP_PROP_FRAME_COUNTNumber of frames of the video.

I have only mentioned the most commonly used properties with their Property ID and Enumerator. You can check cv2.VideoCaptureProperties for the remaining ones. Now we will try to get the width, height, frame rate, and the number of frames of the loaded video using the .get() function.

Width of the video: 1280.0

Height of the video: 720.0

Frame rate of the video: 29

Total number of frames of the video: 166

Now we will use the .set() function to set a new height and width of the loaded video. The function .set() returns False if the video property is not settable. This can happen when the resolution you are trying to set is not supported by your webcam or the video you are working on. The .set() function sets to the nearest resolution if that resolution is not settable like if I try to set the resolution to 500x500, it might fail to happen and the function set the resolution to something else, like 720x480, which is supported by my webcam.

Failed to set the width!

Failed to set the height!

So we cannot set the width and height to 1920x1080 of the video we are working on. An easy solution to this type of issue can be to use the cv2.resize() function on each frame of the video but it is a little less efficient approach.

Now we will put all this in a loop and read and display all the frames sequentially in a window using the function cv2.imshow(), which will look like we are playing a video, but we will be just displaying frames one after the other. We will use the function cv2.waitKey(milliseconds) to wait for one millisecond before updating a frame with the next one.

We will use the functions .get() and .set() to keep restarting the video when every time we will reach the last frame until the key q is pressed, or the close X button on the opened window is pressed. And finally, in the end, we will release the loaded video using the function cv2.VideoCapture.release() and destroy all of the opened HighGUI windows by using cv2.destroyAllWindows().

You can increase the delay specified in cv2.waitKey(delay) to be higher than 1 ms to control the frames per second.

Join My Course Computer Vision For Building Cutting Edge Applications Course

The only course out there that goes beyond basic AI Applications and teaches you how to create next-level apps that utilize physics, deep learning, classical image processing, hand and body gestures. Don’t miss your chance to level up and take your career to new heights

You’ll Learn about:

  • Creating GUI interfaces for python AI scripts.
  • Creating .exe DL applications
  • Using a Physics library in Python & integrating it with AI
  • Advance Image Processing Skills
  • Advance Gesture Recognition with Mediapipe
  • Task Automation with AI & CV
  • Training an SVM machine Learning Model.
  • Creating & Cleaning an ML dataset from scratch.
  • Training DL models & how to use CNN’s & LSTMS.
  • Creating 10 Advance AI/CV Applications
  • & More

Whether you’re a seasoned AI professional or someone just looking to start out in AI, this is the course that will teach you, how to Architect & Build complex, real world and thrilling AI applications

Summary

In this tutorial, we learned what exactly videos are, how to read them from sources like IP camera, webcam, video files & gif, and display them frame by frame in a similar way an image is displayed. We also learned about the different properties of videos and how to get and set them in OpenCV.

These basic concepts we learned today are essential for many in-demand Computer Vision applications such as intelligent video analytics systems for intruder detection and much more.

You can reach out to me personally for a 1 on 1 consultation session in AI/computer vision regarding your project. Our talented team of vision engineers will help you every step of the way. Get on a call with me directly here.

Ready to seriously dive into State of the Art AI & Computer Vision?
Then Sign up for these premium Courses by Bleed AI

Facial Landmark Detection with Mediapipe & Creating Animated Snapchat filters

Facial Landmark Detection with Mediapipe & Creating Animated Snapchat filters

Watch Video Here

In this tutorial, we’ll learn to perform real-time multi-face detection followed by 3D face landmarks detection using the Mediapipe library in python on 2D images/videos, without using any dedicated depth sensor. After that, we will learn to build a facial expression recognizer that tells you if the person’s eyes or mouth are open or closed

Below you can see the facial expression recognizer in action, on a few sample images:


And then, in the end, we see how we can combine what we’ve learned to create animated Snapchat-like 2d filters and overlay them over the faces in images and videos. The filters will trigger in real-time for videos based on the facial expressions of the person. Below you can see results on a sample video.

Everything that we will build will work on the images, camera feed in real-time, and recorded videos as well, and the code is very neatly structured and is explained in the simplest manner possible.

This tutorial also has a video version that you can go and watch for a detailed explanation, although this blog post alone can also suffice.

This post can be split into 4 parts:

Part 1 (a): Introduction to Face Landmarks Detection

Part 1 (b): Mediapipe’s Face Landmarks Detection Implementation

Part 2: Face Landmarks Detection on images and videos

Part 3: Face Expression Recognition

Part 4: Snapchat Filter Controlled by Facial Expressions

Part 1 (a): Introduction to Face Landmarks Detection

Facial landmark detection/estimation is the process of detecting and tracking face key landmarks (that represent important regions of the face e.g, the center of the eye, and the tip of the nose, etc) in images and videos. It allows you to localize the face features and identify the shape and orientation of the face.

It also fits into the key point estimation category that I had explained in detail a few weeks ago in Real-Time 3D Pose Detection & Pose Classification with Mediapipe and Python post, make sure to check that one out too. 

In this tutorial, we will learn to detect four hundred sixty-eight facial landmarks. Below are the results of the landmarks detector we will use.

It is a must-learn task for every vision practitioner as it is used as a pre-processing task in many vision applications like

Some other types of keypoint estimation tasks are hand landmark detection, pose detection, etc.

I have already made tutorials (Hands Landmarks Detection, Pose Detection) on both of them.

Part 1 (b): Mediapipe’s Face Landmarks Detection Implementation

If Here’s a brief introduction to Mediapipe;

 “Mediapipe is a cross-platform/open-source tool that allows you to run a variety of machine learning models in real-time. It’s designed primarily for facilitating the use of ML in streaming media & It was built by Google”

All the solutions provided by Mediapipe are state-of-the-art in terms of speed and accuracy and are used in a lot of well-known applications.

The facial landmarks detection solution provided by Mediapipe is capable of detecting 3D 468 facial landmarks from a 2D image/video and is pretty fast and highly accurate as well and even works fine for occluded faces in varying lighting conditions and with faces of various orientations, and sizes in real-time, even on low-end devices like mobile phones, and Raspberry Pi, etc.

The landmarks detector’s remarkable speed distinguishes it from the other solutions out there anThe landmarks detector’s remarkable speed distinguishes it from the other solutions out there and the reason which makes this solution so fast is that they are using a 2 step detection approach where they have combined a face detector with a comparatively less computationally expensive tracker

So that for the videos, the tracker can be used instead of invoking the face detector at every frame. Let’s dive further into more details

The machine learning pipeline of the Mediapipe’s solution contains two different models that work together:

  1. A face detector that operates on the full image and locates the faces in the image.
  2. A face landmarks detector that operates only on those face locations and predicts the 3D facial landmarks. 

So the landmarks detector gets an accurately cropped face ROI which makes it capable of precisely working on scaled, rotated, and translated faces without needing data augmentation techniques.

In addition, the faces can also be located based on the face landmarks identified in the previous frame, so the face detector is only invoked as needed, that is in the very first frame or when the tracker loses track of any of the faces.  

They have utilized transfer learning and used both synthetic rendered and annotated real-world data to get a model capable of predicting 3D landmark coordinates. Another approach could be to train a model to predict a 2D heatmap for each landmark but will increase the computational cost as there are so many points.

Alright now we have gone through the required basic theory and implementation details of the solution provided by Mediapipe, so without further ado,  let’s get started with the code.


Download Code:

Part 2: Face Landmarks Detection on images and videos

Import the Libraries

Let’s start by importing the required libraries.

As mentioned Mediapipe’s face landmarks detection solution internally uses a face detector to get the required Regions of Interest (faces) from the image. So before going to the facial landmarks detection, let’s briefly discuss that face detector first, as Mediapipe also allows to separately use it.

Face Detection

The mediapipe’s face detection solution is based on BlazeFace face detector that uses a very lightweight and highly accurate feature extraction network, that is inspired and modified from MobileNetV1/V2 and used a detection method similar to Single Shot MultiBox Detector (SSD). It is capable of running at a speed of 200-1000+ FPS on flagship devices. For more info, you can check the resources here.

Initialize the Mediapipe Face Detection Model

To use the Mediapipe’s Face Detection solution, we will first have to initialize the face detection class using the syntax mp.solutions.face_detection, and then we will have to call the function mp.solutions.face_detection.FaceDetection() with the arguments explained below:

  • model_selection – It is an integer index ( i.e., 0 or 1 ). When set to 0, a short-range model is selected that works best for faces within 2 meters from the camera, and when set to 1, a full-range model is selected that works best for faces within 5 meters. Its default value is 0.
  • min_detection_confidence – It is the minimum detection confidence between ([0.0, 1.0]) required to consider the face-detection model’s prediction successful. Its default value is 0.5 ( i.e., 50% ) which means that all the detections with prediction confidence less than 0.5 are ignored by default.

We will also have to initialize the drawing class using the syntax mp.solutions.drawing_utils which is used to visualize the detection results on the images/frames.

Read an Image

Now we will use the function cv2.imread() to read a sample image and then display the image using the matplotlib library, after converting it into RGB from BGR format.

Perform Face Detection

Now to perform the detection on the sample image, we will have to pass the image (in RGB format) into the loaded model by using the function mp.solutions.face_detection.FaceDetection().process() and we will get an object that will have an attribute detections that contains a list of a bounding box and six key points for each face in the image. The six key points are on the:

  1. Right Eye
  2. Left Eye
  3. Nose Tip
  4. Mouth Center
  5. Right Ear Tragion
  6. Left Ear Tragion

After performing the detection, we will display the bounding box coordinates and only the first two key points of each detected face in the image, so that you get more intuition about the format of the output.

FACE NUMBER: 1

—————————–

FACE CONFIDENCE: 0.98

FACE BOUNDING BOX:

xmin: 0.39702364802360535

ymin: 0.2762746810913086

width: 0.16100731492042542

height: 0.24132275581359863

RIGHT_EYE:

x: 0.4368540048599243

y: 0.3198586106300354

LEFT_EYE:

x: 0.5112437605857849

y: 0.3565130829811096

Note: The bounding boxes are composed of xmin and width (both normalized to [0.0, 1.0] by the image width) and ymin and height (both normalized to [0.0, 1.0] by the image height). Each keypoint is composed of x and y, which are normalized to [0.0, 1.0] by the image width and height respectively.

Now we will draw the detected bounding box(es) and the key points on a copy of the sample image using the function mp.solutions.drawing_utils.draw_detection() from the class mp.solutions.drawing_utils, we had initialized earlier and will display the resultant image using the matplotlib library.

Note: Although, the detector quite accurately detects the faces, but fails to precisely detect facial key points (landmarks) in some scenarios (e.g. for non-frontal, rotated, or occluded faces) so that is why we will need the Mediapipe’s face landmarks detection solution for creating the Snapchat filter that is our main goal.

Face Landmarks Detection

Now, let’s move to the facial landmarks detection, we will start by initializing the face landmarks detection model.

Initialize the Mediapipe Face Landmarks Detection Model

To initialize the Mediapipe’s face landmarks detection model, we will have to initialize the face mesh class using the syntax mp.solutions.face_mesh and then we will have to call the function mp.solutions.face_mesh.FaceMesh() with the arguments explained below:

  • static_image_mode – It is a boolean value that is if set to False, the solution treats the input images as a video stream. It will try to detect faces in the first input images, and upon a successful detection further localizes the face landmarks. In subsequent images, once all max_num_faces faces are detected and the corresponding face landmarks are localized, it simply tracks those landmarks without invoking another detection until it loses track of any of the faces. This reduces latency and is ideal for processing video frames. If set to True, face detection runs on every input image, ideal for processing a batch of static, possibly unrelated, images. Its default value is False.
  • max_num_faces – It is the maximum number of faces to detect. Its default value is 1.
  • min_detection_confidence – It is the minimum detection confidence ([0.0, 1.0]) required to consider the face-detection model’s prediction correct. Its default value is 0.5 which means that all the detections with prediction confidence less than 50% are ignored by default.
  • min_tracking_confidence – It is the minimum tracking confidence ([0.0, 1.0]) from the landmark-tracking model for the face landmarks to be considered tracked successfully, or otherwise face detection will be invoked automatically on the next input image, so increasing its value increases the robustness, but also increases the latency. It is ignored if static_image_mode is True, where face detection simply runs on every image. Its default value is 0.5.

After that, we will initialize the mp.solutions.drawing_styles class that will allow us to get different provided drawing styles of the landmarks on the images/frames.

Perform Face Landmarks Detection

Now to perform the landmarks detection, we will pass the image (in RGB format) to the face landmarks detection machine learning pipeline by using the function mp.solutions.face_mesh.FaceMesh().process() and get a list of four hundred sixty-eight facial landmarks for each detected face in the image. Each landmark will have:

  • x – It is the landmark x-coordinate normalized to [0.0, 1.0] by the image width.
  • y – It is the landmark y-coordinate normalized to [0.0, 1.0] by the image height.
  • z – It is the landmark z-coordinate normalized to roughly the same scale as x. It represents the landmark depth with the center of the head being the origin, and the smaller the value is, the closer the landmark is to the camera.

We will display only two landmarks of each eye to get an intuition about the format of output, the ml pipeline outputs an object that has an attribute multi_face_landmarks that contains the found landmarks coordinates of each face as an element of a list.

FACE NUMBER: 1

—————————–

LEFT EYE LANDMARKS:

x: 0.49975821375846863
y: 0.3340317904949188
z: -0.0035526191350072622

x: 0.505615234375
y: 0.33464953303337097
z: -0.005253124982118607

RIGHT EYE LANDMARKS:

x: 0.4383838176727295
y: 0.2998684346675873
z: -0.0014895268250256777

x: 0.430422842502594
y: 0.30033284425735474
z: 0.006082724779844284

Note: The z-coordinate is just the relative distance of the landmark from the center of the head, and this distance increases and decreases depending upon the distance from the camera so that is why it represents the depth of each landmark point.

Now we will draw the detected landmarks on a copy of the sample image using the function mp.solutions.drawing_utils.draw_landmarks() from the class mp.solutions.drawing_utils, we had initialized earlier and will display the resultant image. The function mp.solutions.drawing_utils.draw_landmarks() can take the following arguments.

  • image – It is the image in RGB format on which the landmarks are to be drawn.
  • landmark_list – It is the normalized landmark list that is to be drawn on the image.
  • connections – It is the list of landmark index tuples that specifies how landmarks to be connected in the drawing. The provided options are; mp_face_mesh.FACEMESH_FACE_OVAL, mp_face_mesh.FACEMESH_LEFT_EYE, mp_face_mesh.FACEMESH_LEFT_EYEBROW, mp_face_mesh.FACEMESH_LIPS, mp_face_mesh.FACEMESH_RIGHT_EYE, mp_face_mesh.FACEMESH_RIGHT_EYEBROW, mp_face_mesh.FACEMESH_TESSELATION, mp_face_mesh.FACEMESH_CONTOURS.
  • landmark_drawing_spec – It specifies the landmarks’ drawing settings such as color, line thickness, and circle radius. It can be set equal to the mp.solutions.drawing_utils.DrawingSpec(color, thickness, circle_radius)) object.
  • connection_drawing_spec – It specifies the connections’ drawing settings such as color and line thickness. It can be either a mp.solutions.drawing_utils.DrawingSpec object or a function from the class mp.solutions.drawing_styles, the currently provided options for face mesh are; get_default_face_mesh_contours_style() ,get_default_face_mesh_tesselation_style().

Create a Face Landmarks Detection Function

Now we will put all this together to create a function detectFacialLandmarks() that will perform face landmarks detection on an image and will visualize the resultant image along with the original image or return the resultant image along with the output of the model depending upon the passed arguments.

Now we will utilize the function detectFacialLandmarks() created above to perform face landmarks detection on a few sample images and display the results.

Face Landmarks Detection on Real-Time Webcam Feed

The results on the images were remarkable, but now we will try the function on a real-time webcam feed. We will also calculate and display the number of frames being updated in one second to get an idea of whether this solution can work in real-time on a CPU or not.

Output

Impressive! the solution is fast as well as accurate.

Face Expression Recognition

Now that we have the detected landmarks, we will use them to recognize the facial expressions of people in the images/videos using the classical techniques. Our recognizor will be capable of identifying the following facial expressions:

  • Eyes Opened or Closed 😳 (can be used to check drowsiness, wink or shock expression)
  • Mouth Opened or Closed 😱 (can be used to check yawning)

For the sake of simplicity, we are only limiting this to two expressions. But if you want, you can easily extend this application to make it capable of identifying more facial expressions just by adding more conditional statements or maybe merging these two conditions. Like for example, eyes and mouth both wide open can represent surprise expression.

Create a Function to Calculate Size of a Face Part

First, we will create a function getSize() that will utilize detected landmarks to calculate the size of a face part. All we will need is to figure out a way to isolate the landmarks of the face part and luckily that can easily be done using the frozenset objects (attributes of the mp.solutions.face_mesh class), which contain the required indexes.

  • mp_face_mesh.FACEMESH_FACE_OVAL contains indexes of face outline.
  • mp_face_mesh.FACEMESH_LIPS contains indexes of lips.
  • mp_face_mesh.FACEMESH_LEFT_EYE contains indexes of left eye.
  • mp_face_mesh.FACEMESH_RIGHT_EYE contains indexes of right eye.
  • mp_face_mesh.FACEMESH_LEFT_EYEBROW contains indexes of left eyebrow.
  • mp_face_mesh.FACEMESH_RIGHT_EYEBROW contains indexes of right eyebrow.

After retrieving the landmarks of the face part, we will simply pass it to the function cv2.boundingRect() to get the width and height of the face part. The function cv2.boundingRect(landmarks) returns the coordinates (x1, y1, width, height) of a bounding box enclosing the object (face part), given the landmarks but we will only need the height and width of the bounding box.

Now we will create a function isOpen() that will utilize the getSize() function we had created above to check whether a face part (e.g. mouth or an eye) of a person is opened or closed.

Hint: The height of an opened mouth or eye will be greater than the height of a closed mouth or eye.

Now we will utilize the function isOpen() created above to check the mouth and eyes status on a few sample images and display the results.

As expected, the results are fascinating!

Snapchat Filter Controlled by Facial Expressions

Now that we have the face expression recognizer, let’s start building a Snapchat filter on top of it, that will be triggered based on the facial expressions of the person in real-time.

Currently, our face expression recognizer can check whether the eyes and mouth are open 😯 or not 😌 so to get the most out of it, we can overlay scalable eyes 👀 images on top of the eyes of the user when his eyes are open and a video of fire 🔥 coming out of the mouth of the user when the mouth is open.

Create a Function to Overlay the Image Filters

Now we will create a function overlay() that will apply the filters on top of the eyes and mouth of a person in images/videos utilizing the facial landmarks to locate the face parts and will also resize the filter images according to the size of the face part on which the filter images will be overlayed.

Snapchat Filter on Real-Time Webcam Feed

Now we will utilize the function overlay() created above to apply filters based on the facial expressions, that we will recognize utilizing the function isOpen() on a real-time webcam feed.

Output

Cool! I am impressed by the results now if you want, you can extend the application and add more filters like glasses, nose, and ears, etc. and use some other facial expressions to trigger those filters.

Join My Course Computer Vision For Building Cutting Edge Applications Course

The only course out there that goes beyond basic AI Applications and teaches you how to create next-level apps that utilize physics, deep learning, classical image processing, hand and body gestures. Don’t miss your chance to level up and take your career to new heights

You’ll Learn about:

  • Creating GUI interfaces for python AI scripts.
  • Creating .exe DL applications
  • Using a Physics library in Python & integrating it with AI
  • Advance Image Processing Skills
  • Advance Gesture Recognition with Mediapipe
  • Task Automation with AI & CV
  • Training an SVM machine Learning Model.
  • Creating & Cleaning an ML dataset from scratch.
  • Training DL models & how to use CNN’s & LSTMS.
  • Creating 10 Advance AI/CV Applications
  • & More

Whether you’re a seasoned AI professional or someone just looking to start out in AI, this is the course that will teach you, how to Architect & Build complex, real world and thrilling AI applications

Summary:

Today, in this tutorial, we learned about a very common computer vision task called Face landmarks detection. First, we covered what exactly it is, along with its applications, and then we moved to the implementation details of the solution provided by Mediapipe and how it uses a 2-step (detection + tracking) pipeline to speed up the process.

After that, we performed multi-face detection and 3D face landmarks detection using Mediapipe’s solutions on images and real-time webcam feed. 

Then we learned to recognize the facial expressions in the images/videos utilizing the face landmarks and after that, we learned to apply face filters, which were dynamically controlled by the facial expressions in the images/videos.

Alright here are a few limitations of our application that you should know about, the face expression recognizer we created is really basic to recognize dedicated expressions like shock, surprise. For that, you should train a DL model on top of these landmarks.

Another current limitation is that the face filters are not currently being rotated with the rotations of the faces in the images/videos. This can be overcome simply by calculating the face angle and rotating the filter images with the face angle. I am planning to cover this and a lot more in my upcoming course mentioned above.

You can reach out to me personally for a 1 on 1 consultation session in AI/computer vision regarding your project. Our talented team of vision engineers will help you every step of the way. Get on a call with me directly here.

Ready to seriously dive into State of the Art AI & Computer Vision?
Then Sign up for these premium Courses by Bleed AI

Automatically Generating Squid Game Memes Using OpenCV & Python

Automatically Generating Squid Game Memes Using OpenCV & Python

Watch Video Here

In this tutorial, you will learn to create a Python + OpenCV script that will generate the Squid Game memes automatically without using photoshop or other editors.

If you’re not living in the Stone Age, then I’m willing to bet you must have witnessed the hype of the NetFlix latest hit TV show called the Squid Game. Nowadays every other post on the internet is about it and feels like a storm that has taken over the internet, now if you haven’t watched that show already then I will definitely recommend you to check it out! Otherwise, society may not accept you 😂 … just kidding!

Also, I’m not going to be revealing any spoilers for the show, so don’t worry 🙂.

So in the last couple of weeks, I’ve been seeing a lot of memes related to this show, and have found some of the memes absolutely hilarious like this one:

You need context to get this but as promised I won’t be giving any spoilers but just to summarize the characters had to carve out shapes from the candy above, the more difficult the shapes the harder this challenge was. Now people online have been replacing the original umbrella with all sorts of things. 

And I thought why not embed the Bleed AI logo here using photoshop and post it on my Facebook page, but then I got an even better idea, why not create a python script capable of generating a new meme automatically, given this meme template and any logo. Something like this:

And I ended up creating this tutorial that will teach you to automatically generate these Squid Game memes in a step-by-step manner with each step explained in detail using just OpenCV and Python. 

So to start learning just press the green button in the image above … or keep reading 😏.

Outline:

Download Code:

Import the Libraries

We will start by importing the required libraries.

Read an Image

Now we will use the function cv2.imread() to read a sample image and then display the image using the matplotlib library, after converting it into RGB from BGR format.

Retrieve the Candy ROI

Now we will simply crop the candy ROI from the input image we read and then display the ROI using the matplotlib library.

Remove the Umbrella Design from the Candy

Now that we have the required ROI, we will smoothen out the umbrella design from it using cv2.medianBlur() function. For this, we will perform:

  • Canny Edge Detection to detect the umbrella design regions, using the function cv2.Canny().
  • Dilation to increase size of the detected design edges, using the function cv2.dilate().

And get a mask image of the ROI, with pixel values 255 at the indexes where the umbrella design is present and pixel values 0 at the remaining indexes, which we will utilize to smoothen out only the exact regions where the umbrella design is present in the candy ROI. So we will get rid of the umbrella design while retaining the candy texture.

After clearing the previous design from the candy, our next step will be to embed a new one on the candy to create the meme we want.

Read and Preprocess the Design Image

But For this purpose, we will first have to load the new design image from the disk and perform the required preprocessing on it. We will perform:

  • Resizing the design image to an appropriate size, using the function cv2.resize()
  • Canny Edge Detection on the resized image, to get the design edges, using the function cv2.Canny().
  • Dilation to increase size of the detected design edges, using the function cv2.dilate().
  • Median Blur to smoothen the detected design edges, using the function cv2.medianBlur().

To get a preprocessed mask of the design image that we will need to create that original umbrella-like effect on the candy.

Embed the new Design Image

Now we will overlay this preprocessed design over the region of interest of the cleared candy image. For this, we will first retrieve the ROI using the array slicing technique, and then we will modify the ROI by replacing some pixels values with the processed design pixel values, utilizing the mask of the design to find the indexes of the pixels to replace. And then, we will use the function cv2.addWeighted() to perform the weighted addition between the modified and the original ROI to get a transparency effect for the new design.

Note: The processed design is a one-channel image, so we will have to convert it into a three-channel image by merging that one-channel image three times using the function cv2.merge(), to overlay it over the three-channel candy image.

Display and Save the Output Image

Now we will put together all of the resultant ROIs to get the output meme image, and then we will save it into the disk using the cv2.imwrite() function, and display it using the matplotlib library, after converting it into RGB from BGR format.

Looks cool, right? With this, we have completed the script to automatically generate squid game dalgona candy memes for any design we want.

Join My Course Computer Vision For Building Cutting Edge Applications Course

The only course out there that goes beyond basic AI Applications and teaches you how to create next-level apps that utilize physics, deep learning, classical image processing, hand and body gestures. Don’t miss your chance to level up and take your career to new heights

You’ll Learn about:

  • Creating GUI interfaces for python AI scripts.
  • Creating .exe DL applications
  • Using a Physics library in Python & integrating it with AI
  • Advance Image Processing Skills
  • Advance Gesture Recognition with Mediapipe
  • Task Automation with AI & CV
  • Training an SVM machine Learning Model.
  • Creating & Cleaning an ML dataset from scratch.
  • Training DL models & how to use CNN’s & LSTMS.
  • Creating 10 Advance AI/CV Applications
  • & More

Whether you’re a seasoned AI professional or someone just looking to start out in AI, this is the course that will teach you, how to Architect & Build complex, real world and thrilling AI applications

Summary

In this tutorial, we learned to automatically generate the Squid Game memes just by using OpenCV and Python and while doing so we learned a couple of useful image processing techniques like Canny Edge Detection, Dilation, and Median Blurring, etc now you can try to improve the output further by tuning the parameters if you want. 

Or you can try to generate a different meme using the concepts you have learned in this tutorial and share the results with me. It is always tempting to see you guys build on top of what you learn here at Bleed AI, so make sure to post the links to your memes in the comments

You can reach out to me personally for a 1 on 1 consultation session in AI/computer vision regarding your project. Our talented team of vision engineers will help you every step of the way. Get on a call with me directly here.

Ready to seriously dive into State of the Art AI & Computer Vision?
Then Sign up for these premium Courses by Bleed AI

Different Branches of Machine Learning | Artificial Intelligence Part 3/4 (Episode 5 | CVFE)

Different Branches of Machine Learning | Artificial Intelligence Part 3/4 (Episode 5 | CVFE)

Watch Video Here

In the previous episode of the Computer Vision For Everyone (CVFE) course, we had discussed the history of AI in detail, covering almost all major events so far from 1950 along with the winters AI faced and their causes. And I had also explained what exactly the terms AI, Machine Learning and Deep Learning mean, in the simplest manner possible. 

Now today in this episode, we’ll go a little deeper into machine learning and take a look at different branches of machine learning in detail with their examples.

This is the 3rd part of our 4-parts series on AI. I have witnessed many experienced practitioners that have been working in the field for years but do not know the basic fundamentals of AI which is quite surprising as a solid foundation in the theoretical concepts of AI/ML plays a major role in working with  AI/ML algorithms efficiently.

So through this series of tutorials, I’m trying to provide a thorough understanding of the Artificial Intelligence field for everyone, with an increase in technicality and depth on each subsequent tutorial. 

Alright, so without further ado, let’s get started.

Machine Learning can be further divided into three different branches i.e., Supervised Learning, Unsupervised Learning, and Reinforcement Learning. Other than these three, there are also some hybrid branches too but we’ll learn about them in the next episode.

For now let’s look at each of these three core ML branches, one by one.

  • Supervised Learning.
  • Unsupervised Learning.
  • Reinforcement Learning.

Supervised Learning

Supervised Learning is the most common branch of machine learning, in fact, most of the applications you see these days are examples of supervised learning.

For example, a House Price Prediction System is a popular supervised machine learning problem, where a Machine Learning model predicts the price of a house by looking at some features of the house like house area, the number of bedrooms it has and its location, etc.

Also, it is worth noting that when a Machine Learning model predicts a number, then it’s also called a Regression Problem and it has many types. For example, localizing an object in images/videos using an object detector is also a regression problem, as in this scenario the output i.e., the coordinates (x1, y1, x2, and y2) of a bounding box enclosing the object are numbers.

Another example for Supervised Learning would be a machine learning model looking at an image or a video and predicting a category/label of the object in it.

And whenever a machine learning model predicts a class label that is normally based on some features of the object in the image/video, the process is called a Classification Task or Problem. So both Classification & Regression fall in supervised learning.

But what exactly is this Supervised Learning? We have looked at its examples but how do we define this? Well, it’s pretty simple;

In Supervised Learning, you first have to label all the training examples. Like, suppose if you’re doing something like a Cat & Dog Classification, you’ll first label all training images or videos with either cat or dog. Then you feed all the training examples to the machine learning model, and the model then trains or learns from these examples.

And after it has been trained, we can then show the model some test images or videos that it hasn’t seen before to get the predictions on the test examples and evaluate the model’s performance by verifying the results.

This Whole process is called Supervised Machine Learning. Now let’s check its definition in technical terms. 

In Supervised Learning, we take feature (x), which can be anything from pixel values to extracted house features, and map them to an output (y) which can be anything from labels like cat/dog to a regression number like house prices.

And this X and Y is an input-output pair and with an increase in the training examples, these input-output pairs also increase, and the machine learning model (whose job is to learn this input-output pair relationship during the training process) will be more accurate. 

So essentially when we train a model, ideally it learns a function, capable of mapping any unseen input example to an appropriate output. And this is supervised learning, although supervised learning is responsible for most of the AI applications we see today. But the biggest issue with this approach is that it takes a lot of time and human effort to create the required input-output pairs for training the model. 

So for example, if you had 10,000 images of cats and dogs then you’ll first have to go and label each with either a cat or a dog label, which is a very time-consuming and tedious process.

Unsupervised Learning

Let’s take a look at another machine learning approach called Unsupervised Learning where you don’t have to label anything.

So you have an input (x) but don’t have to map it to output (y), the goal of the machine learning model here is to learn the internal structures, distributions, or patterns in the data.

But how is this useful? Well, let’s discuss Clustering to find out, which is a type of unsupervised learning problem.

Suppose you have lots of unlabeled images of 3 simple shapes like circles, rectangles, and triangles, and all these images are mixed up. So what you can do is show all these examples to an unsupervised machine learning model. 

The model will learn the common patterns and will group them based on similarity like for e.g if just one feature or pattern i.e., the number of corners is considered then the model will cluster the images into 3 different groups i.e., of course, Circle, Triangle, and Rectangle.

And Immediately you’ll recognize the actual class and label these three clusters and this will save the effort of labeling each image separately but this is a very basic example and it isn’t always this simple. Suppose if instead of shapes you had 3 classes of animals like cats, dogs, and reptiles.

Then ideally the clustering algorithm should give you 3 clusters of images with each cluster having images of only one class but this doesn’t happen in reality because clustering just based on raw pixels is not meaningful, the algorithm may cluster images with similar backgrounds or some other thing.

So what we can do here is extract some meaningful features and then cluster data based on those features. And in the end, you can use some metrics to determine if the clusters generated by the algorithm are meaningful or not.

Clustering is popularly used in the e-commerce Industry to cluster customers into different segments like frequent buyers, or people who purchase during Sales, etc.

This helps a lot in designing customized marketing campaigns. Another type of Unsupervised problem is called Association.

In this technique, we analyze data and discover rules that describe groups of data, for example, we can find patterns like if a certain data group contains Feature A, then there is a high probability it will contain Feature B too. 

So Association models help in associating one variable with a data group. Let’s check an example. If we train an association algorithm on customer purchases then it may tell us things like, Customers who bought Item ‘A’ also bought item “B and C”. So if a buyer buys a fan, he may see some excellent recommendations like a rope xD.

[Insert cliparts of fan and rope]

So when you see recommendations in online stores while shopping, it happens due to association algorithms running in the background on your data.

Reinforcement Learning

Alright, we have looked at Supervised Learning & Unsupervised Learning. Now let’s talk about Reinforcement Learning which is something totally different.

Now before we get into Reinforcement Learning, I first want to discuss the necessity for it. So consider,  if you wanted to train an AI to walk then what you could do is attach a ton of sensors to someone’s legs, and capture things like angular velocity, acceleration, muscle tension, and whatnot. Then feed all these data points to a supervised algorithm and try to train it so it learns to walk.

But here’s the thing, this approach will not prove to be much effective because it’s really hard to describe how to walk or what particular features to capture or study in order to learn to walk.

So a much better approach would be learning to walk by trial and error and this is what Reinforcement Learning is. It is used whenever we’re faced with a problem that is hard to describe. Google’s Deepmind got some really interesting results when they trained AI to walk using reinforcement learning.  

In Reinforcement learning, you have an agent, which has to interact with some given environment in order to reach its goal.

Consider the example of a self-driving car, where the agent is the car and the environment can be the roads, people, or any obstacles that the car has to deal with. The objective of this agent i.e., a car is to reach its goal or destination while avoiding any obstacles in the way.

Now what happens during the training phase is that the agent tries to reach the goal by taking actions, these actions are like moving the car forward, backward, taking turns, slowing down, etc. 

And the environment has a state that changes as cars can move towards the agent, an obstacle might block the agent, or anything can happen in the environment.

As the agent gets closer and closer to the goal, it gets rewarded, this way the agent knows that the actions it took were correct as it was rewarded.

And similarly, if the agent makes mistakes it’s punished with a penalty and this tells the agent that the actions it took were bad.

This whole process is repeated in a loop over and over during the training until the agent learns to avoid mistakes and reach the goal using an effective approach.

Also when it comes to AI playing games, reinforcement learning is the go-to approach. In fact, OpenAI’s popular 2016 victory against the World Go champion was built on Deep Reinforcement Learning. 

Summary

In this episode of CVFE, we learned about the three primary Paradigms in machine learning i.e., Supervised Learning, Unsupervised Learning, and Reinforcement Learning in-depth with examples.

Now you have learned the pros and cons of all three and the approach that you should use totally depends on the problem that you are trying to solve. If you are still confused about the approach best suited for your project you can ask me in the comments section.

You can reach out to me personally for a 1 on 1 consultation session in AI/computer vision regarding your project. Our talented team of vision engineers will help you every step of the way. Get on a call with me directly here.

Ready to seriously dive into State of the Art AI & Computer Vision?
Then Sign up for these premium Courses by Bleed AI

With this I conclude this episode, in the next and final part of this series, I’ll go deeper and discuss the hybrid fields of AI, applied fields, AI industries, AI applications and finally we’ll connect everything we have discussed together and show you how everything relates with each other.
Share the post with your colleagues if you have found it useful. Also, make sure to check out part 1 and part 2 of the series and Subscribe to the Bleed AI YouTube channel to be notified when new videos are released.

History of AI, Rise Of Machine Learning and Deep Learning | Artificial Intelligence Part 2/4 (Episode 4 | CVFE)

History of AI, Rise Of Machine Learning and Deep Learning | Artificial Intelligence Part 2/4 (Episode 4 | CVFE)

Watch Video Here

In the previous episode of the Computer Vision For Everyone (CVFE) course, we discussed a high-level introduction to AI and its categories i.e., ANI(Artificial Narrow Intelligence), AGI(Artificial General Intelligence), ASI(Artificial Super Intelligence) in detail.

Now in this tutorial, we’ll see the evolution of AI throughout time and finally understand what popular terms like machine learning and deep learning actually mean and how they came about. Even if you already know these things, I would still advise you to stick around as this tutorial is actually packed with a lot of other exciting stuff too.

This episode of the CVFE course is the 2nd part of our 4-part series on AI. Throughout the series, my focus is on giving you a thorough understanding of the Artificial Intelligence field with 4 different tutorials, with each tutorial we dive deeper and get more technical.

I’ll start by discussing some exciting historical details about how AI emerged and I’ll keep it simple. So up till 1949, there wasn’t much work on Intelligent machines, yes there were some key events like the creation of the Bayes theorem in 1763 or the demonstration of the first chess-playing machine by Leonardo Torres in 1914.

But the First major interest in AI developed or the first AI boom started in the 1950s. So let’s start from there. Now I can’t cover every important event in AI, but we will go over some major ones. So let’s get started.

In 1950, Alan Turing published “Computing Machinery and Intelligence” in which he proposed “The Imitation Game” which was later known as the infamous “Turing Test

This was a  test that tests a machine’s ability to exhibit intelligent behavior like a human. If a human evaluator cannot differentiate between a machine and a human in a conversation then that machine is said to have passed the Turing Test.

There’s also a great movie built around Alan Turing and the Turing Test named  The Imitation Game which I’ll definitely recommend you to check out.

In 1955, the term “Artificial Intelligence” was coined by John McCarthy and some others, it was then further described later on in a workshop in 1956, this is generally considered as the birthdate of AI.

In December 1956, Herbert Simon and Allen Newell developed the Logic Theorist, which was the first AI program.

In 1957, Frank Rosenblatt developed the Perceptron, the most basic version of an Artificial Neural Network, by the way, an extension of this algorithm alone will later give rise to the field of Deep Learning.

In 1958, Lisp was developed by John McCarthy and became the most popular programming language used in AI research.

In 1959, Arthur Samuel coins the term “Machine Learning” defining it as; The field of study that gives computers the ability to learn without being explicitly programmed.

Alright At this moment, I should probably explain what Machine learning is. As the definition above is a little confusing. But First, let’s understand what traditional or classical AI is. 

In traditional AI, programmers code a lot of instructions in a machine about the task it needs to perform. So in general, you can define AI as; “A branch of computer science that focuses on creating intelligent Systems which exhibit intellectual human-like behavior.”

Or another way to say this is; “Any program which resembles or mimics some form of human intelligence is AI.”

But this is Traditional AI, not Machine Learning.  Now you may be thinking what’s the problem, why do we even need machine learning when we can manually instruct machines to exhibit human-like behavior?

Well, Traditional AI itself is great and it has provided a lot of applications in the initial years of AI, but when we started to move towards more complex applications (like self-driving cars), the traditional Rule-based AI didn’t just cut it.

Consider e.g. you instruct a self-driving car to drive when it sees a green light and stop when it sees a pedestrian. What will happen if both events happen at the same time?

Although this is a really simple case and can be solved by checking both conditions, what if the pedestrian is Donald Trump, should you still stop? Or just drive through him.

Anyways pun aside, this should give you a brief idea about how such a simple application can quickly become complex with the increase in the number of variables and you can’t expect programmers to handle and code conditions for all types of future events.

So what’s the best approach? 

Well, how about an approach in which we show a machine lots of examples of some object. And after the machine has learned how the object looks, we show it images of the same objects it has never seen and check if it can recognize the object or not.

Similarly by showing self-driving cars thousands and thousands of hours of data on how to drive a car, makes it learn it. This is Machine learning and it’s also how we humans learn, by watching and observing things and people around us.

So in simple words; “Machine learning is just a subset of AI that consist of all those algorithms and techniques that can learn from the data, in essence, these algorithms give computers the capability to learn without being explicitly programmed”.

Alright, now let’s move on with our timeline.

In 1961, the first industrial robot, Unimate, started working on an assembly line in a General Motors plant in New Jersey.

In 1965, Herbert Simon predicted that “within twenty years machines will be capable of doing any work a man can do.” Needless to say, it didn’t turn out that well, it’s 2021 and we’re still a long way from reaching there.  In 1965, ELIZA, the first AI Chatbot, which could carry conversations in English on any topic was invented.

In 1966, Shakey, the first general-purpose mobile robot was created.


In 1969, .. ….  So in 1969? …is it the moon landing? no, no, no something significantly more important happened xD. Oh yeah in 1969, the famous backpropagation algorithm was described by Arthur Bryson and Yu-Chi Ho, this is the same algorithm that has tremendously contributed to the success of deep learning applications we see today.

Around the same time, Marvin Minsky Quotes:  In from 3 to 8 years we will have a machine with the general intelligence of an average human being.” hmmm 🤔… I’m loving the confidence the AI researchers had in the last century, Props for that. Anyways, needless to say, that did not happen.

After the 50s and 60s, two decades of AI hype, the Field of AI saw its first Winter.  This is defined as the period where the funding of AI research and development was cut down.

It all started in 1973,  with James Lighthill Report to the British Science Research Council on the state of AI research, in summary, the report concluded that; “The promises made by the field of AI initially were not delivered and that most of the techniques and algorithms only worked well on toy problems and fall flat on real-world scenarios,” This report led to a drastic halt in AI.

After the effects of the first AI winter faded, a new AI era emerged, and this time people were more application-focused. In 1979, the Stanford Cart successfully crossed a chair-filled room without human intervention in about five hours, becoming one of the earliest examples of an autonomous vehicle.

In 1981, the Japanese ministry invested $400 million in the Fifth Generation Computer Project. The project was aimed to develop computers that could carry on conversations, translate languages, interpret pictures, and reason like human beings.

In 1986, the first driverless car, a Mercedes-Benz van equipped with cameras and sensors, was built at Bundeswehr University in Munich under the direction of Ernst Dickmanns, which drove up to 55 mph on empty streets.

At this point I should mention that in 1984, a panel called “Dark age of AI” was held, there Marvin Minsky and some others warned of a coming “AI Winter,” predicting an imminent burst of the AI bubble which did happen three years later in 1987 and again it led to a reduction in AI investment and research funding.

This was the second AI Winter and it went on for 6 years. Still, some researchers were working in the field. Like in 1989, Yann LeCun and other researchers at AT&T Bell Labs successfully applied the backpropagation algorithm to a multi-layer Convolutional Neural Network called Lenet which could recognize handwritten ZIP codes

This was the first practical demonstration of deep learning, although the term ‘Deep Learning’ was coined later in 2006 by Geoffery Hinton. Speaking of Deep Learning, let’s understand what it is.

So remember when I explained machine learning is a set of algorithms that learns from the data. Well among those machine learning algorithms, there is an algorithm called “Perceptron”, also called an artificial neural network, which is inspired by the working of our brain. Now a perceptron contains a single layer, this layer contains nodes called Neurons

Each neuron can remember information about the data, as it passes through it

so the greater the number of neurons, the greater the ability of the network to remember the data, similarly you can also add more layers to the network to increase its learning ability, each new layer can extract more information or features from the input data.

Not only that but each new layer builds on knowledge learned from previous layers, this way if you’re trying to build a network that can recognize cats, then the earlier layers will learn to recognize low-level features like, what are edges, or corners, etc. The later layers will learn high-level concepts like recognizing whiskers, ears, a cat’s tail, etc.

This network composed of multiple layers is called a deep neural network, and whenever you’re using Deep Neural networks or DNN’s for short, then it’s called Deep Learning.

The example I just showed you was of a Feed-Forward network and there are lots of other types of neural networks like a Convolutional Neural Network (CNN) or a Long Short Term Memory (LSTM) network and many others. 

Alright, here’s a great definition of Deep learning by Youshua Bengio: One of the pioneers of modern AI. I’ve modified this definition to make it simpler.

“Deep learning is a collection of methods or models that learn hierarchies of features, at each subsequent layer in the model some features are learned, the knowledge gained in lower-level layers is used by high-level layers to learn/build abstract high-level concepts. This way the model can learn features from raw data at multiple levels of abstraction without the need of depending upon human crafted features.”

If this definition sounds complicated then I would recommend reading it again, it’s describing the same hierarchical learning system which I just explained. 

Coming back to the definition, notice the last part in which I mentioned that we don’t need human crafted features, this is the main advantage of deep learning over machine learning.

In machine learning, oftentimes human engineers need to do something called feature engineering to make it easier for the model to learn but in deep learning, you don’t need to do that.

Another major advantage of deep learning is that as the amount of data increases, deep learning models get better and better, but in machine learning, after a certain point the performance plateaus. This is because most machine learning models are not complex enough to utilize and learn from all that data.

Alright, So below is an illustration of how AI, Machine Learning, and deep learning are related. 

Even though Deep Learning had great promises, it didn’t take off in the 1990s, this is because at the time we didn’t have much data. The GPUs were not powerful enough. And the models and algorithms themselves had some limitations.

Now Let’s continue with our timeline.

In October 1996: Taha Anwar was born xD… Well you never know, I might create or do something man.

Anyways let’s move on.

In 1997, 2nd AI winter ended and progress in AI again Started, Sepp Hochreiter and Jürgen Schmidhuber proposed the Long Short-Term Memory (LSTM) model, a very popular type of neural network used to learn sequences of data.

In the same year, Deep Blue became the first computer chess-playing program to beat a reigning world chess champion, Garry Kasparov.

In 1998, Yann LeCun and Yoshua Bengio published papers on Neural Network applications on handwriting recognition and optimizing backpropagation.

In 2000, MIT’s Ph.D. Student Cynthia developed Kismet, a robot structured like a human face with eyes, lips, and everything. And it could recognize and simulate emotions. 

In the same year, Honda introduced the ASIMO robot, the first humanoid robot to walk as fast as a human, delivering trays to customers in a restaurant setting.

In 2005, Stanley, the first autonomous vehicle won the  DARPA Grand Challenge, this event greatly fuels the interest in self-driving cars.

In 2007, Fei Fei Li and colleagues at Princeton University started to assemble ImageNet, the world’s largest database for annotated images. In 2010, ImageNet Large Scale Visual Recognition Challenge was launched, which was an annual AI object recognition competition. In 2011, Watson, a natural language bot created by IBM, defeated two Jeopardy Champions

And in the same year, Apple released Siri, a virtual assistant capable of answering questions in natural language communication.

Now let’s discuss the ImageNet challenge again. This competition ran from 2010 till 2017 and was responsible for some great architectural innovations in modern AI algorithms.

Perhaps the most revolutionizing year for this competition and a landmark year in AI was 2012 when a team under Geoffery Hinton presented AlexNet (a type of Convolutional Neural Network) in the competition. 

Now this Deep Neural Network was cooked up just right by Geoffery Hinton, Alex Krizhevsky, and their team. The timing was perfect, in 2012 we had all the required ingredients to finally make deep learning work.

We had the required Data (ImageNet)  with millions of high-resolution images, the Computation Power (as 2012 offered a lot of Great high-powered GPUs), and we also had made tremendous strides in the Architectural improvement of neural networks

And when they combined all these elements at the right time, AlexNet was born.

A network that got only a 16% error rate on ImageNet competition, a 25% improvement from the year before.

This was a huge milestone. In the next year, all winning entries were using Deep learning models and finally, Deep Learning had taken OFF.

What followed in the years after, was innovation upon innovation in AI using deep learning approaches. Not only in research but we saw AI being successfully applied to almost every other industry.

Every year billions of dollars are being pumped by investors in AI. hundreds of promising new AI Startups are appearing and thousands of papers are being published in AI each year.

And a lot of initial success in AI can be attributed to 3 people which are also known as the Pioneers of Modern AI. They are; Yann Lecun, Geoffrey Hinton, and Yoshio Bengio.

Summary

In this episode of the CVFE course, we discussed the history of AI and how it became one of the most promising fields along with the winters it faced in the past, and what exactly terms like Machine Learning and Deep Learning mean.

Now one question you might have is….Will there be a 3rd AI Winter? And to be honest, the answer is no! 

In 2016, DeepMind’s AlphaGo defeated the World Go champion, a very difficult feat. In 2019, OpenAI Five beats Dota 2 experts, a game that requires a lot of skill to master.

In 2020, language models like OpenAI’s GPT 3, stunned the world with their abilities.

So no, the next AI winter is not coming anytime soon as; AI is seeing its best years. Just in 2020, Eugene became the first AI machine to pass the Turing Test by convincing 33 judges that it was a 13-year-old Ukrainian boy.

How Cool and equally frightening is that?

With this I conclude part 2, in the next episode of this series, I’ll go into more detail and discuss different branches of Machine learning.

In case you have any questions, please feel free to ask in the comment section and share the post with your colleagues if you have found it useful.

You can reach out to me personally for a 1 on 1 consultation session in AI/computer vision regarding your project. Our talented team of vision engineers will help you every step of the way. Get on a call with me directly here.

Ready to seriously dive into State of the Art AI & Computer Vision?
Then Sign up for these premium Courses by Bleed AI


Make sure to check out part 1 of the series and Subscribe to the Bleed AI YouTube channel to be notified when new videos are released.