Generating DeepFakes from a Single Image in Minutes

Generating DeepFakes from a Single Image in Minutes

In this tutorial, we will learn how to manipulate facial expressions and create a DeepFake video out of a static image using the famous First-Order Motion Model. Yes, you heard that right, we just need a single 2D image of a person to create the DeepFake video.

Excited yet? … not that much ? ..  well what if I tell you, the whole tutorial is actually on Google Colab, so you don’t need to worry about installation or GPUs to run, everything is configured.

And you know what the best part is?

Utilizing the colab that you will get in this tutorial, you can generate deepfakes in a matter of seconds, yes seconds, not weeks, not days, not hours but seconds.

What is a DeepFake?

The term DeepFake is a combination of two words; Deep refers to the technology responsible for generating DeepFake content, known as Deep learning, and Fake refers to the falsified content. The technology generates synthetic media, to create falsified content, which can be done by either replacing or synthesizing the new content (can be a video or even audio).

Below you can see the results on a few sample images:

This feels like putting your own words in a person’s mouth but on a whole new level.

Also, you may have noticed, in the results above, that we are generating the output video utilizing the whole frame/image, not just on the face ROI that people normally do.

First-Order Motion Model

We will be using the aforementioned First-Order Motion Model, so let’s start by understanding what it is and how it works?

The term First-Order Motion refers to a change in luminance over space and time, and the first-order motion model utilizes this change to capture motion in the source video (also known as the driving video). 

The framework is composed of two main components: motion estimation (which predicts a dense motion field) and image generation (which predicts the resultant video). You don’t have to worry about the technical details of these modules to use this model. If you are not a computer vision practitioner, you should skip the paragraph below.

The Motion Extractor module uses the unsupervised key point detector to get the relevant key points from the source image and a driving video frame. The local affine transformation is calculated concerning the frame from the driving video. A Dense Motion Network then generates an occlusion map and a dense optical flow, which is fed into the Generator Module alongside the source image. The Generator Module generates the output frame, which is a replica of the relevant motion from the driving video’s frame onto the source image.

This approach can also be used to manipulate faces, human bodies, and even animated characters, given that the model is trained on a set of videos of similar object categories.

Now that we have gone through the prerequisite theory and implementation details of the approach we will be using, let’s dive into the code.

Download code:


Alright, let’s get started.

Step 1: Setup the environment

In the first step, we will set up an environment that is required to use the First-Order Motion model.

Step 1.1: Clone the repositories

Clone the official First-Order-Model repository.

Step 1.2: Install the required Modules

Install helper modules that are required to perform the necessary pre- and post-processing.

Import the required libraries.

Step 2: Prepare a driving video

In this step, we will create a driving video and will make it ready to be passed into the model.

Step 2.1: Record a video from the webcam

Create a function record_video() that can access the webcam utilizing JavaScript.

Remember that Colab is a web IDE that runs entirely on the cloud, so that’s why JavaScript is needed to access the system Webcam.

Now utilize the record_video() function created above, to record a video. Click the recording button, and then the browser will ask for user permission to access the webcam and microphone (if you have not allowed these by default) after allowing, the video will start recording and will be saved into the disk after a few seconds. Please make sure to have neutral facial expressions at the start of the video to get the best Deep Fake results.

You can also use a pre-recorded video if you want, by skipping this step and saving that pre-recorded video at the video_path.

The video is saved, but the issue is that the video is just a set of frames with no FPS and Duration information, and this can cause issues later on, so now, before proceeding further, resolve the issue by utilizing the FFMPEG command.

Step 2.2: Crop the face from the recorded video

Crop the face from the video by utilizing the script provided in the First-Order-Model repository.

The Script will generate a FFMPEG Command that we can use to align and crop the face region of interest after resizing it to 256x256Note that it does not print any FFMPEG Command if it fails to detect the face in the video.

Utilize the FFMPEG command generated by the script to create the desired video.

Now that the cropped face video is stored in the disk, display it to make sure that we have extracted exactly what we desired.

Perfect! The driving video looks good. Now we can start working on a source image.

Step 3: Prepare a source Image

In this step, we will make the source Image ready to be passed into the model.

Download the Image

Download the image that we want to pass to the First-Order Motion Model utilizing the wget command.

Load the Image

Read the image using the function cv2.imread() and display it utilizing the matplotlib library.

Note: In case you want to use a different source image, make sure to use an image of a person with neutral expressions to get the best results.

Step 3.1: Detect the face

Similar to the driving video, we can’t pass the whole source image into the First-Order Motion Model, we have to crop the face from the image and then pass the face image into the model. For this we will need a Face Detector to get the Face Bounding Box coordinates and we will utilize the Mediapipe’s Face Detection Solution.

Initialize the Mediapipe Face Detection Model

To use the Mediapipe’s Face Detection solution, initialize the face detection class using the syntax, and then call the function with the arguments explained below:

  • model_selection – It is an integer index ( i.e., 0 or 1 ). When set to 0, a short-range model is selected that works best for faces within 2 meters from the camera, and when set to 1, a full-range model is selected that works best for faces within 5 meters. Its default value is 0.
  • min_detection_confidence – It is the minimum detection confidence between ([0.0, 1.0]) required to consider the face-detection model’s prediction successful. Its default value is 0.5 ( i.e., 50% ) which means that all the detections with prediction confidence less than 0.5 are ignored by default.

Create a function to detect face

Create a function detect_face() that will utilize the Mediapipe’s Face Detection Solution to detect a face in an image and will return the bounding box coordinates of the detected face.

To perform the face detection, pass the image (in RGB format) into the loaded face detection model by using the function The output object returned will have an attribute detections that contains a list of a bounding box and six key points for each face in the image.

Note that the bounding boxes are composed of xmin and width (both normalized to [0.0, 1.0] by the image width) and ymin and height (both normalized to [0.0, 1.0] by the image height). Ignore the face key points for now as we are only interested in the bounding box coordinates.

After performing the detection, convert the bounding box coordinates back to their original scale utilizing the image width and height. Also draw the bounding box on a copy of the source image using the function cv2.rectangle().

Utilize the detect_face() function created above to detect the face in the source image and display the results.

Nice! face detection is working perfectly.

Step 3.2: Align and crop the face

Another very important preprocessing step is the Face Alignment on the source image. Make sure that the face is properly aligned in the source image otherwise the model can generate weird/funny output results.

To align the face in the source image, first detect the 468 facial landmarks using Mediapipe’s Face Mesh Solution, then extract the eyes center and nose tip landmarks to calculate the face orientation and then finally rotate the image accordingly to align the face.

Initialize the Face Landmarks Detection Model

To use the Mediapipe’s Face Mesh solution, initialize the face mesh class using the syntax and call the function with the arguments explained below:

  • static_image_mode – It is a boolean value that is if set to False, the solution treats the input images as a video stream. It will try to detect faces in the first input images, and upon a successful detection further localizes the face landmarks. In subsequent images, once all max_num_faces faces are detected and the corresponding face landmarks are localized, it simply tracks those landmarks without invoking another detection until it loses track of any of the faces. This reduces latency and is ideal for processing video frames. If set to True, face detection runs on every input image, ideal for processing a batch of static, possibly unrelated, images. Its default value is False.
  • max_num_faces – It is the maximum number of faces to detect. Its default value is 1.
  • refine_landmarks – It is a boolean value that is if set to True, the solution further refines the landmark coordinates around the eyes and lips, and outputs additional landmarks around the irises by applying the Attention Mesh Model. Its default value is False.
  • min_detection_confidence – It is the minimum detection confidence ([0.0, 1.0]) required to consider the face-detection model’s prediction correct. Its default value is 0.5 which means that all the detections with prediction confidence less than 50% are ignored by default.
  • min_tracking_confidence – It is the minimum tracking confidence ([0.0, 1.0]) from the landmark-tracking model for the face landmarks to be considered tracked successfully, or otherwise face detection will be invoked automatically on the next input image, so increasing its value increases the robustness, but also increases the latency. It is ignored if static_image_mode is True, where face detection simply runs on every image. Its default value is 0.5.

We will be working with images only, so we will have to set the static_image_mode to True. We will also define the eyes and nose landmarks indexes that are required to extract the eyes and nose landmarks.

Create a function to extract eyes and nose landmarks

Create a function extract_landmarks() that will utilize the Mediapipe’s Face Mesh Solution to detect the 468 Facial Landmarks and then extract the left and right eyes corner landmarks and the nose tip landmark.

To perform the Face(s) landmarks detection, pass the image to the face’s landmarks detection machine learning pipeline by using the function But first, convert the image from BGR to RGB format using the function cv2.cvtColor() as OpenCV reads images in BGR format and the ml pipeline expects the input images to be in RGB color format.

The machine learning pipeline outputs an object that has an attribute multi_face_landmarks that contains the 468 3D facial landmarks for each detected face in the image. Each landmark has:

  • x – It is the landmark x-coordinate normalized to [0.0, 1.0] by the image width.
  • y – It is the landmark y-coordinate normalized to [0.0, 1.0] by the image height.
  • z – It is the landmark z-coordinate normalized to roughly the same scale as x. It represents the landmark depth with the center of the head being the origin, and the smaller the value is, the closer the landmark is to the camera.

After performing face landmarks detection on the image, convert the landmarks’ x and y coordinates back to their original scale utilizing the image width and height and then extract the required landmarks utilizing the indexes we had specified earlier. Also draw the extracted landmarks on a copy of the source image using the function, just for visualization purposes.

Now we will utilize the extract_landmarks() function created above to detect and extract the eyes and nose landmarks and visualize the results.

Cool! it is accurately extracting the required landmarks.

Create a function to calculate eyes center

Create a function calculate_eyes_center() that will find the left and right eyes center landmarks by utilizing the eyes corner landmarks that we had extracted in the extract_landmarks() function created above.

Use the extracted_landmarks() and the calculate_eyes_center() function to calculate the central landmarks of the left and right eyes on the source image.

Working perfectly fine!

Create a function to rotate images

Create a function rotate_image() that will simply rotate an image in a counter-clockwise direction with a specific angle without losing any portion of the image.

Utilize the rotate_image() function to rotate the source image at an angle of 45 degrees.

Rotation looks good, but rotating the image with a random angle will not bring us any good.

Create a function to find the face orientation

Create a function calculate_face_angle() that will find the face orientation, and then we will rotate the image accordingly utilizing the function rotate_image() created above, to appropriately align the face in the source image.

To find the face angle, first get the eyes and nose landmarks using the extract_landmarks() function then we will pass these landmarks to the calculate_eyes_center() function to get the eyes center landmarks, then utilizing the eyes center landmarks we will calculate the midpoint of the eyes i.e., the center of the forehead. And we will use the detect_face() function created in the previous step, to get the face bounding box coordinates and then utilize those coordinates to find the center_pred point i.e., the mid-point of the bounding box top-right and top_left coordinate.

And then finally, find the distance between the nosecenter_of_forehead and center_pred landmarks as shown in the gif above to calculate the face angle utilizing the famous cosine-law.

Utilize the calculate_face_angle() function created above the find the face angle of the source image and display it.

Face Angle: -8.50144759667417

Now that we have the face angle, we can move on to aligning the face in the source image.

Create a Function to Align the Face and Crop the Face Region

Create a function align_crop_face() that will first utilize the function calculate_face_angle() to get the face angle, then rotate the image accordingly utilizing the rotate_image() function and finally crop the face from the image utilizing the face bounding box coordinates (after scaling) returned by the detect_face() function. In the end, it will also resize the face image to the size 256x256 that is required by the First-Order Motion Model.

Use the function align_crop_face() on the source image and visualize the results.

Make sure that the whole face is present in the cropped face ROI results. Increase/decrease the face_scale_factor value if you are testing this colab on a different source image. Increase the value if the face is being cropped in the source image and decrease the value if the face ROI image contains too much background.

I must say its looking good! all the preprocessing steps went as we intended. But now comes a post-processing step, after generating the output from the First-Order Motion Model.

Remember that later on, we will have to embed the manipulated face back into the source image, so a function to restore the source image’s original state after embedding the output is also required.

Create a function to restore the original source image

So now we will create a function restore_source_image() that will undo the rotation we had applied on the image and will remove the black borders which appeared after the rotation.

Utilize the calculate_face_angle() and rotate_image() function to create a rotated image and then check if the restore_source_image() can restore the images original state by undoing the rotation and removing the black borders from image.

Step 4: Create the DeepFake

Now that the source image and driving video is ready, so now in this step, we will create a DeepFake video.

Step 4.1: Download the First-Order Motion Model

Now we will download the required pre-trained network from the Yandex Disk Models. We have multiple options there, but since we are only interested in face manipulation, we will only download the vox-adv-cpk.pth.tar file.

Create a function to display the results

Create a function display_results() that will concatenate the source image, driving video, and the generated video together and will show the results.

Step 4.2: Load source image and driving video (Face cropped)

Load the pre-processed source image and the driving video and then display them utilizing the display_results() function created above.

Step 4.3: Generate the video

Now that everything is ready, utilize the script that was imported earlier to finally generate the DeepFake video. First load the model file that was downloaded earlier along with the configuration file that was available in the First-Order-Model repository that was cloned. And then generate the video utilizing the demo.make_animation() function and display the results utilizing the display_results() function.

Step 4.4: Embed the manipulated face into the source image

Create a function embed_face() that will simply insert the manipulated face in the generated video back to the source image.

Now let’s utilize the function embed_face() to insert the manipulated face into the source image.

The video is now stored on the disk, so now we can display it to see what the final result looks like.

Step 5: Add Audio (of the Driving Video) to the DeepFake Output Video

In the last step, first copy the audio from the driving video into the generated video and then download the video on the disk.

The video should have started downloading in your system.

Bonus: Generate more examples

Now let’s try to generate more videos with different source images.

And here are a few more results on different sample images:

After Johnny Depp, comes Mark Zuckerberg sponsoring Bleed AI.

And last but not least, of course, comes someone from the Marvel Universe, yes it’s Dr. Strange himself asking you to visit Bleed AI.

You can now share these videos that you have generated on social media. Make sure that you mention that it is a DeepFake video in the post’s caption.


One of the current limitations of the approach we are using is when the person is moving too much in the driving video. The final results will be terrible because we are only getting the face ROI video from the First-Order Motion Model and then embedding the face video into the source image using image processing techniques. We can’t move the body of the person in the source image if the face is moving in the generated face ROI video. So for the driving videos in which the person is moving too much, you can skip the face embedding part or just train a First-Order Motion Model to manipulate the whole body instead of just the face, I might cover that in a future post.

A Message on Deepfakes by Taha

These days, It’s not a difficult job to create a  DeepFake video, as you can see, anyone with access to the colab repo (provided when you download the code) can generate deepfakes in minutes.

Now these fakes are although realistic but you should be easily be able to tell between fake manipulation and real ones, this is because the model is particularly designed for faster interference, there are other approaches where it can take hours or days to render deepfakes but those are very hard to tell from real ones.

The model I used today, is not new but it’s already been out there for a few years (Fun fact: we were actually working on this blogpost since mid of last year so yeah this got delayed  for more than a year) Anyways, the point is, the deepfake technology is fast evolving and leads to two things,

1) Easier accessibility: More and more high-level tools and coming which makes the barrier to entry easier and more non-technical people can use these tools to generate deepfakes, I’m sure you know some mobile apps that let common generate these.

2) Algorithms: algorithms are getting better and better such that, you’re going to find a lot of difficulty in identifying a deepfake vs a real video. Today, professional deepfake creators actually export the output of a deepfake model to a video editor and get rid of bad frames or correct them so people are not able to easily figure out if it’s a  fake and it makes sense if the model generates a 10 sec (30fps) frames then not all 300 outputs are going to be perfect.

Obviously, deepfake tech has many harmful effects, it has been used to generate fake news, spread propaganda, and create pornography but it also has its creative use cases in the entertainment industry (check wombo) and in the content industry, just check out the amazing work is doing and how it had helped people and companies.

One thing you might wonder is that in these times, how should you equip yourself to spot deepfakes?

Well, there are certainly some things you can do to better prepare yourself, for one, you can learn a thing or two about digital forensics and how you can spot the fakes from anomalies, pixel manipulations, metadata, etc.

Even as a non-tech consumer you can do a lot in identifying a fake from a real video by fact-checking and finding the original source of the video. For e.g. if you find your country’s president talking about starting a nuclear war with North Korea on some random person’s Twitter, then it’s probably fake no matter how real the scene looks. An excellent resource to learn about fact-checking is this youtube series called Navigating Digital Information by Crashcourse. Do check it out.


Hire Us

Let our team of expert engineers and managers build your next big project using Bleeding Edge AI Tools & Technologies


Join My Course Computer Vision For Building Cutting Edge Applications Course

The only course out there that goes beyond basic AI Applications and teaches you how to create next-level apps that utilize physics, deep learning, classical image processing, hand and body gestures. Don’t miss your chance to level up and take your career to new heights

You’ll Learn about:

  • Creating GUI interfaces for python AI scripts.
  • Creating .exe DL applications
  • Using a Physics library in Python & integrating it with AI
  • Advance Image Processing Skills
  • Advance Gesture Recognition with Mediapipe
  • Task Automation with AI & CV
  • Training an SVM machine Learning Model.
  • Creating & Cleaning an ML dataset from scratch.
  • Training DL models & how to use CNN’s & LSTMS.
  • Creating 10 Advance AI/CV Applications
  • & More

Whether you’re a seasoned AI professional or someone just looking to start out in AI, this is the course that will teach you, how to Architect & Build complex, real world and thrilling AI applications

Ready to seriously dive into State of the Art AI & Computer Vision?
Then Sign up for these premium Courses by Bleed AI

A 9000 Feet Overview of Entire AI Field + Semi & Self Supervised Learning | Episode 6

A 9000 Feet Overview of Entire AI Field + Semi & Self Supervised Learning | Episode 6

Watch Video Here

In the previous episode of the Computer Vision For Everyone (CVFE) course, we discussed different branches of machine learning in detail with examples. Now in today’s episode, we’ll further dive in, by learning about some interesting hybrid branches of AI.

We’ll also learn about AI industries, AI applications, applied AI fields, and a lot more, including how everything is connected with each other. Believe me, this is one tutorial that will tie a lot of AI Concepts together that you’ve heard out there, you don’t want to skip it.

By the way, this is the final part of Artificial Intelligence 4 levels of explanation. All the four posts are titled as:

This tutorial is built on top of the previous ones so make sure to go over those parts first if you haven’t already, especially the last one in which I had covered the core branches of machine learning.  If you already know about a high-level overview of supervised, unsupervised, and reinforcement learning then you’re all good.

Alright, so without further ado, let’s get into it.

We have already learned about Core ML branches, Supervised Learning, Unsupervised Learning, and Reinforcement Learning, so now it’s time to explore hybrid branches, which use a mix of techniques from these three core branches. The two most useful hybrid fields are; Semi-Supervised Learning and Self-Supervised Learning. And both of these hybrid fields actually fall in a category of Machine Learning called Weak SupervisionDon’t worry I’ll explain all the terms.

The aim of hybrid fields like Semi-Supervised and Self-Supervised learning is to come up with approaches that bypass the time-consuming manual data labeling process involved in Supervised Learning.

So here’s the thing supervised learning is the most popular category of machine learning and it has the most applications in the industry and In today’s era where an everyday people are uploading images, text, blogposts in huge quantities, we’re at a point where we could train supervised models for almost anything with reasonable accuracy but here’s the issue, even though we have lots and lots of data, it’s actually very costly and time-consuming to label all of it. 

So what we need to do is somehow use methods that are as effective as supervised learning but don’t require us, humans, to label all the data. This is where these hybrid fields come up, and almost all of these are essentially trying to solve the same problem.

There are some other approaches out there as well, like the Multi-Instance Learning and some others that also, but we won’t be going over those in this tutorial as Semi-Supervised and Self-Supervised Learning are more frequently used than the other approaches.

Semi-Supervised Learning

Now let’s first talk about Semi-Supervised Learning. This type of learning approach lies in between Supervised Learning and Unsupervised Learning as in this approach, some of the data is labeled but most of it is still unlabelled.

Unlike supervised or unsupervised learning, semi-supervised learning is not a full-fledged branch of ML rather it’s just an approach, where you use a combination of supervised and unsupervised learning techniques together.

Let’s try to understand this approach with the help of an example; suppose you have a large dataset with 3 classes, cats, dogs, and reptiles. First, you label a portion of this dataset, and train a supervised model on this small labeled dataset.

After training, you can test this model on the labeled dataset and then use the output predictions from this model as labels for the unlabeled examples.

And then after performing prediction on all the unlabeled examples and generating the labels for the whole dataset, you can train the final model on the complete dataset.

Awesome right? With this trick, we’re cutting down the data annotation effort by 10x or more. And we’re still training a good mode.

But there is one thing that I left out, since the initial model was trained on a tiny portion of the original dataset it wouldn’t be that accurate in predicting new samples. So when you’re using the predictions of this model to label the unlabelled portion of the data, an additional step that you can take is to ignore predictions that have low confidence or confidence below a certain threshold.

This way you can perform multiple passes of predicting and training until your model is confident in predicting most of the examples. This additional step will help you avoid lots of mislabeled examples.

Note, what I’ve just explained is just one Semi-Supervised Learning approach and there are other variations of it as well.

It’s called semi-supervised since you’re using both labeled data and unlabeled data and this approach is often used when labeling all of the data is too expensive or time-consuming. For example, If you’re trying to label medical images then it’s really expensive to hire lots of doctors to label thousands of images, so this is where semi-supervised learning would help.

When you search on google for something, google uses a semi-supervised learning approach to determine the relevant web pages to show you based on your query.

Self-Supervised Learning

Alright now let’s talk about the Self-Supervised Learning, a hybrid field that has gotten a lot of recognition in the last few years, as mentioned above, it is also a type of a weak supervision technique and it also lies somewhere in between unsupervised and supervised learning.

Self-supervised learning is inspired by how we humans as babies pick things up and build up complex relations between objects without supervision, for example, a child can understand how far an object is by using the object’s size, or tell if a certain object has left the scene or not and we do all this without any external information or instruction.

Supervised AI algorithms today are nowhere close to this level of generalization and complex relation mapping of objects. But still, maybe we can try to build systems that can first learn patterns in the data like unsupervised learning and then understand relations between different parts of input data and then somehow use that information to label the input data and then train on that labeled data just like supervised learning.

This in summary is Self-Supervised Learning, where the whole intention is to somehow automatically label the training data by finding and exploiting relations or correlations between different parts of the input data, this way we don’t have to rely on human annotations. For example, in this paper, the authors successfully applied Self-Supervised Learning and used the motion segmentation technique to estimate the relative depth of scenes, and no human annotations were needed.

Now let’s try to understand this with the help of an example; Suppose you’re trying to train an object detector to detect zebras. Here are the steps you will follow; First, you will take the unlabeled dataset and create a pretext task so the model can learn relations in the data.

A very basic pretext task could be that you take each image and randomly crop out a segment from the image and then ask the network to fill this gap. The network will try to fill this gap, you will then compare the network’s result with the original cropped segment and determine how wrong the prediction was, and relay the feedback back to the network.

This whole process will repeat over and over again until the network learns to fill the gaps properly, which would mean the network has learned how a zebra looks like. Then in the second step; just like in semi-supervised learning, you will label a very small portion of the dataset with annotations and train the previous zebra model to learn to predict bounding boxes.

Since this model already knows how a zebra looks like, and what body parts it consists of, it can now easily learn to localize it with very few training examples.

This was a very basic example of a self-supervised learning pipeline and the pretext cropping task I mentioned was very basic, in reality, the pretext task for computer vision used in self-supervised learning is more complex.

Also If you know about Transfer Learning then you might wonder why not instead of using a pretext task, we instead use transfer learning. Now that could work but there are a lot of times when the problem we’re trying to solve is a lot different than the tasks that existing models were trained on and so in those cases transfer learning doesn’t work as efficiently with limited labeled data.

I should also mention that although self-supervised learning has been successfully used in language-based tasks, it’s still in the adoption and development stage in Computer vision tasks. This is because, unlike text, it’s really hard to predict uncertainty in images, the output is not discrete and there are countless possibilities meaning there is not just one right answer. To learn more about these challenges, watch Yan Lecun’s ICLR presentation on self-supervised learning.

2 years back, Google published the SimCLR network in which they demonstrated an excellent self-supervised learning framework for image data. I would strongly recommend reading this excellent blog post in order to learn more on this topic. There are some very intuitive findings in this article that I can’t cover here.

Besides Weak Supervision techniques, there are a few other methods like Transfer Learning and Active Learning. All of the techniques aim to partially or completely automate or reduce the data labeling or annotation process.

And this is a very active area of research these days, weak supervision techniques are closing the performance gap between them and supervised techniques. In the coming years, I expect to see wide adoption of Weak supervision and other similar techniques where manual data labeling is either no longer required or just minimally involved.

In Fact here’s what Yan LeCun, one of the pioneers of modern AI says:

“If artificial intelligence is a cake, self-supervised learning is the bulk of the cake,” “The next revolution in AI will not be supervised, nor purely reinforced”

Alright now let’s talk about Applied Fields of AI, AI industries, applications, and also let’s recap and summarize the entire field of AI and along with some very common issues.

So, here’s the thing … You might have read or heard these phrases.

Branches of AI, sub-branches of AI, Fields of AI, Subfields of AI, Domains of AI, or Subdomains of AI, Applications of AI,  Industries of AI, AI paradigms.

Sometimes these phrases are accompanied by words like Applied AI Branches or Major AI Branches etc. And here’s the issue, I’ve seen numerous blog posts and people that used these phrases interchangeably. And I might be slightly guilty of that too. But the thing is, there is no strong consensus on what is major, applied branches, or sub Fields of AI. It’s a huge clutter of terminology out there.

In Fact, I actually googled some of these phrases and clicked to see images. But believe me, it was an abomination, to say the least.

I mean the way people had done categorization of AI Branches was an absolute mess. I mean seriously, the way people had mixed up AI applications with AI industries with AI branches …. it was just chaos… I’m not lying when I say I got a headache watching those graphs.

So here’s what I’m gonna do! I’m going to try to draw an abstract overview of the complete field of AI along with branches, subfields, applications, industries, and other things in this episode.

Complete Overview of AI Field

Now what I’m going to show you is just my personal overview and understanding of the AI field, and it can change as I continue to learn so I don’t expect everyone to agree with this categorization.

One final note, before we start: If you haven’t subscribed then please do so now. I’m planning to release more such tutorials and by subscribing you will get an email every time we release a tutorial.

Alright, now let’s summarize the entire field of Artificial Intelligence. First off, We have Artificial Intelligence, I’m talking about Weak AI  Or ANI (Artificial Narrow Intelligence), since we have made no real progress in AGI or ASI, we won’t be talking about that.

Inside AI, there is a subdomain called Machine Learning, now the area besides Machine learning is called Classical AI, this consists of rule-based Symbolic AI, Fuzzy logic, statistical techniques, and other classical methods. The domain of Machine learning itself consists of a set of algorithms that can learn from the data, these are SVM, Random Forest, KNN, etc.

Inside machine learning is a subfield called Deep Learning, which is mostly concerned with Hierarchical learning algorithms called Deep Neural Networks. Now there are many types of Neural nets, e.g. Convolutional networks, LSTM, etc. And each type consists of many architectures which also have many variations.

Now machine learning (Including Deep learning) has 3 core branches or approaches, Supervised Learning, Unsupervised Learning, and Reinforcement Learning, we also have some hybrid branches which combine supervised and unsupervised methods. All of these can be categorized as Weak Supervision methods.

Now when studying machine learning, you might also come across learning approaches like Transfer Learning, Active Learning, and others. These are not broad fields but just learning techniques used in specific circumstances.

Alright now let’s take a look at some applied fields of AI, now there is no strong consensus but according to me there are 4 Applied Fields of AI; Computer Vision, Natural Language Processing, Speech, and Numerical Analysis. All 4 of these Applied fields use algorithms from either Classical AI, Machine Learning, or Deep Learning.

Let’s further look into these fields, Computer Vision can be split into 2 categories, Image Processing where we manipulate, process, or transform images. And Recognition, where we analyze content in images and make sense out of it. A lot of the time when people are talking about computer vision they are only referring to the recognition part.

Natural Language Processing can be broadly split into 2 parts; Natural Language Understanding; where you try to make sense of the textual data, interpret it, and understand its true meaning. And Natural Language Generation; where you try to generate meaningful text.

Btw the task of  Language translation like in Google Translate uses both NLU & NLG

Speech can also be divided into 2 categories, Speech Recognition or Speech to text (STT); where you try to build systems that can understand speech and correctly predict the right text for it, and Speech Generation or text-to-speech (TTS); where you try to build systems able to generate realistic human-like speech.

And Finally Numerical Analytics; where you analyze numerical data to either gain meaningful insights or do predictive modeling, meaning you train models to learn from data and make useful predictions based on it.

Now I’m calling this numerical analytics but you can also call this Data Analytics or Data Science. I avoided the word “data” because Image, Text, and Speech are also data types.

And if you think about it, even data types like images, and text are converted to numbers at the end but, right now I’m defining numerical analytics as the field that analyzes numerical data other than these three data types.

Now since I work in Computer Vision, let me expand the computer vision field a bit.

So both of these categories (Image Processing and Recognition) can be further split into two types; Classical vision techniques and Modern vision techniques.

The only difference between the two types is that modern vision techniques use only Deep Learning based methods whereas Classical vision does not. So for example, Classical Image Processing can be things like image resizing, converting an image to grayscale, Canny edge detection, etc.

And Modern Image Processing can be things like Image Colorization via deep learning etc.

Classical Recognition can be things like: Face Detection with Haar cascades, and Histogram based Object detection.

And Modern Recognition can be things like Image Classification, Object Detection using neural networks, etc.

So these were Applied Fields of AI, Alright now let’s take a look at some Applied SubFields of AI. I’m defining Applied subfields as those fields that are built around certain specialized topics of any of the 4 applied fields I’ve mentioned.

For example, Extended Reality is an applied subfield of AI built around a particular set of computer vision algorithms. It consists of Virtual Reality;

Augmented Reality;

and Mixed Reality;

You can even consider Extended Reality as a subdomain of Computer Vision. It’s worth mentioning that most of the computer vision techniques used in Extended reality itself fall in another domain of Computer Vision called Geometric Computer Vision, these algorithms deal with geometric relations between the 3D world and its projection into a 2D image.

There are many applied AI Subfields, another example of this would be Expert Systems which is an AI system that emulates the decision-making ability of a human expert.

So consider a Medical Diagnostic app that can take pictures of your skin and then a computer vision algorithm evaluates the picture to determine if you have any skin diseases.

Now, this system is performing a task that a dermatologist (skin expert) does, so it’s an example of an Expert system.

Rule-based Expert Systems became really popular in the 1980s and were considered a major feat in AI. These systems had two parts, a knowledge base, (A database containing all the facts provided by a human expert) and an inference engine that used the knowledge base and the observations from the user to give out results.

Although these types of expert systems are still used today, they have serious limitations. Now the example of the Expert system I just gave is from the Healthcare Industry and Expert systems can be found in other industries too.

Speaking of industries, let’s talk about AI applications used in industries. So these days AI is used in almost any industry you can think of, some popular categories are Automotive, Finance, Healthcare, Robotics, and others.

Within each Industry, you will find AI applications like self-driving cars, fraud detection, etc. All these applications are using methods & techniques from one of the 4 Applied AI Fields.

There are many applications that fail in multiple industries, for example, a humanoid robot built for amusement will fall in robotics and the entertainment industry. While the Self Driving car technologies fall into the transportation and automotive industry.

Also, an industry may split into subcategories. For example, Digital Media can be split into social media, streaming media, and other niche industries. By the way, most media sites use Recommendation Systems, which is yet another applied AI subdomain.

Join My Course Computer Vision For Building Cutting Edge Applications Course

The only course out there that goes beyond basic AI Applications and teaches you how to create next-level apps that utilize physics, deep learning, classical image processing, hand and body gestures. Don’t miss your chance to level up and take your career to new heights

You’ll Learn about:

  • Creating GUI interfaces for python AI scripts.
  • Creating .exe DL applications
  • Using a Physics library in Python & integrating it with AI
  • Advance Image Processing Skills
  • Advance Gesture Recognition with Mediapipe
  • Task Automation with AI & CV
  • Training an SVM machine Learning Model.
  • Creating & Cleaning an ML dataset from scratch.
  • Training DL models & how to use CNN’s & LSTMS.
  • Creating 10 Advance AI/CV Applications
  • & More

Whether you’re a seasoned AI professional or someone just looking to start out in AI, this is the course that will teach you, how to Architect & Build complex, real world and thrilling AI ap


Alright, so this was a high-level overview of the complete field of AI. Not everyone would agree with this categorization, but this categorization is necessary when you’re deciding which area of AI to focus on and how all the fields are connected to each other, and personally, I think this is one of the simplest and most intuitive abstract overviews of the AI field that you’ll find out there. Obviously, It was not meant to cover everything, but a high-level overview of the field.

This Concludes the 4th and final part of our Artificial Intelligence – 4 levels Explanation series. If you enjoyed this episode of computer vision for everyone then do subscribe to the Bleed AI YouTube channel and share it with your colleagues. Thank you.


Hire Us

Let our team of expert engineers and managers build your next big project using Bleeding Edge AI Tools & Technologies


Ready to seriously dive into State of the Art AI & Computer Vision?
Then Sign up for these premium Courses by Bleed AI