Exploring Diffusion Networks

Let's delve into the underlying concepts of diffusion networks and provide a practical explanation on how such a model can be trained and used to generate novel images.

10/06/2024

Exploring Diffusion Networks

The emergence of DALL-E [3], Stable Diffusion [4] and other similar models in recent years have taken the world by storm with their unparalleled image synthesis capabilities, surpassing the quality produced by traditional methods such as Generative Adversarial Networks (GANs), Normalizing Flows, and Variational Autoencoders (VAEs). DALL-E’s innovative approach to image synthesis has placed it at the forefront of the field, captivating the mainstream consciousness and setting a new standard for what is possible in this area of artificial intelligence.

In this article, we will delve into the underlying concepts of diffusion networks and provide a practical explanation on how such a model can be trained and used to generate novel images. By the end of this article, you will have a clear understanding of the intuition behind diffusion networks and a step-by-step procedure for using them for image generation.

Introducing Diffusion Models

Diffusion models are a type of probabilistic generative model [6] that use and generate new, high-quality data that resembles the original input. This approach is useful for tasks such as denoising and data generation, as it can be trained to preserve the underlying structure of the data while removing or reducing unwanted noise.

Diffusion models can be understood as a type of latent variable model, where the term “latent” refers to the presence of a hidden continuous feature space. The mapping from the original image to this latent space is achieved through a Markov Chain, which is essentially a series of T timesteps that add random noise to the image at each step. This process follows the Markov property, meaning that the next time step is only dependent on the previous one. These diffusion models are a way to incorporate uncertainty into the mapping from the original image to the latent space, leading to more robust and flexible models.

Diffusion models are made up of two main processes:

The forward process: It starts out at with the original image x₀ and at every step, some noise is added to it until the image turns into pure noise.
Parametrized Backward process: In the backward process the model is tasked with predicting the noise that has been added to the images between each step. It is called parametrized because we use a neural network for this part. If we succeed in training a network to predict the noise, we can then use it to generate new images from pure noise by predicting and removing this noise, then iteratively progressing up until x0 is reached.

It is important to note that unlike with VAEs, the input image and latent variable dimension match, which lends U-Net architecture as a fitting choice for the noise predictor.

Forward diffusion process

The main objective of the forward process is to add noise to the image, which mathematically can be written as:

Where x0 is the initial input, and as this is a Markov chain, each subsequent output is only dependent on the sample that came before it.

The Gaussian noise added between each step can be written as:

βₜ is called the variance schedule and it controls the amount of noise we add to the image at each timestep. As t grows so does βₜ meaning the 1 — βₜ term decreases until the mean of the sample reaches 0. I is the identity matrix, this means that the variance is fixed. Although there are variations of diffusion networks where the model learns the variance of the noise as well, for the sake of simplicity we stick to a version where it is fixed and only the mean is learnt.

Choosing a suitable βₜ is called noise scheduling, and it is very important as we want to set it in a way that the image at time T just about reaches a 0 mean, 1 variance Gaussian. If this happens too early or too much information remains at step T, training can fail, and our model will not converge properly.

There are many different strategies for noise scheduling [1] such as:

changing the noise schedule functions: linear, cosine, cosine (logSNR), sigmoid, sigmoid (logSNR), quadratic.

adjusting the input scaling factor b in the following equation for the sample:

Lastly, even though we talk about iteratively noising images in the forward process, it is important to point out that because the sum of Gaussians is still a Gaussian, we can actually sample the noised image at timestep t immediately without having to go through all the previous steps thanks to the reparameterization trick.

In another form this can be written as:

In this case, we need to pre-calculate αₜ the so-called cumulative variance schedule based on the following equations.

Backward denoising process

The backward step of the process is used in two different ways: to train the model and to generate new images from noise (this is referred to as sampling).

During training, we don’t iterate over every timestep, instead, we choose a random t, sample some noise, ϵ and use it in the forward process that we already discussed to get the noisy sample xₜ We can feed this noisy image into our neural network that will predict the noise that was used to generate the image. The loss will be the L2 distance between the predicted and the real noise.

When we want to create a completely new image from noise, we must start from xₜ (pure noise) and iterate over every timestep, passing our noisy image into the trained network that will predict the noise ϵ which we can use to calculate the noise for the current timestep, which is then deducted from the current image xₜ to get xₜ ₋ ₁ until we reach x₀.

Architecture

As we already stated, a U-Net [5] is adequate for predicting the noise from an input image as it has the same input and output dimensions and has been an imperative component not just in image segmentation tasks but also in synthesis in GANs and its conditional variants as well.

The main feature of a U-Net is its hierarchical nature in which the input images go through a set of down-sampling layers where at every step, the image loses spatial information but gains feature channels. At the bottom, the data reaches a bottleneck which is followed by a series of up-sampling layers that increase the image size but reduce the depth of the data. Furthermore, residual connections can be found between every up and down-sampling module of the same size. In addition to these basic features in more recent networks, attention layers are also used as these can help the network focus on the more interesting parts of the image.

The Math

The reverse process can be described mathematically with the following equations:

We start from pure noise with 1 variance and mean 0. The following formula describes how the model learns the probability density of the previous timestep given the current one.

This density ρθ is defined by the predicted Gaussian noise distribution in the image, to get the previous step xₜ ₋ ₁ we need to remove this noise from xₜ

The complete formula for calculating xₜ ₋ ₁:

Here ϵθ is the output of the U-Net, which tries to predict the ϵ that we used in the forward process to generate xₜ. The σₜz part is another Gaussian noise we generate to then inject into the update to avoid collapses into local minima. This comes from stochastic gradient Langevin dynamics which is a concept from physics for the statistical modelling of molecular systems.

Timestep Encoding

Timestep Encoding is also an important topic to touch on. This is because we use the same U-Net across all timesteps, meaning weights are shared, so we need to somehow tell our model at which timestep it’s currently working. To do this, we typically use something called positional embedding, which is used to encode discrete positional information along with our data. The purpose of positional embedding is to assign a unique vector for every index.

Encoding for the kₜₕ object in a sequence of L size can be done using the following equations:

The explanation for the different variable is as follows:

k: position of an object in the input sequence, also 0 ≤ k <L/2
d: dimension of the output embedding space
P(k,j) position function for the mapping
n: arbitrary scalar: usually 10000
i: column index of the encoding matrix 0 ≤ i <d/2

In this case even positions (in the encoding matrix) will be used with the sine wave and odd positions will be encoded with the cosine wave.

The advantages of encoding positions like this are as follows:

Both sine and cosine take up values between [-1,1], meaning the encoding matrix is always normalized.
As the sinusoid for each position is different, each k is encoded to a unique vector.
In the case of natural language processing tasks, this enables the measurement of similarity between each position for the relative position encoding of words.

If we plot the embedding matrix with the y axis as k and the x axis as i we get the following graph:

For k=0 position’s embedding vector would simply be the first row of this graph, k=1 the second and so on.

The way these embeddings are used in the U-Net is that for a given timestep, we generate the positional embedding vector based on the formulas presented above, send it through a dense layer to achieve the same feature depth as the images at that current level, and finally, these vectors are broadcasted to each x and y position and simply added to the images both during up and down-sampling.

Loss metric

The researchers in [2] used the variational lower bound to optimize their model, akin to VAEs. The first loss term they derived for the model used the mean and predicted mean of the sample:

Afterwards, they reparametrized the Gaussian term to predict epsilon from the input xₜ at timestep t because xₜ is available as input during training time.

We know that:

If we substitute these into the previous formula:

Lastly [2] found that the training worked better with a simplified objective where they remove the weighting term, which leaves the following final formula:

This means that in order to train the U-Net, the loss we have to calculate is simply the squared error between the predicted noise ϵθ, which is the direct output of the model and the real noise ϵ that we sampled when generating xₜ.

While this result might seem quite straightforward, the mathematical derivations behind it are far from simple. In fact, they can be quite complex so for those who are interested in delving deeper into the math side of things, the website provides a detailed explanation of the process.

Our Opinion

Here at TechnoLynx we encounter a wide variety of image processing problems that can be solved with anything ranging from traditional algorithms to complex neural networks. Diffusion networks will complete our machine learning toolset that we can leverage to generate images at a quality that is unparalleled by any other network we’ve encountered. While it might not be as fast in terms of inference as a GAN, the diffusion training process is stable and not prone to the same pitfalls as adversarial models. This means quicker, safer results and a more robust development process for our team. We at TechnoLynx are certainly looking forward to seeing what the future holds for this technology.

This has been an interesting high-level look of the model, but we are not done yet. Our next plan is to investigate how such a model can be converted to Core ML using coremltools for Apple devices. If you liked this article and would like to learn more, please follow us on Medium here and be sure to check out our website.

References

[1] Chen, T. (2023). On the Importance of Noise Scheduling for Diffusion Models
[2] Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models
[3] Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., . . . Sutskever, I. (2021). Zero-Shot Text-to-Image Generation
[4] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models
[5] Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation
[6] Yang, L., Zhang, Z., Song, Y., Hong, S., Xu, R., Zhao, Y., . . . Yang, M.-H. (2022). Diffusion Models: A Comprehensive Survey of Methods and Applications

Read our Blog!

Technical Excellence

Founded in 2019 by Balázs Keszthelyi, co-inventor of more than a dozen patents and contributor to two international standards, we know how to beat the state-of-the-art.

Balázs’ passion for high quality and superior performance sets a high bar, generating value for our clients and growth for our employees.

Meet our team

Technologies

Computer Vision
Generative AI
Extended Reality (XR)

What We Do

We specialise in guiding clients through the entire research and development journey, from initial prototyping to seamless integration and even safeguarding intellectual property. As an innovative solutions center, we not only identify areas for workflow enhancement but also actively engage in crafting and implementing solutions.

Reach out!

Services

Technical Business Analysis & Consulting
R&D Outsourcing
Custom Software Development
MLOps
Performance Optimisation

4/07/2025

AI Anomaly Detection for RF in Emergency Response

Learn how AI-driven anomaly detection secures RF communications for real-time emergency response. Discover deep learning, time series data, RF anomaly detection, and satellite communications.

3/07/2025

AI-Powered Video Surveillance for Incident Detection

Learn how AI-powered video surveillance with incident detection, real-time alerts, high-resolution footage, GDPR-compliant CCTV, and cloud storage is reshaping security.

24/06/2025

Artificial Intelligence on Air Traffic Control

Learn how artificial intelligence improves air traffic control with neural network decision support, deep learning, and real-time data processing for safer skies.

11/06/2025

5 Ways AI Helps Fuel Efficiency in Aviation

Learn how AI improves fuel efficiency in aviation. From reducing fuel use to lowering emissions, see 5 real-world use cases helping the industry.

10/06/2025

AI in Aviation: Boosting Flight Safety Standards

Learn how AI is helping improve aviation safety. See how airlines in the United States use AI to monitor flights, predict problems, and support pilots.

6/06/2025

IoT Cybersecurity: Safeguarding against Cyber Threats

Explore how IoT cybersecurity fortifies defences against threats in smart devices, supply chains, and industrial systems using AI and cloud computing.

5/06/2025

Large Language Models Transforming Telecommunications

Discover how large language models are enhancing telecommunications through natural language processing, neural networks, and transformer models.

4/06/2025

Real-Time AI and Streaming Data in Telecom

Discover how real-time AI and streaming data are transforming the telecommunications industry, enabling smarter networks, improved services, and efficient operations.

3/06/2025

AI in Aviation Maintenance: Smarter Skies Ahead

Learn how AI is transforming aviation maintenance. From routine checks to predictive fixes, see how AI supports all types of maintenance activities.

2/06/2025

AI-Powered Computer Vision Enhances Airport Safety

Learn how AI-powered computer vision improves airport safety through object detection, tracking, and real-time analysis, ensuring secure and efficient operations.

30/05/2025

Fundamentals of Computer Vision: A Beginner's Guide

Learn the basics of computer vision, including object detection, convolutional neural networks, and real-time video analysis, and how they apply to real-world problems.

29/05/2025

Computer Vision in Smart Video Surveillance powered by AI

Learn how AI and computer vision improve video surveillance with object detection, real-time tracking, and remote access for enhanced security.

28/05/2025

Generative AI Tools in Modern Video Game Creation

Learn how generative AI, machine learning models, and neural networks transform content creation in video game development through real-time image generation, fine-tuning, and large language models.

27/05/2025

Artificial Intelligence in Supply Chain Management

Learn how artificial intelligence transforms supply chain management with real-time insights, cost reduction, and improved customer service.

26/05/2025

Content-based image retrieval with Computer Vision

Learn how content-based image retrieval uses computer vision, deep learning models, and feature extraction to find similar images in vast digital collections.

23/05/2025

What is Feature Extraction for Computer Vision?

Discover how feature extraction and image processing power computer vision tasks—from medical imaging and driving cars to social media filters and object tracking.

22/05/2025

Machine Vision vs Computer Vision: Key Differences

Learn the differences between machine vision and computer vision—hardware, software, and applications in automation, autonomous vehicles, and more.

21/05/2025

Computer Vision in Self-Driving Cars: Key Applications

Discover how computer vision and deep learning power self-driving cars—object detection, tracking, traffic sign recognition, and more.

20/05/2025

Machine Learning and AI in Modern Computer Science

Discover how computer science drives artificial intelligence and machine learning—from neural networks to NLP, computer vision, and real-world applications. Learn how TechnoLynx can guide your AI journey.

19/05/2025

Real-Time Data Streaming with AI

You have surely heard that ‘Information is the most powerful weapon’. However, is a weapon really that powerful if it does not arrive on time? Explore how real-time streaming powers Generative AI across industries, from live image generation to fraud detection.

17/05/2025

Core Computer Vision Algorithms and Their Uses

Discover the main computer vision algorithms that power autonomous vehicles, medical imaging, and real-time video. Learn how convolutional neural networks and OCR shape modern AI.

14/05/2025

Applying Machine Learning in Computer Vision Systems

Learn how machine learning transforms computer vision—from object detection and medical imaging to autonomous vehicles and image recognition.

13/05/2025

Cutting-Edge Marketing with Generative AI Tools

Learn how generative AI transforms marketing strategies—from text-based content and image generation to social media and SEO. Boost your bottom line with TechnoLynx expertise.

12/05/2025

AI Object Tracking Solutions: Intelligent Automation

AI tracking solutions are incorporating industries in different sectors in safety, autonomous detection and sorting processes. The use of computer vision and high-end computing is key in AI tracking.

9/05/2025

Feature Extraction and Image Processing for Computer Vision

Learn how feature extraction and image processing enhance computer vision. Discover techniques, applications, and how TechnoLynx can assist your AI projects.

8/05/2025

Fine-Tuning Generative AI Models for Better Performance

Understand how fine-tuning improves generative AI. From large language models to neural networks, TechnoLynx offers advanced solutions for real-world AI applications.

7/05/2025

Image Segmentation Methods in Modern Computer Vision

Learn how image segmentation helps computer vision tasks. Understand key techniques used in autonomous vehicles, object detection, and more.

6/05/2025

Generative AI's Role in Shaping Modern Data Science

Learn how generative AI impacts data science, from enhancing training data and real-time AI applications to helping data scientists build advanced machine learning models.

5/05/2025

Deep Learning vs. Traditional Computer Vision Methods

Compare deep learning and traditional computer vision. Learn how deep neural networks, CNNs, and artificial intelligence handle image recognition and quality control.

30/04/2025

Control Image Generation with Stable Diffusion

Learn how to guide image generation using Stable Diffusion. Tips on text prompts, art style, aspect ratio, and producing high quality images.

29/04/2025

Object Detection in Computer Vision: Key Uses and Insights

Learn how object detection with computer vision transforms industries, from autonomous driving to medical imaging, using AI, CNNs, and deep learning.

28/04/2025

The Foundation of Generative AI: Neural Networks Explained

Find out how neural networks support generative AI models with applications like content creation, and where these models are used in real-world scenarios.

25/04/2025

Virtual Reality Transforming Modern Manufacturing Processes

Learn how virtual reality is changing the manufacturing industry. From assembly lines to lean manufacturing, VR applications improve real-time production, training, and design.

22/04/2025

Computer Vision Applications in Autonomous Vehicles

Learn how computer vision, deep learning models, and AI drive autonomous vehicles. Understand applications like object detection, image classification, and driver assistance to reduce human error on real-world roads.

17/04/2025

Agentic AI vs Generative AI: What Sets Them Apart?

Understand the difference between agentic AI and generative AI, including how they work in content creation, deep learning, and artificial intelligence applications.

15/04/2025

Extended Reality in Remote Work: A Practical Shift

See how extended reality, including virtual, augmented, and mixed reality, is changing the remote work experience through immersive real-time environments.

14/04/2025

Top Cutting-Edge Generative AI Applications in 2025

Learn how applications in text, image, music, fashion, architecture, and business are driven by deep learning, neural networks, and large language models.

11/04/2025

Computer Vision for Production Line Inspections

Learn how computer vision improves quality checks on production lines. AI, deep learning, and visual data make inspections faster and more reliable.

10/04/2025

The Growing Need for Video Pipeline Optimisation

Learn how video pipeline optimisation improves real-time computer vision performance. Reduce bandwidth use, transmit data efficiently, and scale AI applications with ease.

9/04/2025

Unlocking XR’s True Power with Smarter GPU Optimisation

Learn how optimising your GPU can enhance performance, reduce costs, and improve user experience. Discover best practices, real-world case studies.

9/04/2025

TechnoLynx Named a Top Machine Learning Company

TechnoLynx named a top machine learning development company by Vendorland. We specialise in AI, supervised learning, and custom machine learning systems that deliver real business results.

8/04/2025

Cloud Computing and Computer Vision in Practice

See how computer vision and cloud computing work together. Learn how AI, deep learning, and cloud services improve image processing and object detection.

7/04/2025

XR: The Future of Immersion

It is really impressive how far technology has come. In some fields, we have reached a point where we don’t always seek revolutionary solutions but fun solutions as well. The idea of Extended Reality (XR) has become a reality in recent years, and it always keeps improving.

4/04/2025

Real-Time AI Motion Tracking in XR Experiences

Learn how motion tracking works in XR. See how real-time systems use AI and motion capture for smoother virtual reality experiences.

3/04/2025

Generative AI Models: How They Work and Why They Matter

Learn how generative AI models like GANs, VAEs, and LLMs work. Understand their role in content creation, image generation, and AI applications.

2/04/2025

Augmented and Virtual Reality in Real Estate Industry

Learn how augmented and virtual reality improve real estate with virtual tours, headsets, and real-time interaction in both real and digital spaces.

1/04/2025

Augmented Reality 3D Billboards: Future of Advertising

Learn how augmented reality 3D billboards use AR apps, mobile devices, and real-world views to create immersive advertising in real time.

31/03/2025

Markov Chains in Generative AI Explained

Discover how Markov chains power Generative AI models, from text generation to computer vision and AR/VR/XR. Explore real-world applications!