Why do We Need “Distributed Training” in Deep Learning? 🤔

5 min readDec 13, 2024

💪 Motivation

We know that in the age of the internet, we have a much larger amount of data, and if we need to train our AI model on that large amount of data, we need more complex models. If your data and model are too complex, then we must have a great and powerful device for it so that we can train those AI models. To solve that problem, either we can build a more powerful device or use “Distributed training” across all devices.

👋 Introduction

So, what is Distributed Training? How does it work?

What is Distributed training?

Consider you have a Large amount of Data, for example, the entire Internet, and you have to train your LLM on it. To train this model on a single GPU it’s quite more difficult. There are some challenges that you may face:

The model may not fit into a single GPU. (Large model)
Due to the large dataset, you choose the bigger batch size but GPU has limited Memory so you can’t. (Larger Dataset)
Models take too much time to train because of the larger dataset. Probably Months, Years, etc. (Smaller Device)

How does it work?

You can scale your training process using 2 methods

Vertically scaling.
Horizontal scaling.

1. Vertically scaling

So vertically scaling is too simple, you have 12GB RAM GPU increase the RAM to 32GB RAM. Quite simply no need to change our original code.

2. Horizontal scaling

So, Horizontal scaling has two types and it requires some sort of code changes.

Data parallelism
Model parallelism

So, rather than change the configuration of the original device build the eco-system of the same device and train the model.

📊 Data Parallelism vs 🖥️ Model Parallelism

1. Data Parallelism

So, Data Parallelism works when your model can fit into a single GPU, then we distribute our data to multiple GPUs. With each of the GPUs processing a subset of the dataset and performing the forward and backward pass.

Data Parallelism Diagram — Data Parallelism — (Image by Author)

As we can see in the Image we have 4 GPUs with the same configurations and we can fit our model into 1 GPU so we divide the Dataset into SubDataset. In this process, we synchronize the gradients during the backward pass.

⚙️ Perform the Distributed Data Parallel

Step-1: Initialized the model parameters on the Main GPU

We initialized the model parameter randomly on the Main GPU where we are putting our code. We forward those initial values and parameters to the next GPUs.

Step-1: Initialized the paramerters and forward to the other GPUs

Step-2: Calculate the local gradient across all GPUs

Each machine computes its gradient and accumulates that gradient.

Step-3: Cumulate local gradient to Main GPU

Now, we accumulate the gradient to the Main GPU. Next, we distribute this summed gradient to all GPUs.

Step-4: Reset the gradient after updating the parameters.

Last but not least the main process is to update the parameters and set the gradient to zero.

Step-4: update the parameters and set the grad to zero.

There are 2 main processes for sharing gradients. It’s called the “Collective Communication Primitives”. CCP allows us to communicate with different devices.

Point-to-point: We distribute the gradient line.
Collective Communication Broadcast: We select one machine and its job is to share the gradients, while the main GPU distributes the gradients.

2. Model Parallelism

So, Model Parallelism is Quite different from Data Parallelism. In Data Parallelism our model fits into one GPU. But what if? We have this strange problem that we can’t able to load our entire model into one GPU.
At this time the Model parallelism is very helpful. Model Parallelism allows us to Distribute our model into multiple GPUs.

So, Model parallelism is useful when your model is too large for your GPU. In that case, we distribute our model into multiple GPUs. The GPUs 1, 2, 3, and 4 contain Layers 1, 2, 3, and 4 of a model. After distributing the Layers we performed the forward and backward pass for each layer in a synchronized manner.

📈 Advantages and 📉Disadvantages

Some of the advantages and Disadvantages of the Distributed training and also there is the advantage of Data Parallelism and Model Parallelism also disadvantage there.

Advantages of Distributed Traning

So, In Distributed Training, if your task is too expensive for your GPU it solves that problem.
Also, allows you to train the large Model on a Large dataset.
Help to achieve the best results by adjusting the Perfect Hyperparameters.

Disadvantages of Distributed Traning

At some point what if? Our one Machine has failed if we don’t take care of that we lose the whole process.
Sometimes it’s too complicated to code the Data Parallelism and Model Parallelism. (not more thanks to Pytorch)
Takes too much time just sharing the gradients and it slows down the process of training.

🙇🏻‍♂️Conclusion

At some point “Distributed training” is not useful but it is necessary to train the larger and bigger and bigger model we must have some kind of mechanism that allows us to train the model with a limited amount of storage use case, also we can apply the “Quantization” that helps a lot for optimizing the model.

🙏🏻 Special thanks to Umar Jamil without their video this blog is never possible.