[源码解析]|[源码解析] PyTorch 分布式(12) ----- DistributedDataParallel 之前向传播 [源码解析]PyTorch分布式(12)---

[源码解析] PyTorch 分布式(12) ----- DistributedDataParallel 之前向传播
目录

[源码解析] PyTorch 分布式(12) ----- DistributedDataParallel 之前向传播
- 0x00 摘要
- 0x01 总体逻辑
- 0x02 Python 世界
- 0x03 C++世界
  - 3.1 准备前向传播
  - 3.2 重建桶
    - 3.2.1 计算桶尺寸
    - 3.2.2 同步桶indices
    - 3.2.3 初始化桶
  - 3.3 准备后向传播
    - 3.3.1 重置
    - 3.3.2 查找未使用的参数
- 0xFF 参考

0x00 摘要前文已经对Reducer如何构建和几个重要场景做了介绍，本文就来分析 Reducer 如何实现前向传播。
本系列其他文章如下：
深度学习利器之自动微分(1)
深度学习利器之自动微分(2)
[源码解析]深度学习利器之自动微分(3) --- 示例解读
[源码解析]PyTorch如何实现前向传播(1) --- 基础类(上)
[源码解析]PyTorch如何实现前向传播(2) --- 基础类(下)
[源码解析] PyTorch如何实现前向传播(3) --- 具体实现
[源码解析] Pytorch 如何实现后向传播 (1)---- 调用引擎
[源码解析] Pytorch 如何实现后向传播 (2)---- 引擎静态结构
[源码解析] Pytorch 如何实现后向传播 (3)---- 引擎动态逻辑
[源码解析] PyTorch 如何实现后向传播 (4)---- 具体算法
[源码解析] PyTorch 分布式(1)------历史和概述
[源码解析] PyTorch 分布式(2) ----- DataParallel(上)
[源码解析] PyTorch 分布式(3) ----- DataParallel(下)
[源码解析] PyTorch 分布式(4)------分布式应用基础概念
[源码解析] PyTorch分布式(5) ------ DistributedDataParallel 总述&如何使用
[源码解析] PyTorch分布式(6) ---DistributedDataParallel -- 初始化&store
[源码解析] PyTorch 分布式(7) ----- DistributedDataParallel 之进程组
[源码解析] PyTorch 分布式(8) -------- DistributedDataParallel之论文篇
[源码解析] PyTorch 分布式(9) ----- DistributedDataParallel 之初始化
[源码解析] PyTorch 分布式(10)------DistributedDataParallel 之 Reducer静态架构
[源码解析] PyTorch 分布式(11) ----- DistributedDataParallel 之构建Reducer和Join操作
0x01 总体逻辑我们还是需要祭出法宝，看看论文中的DDP总体逻辑：

[源码解析]|[源码解析] PyTorch 分布式(12) ----- DistributedDataParallel 之前向传播

文章图片

然后给出一个前向传播的总体策略如下：
Forward Pass:

每个进程读去自己的训练数据，DistributedSampler确保每个进程读到的数据不同。
DDP 获取输入并将其传递给本地模型。
模型进行前向计算，结果设置为 out。现在计算都是在每个进程（CUDA设备）上完成。
如果find_unused_parameters设置为True，DDP 会分析本地模型的输出，从 out 开始遍历计算图，把未使用参数标示为 ready，因为每次计算图都会改变，所以每次都要遍历。
- 此模式（Mode）允许在模型的子图上向后运行，并且 DDP 通过从模型输出out遍历 autograd 图，将所有未使用的参数标记为就绪，以减少反向传递中涉及的参数。
- 在后向传播期间，Reducer会规约所有桶，在此过程中，Reducer会等待未准备好的参数。将参数梯度标记为就绪并不能帮助 DDP 跳过桶，但它会阻止 DDP 在向后传递期间永远等待不存在的梯度。
- 请注意，遍历 autograd 图会引入额外的开销，因此应用程序仅在必要时才设置 find_unused_parameters为True 。
返回out即可。这点与 DP不同，DDP的模型网络输出不需要被gather到 rank 0进程。

0x02 Python 世界我们还是从 Python 代码入手开始分析，代码位于：torch/nn/parallel/distributed.py。
我们这里省略 join 相关，只关注主体部分，forward 方法逻辑如下：

保存线程本地状态。
如果做配置，则调用 reducer.prepare_for_forward 为forward做准备。
如果配置ddp_join_enabled，做相应处理。
在前向传播之前使用 _rebuild_buckets 来重置桶。
- 在 _rebuild_buckets 函数之中，也许会在释放旧bucket之前分配新bucket。
- 如果要节省峰值内存使用量，请在正向计算期间峰值内存使用量增加之前调用_rebuild_bucket。
如果需要同步，则调用_sync_params对前向传播参数进行前向传播参数。
进行前向传播。
如果需要同步后向传播梯度，则调用prepare_for_backward。
- 当DDP参数 find_unused_parameter 为 true 时，其会在 forward 结束时，启动一个回溯，标记出所有没被用到的 parameter，提前把这些设定为 ready，这样 backward 就可以在一个 subgraph 之上进行，但这样会牺牲一部分时间。

具体代码如下：

def forward(self, *inputs, **kwargs): with torch.autograd.profiler.record_function("DistributedDataParallel.forward"):# 保存线程本地状态 self.reducer.save_thread_local_state()# 如果做配置，则调用 reducer 为forward做准备 if torch.is_grad_enabled() and self.require_backward_grad_sync: self.logger.set_runtime_stats_and_log() self.num_iterations += 1 self.reducer.prepare_for_forward()# 如果配置ddp_join_enabled，做相应处理 if self.ddp_uneven_inputs_config.ddp_join_enabled: ones = torch.ones(1, device=self.device) work = dist.all_reduce(ones, group=self.process_group, async_op=True) if self.ddp_uneven_inputs_config.ddp_join_throw_on_early_termination: # Active ranks schedule an allreduce with zeros, inactive # ranks schedule them with 1. If the result != 0 it # indicates at least one rank has terminated and we should # throw. zeros = torch.zeros(1, device=self.device) dist.all_reduce(zeros, group=self.process_group) should_throw_stop_iteration = zeros.item() if should_throw_stop_iteration: raise RuntimeError( "Detected at least one rank that exhausted inputs. Throwing across all ranks." ) else: self.reducer._set_forward_pass_work_handle( # 是join这里用到 work, self.ddp_uneven_inputs_config.ddp_join_divide_by_initial_world_size, )# Calling _rebuild_buckets before forward compuation, # It may allocate new buckets before deallocating old buckets # inside _rebuild_buckets. To save peak memory usage, # call _rebuild_buckets before the peak memory usage increases # during forward computation. # This should be called only once during whole training period.# 在前向传播之前使用 _rebuild_buckets 来重置桶 # 在此函数内，也许在释放旧bucket之前分配新bucket。 # 如果要节省峰值内存使用量，请在正向计算期间峰值内存使用量增加之前调用_rebuild_bucket。 # 在整个训练期间，这只能调用一次。 if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): logging.info("Reducer buckets have been rebuilt in this iteration.")# 如果需要同步前向传播参数，则进行同步 if self.require_forward_param_sync: self._sync_params()if self.ddp_uneven_inputs_config.ddp_join_enabled: # Notify joined ranks whether they should sync in backwards pass or not. self._check_global_requires_backward_grad_sync(is_joined_rank=False)# 进行前向传播 if self.device_ids: # 多卡情况 inputs, kwargs = self.to_kwargs(inputs, kwargs, self.device_ids[0]) output = self.module(*inputs[0], **kwargs[0]) else: output = self.module(*inputs, **kwargs)# 如果需要同步后向传播梯度，则调用prepare_for_backward if torch.is_grad_enabled() and self.require_backward_grad_sync: # 当DDP参数 find_unused_parameter 为 true 时，其会在 forward 结束时，启动一个回溯，标记出所有没被用到的 parameter，提前把这些设定为 ready，这样 backward 就可以在一个 subgraph 进行，但这样会牺牲一部分时间。self.require_forward_param_sync = True # We'll return the output object verbatim since it is a freeform # object. We need to find any tensors in this object, though, # because we need to figure out which parameters were used during # this forward pass, to ensure we short circuit reduction for any # unused parameters. Only if `find_unused_parameters` is set. if self.find_unused_parameters and not self.static_graph: # Do not need to populate this for static graph. self.reducer.prepare_for_backward(list(_find_tensors(output))) else: self.reducer.prepare_for_backward([]) else: self.require_forward_param_sync = False# TODO. Right now we add this sink for static_graph training only. once # this feature is stable, we will add this sink for all cases. E.g. # This sink can help capture more accuracte backward start time as well. if self.static_graph and self.num_iterations == 1: # Need to grab list of tensors from user output in order to pass # to custom autograd function. output_tensor_list, treespec = tree_flatten(output) passthrough_tensor_list = _DDPSink.apply( self.reducer, *output_tensor_list ) # Reconstruct output data structure. output = tree_unflatten(passthrough_tensor_list, treespec) return output

其中，使用 _sync_params 来同步模型参数，具体是使用 _distributed_broadcast_coalesced 进行完成。

def _sync_params(self): with torch.no_grad(): # module buffer sync if self.will_sync_module_buffers(): # Synchronize buffers across processes. # If we are running DDP with the join manager, we have to agree # upon a rank to sync module buffers from, since rank 0 may # already have been joined and have stale module buffers. if self.ddp_uneven_inputs_config.ddp_join_enabled: authoritative_rank = self._find_common_rank( self._distributed_rank, True ) else: # The process with rank 0 is considered the authoritative copy. authoritative_rank = 0 self._distributed_broadcast_coalesced( self.modules_buffers[0], self.broadcast_bucket_size, authoritative_rank, )

0x03 C++世界我们接下来进入到 C++ 世界，看看这里如何支持前向传播。具体分为：准备前向传播，重建桶，准备后向传播这几部分。
3.1 准备前向传播
这里把 num_iterations_ 增加，并且记录时间。

void Reducer::prepare_for_forward() { std::lock_guard lock(mutex_); num_iterations_++; // 这里会递增 if (should_collect_runtime_stats()) { record_forward_compute_start_time(); } }

3.2 重建桶
接下来进行重建桶，具体分为：

配置各种尺寸限制。
计算桶的尺寸。
同步桶indices。
初始化桶。

bool Reducer::rebuild_buckets() { // Ensure reduction for previous backwards pass is finished. If user's model // has unused parameters for example, this will raise an error recommending to // run with find_unused_parameters=True, instead of the size mismatch // exception below. std::lock_guard lock(mutex_); ensure_prior_reduction_finished(); if (!should_rebuild_buckets() || rebuilt_params_.empty()) { return false; }std::vector> rebuilt_bucket_indices; // 配置各种尺寸限制 std::vector bucket_size_limits; bucket_size_limits.push_back(kDefaultFirstBucketBytes); bucket_size_limits.push_back(bucket_bytes_cap_); // 计算桶的尺寸 rebuilt_bucket_indices = compute_bucket_assignment_by_size( rebuilt_params_, bucket_size_limits, expect_sparse_gradients_[0], rebuilt_param_indices_); // For rebuilt bucket indices, it needs to be synced across all ranks. // Broadcast the newly rebuilt bucket indices from rank 0 in default. // After syncing up rebuilt bucket indices, initialize buckets for reducer. // 同步桶indices sync_bucket_indices(rebuilt_bucket_indices); has_rebuilt_bucket_ = true; rebuilt_params_.clear(); rebuilt_param_indices_.clear(); // 初始化桶 initialize_buckets(std::move(rebuilt_bucket_indices)); return true; }

我们接下来具体看看如何重建。
3.2.1 计算桶尺寸我们首先要看看compute_bucket_assignment_by_size 之中关键结构如下，BucketAccumulator 可以认为是实际的桶。

struct BucketAccumulator { std::vector indices; // 桶内容，是张量列表 size_t size = 0; // 桶大小，比如若干mb }; // 桶的逻辑内容// Keep vector of indices and size accumulator by tensor type and device. std::unordered_map> buckets; // 所有桶的列表，每一个实际桶可以认为是 BucketAccumulator

其次，我们来看看 compute_bucket_assignment_by_size的具体逻辑：

生成一个计算结果 result，并且使用参数tensors的大小来为result预留出空间。
生成一个buckets，这是所有桶的列表，每一个实际桶可以认为是 BucketAccumulator
遍历传入的所有张量，对于每一个张量：
- 如果有index，就拿到张量的index。
- 如果配置了期待sparse gradient，则把这个张量自己放入一个桶，因为没法和其他张量放在一起。
- 使用张量信息构建桶的key。
- 使用 key 找到对应的桶, 拿到BucketAccumulator。
- 向该桶的张量列表 indices 里面插入新张量的index，indices 是 tensor index list。
- 增加对应桶大小。
- 如果需要，就设定成大小限制的初始值。
- 如果桶的尺寸大于最小值限制，就是说目前桶的尺寸已经达到了桶的最大限制，按说需要转移到新桶了（实际上确实转移到了逻辑的新桶，但是实际还是在现有桶内执行，因为 type, device 还是同样的，还是应该在原有桶内继续累积，不过原有桶的indice已经转移到了result之中，就相当于清空了）。
  - 把桶内容插入到返回result，就是说，当桶尺寸过大的时候，就先插入到result之中。
  - 利用 BucketAccumulator() 重新生成桶，bucket是个引用，所以直接赋值，就相当于清空原有的桶，就是原来桶继续用，但是桶内原有的indices已经转移到了result之中。
把剩余的桶内indices插入到返回值result。之前已经有些直接插入到了result之中。
对 result 进行排序：
- 如果 tensor_indices 非空，说明张量的顺序已经是梯度准备好的顺序，不需要再排序了。
- 如果 tensor_indices 是空的，依据最小张量index来排序，这里假定张量的顺序是他们使用的顺序（或者说是他们梯度产生次序的反序）。这种排序可保证桶是按照连续不断的顺序准备好。
- 注意，这里就是正序排列，等到创建Reducer的时候，才反序传入：list(reversed(bucket_indices))

另外需要注意的是：因为 tensors就是 Python 代码中的参数 parameters[0]，而 parameters[0] 是按照 parametes() 的返回结果来的，所以DDP最终是按model.parameters()的相反顺序启动AllReduce。

std::vector> compute_bucket_assignment_by_size( const std::vector& tensors, const std::vector& bucket_size_limits, // 桶大小限制 const std::vector& expect_sparse_gradient, const std::vector& tensor_indices) { //实际上，初始化时候没有传入 tensor_indices // Either expect_sparse_gradient is not specified or it has as many elements // as the vector with tensors. TORCH_INTERNAL_ASSERT( expect_sparse_gradient.empty() || (tensors.size() == expect_sparse_gradient.size())); TORCH_INTERNAL_ASSERT(tensors.size() > 0); std::vector> result; result.reserve(tensors.size()); // 预留大小// Keep iterator into the size_limit vector by tensor type and device. // This is done so that we can use the consecutive bucket limits per type. std::unordered_map< BucketKey, std::vector::const_iterator, c10::hash> bucket_size_limit_iterators; // Local accumulator type for a single bucket. struct BucketAccumulator { std::vector indices; // 桶内容，是张量列表 size_t size = 0; // 桶大小，比如若干mb }; // 桶的逻辑内容// Keep vector of indices and size accumulator by tensor type and device. std::unordered_map> buckets; // 所有桶的列表，每一个实际桶可以认为是 BucketAccumulatorfor (size_t i = 0; i < tensors.size(); i++) { // 遍历传入的所有张量 const auto& tensor = tensors[i]; //拿到张量 TORCH_CHECK(!tensor.is_sparse(), "No support for sparse tensors."); // when tensor_indices is empty, the index of tensors[i] assigned to // bucket is i, otherwise the tensor index is tensor_indices[i]. auto tensor_index = i; // 就是给所有的tensor一个index，从0开始递增，一直到 tensors.size() if (!tensor_indices.empty()) { tensor_index = tensor_indices[i]; // 如果有index，就拿到张量的index } // If we expect a sparse gradient to be produced for this tensor, it cannot // be grouped together with other gradients and gets its own bucket. // 如果配置了期待sparse gradient，则把这个张量自己放入一个桶，因为没法和其他张量放在一起 if (!expect_sparse_gradient.empty() && expect_sparse_gradient[tensor_index]) { result.push_back({tensor_index}); continue; }auto key = BucketKey(tensor.scalar_type(), tensor.device()); //使用张量信息构建桶的key auto& bucket = buckets[key]; // 找到对应的桶, 拿到BucketAccumulator bucket.indices.push_back(tensor_index); // 该桶的张量列表里面插入新张量的index，indices 是 tensor index list bucket.size += tensor.numel() * tensor.element_size(); // 增加对应桶大小// Initialize bucket size limit iterator if necessary. // 如果需要，就设定成大小限制的初始值 if (bucket_size_limit_iterators.count(key) == 0) { bucket_size_limit_iterators[key] = bucket_size_limits.begin(); }// bucket_size_limit_iterator 就是桶大小的范围, 即 [_DEFAULT_FIRST_BUCKET_BYTES, int(bucket_cap_mb * 1024 * 1024)] auto& bucket_size_limit_iterator = bucket_size_limit_iterators[key]; const auto bucket_size_limit = *bucket_size_limit_iterator; // 当前最小值限制 if (bucket.size >= bucket_size_limit) { // 如果桶的尺寸大于最小值限制，就是说目前桶的尺寸已经达到了桶的最大限制，按说需要转移到新桶了（实际上确实转移到了逻辑的新桶，但是实际还是在现有桶内执行，因为 type, device 还是同样的，还是应该在原有桶内继续累积，不过原有桶的indice已经转移到了result之中，就相当于清空了） result.emplace_back(std::move(bucket.indices)); // 把桶内容插入到返回result，就是说，当桶尺寸过大的时候，就先插入到result之中。 bucket = BucketAccumulator(); // 重新生成桶，bucket是个引用，所以直接赋值，就相当于清空原有的桶，就是原来桶继续用，但是桶内原有的indices已经转移到了result之中。// Advance to the next bucket size limit for this type/device. // 前进到下一个尺寸限制 auto next = bucket_size_limit_iterator + 1; if (next != bucket_size_limits.end()) { bucket_size_limit_iterator = next; } } }// Add remaining buckets. 把剩余的桶内indices插入到返回值，因为之前已经有些直接插入到了result之中 for (auto& it : buckets) { auto& bucket = it.second; if (!bucket.indices.empty()) { result.emplace_back(std::move(bucket.indices)); } }// If tensor_indices is not empty, the order of the tensors is in the gradient // ready order, so no need to sort. // If tensor_indices is empty, sort resulting buckets by the minimum tensor // index they include. We assume that the order of the tensors is the order in // which they are used (or the reverse order in which their gradients are // produced). This sorting step ensures that the buckets are ready in // consecutive order. // 如果 tensor_indices 非空，说明张量的顺序已经是梯度准备好的顺序，不需要再排序了 // 如果 tensor_indices 是空的，依据最小张量index来排序，这里假定张量的顺序是他们使用的顺序（或者说是他们梯度产生次序的反序）。这种排序可保证桶是按照连续不断的顺序准备好。 // 注意，这里就是正序排列，等到创建Reducer的时候，才反序传入：list(reversed(bucket_indices)) if (tensor_indices.empty()) { std::sort( result.begin(), result.end(), [](const std::vector& a, const std::vector& b) { // 对于任意两个vector，排序的依据是：用这两个vector之中最小index来排序 const auto amin = std::min_element(a.begin(), a.end()); // a中的最小index const auto bmin = std::min_element(b.begin(), b.end()); // b中的最小index return *amin < *bmin; }); }return result; }

result 最终如下，里面每个vector 都对应了一个bucket，里面是都是 tensor 的 index，这里都是从小到大顺序排序。模型参数以（大致）Model.parameters()与给定模型相反的顺序分配到桶中。使用相反顺序的原因是因为 DDP 期望梯度在反向传递期间以大约该顺序准备就绪。

+-----------------------------------------------------------------------+ || || || || || || || |......| || || || || +-----------------------------------------------------------------------+

3.2.2 同步桶indices 产生尺寸之后，就使用 sync_bucket_indices 同步桶的indices，其逻辑如下：

遍历桶，把桶的大小都记录到bucket_sizes。
配置TensorOptions。
把桶对应的indices和桶数目放入indices_tensor，这里是通过 PyTorch accessor来对张量进行读写，accessor就像是一个张量，但它将张量的维度和 dtype 硬编码为了模板参数，可以高效的访问元素。
因为 NCCL这样的 ProcessGroup 只支持device之间的操作，所以把indices_tensor拷贝到indices_tensor_device。
对 indices_tensor_device 进行广播。
类似，对桶尺寸进行广播。
广播结束之后，遍历桶，使用从rank 0收到的num_buckets, bucket_sizes_tensor 和 indices_tensor 更新传进来的参数bucket_indices。

void Reducer::sync_bucket_indices( std::vector>& bucket_indices) {auto num_buckets = bucket_indices.size(); std::vector bucket_sizes; bucket_sizes.reserve(num_buckets); int64_t total_size = 0; //遍历桶，把桶的大小都记录到bucket_sizes for (size_t i = 0; i < num_buckets; i++) { auto bucket_size = bucket_indices.at(i).size(); bucket_sizes.push_back(bucket_size); total_size += bucket_size; }// 配置TensorOptions at::TensorOptions options; options = options.dtype(at::kInt); options = options.device(replicas_[0][0].device()); // Group indices and num_bucket together into indices_tensor // Broadcast this tensor first, as its size is equal among all processes // 把桶对应的indices和桶数目放入indices_tensor，这里是通过 PyTorch accessor来对张量进行读写，accessor就像是一个张量，但它将张量的维度和 dtype 硬编码为了模板参数，可以高效的访问元素 auto indices_tensor = at::empty({total_size + 1}, at::kInt); auto indices_accessor = indices_tensor.accessor(); auto indices_accessor_Index = 0; for (size_t i = 0; i < num_buckets; i++) { const auto& bucket_size = bucket_indices.at(i).size(); for (size_t j = 0; j < bucket_size; j++) { indices_accessor[indices_accessor_Index++] = bucket_indices[i][j]; } } indices_accessor[indices_accessor_Index] = num_buckets; // Copy CPU tensor to device tensor, as the process_group_ could be NCCL and // it can only broadcast device tensors. auto indices_tensor_device = at::empty({total_size + 1}, options); // 因为 NCCL这样的 ProcessGroup 只支持device之间的操作，所以把indices_tensor拷贝到indices_tensor_device indices_tensor_device.copy_(indices_tensor, /*non_blocking=*/true); std::vector indices_tensor_list = {indices_tensor_device}; // 对 indices_tensor_device 进行广播 process_group_->broadcast(indices_tensor_list)->wait(); indices_tensor.copy_(indices_tensor_list.front(), /*non_blocking=*/false); // Update num_buckets after receiving it from rank 0 num_buckets = indices_accessor[indices_accessor_Index]; // Broadcast bucket_sizes // 类似，对桶尺寸进行广播 auto bucket_sizes_tensor = at::empty({(int64_t)num_buckets}, at::kInt); auto bucket_sizes_accessor = bucket_sizes_tensor.accessor(); for (size_t i = 0; i < num_buckets; i++) { // For rank != 0, it is possible that local num buckets bucket_sizes.size() // is smaller than broadcasted num_buckets bucket_sizes_accessor[i] = bucket_sizes.at(std::min(i, (bucket_sizes.size() - 1))); } auto bucket_sizes_tensor_device = at::empty({(int64_t)num_buckets}, options); bucket_sizes_tensor_device.copy_(bucket_sizes_tensor, /*non_blocking=*/true); std::vector bucket_sizes_tensor_list = { bucket_sizes_tensor_device}; process_group_->broadcast(bucket_sizes_tensor_list)->wait(); bucket_sizes_tensor.copy_( bucket_sizes_tensor_list.front(), /*non_blocking=*/false); // Clear bucket_indices first, and then update bucket_indices using received // num_buckets, bucket_sizes_tensor and indices_tensor from rank 0 bucket_indices.clear(); bucket_indices.reserve(num_buckets); indices_accessor_Index = 0; // 遍历桶，使用从rank 0收到的num_buckets, bucket_sizes_tensor 和 indices_tensor 更新传进来的参数bucket_indices for (size_t i = 0; i < num_buckets; i++) { const auto& bucket_size = bucket_sizes_accessor[i]; std::vector bucket; bucket.reserve(bucket_size); for (size_t j = 0; j < bucket_size; j++) { bucket.push_back(indices_accessor[indices_accessor_Index++]); } bucket_indices.emplace_back(std::move(bucket)); } }

3.2.3 初始化桶同步之后就是初始化桶，本部分代码在前文已经分析过，故此省略。
3.3 准备后向传播
前向传播完成之后，调用 prepare_for_backward 完成了后向传播的准备。
具体大致分为两步：重置，查找未使用的参数。

void Reducer::prepare_for_backward( const std::vector& outputs) { std::lock_guard lock(mutex_); // 记录开始时间 cpu_timer_.backward_compute_start_time = current_time_in_nanos(); if (should_collect_runtime_stats()) { record_backward_compute_start_time(); }// Reset accounting. expect_autograd_hooks_ = true; reset_bucket_counting(); // Reset unused parameter accounting. has_marked_unused_parameters_ = false; // Reset per iteration marked ready parameters. perIterationReadyParams_.clear(); // 重置每次迭代的marked ready parameters// If static graph is not set, search graph to detect unused parameters. // When static graph is set, unused_parameters_ will be detected and will // not change after 1st iteration. // If static_graph_ = false and find_unused_parameters_ is false, // we assume that autograd hooks for ALL variables will be called, // and we don't have to search the autograd graph for presence of these hooks. if (dynamic_graph_find_unused()) { unused_parameters_.clear(); search_unused_parameters(outputs); // 查找没有使用的参数 } }

3.3.1 重置这里会遍历桶，对于每个桶，重置其副本的pending状态，某一个模型副本pending状态是由这个模型副本中对应桶的变量数目决定。
如果是静态图，则重置numGradHooksTriggeredMapPerIteration_。

void Reducer::reset_bucket_counting() { next_bucket_ = 0; // Reset num_buckets_ready_ at the beginning of backward computation // in each iteration. num_buckets_ready_ = 0; for (auto& bucket : buckets_) { // 遍历桶 for (auto& replica : bucket.replicas) { replica.pending = replica.variables.size(); //对于每个桶，重置其副本的pending状态，某一个模型副本pending，是由这个模型副本中，本桶的变量数目决定 } bucket.pending = bucket.replicas.size(); // 重置桶的pending状态，桶pending是由多少个模型副本决定 }if (static_graph_) { // 重置numGradHooksTriggeredMapPerIteration_ numGradHooksTriggeredMapPerIteration_ = numGradHooksTriggeredMap_; } }

3.3.2 查找未使用的参数 search_unused_parameters 完成了 "查找未使用的参数" 功能。
我们首先要看看 Reducer 的 find_unused_parameters_ 成员变量。如果 find_unused_parameters_ 被设置为 true，则 DDP 会在前向传播结束时候，从指定的输出进行回溯，遍历autograd计算图来找到所有没有使用过的参数，并且一一标记为就绪 ready。
对于所有参数，DDP 都有一个指向它们的梯度累积函数的指针，但对于那些autograd图中不存在的参数，它们将在第一次调用autograd钩子时就被标记为准备就绪。
因为模型输出可能会被忽略，所以这个操作不是立即完成的，我们只是像在torch.autograd.backward()这里开始执行规约操作。
大家可以发现，这么做开销会很大，为什么要这么做？这是因为计算动态图会改变。

训练时候，某次迭代可能只用到模型的一个子图，而且因为PyTorch 是动态计算，所以子图会在迭代期间改变，就是说，某些参数可能在下一次迭代训练时候被跳过。
同时，因为所有参数在一开始就已经被分好桶，而 hook 又规定了只有整个桶 ready （即，pending == 0）之后才会进行通信，所以如果我们不将未使用参数标记为 ready，整个通信过程就会没法进行。

// Traverse the autograd graph starting at the specified output. // All parameters for which we have a pointer to their gradient accumulation // functions, but don't show up in the autograd graph will be marked ready for // for reduction as soon as the first autograd hook is called. This is not // done immediately because the model output may be ignored, and we only // want to start performing reductions on `torch.autograd.backward()`. void Reducer::search_unused_parameters( const std::vector& outputs) { std::unordered_set seen; std::vector queue; RECORD_FUNCTION( "torch.distributed.ddp.reducer::search_unused_parameters", std::vector()); // Seed queue with the grad functions of all outputs. for (const auto& output : outputs) { const auto& grad_fn = output.grad_fn(); if (grad_fn) { queue.push_back(grad_fn.get()); // 把所有输出节点的梯度函数插入到queue } }// Traverse the autograd graph starting at the specified output. // 遍历这个queue中的元素，对于每一个函数，找到其后向图之中的后续边，然后把后续边指向的节点再插入queue，然后继续循环，最终 seen 里面是所有从output出发，所有节点的梯度函数 while (!queue.empty()) { auto fn = queue.back(); queue.pop_back(); for (const auto& edge : fn->next_edges()) { if (auto next_ptr = edge.function.get()) { const bool was_inserted = seen.insert(next_ptr).second; if (was_inserted) { queue.push_back(next_ptr); } } } }// Find accumulator functions that don't show up in this graph. // gradAccToVariableMap_ 里面是所有需要被规约的variable // 遍历gradAccToVariableMap_，如果 seen 之中没有，就说明这个参数没有被使用，插入到unused_parameters_ for (const auto& it : gradAccToVariableMap_) { // If the accumulator function is present in the graph, we know // a gradient will be computed for the corresponding parameter. if (seen.count(it.first) == 0) { unused_parameters_.push_back(it.second); } }// Warn user about unnecessary perf hit if all parameters were used in // forward. if (unused_parameters_.empty()) { TORCH_WARN_ONCE( "find_unused_parameters=True was specified in DDP constructor, " "but did not find any unused parameters in the forward pass. This flag " "results in an extra traversal of the autograd graph every iteration, " " which can adversely affect performance. If your model indeed never " "has any unused parameters in the forward pass, consider turning this " "flag off. Note that this warning may be a false positive if your model " "has flow control causing later iterations to have unused parameters."); } }

至此，前向传播已经结束，我们得到了如下：

需要计算梯度的参数已经分桶。
桶已经重建完毕。
前向传播已经完成。
从指定的输出进行回溯，遍历autograd计算图来找到所有没有使用过的参数，并且一一标记为就绪 ready。

我们在下一篇就分析后向传播。
0xFF 参考 pytorch分布式系列3——分布式训练时，torch.utils.data.distributed.DistributedSampler做了什么？
pytorch分布式系列1——搞清torch.distributed.launch相关的环境变量
pytorch分布式系列2——DistributedDataParallel是如何做同步的？
pytorch(分布式)数据并行个人实践总结——DataParallel/DistributedDataParallel
Pytorch的nn.DataParallel
https://discuss.pytorch.org/t/dataparallel-imbalanced-memory-usage/22551/20
https://pytorch.org/docs/stable/distributed.html
PyTorch 源码解读之分布式训练了解一下？
实操教程｜PyTorch AutoGrad C++层实现
PYTORCH 自动微分（一）
PyTorch如何加速数据并行训练？分布式秘籍大揭秘
pytorch分布式训练（二init_process_group）
https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
https://pytorch.org/docs/master/notes/ddp.html
https://pytorch.org/tutorials/intermediate/dist_tuto.html
PyTorch 源码解读之 DP & DDP：模型并行和分布式训练解析
【[源码解析]|[源码解析] PyTorch 分布式(12) ----- DistributedDataParallel 之前向传播】Pytorch模型中的parameter与buffer