haiscale.ddp¶

DistributedDataParallel

分布式数据并行工具，封装了 hfreduce

class haiscale.ddp.DistributedDataParallel(module, device_ids=None, broadcast_buffers=True, process_group=None, find_unused_parameters=False)[source]¶

分布式数据并行工具，封装了 hfreduce

使用幻方 AI 自研的 hfreduce 多卡通信工具，可以替换 PyTorch DDP 加速训练。使用方法与 torch.nn.parallel.DistributedDataParallel 相同。

Parameters

module (torch.nn.Module) – PyTorch 模型
device_ids (list) – 模型所在的 GPU id，如果是 None 则会用 torch.cuda.current_device() 的返回值，默认是 None
broadcast_buffers (bool) – 是否在 forward 之前把 rank-0 上的 buffer 广播到其他 rank 上，默认是 True
process_group (ProcessGroup) – ProcessGroup 对象，如果是 None 会用默认的分组
find_unused_parameters (bool) – 是否遍历计算图，找到不参与 backward 的参数；在少数情况下（比如训练 GAN）需要设置成 True；默认是 False

Note

进程退出时可能会报错，可通过退出之前调用 model.reducer.stop() 来解决

Examples:

from haiscale.ddp import DistributedDataParallel
model = DistributedDataParallel(model, device_ids=[local_rank])
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# training ...
for epoch in range(epochs):
    for step, (x, y) in enumerate(dataloader):
        # training
        optimizer.zero_grad()
        output = model(x)
        loss_fn(y, output).backward()
        optimizer.step()

no_sync()[source]¶

有两种情况不需要对梯度做 allreduce:

torch.is_grad_enabled 为 False：这样 forward 的时候不产生梯度, 通常来说也不用做 backward
做多次 forward-backward, 把梯度进行累加。这样只有最后一次 backward 需要对梯度进行同步

该方法主要是考虑到了第二种情况，梯度和 buffer 都不会进行同步

Example:

ddp = haiscale.ddp.DistributedDataParallel(model, ...)
with ddp.no_sync():
    for input in inputs:
        ddp(input).backward()  # no synchronization, accumulate grads
ddp(another_input).backward()  # synchronize grads