问题:
accelerate在启动rdzv时,默认的超时时间是900(or 1800?),无法满足多节点pending时的等待需求
解决方案:
accelerate launch [Other params] --rdav_conf timeout=86400 [Other params]
即可。
参考页面:
https://github.com/huggingface/accelerate/blob/main/docs/source/basic_tutorials/launch.md https://github.com/huggingface/accelerate/blob/main/src/accelerate/commands/launch.py#L333
Comments | NOTHING