the spark of OPT-175B
近期的一些杂项
infra decouple from kind of internal production system
重点并不是 infra 如何帮忙自动恢复, 只是略有提到; 重点还是训练的调参
famous uncorrectable ECC error
we just restart the run
try to make run stable (数学上的稳定)
FP16
Lost GPU
CUDA errors
Job hanging
NCCL error
Job Slowdown
High DRAM correctable errors etc.
blob storage issues
when we are training these models, we kind of just stare at tensorboard all day
in general the mixture of hardware issues, training like numerical converting issues
~30days change the hyperparameter to try to get through
56days, 53 - 54 restarts, OPT-175B survived 143K steps
Andrej Karpathy
LLM
LAMA-2-70B
fp16, 2bytes, 70B
2 * 70B = 140B bytes = 140 * 1,000,000,000 bytes = 140,000,000,000 bytes = 140 gigabytes (bytes, kbytes, mbytes, gbytes)
140GB
tokenize
encoder, 将字符串转换为整数编码
decoder, 将整数编码转为字符串