THP 会让 os 申请连续的内存空间大小, 但如果申请不到, 则 os 会开始 compact, reclaim or page out other pages;
That process is expensive and could cause latency spikes (up to seconds)
cat /proc/buddyinfo
Each column represents the number of pages of a certain order which are
available. In this case, there are 0 chunks of 2^0PAGE_SIZE available in
ZONE_DMA, 4 chunks of 2^1PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
available in ZONE_NORMAL, etc…
a single model that could generalize across domains and tasks while being highly efficient. An important milestone toward realizing this vision was to develop the new Pathways system to orchestrate distributed computation for accelerators.
few-shot
TPU v4 Pods
Pipelining is typically used with DCN
word to vector, This vector represents the word’s meaning and context within the given language
This means that the output of a layer is added to the initial input, allowing the model to learn to only make small changes to the input
The decoder’s job is to produce the English sentence based on both the original French sentence and the bits of the English sentence it has generated so far.
Input Embedding: Just as with the Encoder, the input to the Decoder (which is the target sequence during training) is first embedded into continuous vectors.
It’s important to note that this masking is only applied during training. During inference, the decoder can attend to all words in the target sequence, including future words.
To summarize, the Decoder in the Transformer architecture processes its input through self-attention, cross-attention with the Encoder’s output, and position-wise Feed-Forward networks, repeatedly for each stacked block, culminating in a final output sequence after the softmax operation.
The nvml.h file is a direct copy of nvml.h from the NVIDIA driver. Since the NVML API is guaranteed to be backwards compatible, we should strive to keep this always up to date with the latest.
~/projects/go-nvml ❯ nvidia-smi Sun Dec 17 18:57:57 2023 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 545.29.04 Driver Version: 546.17 CUDA Version: 12.3 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3070 Ti On | 00000000:06:00.0 On | N/A | | 0% 33C P8 13W / 290W | 1139MiB / 8192MiB | 1% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 23 G /Xwayland N/A | +---------------------------------------------------------------------------------------+
funcgetNvidiaDeviceCount() { ret := nvml.Init() if ret != nvml.SUCCESS { log.Fatalf("Unable to initialize NVML: %v", nvml.ErrorString(ret)) } count, ret := nvml.DeviceGetCount() if ret != nvml.SUCCESS { log.Fatalf("Unable to get device count: %v", nvml.ErrorString(ret)) } fmt.Printf("%d\n", count) }
-Wl,-z,lazy: The -Wl,-z,lazy flag in the gcc command is a linker option used to instruct the linker to utilize lazy binding for dynamic libraries during the linking process.
When a program uses shared libraries (dynamic libraries), such as .so files in Linux, the linking process involves resolving symbols (functions or global variables) from these libraries. Lazy binding delays the resolution of these symbols until they are actually referenced during the program’s execution, rather than resolving all symbols at startup.
Lazy binding delays the resolution of these symbols until they are actually referenced during the program’s execution, rather than resolving all symbols at startup.
-Wl,-z,now: When you compile a program using gcc with the -Wl,-z,now flag, it influences how the dynamic linker behaves at runtime, particularly when the program is executed and loaded into memory. This flag impacts the linking stage, ensuring that symbols from shared libraries are resolved and bound immediately during the linking phase.
During the binary’s execution, when shared libraries are loaded, immediate binding might help in reducing the overhead associated with symbol resolution at runtime because the symbols are already resolved and bound during the linking process.
In summary, the -Wl,-z,now flag influences the behavior of the linker while creating the binary, affecting how symbol resolution occurs when the binary is loaded and executed, potentially impacting the startup performance by pre-resolving symbols.