Diffusion Language Models combining deep narrow networks, Canon layers (depthwise causal convolutions), and WSD (Warmup-Stable-Decay) training.
CPU demo of dhara-250M tri-mode (AR/diffusion/self-spec)