In masked diffusion LMs, we sample a noise level t ∈ (0,1]. Each non-special token is independently replaced with [MASK] with probability t. The model must denoise back to the original.
Sample t = 0.5 (medium noise)
t = 0.5 # noise level sampled from uniform or cosine schedule
# For each real content token (positions 1–5):
# mask it with probability t = 0.5
# Special tokens [BOS], [EOS], [PAD] are NEVER masked.
Mask sampling (one possible outcome)
uniform_samples = [—, 0.31, 0.78, 0.44, 0.62, 0.19, —, —]
mask_applied = [—, <0.5, <0.5, <0.5, <0.5, <0.5, —, —]
= [—, True, False, True, False, True, —, —]
Original vs corrupted sequence
Corrupted xₜ (model input)
Mask boolean matrix (which positions need loss)
mask_bool = [False, True, False, True, False, True, False, False]
# BOS I love deep learn ##ing EOS PAD
# ↑ ↑ ↑
# masked tokens → loss computed only here
The model sees xₜ (with [MASK] tokens). It must predict the original token at every masked position. This is exactly like BERT's MLM — but now the fraction masked is controlled by the diffusion timestep t.