BERT Is Just a Single Text Diffusion Step

A while back, Google DeepMind unveiled Gemini Diffusion, an experimental language model that generates text using diffusion. Unlike traditional GPT-style models that generate one word at a time, Gemini Diffusion creates whole blocks of text by refining random noise step-by-step.

I read the paper Large Language Diffusion Models and was surprised to find that discrete language diffusion is just a generalization of masked language modeling (MLM), something we’ve been doing since 2018. The first thought I had was, “can we finetune a BERT-like model to do text generation?” I decided to try a quick proof of concept out of curiosity.

The original Transformer architecture, introduced in 2017, was an encoder-decoder model. In 2018, researchers realized that the encoder and decoder components of the model could be separated (with the advent of BERT and GPT), and two distinct families of models were created:

Encoder models used masked language modeling (MLM) as a training objective: randomly mask out a subset of tokens of each input and train the encoder to reconstruct the missing tokens (fill in the blanks). The model sees the entire (partially masked) context at once and learns bidirectional representations. This architecture excelled at tasks requiring a full‐sentence (or paragraph) representation (e.g., classification and retrieval).

Decoder models used next‐token prediction as a training objective: at each position $t$, predict the token at position $t + 1$ given all tokens up to $t$ as context. Only the left context is used to predict future values (unidirectional). This architecture excelled at generative tasks where you produce text one token at a time, such as open‐ended generation, summarization, and translation.

Originally, BERT saw immediate use in tasks such as classification, whereas GPT-style models didn’t become popular until later (due to initial limited capabilities). Eventually, the generation capabilities of autoregressive (decoder) transformers vastly improved. The general training objective of “next token prediction” means a much larger space of use cases when compared to encoder models.

Diffusion models were first popularized in image generation. In image generation, diffusion models gradually add Gaussian noise to an image (forward process) and then train a neural network to iteratively denoise it (reverse process). A high‐level summary of continuous diffusion with images is:

Applying this idea to language means we need a way to add noise to text and then remove it in stages. The simplest way to do this is a masking‐based noise process:

In this discrete text diffusion framework, the model learns a likelihood bound on the data distribution by optimizing a sum of denoising losses over all timesteps, rather than a single MLM objective at a fixed mask probability.

As we can see, BERT’s masked language modeling objective is the same training objective as text diffusion, but just for a subset of masking rates. By introducing variable masking rates (from 0 to 1) and a scheduled sequence of denoising steps (inspired by diffusion theory), we can transform BERT’s masked language modeling objective into a full generative procedure.

In 2019, RoBERTa was released. It was essentially just an enhancement of the original BERT model, with better hyperparameters, data training size, and a more simple training objective (MLM only, removed next sentence prediction).

Here we use the HuggingFace transformers and dataset libraries to pull in the original RoBERTa weights, tokenizer, and the Trainer class to easily finetune the model on the WikiText dataset. The main code (full code here) looks like this below:

Currently we have 10 diffusion steps, so we randomly sample a percentage $p$ out of mask_probs (1.0, 0.9, 0.9, &mldr;, 0.1) and mask that percent of the tokens each batch. The custom diffusion_collator function (see code here) samples one mask-probability p from mask_probs per batch and sets each token to with p probability.

To be able to condition the generation on a “prompt”, we currently never mask the first 16 tokens. That means that during training, each step will always have the first 16 tokens as context for generation.

For inference, we start with an input which is a tensor of size 256 (since we are generating blocks of 256 tokens). The first 16 positions are the token ids that correspond to the prompt, and the last 240 are just tokens. We iterate through the denoising schedule and each step, we generate a prediction and then remask the sequence again. The process looks like this:

Here is an example output generation of the fine-tuned model after training on an H200 for 30 minutes (the first line is the initial prompt):

The output looks surprisingly coherent! Most of the quirks present are actually just quirks from the formatting of WikiText (spaces around punctuation “, turning hyphens – into @-@).

We see GPT-2’s output is more coherent and slightly faster (~9 seconds vs ~13) but I’m pleasantly surprised with how good my simple implementation was. It is a good proof of concept, and with new approaches like AR-Diffusion and Skip-Step Diffusion (and a more optimized implementation), the quality and speed can be drastically improved.

We’ve seen that masked language models like RoBERTa, originally designed for fill-in-the-blank tasks, can be repurposed into fully generative engines by interpreting variable-rate masking as a discrete diffusion process. By gradually corrupting text with tokens and training the model to iteratively denoise at increasing mask intensities, we effectively turn the standard MLM objective into a step-by-step generation procedure.

Even without architectural changes, a fine-tuned RoBERTa can produce surprisingly coherent passages, validating the core idea that text diffusion is essentially just a generalization of classical masked language modeling.

def diffusion_collator(examples): batch = tokenizer.pad(examples, return_tensors=”pt”) # Randomly select masking probability for this batch mask_prob = random.choice([1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1]) # Never mask the first PREFIX_LEN tokens (preserved context) maskable_positions = batch.input_ids[:, PREFIX_LEN:] # Create random mask for the chosen probability mask = torch.rand(maskable_positions.shape) 0: mask_indices = torch.rand(MAX_LEN – PREFIX_LEN) < mask_prob input_ids[0, PREFIX_LEN:][mask_indices] = tokenizer.mask_token_id

NOTE: After I wrote the article I stumbled upon the paper DiffusionBERT which does essentially the same thing but with more rigorous testing! Check it out if this post interested you.

Simplified code for the diffusion_collator looks like:

Below is a comparison between our diffusion model and GPT-2:

# Load and tokenize dataset and instantiate the model dataset = load_dataset("wikitext", "wikitext-2-raw-v1") tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base") model = RobertaForMaskedLM.from_pretrained("roberta-base") # Create the training args and Trainer instance training_args = TrainingArguments( output_dir="finetuned-roberta-diffusion", overwrite_output_dir=True, num_train_epochs=NUM_EPOCHS, per_device_train_batch_size=BATCH_SIZE, save_strategy="epoch", save_total_limit=1, logging_steps=200, ) trainer = Trainer( model=model, args=training_args, train_dataset=tokenized["train"], eval_dataset=tokenized["validation"], data_collator=diffusion_collator, # custom implementation tokenizer=tokenizer, ) # Train & save trainer.train() trainer.save_model("finetuned-roberta-diffusion")

XDEFiANCE'e Quality Internet Shop

This is the xdefiance Online Web Shop.

Reaching Outwards

Join the fun!

Recent blog posts

How to Build Reactive Declarative UI in Vanilla JavaScript

Fossil versus Git

Lightpanda migrate DOM implementation to Zig

Ai, Japanese chimpanzee who counted and painted dies at 49

CDC staff 'blindsided' as child vaccine schedule unilaterally overhauled

MIT Non-AI License

Your cart (items: 0)