A while back, Google DeepMind unveiled Gemini Diffusion, an experimental language model that generates text using diffusion. Unlike traditional GPT-style models that generate one word at a time, Gemini Diffusion creates whole blocks of text by refining random noise step-by-step.

I read the paper Large Language Diffusion Models and was surprised to find that discrete language diffusion is just a generalization of masked language modeling (MLM), something we’ve been doing since 2018. The first thought I had was, “can we finetune a BERT-like model to do text generation?” I decided to try a quick proof of concept out of curiosity.

The original Transformer architecture, introduced in 2017, was an encoder-decoder model. In 2018, researchers realized that the encoder and decoder components of the model could be separated (with the advent of BERT and GPT), and two distinct families of models were created:

Encoder models used masked language modeling (MLM) as a training objective: randomly mask out a subset of tokens of each input and train the encoder to reconstruct the missing tokens (fill in the blanks). The model sees the entire (partially masked) context at once and learns bidirectional representations. This architecture excelled at tasks requiring a full‐sentence (or paragraph) representation (e.g., classification and retrieval).

Decoder models used next‐token prediction as a training objective: at each position $t$, predict the token at position $t + 1$ given all tokens up to $t$ as context. Only the left context is used to predict future values (unidirectional). This architecture excelled at generative tasks where you produce text one token at a time, such as open‐ended generation, summarization, and translation.

Originally, BERT saw immediate use in tasks such as classification, whereas GPT-style models didn’t become popular until later (due to initial limited capabilities). Eventually, the generation capabilities of autoregressive (decoder) transformers vastly improved. The general training objective of “next token prediction” means a much larger space of use cases when compared to encoder models.

Diffusion models were first popularized in image generation. In image generation, diffusion models gradually add Gaussian noise to an image (forward process) and then train a neural network to iteratively denoise it (reverse process). A high‐level summary of continuous diffusion with images is:

Applying this idea to language means we need a way to add noise to text and then remove it in stages. The simplest way to do this is a masking‐based noise process:

In this discrete text diffusion framework, the model learns a likelihood bound on the data distribution by optimizing a sum of denoising losses over all timesteps, rather than a single MLM objective at a fixed mask probability.

As we can see, BERT’s masked language modeling objective is the same training objective as text diffusion, but just for a subset of masking rates. By introducing variable masking rates (from 0 to 1) and a scheduled sequence of denoising steps (inspired by diffusion theory), we can transform BERT’s masked language modeling objective into a full generative procedure.

In 2019, RoBERTa was released. It was essentially just an enhancement of the original BERT model, with better hyperparameters, data training size, and a more simple training objective (MLM only, removed next sentence prediction).

Here we use the HuggingFace transformers and dataset libraries to pull in the original RoBERTa weights, tokenizer, and the Trainer class to easily finetune the model on the WikiText dataset. The main code (full code here) looks like this below:

Currently we have 10 diffusion steps, so we randomly sample a percentage $p$ out of mask_probs (1.0, 0.9, 0.9, …, 0.1) and mask that percent of the tokens each batch. The custom diffusion_collator function (see code here) samples one mask-probability p from mask_probs per batch and sets each token to with p probability.

To be able to condition the generation on a “prompt”, we currently never mask the first 16 tokens. That means that during training, each step will always have the first 16 tokens as context for generation.

For inference, we start with an input which is a tensor of size 256 (since we are generating blocks of 256 tokens). The first 16 positions are the token ids that correspond to the prompt, and the last 240 are just tokens. We iterate through the denoising schedule and each step, we generate a prediction and then remask the sequence again. The process looks like this:

Here is an example output generation of the fine-tuned model after training on an H200 for 30 minutes (the first line is the initial prompt):

The output looks surprisingly coherent! Most of the quirks present are actually just quirks from the formatting of WikiText (spaces around punctuation “, turning hyphens – into @-@).

We see GPT-2’s output is more coherent and slightly faster (~9 seconds vs ~13) but I’m pleasantly surprised with how good my simple implementation was. It is a good proof of concept, and with new approaches like AR-Diffusion and Skip-Step Diffusion (and a more optimized implementation), the quality and speed can be drastically improved.

We’ve seen that masked language models like RoBERTa, originally designed for fill-in-the-blank tasks, can be repurposed into fully generative engines by interpreting variable-rate masking as a discrete diffusion process. By gradually corrupting text with tokens and training the model to iteratively denoise at increasing mask intensities, we effectively turn the standard MLM objective into a step-by-step generation procedure.

Even without architectural changes, a fine-tuned RoBERTa can produce surprisingly coherent passages, validating the core idea that text diffusion is essentially just a generalization of classical masked language modeling.

def diffusion_collator(examples): batch = tokenizer.pad(examples, return_tensors=”pt”) # Randomly select masking probability for this batch mask_prob = random.choice([1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1]) # Never mask the first PREFIX_LEN tokens (preserved context) maskable_positions = batch.input_ids[:, PREFIX_LEN:] # Create random mask for the chosen probability mask = torch.rand(maskable_positions.shape) 0: mask_indices = torch.rand(MAX_LEN – PREFIX_LEN) < mask_prob input_ids[0, PREFIX_LEN:][mask_indices] = tokenizer.mask_token_id

NOTE: After I wrote the article I stumbled upon the paper DiffusionBERT which does essentially the same thing but with more rigorous testing! Check it out if this post interested you.

Simplified code for the diffusion_collator looks like:

Below is a comparison between our diffusion model and GPT-2:

# Load and tokenize dataset and instantiate the model dataset = load_dataset("wikitext", "wikitext-2-raw-v1") tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base") model = RobertaForMaskedLM.from_pretrained("roberta-base") # Create the training args and Trainer instance training_args = TrainingArguments( output_dir="finetuned-roberta-diffusion", overwrite_output_dir=True, num_train_epochs=NUM_EPOCHS, per_device_train_batch_size=BATCH_SIZE, save_strategy="epoch", save_total_limit=1, logging_steps=200, ) trainer = Trainer( model=model, args=training_args, train_dataset=tokenized["train"], eval_dataset=tokenized["validation"], data_collator=diffusion_collator, # custom implementation tokenizer=tokenizer, ) # Train & save trainer.train() trainer.save_model("finetuned-roberta-diffusion")

This is the xdefiance Online Web Shop.

A True Shop for You and Your Higher, Enlightnened Self…

Welcome to the xdefiance website, which is my cozy corner of the internet that is dedicated to all things homemade and found delightful to share with many others online and offline.

You can book with Jeffrey, who is the Founder of the xdefiance store, by following this link found here.

Visit the paid digital downloads products page to see what is all available for immediate purchase & download to your computer or cellphone by clicking this link here.

Find out more by reading the FAQ Page for any questions that you may have surrounding the website and online sop and get answers to common questions. Read the Returns & Exchanges Policy if you need to make a return on a recent order. You can check out the updated Privacy Policy for xdefiance.com here,

If you have any unanswered questions, please do not hesitate to contact a staff member during office business hours:

Monday-Friday 9am-5pm, Saturday 10am-5pm, Sun. Closed

You can reach someone from xdefiance.online directly at 1(419)-318-9089 via phone or text.

If you have a question, send an email to contact@xdefiance.com for a reply & response that will be given usually within 72 hours of receiving your message.

Browse the shop selection of products now!

Reaching Outwards