We propose Latent Thought Models (LTMs), a novel class of language models that incorporate explicit latent thought vectors to guide autoregressive generation. Through dual-rate optimization and inference-time computation, LTMs achieve superior efficiency and emergent reasoning capabilities at significantly smaller scales than traditional models.
Think about how you write. Before putting pen to paper (or fingers to keyboard), your mind forms an abstract understanding of what you want to express. You might think about the main themes, the emotional tone, or the logical structure. Only then do you translate these abstract thoughts into concrete words.
Current language models like GPT work differently. They generate text token by token, word by word, without any higher-level planning or abstract representation. It’s like speaking without thinking—impressive, but ultimately limited.
We propose Latent Thought Models (LTMs)—a new class of language models that explicitly learn to form abstract “thoughts” before generating text. These thoughts are represented as latent vectors that capture the essence of what the model wants to express, before translating it into actual words.
Imagine an AI that works more like a human writer:
Our LTMs do exactly this through a two-stage process:
\[\text{Abstract Thoughts } (\mathbf{z}) \rightarrow \text{Concrete Words } (\mathbf{x})\]The latent thought vectors $\mathbf{z}$ serve as a compressed, abstract representation that guides the generation of each token in the sequence.
Instead of having just one set of thoughts, our models use layered thought vectors—different abstract representations for different layers of the neural network. This creates a hierarchy of abstraction:
We assume $\mathbf{z} = (\mathbf{z}_1, …, \mathbf{z}_L)$, where $\mathbf{z}_l$ consists of thought vectors cross-attending to layer $l$ of the Transformer decoder.
The key component is a thought-conditioned autoregressive generator $p_{\beta}(\mathbf{x} | \mathbf{z})$: |
Unlike standard autoregressive models that only condition on previous tokens, our model incorporates the thought vectors $\mathbf{z}$ at each generation step through cross-attention.
Our training process mirrors human learning with a dual-rate optimization:
This reflects the declarative-procedural framework from cognitive science—our latent thoughts act like declarative memory while the text generator represents procedural knowledge.
💡 Click any combination of layer groups to see how they work together!
While traditional language models scale along two main axes (model size and training data), LTMs introduce a third crucial dimension: inference steps. More thinking time leads to better performance—you can trade off model size for more deliberate reasoning.
Something remarkable happens with LTMs: they develop few-shot learning abilities (like GPT-3’s in-context learning) but with dramatically fewer parameters. Our smallest model achieves this with just 38M parameters—a fraction of what’s typically needed.
The results speak for themselves:
We formulate LTMs within the classical variational Bayes framework. The model assumes latent thought vectors $\mathbf{z}$ follow a prior $p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$ and generate text $\mathbf{x}$ via a Transformer decoder.
We introduce a sequence-specific variational posterior $q(\mathbf{z} | \mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\sigma}^2)$ and maximize the evidence lower bound (ELBO): |
Crucially, $(\boldsymbol{\mu}, \boldsymbol{\sigma}^2)$ are local parameters specific to each sequence, while $\beta$ represents global parameters shared across all samples.
LTMs introduce a distinct computational cost: inference-time computation stemming from the fast learning of latent thought vectors. This occurs in both training and testing.
For a model with $L$ layers, $N_{\mathbf{z}}$ latent vectors per layer, and $T_{\text{fast}}$ inference steps, the computational complexity scales as:
\[\mathcal{O}(T_{\text{fast}} \cdot L \cdot (N^2H + NN_{\mathbf{z}}H + NH^2))\]When $T_{\text{fast}} \gg 1$, the inference computation dominates, making thinking time the primary computational factor.
We conducted extensive experiments at GPT-2 scale using the OpenWebText dataset. Our results demonstrate:
Zero-shot Language Modeling Performance:
Model | Parameters | Training FLOPs/tok | PTB | WikiText | LM1B |
---|---|---|---|---|---|
GPT-2-Large | 762M | 5.32G | 161.33 | 30.09 | 45.61 |
LTM-Medium | 51M | 5.52G | ≤32.06 | ≤17.39 | ≤25.16 |
LTM-Large | 76M | 32.2G | ≤4.43 | ≤3.66 | ≤3.92 |
Text Generation Quality:
Model | Sampling | MAUVE ↑ |
---|---|---|
GPT-2-Medium | Nucleus-0.95 | 0.955 |
GPT-2-Medium | Multinomial | 0.802 |
LTM-Large | Multinomial | 0.974 |
LTM-Large | Greedy | 0.972 |
We investigated how semantic information is distributed across layers through progressive reconstruction experiments. The results reveal that LTMs process information hierarchically:
This demonstrates distinctive “synthesis layers” that integrate information from earlier representations.
Our approach connects to a deep idea in cognitive science: that thinking happens in an internal “language of thought” that’s distinct from the language we speak. The latent thought vectors can be seen as “words” in this internal cognitive language.
Perhaps most importantly, LTMs demonstrate that thinking time can be as valuable as model size or training data. This opens up new possibilities:
This work opens several exciting directions:
We acknowledge important areas for future work:
Reward Models in Latent Space: Incorporating verifier models $p_\gamma(r | \mathbf{z})$ to guide optimization for reasoning tasks |
Latent Thought Models represent a fundamental shift in how we think about language generation. Instead of immediate word-by-word generation, they introduce a more human-like process of abstract thinking followed by linguistic expression.
The key insight is simple but profound: giving AI systems explicit space to think makes them more efficient, more capable, and more aligned with how humans actually process language.
As we continue to push the boundaries of AI capabilities, approaches like LTMs suggest that the future lies not just in bigger models or more data, but in architectures that more closely mirror the sophisticated cognitive processes that make human intelligence so remarkable.
Its generally a better idea to avoid linking to images hosted elsewhere - links can break and you might face losing important information in your blog post. To include images in your submission in this way, you must do something like the following:
{% include figure.html path="assets/img/2025-04-28-distill-example/iclr.png" class="img-fluid" %}
which results in the following image:
To ensure that there are no namespace conflicts, you must save your asset to your unique directory /assets/img/2025-04-28-[SUBMISSION NAME]
within your submission.
Please avoid using the direct markdown method of embedding images; they may not be properly resized. Some more complex ways to load images (note the different styles of the shapes/shadows):
PLACEHOLDER FOR ACADEMIC ATTRIBUTION
BibTeX citation
PLACEHOLDER FOR BIBTEX