Latent Thought Models:

Jump to: TL;DR, Introduction, Key Findings, Conclusion, Acknowledgements, Related Links, Cite

TL;DR

We introduce Latent Thought Models (LTMs), a novel class of language models that incorporate explicit latent thought vectors following a prior model in latent space. LTMs use a dual-rate optimization process within the variational Bayes framework: fast inference-time computation for latent vectors and slow learning of decoder parameters. This approach achieves superior sample and parameter efficiency compared to autoregressive models and introduces inference-time computation as a new scaling dimension beyond traditional LLMs.

High-Level Overview: LTMs first develop internal latent thoughts, then use them to guide autoregressive text generation through a Transformer decoder.

High-Level Overview: LTMs first develop internal latent thoughts vectors 𝑧, then use them to guide autoregressive text generation 𝑥 through a Transformer decoder.

Introduction

Current language models scale primarily through increasing parameters and training data, leaving inference-time computation largely unexplored. We introduce Latent Thought Models (LTMs) that incorporate explicit “thinking” before generation (or speaking).

Key Inspirations:

Declarative vs. Procedural Memory: Latent vectors parallel declarative/episodic memory with fast learning, while global decoder parameters mirror procedural memory with slow learning;
Complementary Learning Systems: Fast and Slow: Our dual-rate learning mirrors the hippocampus (rapid learning of specific experiences) and neocortex (slower learning of general knowledge).
Language of Thought: Latent vectors serve as “words” in an internal cognitive language—a “mentalese” that underlies our ability to learn and use natural languages, realizing a computational “think before speak” paradigm.

Why This Matters: LTMs unlock inference-time computation as a new scaling dimension—the process of finding better internal representations (posterior distributions of latent thought vectors). Just as humans can achieve better understanding by “thinking harder” about a problem, LTMs can use more inference-time computation to achieve better performance with significantly less training data and computation.

Key Findings

Our empirical studies reveal several important discoveries about LTMs’ unique scaling properties and capabilities:

Scaling Behaviors of Inference-Time Computation LTMs demonstrate a new scaling dimension beyond traditional model parameters. Performance consistently improves with more inference steps, as the model iteratively refines latent thought vectors to find better internal representations.

**Inference-Time Scaling**: Performance improvement as a function of inference steps and latent size, demonstrating the new scaling dimension introduced by LTMs.

Sample and Computational Efficiency LTMs achieve superior efficiency by leveraging inference steps and latent size to improve performance more effectively than simply scaling model parameters or training data.

Emergent In-Context Learning in Mathematics LTMs demonstrate emergent few-shot mathematical reasoning capabilities at remarkably small scales. The explicit latent thought modeling enables mathematical reasoning to emerge much earlier in model scaling than traditional approaches.

**Mathematical Reasoning**: Emergent in-context learning capabilities for mathematical tasks at small model scales. (LTM-L has 76M parameters.)

Conclusion

Latent Thought Models represent a significant advancement in language modeling by introducing explicit latent thought vectors and inference-time computation as a new scaling dimension. The dual-rate optimization within the variational Bayes framework enables superior sample and parameter efficiency while maintaining competitive generation quality.

This work opens new directions for efficient language model design and suggests that explicit modeling of internal representations can unlock additional scaling dimensions beyond traditional approaches. The ability to trade model size for inference computation provides flexible deployment strategies for resource-constrained environments.

Acknowledgements

We thank Ruiqi Gao and Kevin Murphy from Google DeepMind for insightful discussions and valuable suggestions. Y. W. was partially supported by NSF DMS-2015577, NSF DMS-2415226, and a gift fund from Amazon. We gratefully acknowledge the support of Lambda, Inc. for providing the compute for this project.

BibTeX

If you consider citing us, feel free to use the bibtex-entry below.

@article{kong2025latent,
        title = {Latent Thought Models with Variational Bayes Inference-Time Computation},
        author = {Kong, Deqian and Zhao, Minglu and Xu, Dehong and Pang, Bo and Wang, Shu and Honig, Edouardo and Si, Zhangzhang and Li, Chuan and Xie, Jianwen and Xie, Sirui and Wu, Ying Nian},
        booktitle = {Proceedings of the 42nd International Conference on Machine Learning (ICML)},
        year = {2025}
      }