Issue 01: key ML trends & patterns for 2024
At Google, I publish a periodic, internal newsletter on the state of applied AI, with a particular focus on automation, productivity, and knowledge work. I’ve decided to experiment with externalizing a version of the letter (less anything confidential) with the hope that some amount of what’s being written would be useful to an even broader audience. If nothing else, increased accountability to publish regularly is a plus for the author! Perspectives are mine and mine alone.
This late “welcome to 2024” issue marks my return from paternity leave.
A brief personal note:
Last November, on Thanksgiving Day, we welcomed a little one to earth. Watching life come online is a beautiful and miraculous unfolding. It leads me to think more deeply about the mind, and how it emerges from the brain and body—There are striking parallels between this natural wonder and the artificial wonder of modern AI. Across both domains, I am in awe, and eagerly anticipating signs of emergence.
During “time off”, I was able to steal a few hours to delve into a handful of deep learning trends and techniques that had caught my interest but were out of reach during regular times. I’m reminded how rewarding time for achieving depth is, and plan to write some thoughts on how the new crop of AI tools enable it through unprecedented, self directed means.
Now onto the meat of it, 4 trends and patterns in ML/AI worth watching in 2024:
LLMs & Symbolic Systems
Models with memory
Interpreting LLMs
The “Mixture of Experts” architecture
LLMs & Symbolic Systems
For those using AI in their work, combining the world knowledge and creativity of language models with provably correct symbolic systems, such as compilers or calendars, is an obvious choice for practical AI applications. Symbolic systems excel in processing logical expressions and reasoning tasks, providing a structured, rule-based approach to problem-solving. On the other hand, language models shine in text generation, creativity, and the creation of diverse forms of content, showcasing an unparalleled ability to understand and manipulate natural language. A prime example of this potent combination is AlphaGeometry, a system from DeepMind that integrates a symbolic reasoning framework with a neural language model to tackle complex geometry problems. The beauty in it, I think, is that the language model proposes potential solutions, while the symbolic component rigorously evaluates these proposals to identify the accurate solution. This collaborative effort allows AlphaGeometry to solve challenges that would stump either component on its own.
Why it matters:
Since the introduction of the Wolfram Alpha plugin to the fully integrated code interpreter in ChatGPT, the evolution path has been clear: What will 'feel like' weak AGI is likely to be an assemblage of systems dedicated to specific tasks (not entirely unlike the brain).
Imagining the next tier of frontier models, complete with system 2 like reasoning, episodic memory, deeply integrated symbolic systems, and broad tool use might just be the droids we’ve been looking for
Models with memory
If you’ve been following the state space “S papers”, you’ll have surely heard about the recent Mamba architecture (it’s name is a meme from the snake sound of success S’s). "Selective State Space Models (SSMs) present a potentially groundbreaking development in network architecture, particularly for the next wave of models with stateful memory thanks to an innovative approach for sequence modeling. Where current models struggle with efficiently processing long sequences, Mamba introduces a selective state space method, enabling linear-time complexity in handling extended data sequences. This is a significant advancement over Transformer models, which typically exhibit quadratic time complexity. Mamba's ability to dynamically adjust its state space based on the context of the input sequence allows for more effective and efficient memory usage. By addressing the fundamental issue of context compression in sequence modeling, Mamba sets the stage for more sophisticated, memory-efficient AI models capable of handling complex, long-range tasks.
Why it matters:
Perhaps the most glaring limitation of current AI products is their memory capacity. While some efforts have been made to emulate episodic memory, like the 'town of agents' project from Stanford, none have fundamentally overcome the limitations of current model architectures. A breakthrough in this area could unleash a wave of new possibilities, from AI with (optional) comprehensive memory to enhanced long-range capabilities vital for scientific research
It's important to remember that scaling laws don't depend on major advancements in core architecture. This suggests that we might experience even more rapid improvements, not just through increased compute and better data but also through algorithmic innovations
Interpreting LLMs
Like how brains have regions dedicated to certain functionality, so might language models. Mechanistic Interpretability (a mouthful) is a nascent-ish research domain aimed at demystifying the complex inner workings of deep neural networks. One early and compelling concept from this domain is Universality. This refers to the idea that certain features, patterns, or circuits within neural networks are common across different models, particularly those trained on similar tasks or domains. It suggests that the effort invested in understanding one model can provide a foothold in understanding subsequent models, making the daunting task of interpretability more manageable and scalable. Universality posits that a repeating structure or behavior exists in neural networks, transcending individual models—akin to finding commonalities in biological systems across different species, which can significantly aid in the generalization of findings and methodologies.
Why it matters:
More predictable and transparent models will lead to improved safety and alignment. Safety and alignment are good for business. And research in the area is heating up, including Patchscopes from Google Research
Although it's still early days, companies are emerging that specialize in offering interpretability-based optimizations for model selection and routing
The “Mixture of Experts” architecture
Dating back to early rumors of GPT4 being a variant of the “Mixture of Experts (MoEs)” architecture, interest in the paradigm has peaked with the release of Mistral’s Mixtral 8x7B open weight model. This approach in transformer models integrates numerous specialized neural networks, or 'experts', each adept at handling different segments of complex data. MoEs are notable for their efficient pretraining and faster inference capabilities compared to traditional dense models. Known as 'sparse' models, they enable scaling up with reduced computational demands. However, MoEs are not without challenges; they grapple with fine-tuning instabilities and training complexities, arising from the intricate balance needed in expert load management and effective data routing.
Why it matters:
MoEs begin to address a fundamental limitation in the AI field: the trade-off between model size and manageable compute requirements. Or said another way, a small startup founded just last year can train a sparse model on par with leading frontier models, for a fraction of the price
As with the aforementioned State Space Models, MoEs introduce yet another fundamental paradigm that could impact scaling laws in new and exciting ways


Looking forward to following along!
I'm a design technologist who follows algorithmic design innovations closely. I'm struck by the lack of the word human(s) or user(s) in this article. It's frankly a bit scary when a leader with a background in design at a company the size of Google doesn't talk about humans or users. (I get that your algorithmic scientists have abstracted everything and everyone to tokens.) Does applied AI work at Google involve humans? Was it just this post in particular that was particularly heavy on intelligence and the brain? Will future posts cover the people and communities that are affected and possibly harmed by the scale of the machines you're imaging deploying to humans?