Building a GENERAL AI agent with reinforcement learning

23,065

606 0

Published 2024-03-20

Dr. Minqi Jiang and Dr. Marc Rigter explain an innovative new method to make the intelligence of agents more general-purpose by training them to learn many worlds before their usual goal-directed training, which we call "reinforcement learning".

Their new paper is called "Reward-free curricula for training robust world models" arxiv.org/pdf/2306.09205.pdf

twitter.com/MinqiJiang
twitter.com/MarcRigter

Interviewer: Dr. Tim Scarfe

Please support us on Patreon, Tim is now doing MLST full-time and taking a massive financial hit. If you love MLST and want this to continue, please show your support! In return you get access to shows very early and private discord and networking. patreon.com/mlst

We are also looking for show sponsors, please get in touch if interested mlstreettalk at gmail.

MLST Discord: discord.gg/machine-learning-street-talk-mlst-93735…

00:00:00 - Intro
00:01:05 - Model-based Setting
00:02:41 - Similar to POET Paper
00:05:27 - Minimax Regret
00:07:21 - Why Explicitly Model the World?
00:12:47 - Minimax Regret Continued
00:18:17 - Why Would It Converge
00:20:36 - Latent Dynamics Model
00:24:34 - MDPs
00:27:11 - Latent
00:29:53 - Intelligence is Specialised / Overfitting / Sim2real
00:39:39 - Openendedness
00:44:38 - Creativity
00:48:06 - Intrinsic Motivation
00:51:12 - Deception / Stanley
00:53:56 - Sutton / Rewards is Enough
01:00:43 - Are LLMs Just Model Retrievers?
01:03:14 - Do LLMs Model the World?
01:09:49 - Dreamer and Plan to Explore
01:13:14 - Synthetic Data
01:15:21 - WAKER Paper Algorithm
01:21:24 - Emergent Curriculum
01:31:16 - Even Current AI is Externalised/Mimetic
01:36:39 - Brain Drain Academia
01:40:10 - Bitter Lesson / Do We Need Computation
01:44:31 - The Need for Modelling Dynamics
01:47:48 - Need for Memetic Systems
01:50:14 - Results of the Paper and OOD Motifs
01:55:47 - Interface Between Humans and ML

All Comments (21)

@Ben_D. 4 months ago

I love the long format and high level context. Excellent.
@CharlesVanNoland 4 months ago

This is awesome. Thanks Tim! "If we just take a bunch of images and try and directly predict images, that's quite a hard problem, to just predict straight in image space. So the most common thing to do is kind of take your previous sequence of images and try and get a compressed representation of the history of images, in the latent state, and then predict the dynamics in the latent state." "There could be a lot of spurious features, or a lot of additional information, that you could be expending lots of compute and gradient updates just to learn those patterns when they don't actually impact the ultimate transition dynamics or reward dynamics that you need to learn in order to do well in that environment."
@MartinLaskowski 4 months ago

I really value the effort you put into production detail on the show. Makes absorbing complex things feel natural
@redacted5035 4 months ago

00:00:00-Intro 00:01:05 - Model-based Setting 00:02:41-Similar to POET Paper 00:05:27 - Minimax Regret 00:07:21 -Why Explicitly Model the World? 00:12:47- Minimax Regret Continued 00:18:17-Why Would It Converge 00:20:36-Latent Dynamics Model 00:24:34-MDPS 00:27:11-Latent 00:29:53- Intelligence is Specialised / Overfitting / Sim2real 00:39:39 - OpenendednesS 00:44:38-Creativity 00:48:06 - Intrinsic Motivation 00:51:12 - Deception / Stanley 00:53:56 - Sutton /Rewards is Enough 01:00:43- Are LLMs Just Model Retrievers? 01:03:14 - Do LLMs Modelthe World? 01:09:49 - Dreamer and Plan to Explore 01:13:14 - Synthetic Data 01:15:21 - WAKER Paper Algorithm 01:21:24 - Emergent Curriculum 01:31:16 - Even Current Al is Externalised/Mimetic 01:36:39- Brain Drain Academia 01:40:10 - Bitter Lesson /Do We Need Computation 01:44:31-The Need for Modelling Dynamics 01:47:48 - Need for Memeetic Systems 01:50:14 -Results of the Paper and OOD MotifS 01:55:47 -Interface Between Humans and ML
@agenticmark 2 months ago

we can derive a reward network based on the user responses - the same way we do for analytics. If the user "kills" the agent, it wasnt performing. if the models work was put back into a target q network, we could use that to adjust the reward policy network to give it RLHF affects. I have done this with some atari and board games, where the specific reward function was used to train the foundation model, then later on fine tuned without that reward function and instead switching to the Qnetwork for rewards. These guys were two of your best guests after the Anthropic boys
@ehfik 4 months ago

great guests, good interview, interesting propositions! MLST is the best!
@diga4696 4 months ago

Amazing guests!!! Thank you so much. Human modalities, when symbolically reduced and quantized into language and subsequently distilled through a layered attention mechanism, represent a sophisticated attempt to model complexity. This process is not about harboring regret but rather acknowledges that regret is merely one aspect of the broader concept of free energy orthogonality. Such endeavors underscore our drive to understand reality, challenging the notion that we might be living in a simulation by demonstrating the depth and nuance of human perception and cognition.
@Niamato_inc 4 months ago

Thank you wholeheartedly.
@flyLeonardofly 4 months ago

Great episode! Thank you!
@Dan-hw9iu 4 months ago

Superb interview, Tim. This is among your best. I was amused by the researchers hoping/expecting that future progress will require more sophisticated models in lieu of simply more compute; I would probably believe this too, if my career depended on it! But I suspect that we'll discover the opposite: the Bitter Lesson was a harbinger for the Bitter End. Human-level AGI needed no conceptual revolutions or paradigm shifts, just boosting parameters -- intellectual complexity doggedly follows from system complexity. More bit? More flip? More It. And why should we have expected a more romantic story? Using a dead simple objective function, Mother Nature marinated apes in a savanna for awhile and out popped rocket ships. Total accident. No reasoning system needed. But if we intentionally drive purpose-built systems toward a mental phenomenon like intelligence, approximately along a provably optimal learning path, for millions of FLOP-years...we humans will additionally need a satisfying cognitive model to succeed? I'm slightly skeptical. The power of transformers was largely due to vast extra compute (massive training parallelism) that they unlocked. And what were the biggest advancements since their inception? Flash attention? That's approximating more intensive compute. RAG? Cached compute. Quantization? Trading accuracy for compute. Et cetera. If the past predicts the future, then we should expect progress via incremental improvements in compute (training more efficiently, on more data, with better hardware, for longer). We're essentially getting incredible mileage out of an algorithm from the '60s. Things like JEPA are wonderful contributions to that lineage. But if anyone's expecting some fundamentally new approach to reach human-level AGI, then I have a bitter pill for them to swallow...
@conorosirideain5512 4 months ago

It's wonderful that model based RL has become more popular recently
@NextGenart99 4 months ago

Seemingly straightforward, yet profoundly insightful.
@BilichaGhebremuse 4 months ago

Great interview
@filipefigueira6889 1 month ago

What a talk! thank you for this gift.
@sai4007 4 months ago

One important thing which world models bring in over simple forward dynamics model is learning to infer latent Markovian belief state representations from observations through probablistic filtering. This distinguishes latent state world models from normal MbRL! Partial observability is handled systematically by models like dreamer, which use a recurrent variational inference objective along with Markovian assumption on latent states to learn variational encoders that infer latent Markovian belief states.
@lancemarchetti8673 4 months ago

Wow! This was awesome
@codybattery8370 1 month ago

What would be the disadvantage of having a policy try to maximize the positive delta of the world model prediction? I.e. it predicts the most can be learned from specific actions?
@olegt3978 4 months ago

Amazing. We are on the highway to AGI in 2027-2030
@XOPOIIIO 3 months ago

I've missed it, why exactly it would explore the world? What's the reward function is?
@drlordbasil 1 month ago

Do you think taking current datasets and synthesizing docstrings essentially but for real world prompt/assistant sets? this could add a level of reasoning if it adds extra context to current data online gradually improving contextual understanding of each reasoning that happens between the user and the agent, explaining what the users intents are and the misconceptions/pros/cons from the response of the assistant, add steps that could have improved the assistants responses by having another AI respond as a critique?