No Priors Ep. 24 | With Devi Parikh from Meta

No Priors: AI, Machine Learning, Tech, & Startups2023-07-20

708 views|1 years ago

💫 Short Summary

Debbie Parke discusses her journey in computer vision and AI research, focusing on generative AI and machine learning. She emphasizes the importance of human-machine interaction and creative expression through AI. The discussion includes her transition from academia to industry, specifically at Meta, where she works on generative AI research. The video project 'Make a Video' explores video generation using image and text data. Limitations in current video generation technology are highlighted, along with challenges in video processing. The importance of data in video applications and the potential of AI in social media for creative expression are also discussed.

✨ Highlights

📊 Transcript

✦

Debbie Parke's background in computer vision and AI research.

00:39

Originally from India, she moved to the US for education and was introduced to pattern recognition, sparking her interest in research projects.

Despite initially considering a master's degree, she was guided towards a PhD track at Carnegie Mellon.

Her interest in computer vision stemmed from the visual element of image processing and the ability to visually interpret algorithm outputs.

✦

Transitioning research focus from pattern recognition to machine learning, with an emphasis on human-machine interaction.

02:47

Importance of meaningful human-machine interaction leading to a shift towards visual modalities for better communication.

Evolution towards natural language processing and creative expression through AI.

Journey from random projects to dedicated research agenda in generative modeling to enhance human creativity through AI.

✦

Transition from academia to industry at Meta (formerly Facebook AI Research).

05:58

Initially planned a one-year stint but ended up staying for several years due to enjoying the work and company's interest in keeping them.

Shift towards generative AI research, focusing on large language models, image and video generation, 3D content, audio, and music.

Emphasis on creating content in various modalities to cater to the need for more creators in addition to consumers.

✦

The 'Make a Video' project aimed to explore video generation using image and text data.

08:26

The approach involved leveraging diffusion-based models to make video generation feasible.

The project focused on separating visual appearance and language from motion to learn how people describe visual content and object movement.

This unique approach offered advantages such as reducing the model's learning complexity.

Overall, 'Make a Video' sought to push the boundaries of video generation technology through innovative data processing methods.

✦

Benefits of using image data sets for training video generation models.

10:12

Image data sets offer diversity with fantastical depictions like dragons and unicorns.

Simplified training process by separating images and text data, requiring only labeled video data for learning motion.

Model initializes with image generation parameters and gradually learns to generate temporally coherent videos.

Models learn from various visual concepts present in images and videos, enhancing interpretability.

✦

Limitations of current video generation technology.

13:26

Lack of complexity and storytelling capabilities in existing animated videos.

Emphasis on the need for longer, more intricate videos with consistent object appearances and scene transitions.

Comparison between the current state of video generation and advancements in image generation.

Questioning the existence of fundamental missing elements in approaching video generation, potentially causing a bottleneck in development.

✦

Challenges in video processing include slower iteration cycles, inadequate representations, and the need for hierarchical architectures.

15:51

Data is emphasized as crucial in video processing, urging the development of strategies for optimizing data.

Progress has been made in language and image processing, but effective data management for video applications still needs improvement.

✦

Challenges in training models with video data include limited motion in short videos and difficulty in learning from complex videos.

19:22

Video understanding is progressing slower than image understanding, affecting overall advancements in the field.

Advances in video understanding are more beneficial for robotics applications than video generation.

Embodied agents consuming visual content emphasize the significance of video understanding in AI applications.

✦

Importance of Controllability in Video Creation

21:39

Robots in embodied agents or robotics are active participants, with their actions influencing the visual signals they receive.

Controllability is essential in aligning generative models with users' creative expression.

Text prompts are crucial for controlling models and enhancing content generation.

There is a need for more diverse and multimodal prompts to improve direct control in inputting text prompts and receiving images or videos.

✦

Improving video generation models through various inputs.

22:49

Emphasizing the importance of controlling generated content and suggesting iterative editing mechanisms.

Predicting advancements in core capabilities will come before improvements in prompting techniques.

Discussing the trend of stylizing and editing existing videos, with products like Runway as examples.

✦

Overview of Text-to-Speech Systems and Audio Integration in Visual Content.

25:46

Text-to-speech systems are evolving to enhance the expressiveness and delightfulness of visual content through audio.

There is an underinvestment in audio and music integration in these systems.

Challenges in audio compositionality include creating longer pieces with multiple sounds happening simultaneously.

Despite the value audio adds to visual content, further development and investment are needed for improvement.

✦

Limitations of current AI models in generating natural speech and handling complex sequences of sounds.

27:11

Potential applications of AI agents in media creation, including animated gifs, video editing, and marketing.

Anticipation of near-term and longer-term uses for AI technology, focusing on unexpected and delightful user experiences.

Importance of identifying spaces where AI can emerge as a primary use case, hinting at innovative applications in various fields.

✦

Impact of social media on photography and video creation.

29:43

Instagram and TikTok have simplified control parameters, democratizing the creation of high-quality imagery.

Generative technologies have become popular, attracting both skilled artists and individuals seeking creative expression.

The rise of AI tools for self-expression has sparked questions about artistic engagement and creativity.

Some artists have built their brand around using AI as their primary tool for self-expression.

✦

Excitement about new technology and creative possibilities.

32:45

Importance of control in using tools for desired outcomes.

Artist planning to use AI in a video project.

Diversity in perspectives on AI's role in creation.

Some focus on specific visions while others embrace unpredictability in the artistic process.

✦

Multi-modality in image, audio, and video generation.

34:51

Lack of research in integrating all modalities into one system.

Progress seen in this area.

Insights from experience at CVPR, focusing on scaling models and impact on research labs.

Prediction of increased use of AI tools in social media for creative expression and communication.

✦

The impact of AI agents on social networks and communication.

37:15

AI is changing how people connect with each other.

Recognizing the impact AI has already had, even if it is sometimes overlooked.

Bot-based interactions and other social expression modalities can impact social and communicative media.

Advice on time management and productivity in AI research, emphasizing scheduling tasks on a calendar over a to-do list.

✦

Encouragement to not self-select opportunities and take a chance.

39:35

The conversation with the guest ends positively.

Gratitude expressed for the guest's time and participation.

00:00foreign

00:05tax prompts are democratizing creative

00:07expression and the Holy Grail is AI

00:09generated and edited video elad Gill and

00:12I sit down with Debbie parake she's a

00:15research director in generative AI at

00:17meta a leading researcher at

00:19multimodality and AI for visual audio

00:21and video and she's an associate

00:23professor in the school of interactive

00:25Computing at Georgia Tech recently she

00:27worked on make a video 3D which creates

00:29animations from text prompts she's also

00:31a talented artist herself Debbie welcome

00:34to no priors thank you thank you for

00:36having me let's start with your

00:37background and how you got started in

00:39computer vision I I've heard you say you

00:42choose projects based on what brings you

00:44Joy is that how you got into AI research

00:48um kind of kind of yeah so my background

00:50is that I grew up in India and then I

00:53moved to the US After High School and I

00:56went uh to a small school called Roman

00:58University in Southern New Jersey for my

01:00undergrad and that is where I first got

01:03exposed to

01:05um what at the time was being called

01:06pattern recognition we weren't even

01:07calling it machine learning and got

01:09exposed to some research projects there

01:11was a professor there who kind of showed

01:13some interest in me thought I might have

01:14potential to contribute meaningfully to

01:16research projects

01:18um and that's how I got exposed and I

01:20really really enjoyed what I was doing

01:21there decided to go to grad school to

01:24Carnegie Mellon I knew I was enjoying it

01:26but I wasn't sure if I wanted to do a

01:28PhD so at first I wanted to just kind of

01:30get a master's degree with the thesis

01:32where I can do some research but the

01:34year that I applied uh that the ECE

01:37department at CMU decided that there

01:39wasn't going to be a masters track for

01:41thesis like either you can just take

01:42courses or you go to a PhD and so they

01:45kind of slotted me onto the PHD track

01:48um which I wasn't so short of but my

01:50advisor there was reasonably confident

01:52that I'm going to enjoy it and I'm going

01:53to want to keep going

01:55um so yeah that's how I got started in

01:57this space

01:58Forest I was doing projects that didn't

02:01have a visual element to it how did you

02:03pick a thesis project

02:04so at first I wasn't I was working on

02:07projects that didn't have too much of a

02:09visual element to them

02:11um but when I got to CMU my advisor's

02:14lab was working in image processing in

02:15computer vision and I always thought

02:17that it was pretty cool that everybody

02:18gets to kind of look at the outputs of

02:20their algorithms and see what they're

02:23doing whereas if it's kind of non-visual

02:25then yeah you see these metrics but you

02:26don't really have a sense for what's

02:27what's uh happening if it's working if

02:30it's not

02:31um and so that's how I got interested in

02:32computer vision and that then defined

02:34the topic of my thesis over the course

02:36of my PhD

02:38so you have been working in machine

02:42learning long enough that as you said it

02:43was called pattern recognition and

02:45you've worked across a bunch of

02:47different modalities how does how has

02:49that changed your your research path

02:51because like things like diffusion

02:54models and games and large Transformers

02:56none of that existed when you were first

02:58starting and I I think you have managed

03:00to sort of translate or transition your

03:03interests in a way that keeps you on The

03:05Cutting Edge how does that happen

03:08yes I think I mean it you can always

03:11kind of look back and try and find

03:13patterns like when you're actually doing

03:14it you don't necessarily have a grand

03:16strategy of anything in mind but when I

03:18look back I think one common theme that

03:21led to me transitioning across topics a

03:24little bit

03:25um was that I was interested in seeing

03:27how we can get humans to interact with

03:30machines in more meaningful ways and so

03:34kind of even my transition from kind of

03:35non-visual to visual modalities in

03:37hindsight I feel like was essentially

03:38that I felt like you can't interact with

03:40these systems too much if it's sort of

03:42these abstract modalities that you're

03:43looking at and then when I was working

03:45in computer vision I wanted to find ways

03:48for humans to be able to interact with

03:49these systems more so I started looking

03:51at kind of these attributes and

03:53adjectives of like oh something is furry

03:54or something is shiny and using that as

03:57a mode of communication between humans

03:58and machines both for humans to teach

04:01machines new Concepts and for machines

04:03to be more interpretable than explaining

04:05why they're making the decisions that

04:06they're making and that slowly led to

04:08the the sort of more international

04:10language processing where instead of

04:12these kind of just adjectives and

04:13attributes looking at more natural

04:14language as a way of interacting so a

04:16lot of my work in visual question

04:18answering where you're answering

04:19questions about images image captioning

04:21was coming from there and then over time

04:24I sort of started thinking of are there

04:27ways to go even deeper in this

04:29interaction there are other ways very

04:30high tools can enhance sort of creative

04:33expression for people give them more

04:35tools for expressing themselves

04:37um and that's how I got interested in AI

04:39for creativity and I was dabbling in

04:40kind of a few fairly random projects a

04:44few years ago kind of my bread and

04:46butter research was still multimodal

04:48vision and language

04:50um but then sort of I enjoyed what I was

04:52doing with AI for creativity and a

04:54couple of years ago that took sort of a

04:55little bit more of a serious turn where

04:57I made it more of my sort of

04:59full-fledged research agenda and that's

05:01how I started working more seriously on

05:03generative modeling including

05:05Transformer based approaches diffusion

05:07models for images but video for 3D video

05:10things of that sort so you became a

05:13professor and you also you know now work

05:15in Industry at meta what brought you

05:17there

05:18so this was I've been at meta for about

05:21seven years now and so this had started

05:23when I was transitioning from Virginia

05:25Tech to Georgia Tech's I was an

05:27assistant professor at Virginia Tech and

05:28I was getting started at Georgia Tech

05:30and in that uh transition uh I decided

05:34to spend a year at Fair

05:36um at the time it was called Facebook AI

05:38research now fundamentally I research at

05:40meta and I knew colleagues there from

05:43some of them had been at Microsoft

05:44research before that I had interned at

05:46MSR I had spent summers at MSR even as a

05:49faculty member

05:50um and so I had a lot of colleagues who

05:52I knew once I thought it would be fun to

05:54kind of spend a year collaborate with

05:55them

05:56um get to know what fair is like

05:58um so that's what that was It was

05:59supposed to be a one-year uh stint and

06:01then I was going to go back to Georgia

06:02Tech and kind of continue with my uh

06:04academic position

06:06um but in that one year I enjoyed it

06:08enough uh I think Fair enjoyed having me

06:11around enough where we tried to figure

06:12out is there a way to keep this going

06:14for longer

06:16um and so for uh many years after that

06:18for five years so I was splitting my

06:20time but every fall I would go back to

06:22Georgia Tech to Atlanta spend the fall

06:24semester there teach and then the rest

06:26of the year I would be in Menlo Park at

06:28Facebook no matter and and you

06:30transition from fundamental AI research

06:33to a new sort of generative AI group can

06:35you talk about why that's interesting to

06:37matter sort of what what kind of things

06:38you're working on now

06:39yeah yeah I mean yeah that's a very

06:42exciting space right now there's a lot

06:44happening both uh within my diet and

06:46outside as I'm sure many of the people

06:48listening to this are aware of

06:50um but yeah so that the new organization

06:52was created a few months ago

06:54um so not not a long not a long time ago

06:57um and it's uh looking at things like

06:59large language models image Generation

07:01video generation generating 3D content

07:03audio music

07:05um yeah all sort of all sorts of

07:07modalities that you might think of

07:10um and

07:11why is it interesting I mean like right

07:14now if you think about all the content

07:16there's so much content that we consume

07:19in all modalities and all sorts of

07:21surfaces

07:22um and it makes a lot of sense to ask

07:24that instead of uh maybe not instead of

07:26but in addition to all of this

07:28consumption can more of us be creating

07:30more of this content right and so

07:33um almost everything that you think of

07:35images video you can ask this question

07:38like for any situation where you're

07:39searching for something trying to find

07:40something it's relevant to ask well

07:42could I just create what it is that I

07:44have in my head

07:45um and so when you think of it that way

07:46you can see how it can touch

07:49um a lot of different things across a

07:51variety of products and surfaces so yeah

07:53yeah that makes a ton of sense I think

07:55we'll um come back and ask you some

07:56questions about images and audio and a

07:59few other things since you've done so

08:00much interesting work across so many

08:01different areas but maybe we can start a

08:03little bit with video generation in part

08:05you know due to the fact that you had a

08:07really interesting recent project called

08:08make a video and in that approach users

08:11can generate video with a text prompt so

08:13you could type in Imagine a corgi

08:15playing with a ball and it would

08:16generate a short video of a quirky

08:18playing with a ball

08:19um could you tell us a bit more about

08:21that project how it started and also how

08:24it works and what's the basis for the

08:25technology there

08:26yeah yeah so make a video

08:30um it it started because I mean this was

08:33um a couple of years ago where uh this

08:35was before Dali 2 by the way so this was

08:37like after Dali one had happened and I

08:39feel like a lot of people don't even

08:40remember Dali one anymore like people

08:42don't even talk about that it's fun to

08:44go check out what those images look like

08:46um and that had that had blown our minds

08:48at the time but now when you go back

08:49you're like wait like that's not even

08:50interesting but anyway so we had seen a

08:53lot of progress in image generation

08:55um and so it seemed like the next kind

08:57of entirely open question where we

08:59hadn't seen much work at all was to see

09:01what can we do with video generation and

09:03so that was kind of the the inspiration

09:04behind that

09:06um and for make a video the approach uh

09:09specifically the thinking was we have

09:13these image generation models by this

09:14time we had seen a lot of progress with

09:16diffusion based models

09:18um from a variety of different

09:19institutions and so the idea was is

09:21there a way of leveraging all the

09:23progress that's happened with images in

09:26a very way to sort of make video

09:28generation possible and so that led to

09:31this intuition that what if we try and

09:34use images and Associated text as a way

09:37of learning what the world looks like

09:39and how we how people talk about the

09:42visual content and then separate that

09:45out from trying to learn how the how

09:48things in the world move so separate out

09:50appearance and language and that

09:51correspondence from motion

09:54um of how things of how things move and

09:57so that is what led to make a video

09:59um and so the there are sort of multiple

10:01advantages of thinking of it that way

10:04one is there's less for the model to

10:06learn because you're directly bringing

10:07in everything that you already know

10:09about images to start with

10:12um the second is all of the diversity

10:15that we have in our image data sets that

10:17image models already know all sorts of

10:19Fantastical depictions of like dragons

10:20and unicorns and things like that where

10:22you may not have as much video data

10:25easily available all of that is

10:27inherited so all the diversity of the

10:29visual concepts can come in through

10:30images even if your video data set

10:32doesn't have all of that

10:34um and the third benefit maybe the

10:36biggest one is that because of the

10:38separation you don't need video and

10:41Associated text aspired Data you have

10:43images and text as per data and then you

10:45just have one labeled video to learn

10:46motion from

10:48and so these were kind of three things

10:50that we thought were quite interesting

10:52in how we approached make a video

10:54um and so concretely the way it works is

10:56that when you initialize the model

10:58you're starting off with image

11:00Generation

11:01Um sort of parameters that have already

11:03been learned so before you do any

11:05training for make a video you you we set

11:07it up so that it can generate a few

11:10frames that are not temporarily coherent

11:13so there is going to be independent

11:15images like the Corgi playing with the

11:16ball is just going to be independent

11:18images of corgis playing with blue balls

11:20but they're not going to be temporally

11:22coherent and then what the network is

11:24trying to do as it goes through the

11:25learning process is to make these images

11:27temporally coherent so that at the end

11:29of training

11:30um it is generating a video rather than

11:33just unrelated images and that's where

11:35the videos come in as training data

11:37that's a great explanation one question

11:39just like if we if we use an example of

11:42something that is not going to be in

11:45your video training set right um so so I

11:49want to a flying Corgi for example right

11:52how should I think of this in terms of

11:54like interpreted motion

11:57yeah so one way of thinking of it could

11:59be that you may not have seen a flying

12:01Corgi but you've probably seen flying

12:03airplanes or flying birds and other

12:05things that fly

12:07um in images and in video and so from

12:09images you will have text associated

12:12with it so you will have a sense for

12:13what things tend to look like when

12:15someone is saying oh this is X flying or

12:16why flying and then in videos you will

12:18have seen the motion of what stuff looks

12:20like when it when it flies and in images

12:23you'll have seen what corgis look like

12:25and so it's hard to kind of know for

12:28sure what it is that these models learn

12:29sort of interpretability is not a

12:32strength of many of these uh deep large

12:34architectures but that could be one

12:36intuitive explanation for how the model

12:39is managing to figure out what a flying

12:41Corgi might look like yeah

12:42well what are some of the major

12:44forward-looking aspects of this sort of

12:45project and research

12:47I think there's a ton to do in the

12:50context of video generation like if you

12:51look at make a video it was very

12:53exciting you know sort of first of its

12:55kind

12:55um capabilities at the time but it's

12:57still it's a four second video it's

12:59essentially an animated image right it's

13:01kind of the same scene the same set of

13:03objects that are moving around in

13:05reasonable ways but you're not seeing

13:07objects appear objects uh disappear

13:09you're not seeing objects reappear

13:11you're not seeing scene transitions

13:14um none of this is is is in there and so

13:17if you look at if you think about the

13:19complexity of videos that you just

13:20regularly come across on various

13:22surfaces this is far from that and so

13:24there is a ton to be done in terms of

13:26making these videos longer more complex

13:29having memory so that if an object

13:32reappears it's actually consistent it

13:34doesn't now look entirely different

13:36um things of that sort of being able to

13:38tell more complex stories through videos

13:40all of this is uh entirely open well and

13:45I know these things are always extremely

13:46hard to predict but if you you look for

13:48it a year or two what do you think the

13:49state of the art will be in terms of

13:51length of video complexity of the scenes

13:53that you can animate things like that

13:54yeah that is that is hard to say and

13:59to have been seen more of this already

14:02like make a video was I think what nine

14:05months or so ago maybe approaching one

14:07year and it's not like even from other

14:10institutions it's not like we are seeing

14:12amazingly longer videos or significantly

14:14higher resolution or much more

14:15complexity we're still kind of in this

14:17videos equals animated images and yeah

14:20maybe the resolution is a little bit

14:21bigger quality is a little bit higher um

14:24but it's not like we've made significant

14:25breakthroughs unlike for example what

14:27we've seen with images

14:29um so I do I I that has given me a sense

14:32that maybe this is harder than what we

14:35might think and sort of our usual curves

14:37of like with language models or image

14:38models we're like oh just six more ones

14:41then there's going to be something else

14:42that's an entirely different step change

14:43over this

14:45um I think that might be harder in video

14:46and I wonder if there is something that

14:48we're kind of fundamentally missing in

14:50terms of how we approach video

14:51Generation

14:52Um so it's not quite answering uh what

14:55you asked me but I do think that it

14:56might be a little bit slower than what

14:58we might have guessed just because based

15:00on problems and other modalities what do

15:02you think is the main either challenge

15:04or bottleneck

15:05that you think has slowed progress in

15:07this field or not slowed it I mean

15:08obviously there's been you know every a

15:10lot of people are working very hard on

15:11these problems but to your point it

15:13seems like sometimes you have these

15:14fundamental breakthroughs and sometimes

15:15it's an architecture like Transformer

15:17based models versus traditional NLP and

15:20sometimes it's

15:21um you know iterating on a lot of

15:23other things that already exist in the

15:25pre-existing approaches and just sort of

15:26solving specific engineering or

15:28technical challenges

15:29if you were to sort of list out the

15:31bottle next to this why do you think

15:32they're likely to be

15:33yeah I think there's a few different

15:35things one is videos I just sort of from

15:39an infrastructure perspective harder to

15:40work with right they're just sort of

15:42larger more storage and sort of more

15:44expensive to process and more expensive

15:46to generate and all of that so there's

15:48just that iteration cycle that is much

15:51slower with video than it would be with

15:52other modalities so that is is one

15:55um the second is I don't I don't think

15:58we've still figured out the right

16:00representations for video right there is

16:03a lot of redundancy in video One frame

16:05to the next frame there's not a whole

16:06lot that changes

16:08um we still kinda approach them fairly

16:10independently as sort of independent

16:12images even if you're generating it it's

16:14kind of one after the other or if even

16:16if you're generating in parallel and

16:18then making it finer grain

16:20um so I think maybe that could be

16:21something that helps with a breakthrough

16:23that if we really figure out how to

16:25represent videos efficiently

16:28um and the third is this hierarchical

16:30architecture that if you want longer

16:33videos there's just so many pixels that

16:35you're trying to generate right it's a

16:36very very high dimensional signal um

16:38compared to anything else that we're

16:41doing

16:41and so just thinking through how do we

16:43even approach that what sort of

16:44hierarchical representation

16:46um makes sense especially if you want

16:48these scene transitions if you want to

16:49have this consistency which may be a

16:52form of memory

16:53um figuring those architectural pieces

16:55out um I think maybe another piece of

16:57this puzzle um and then finally data

16:59right data is kind of gold in anything

17:01that we're uh we're trying to do

17:03um and I don't know if as the community

17:05we've quite built the muscle of thinking

17:07through data

17:09um sort of massaging the data

17:10appropriately

17:11um and all of that in the context of

17:13video we have that muscle quite a bit

17:15with language quite a bit with images

17:17but with video we're perhaps not quite

17:19there yet what would be the ideal

17:21training set for for video in that case

17:24or what's lacking from the existing

17:26approaches

17:27yeah I think what's what's lacking may

17:31not be so much

17:32um the data source itself although that

17:34is certainly a challenge as it is with

17:36other modalities but I think it might

17:38also be the data recipes that do we want

17:41to start with training with sort of

17:43these very short videos where not much

17:45is happening the scene isn't really

17:46changing but then that also tends to

17:48limit the motion there's just not much

17:50happening and so you're going to end up

17:51with these kind of animated image

17:52looking things

17:54um and on the other hand you have sort

17:55of very complex video might be multiple

17:57minutes long with all sorts of scene

17:58Transitions and that's ideally what you

18:00want to shoot for that's where you want

18:02to get but if you're just kind of

18:03directly throw all of that into the

18:05network it's unclear if the models will

18:06be able to learn

18:08um all of that complexity well so I

18:10think thinking through some sort of a

18:11curriculum may be valuable here and I

18:14don't think we've quite nailed that

18:15recipe down

18:17I feel like every generation of sort of

18:19Technology shifts always runs into video

18:22is the hardest thing to do and if you

18:24look at sort of just the first

18:25substantiation of the web one of the

18:26reasons YouTube sold was the

18:27infrastructure point you made earlier

18:29where just dealing with that huge amount

18:31of streaming and the costs associated

18:32with it and everything else even in the

18:34prior generation of just you know can we

18:36host and stream this effectively in part

18:39led to them you know getting sold to

18:41Google reasonably early in the life of a

18:43company so it's interesting how video is

18:44always that much more complicated yeah

18:46yeah and same thing for computer vision

18:48right like here we talk about generation

18:50but even just understanding

18:52um with images with image understanding

18:53there was so much progress that was

18:55being made and videos was always kind of

18:57not only trailing behind but just sort

18:59of continue to be harder even sort of

19:00the rate of progress uh was slower not

19:03just the absolute progress and I think

19:05yeah I think we're seeing some of that

19:07uh for alternative models as well

19:09Devi I know this isn't within your like

19:11core field but I'm sure you also pay

19:13attention like how do you think advances

19:15in video may like impact robotics

19:22so I think

19:23there the video understanding piece is

19:26probably more relevant than the video

19:28generation piece

19:29um and video is like if you think of

19:32embodied agents right they're sort of

19:33moving around and consuming visual

19:36content which inherently is video right

19:37they're not looking at static images and

19:39so I think that video understanding

19:41piece is is very relevant there what's

19:43also interesting in the context of

19:45embodied agents or sort of Robotics uh

19:47physical robots that are moving around

19:48is that it's not passive consumption of

19:51videos right it's not like how you and I

19:53might be watching videos on YouTube or

19:55anything else it's that the I'm yelling

19:58at the screen I'm not passive

20:00foreign

20:01it's that the the next visual signal

20:05that the robot will see will be a

20:07consequence of the action that the robot

20:08had taken right so if it chose to move a

20:10certain way that's going to change what

20:12the video looks like in the next few

20:14seconds

20:15um and so there's that interesting

20:16feedback loop there where it knows what

20:18action it had taken it sees how that

20:20changed the visual signal that it is uh

20:23that it is now getting as input and so

20:26that connection makes it adds a layer of

20:29interestingness to how it can process

20:31the video sort of in contrast to with

20:34sort of regular computer vision

20:35disembodied tasks where we think of sort

20:37of it just streaming a video is just

20:39kind of happening and you're not

20:40controlling what you're seeing

20:42you started by saying that human

20:45interaction was a big driving force in

20:49in your research interests and

20:51um you know going Beyond like metrics as

20:53outputs and um even language as inputs

20:57um how do you think about uh

20:59controllability in video and like how

21:01important text prompting is to sort of

21:04the next generation of creation

21:07yeah I think that's um I think that's

21:09very important exactly to your point

21:11that if we want these generative models

21:13not just for video but for any modality

21:15to be

21:16um tools for Creative expression then it

21:20needs to be generating content that

21:22corresponds to what someone wants to

21:25express right like it has to bring

21:26somebody's voice to life and that is not

21:29possible if there aren't good enough

21:30ways of controlling

21:32um these models and so text is one way

21:34that's better than random samples that's

21:37that's one way in which I can I can say

21:39what I want but right now for the most

21:40part you type in a text prompt you get

21:43an image back a video back

21:45um and either you take it or leave it

21:47right like if you like it that's great

21:48if not you're just going to try again

21:50and maybe you tweak The Prompt a little

21:52bit you sort of

21:53um try a whole bunch of these prompt

21:55engineering tricks and and hope that you

21:57get lucky but it's not really a very

21:59direct form of control

22:02um and so I think of more control at

22:04least in two different ways one is to

22:07allow for prompts that are not just text

22:09but are multimodal themselves so for

22:12image generation for example instead of

22:14just text it would be nice if I can kind

22:16of sketch out what I want the

22:17composition of the scene to look like

22:19and and the model would be expected to

22:21kind of respect that

22:23for video instead of just text as input

22:26maybe I can also provide an image as

22:27input so that I can tell the system that

22:29this is the kind of scene that I want

22:30maybe I can provide sort of a little

22:32audio clip as input to convey that this

22:34is the kind of audio or sound that I

22:36want to associated with it maybe I also

22:38bring in a short video clip

22:40um and expect the model to sort of bring

22:41in all of these different modalities in

22:43a reasonable way

22:45um to generate a video so that's one

22:46piece where I can bring in more inputs

22:49as a way of more control and the second

22:51piece is sort of the predictability part

22:53that even if I bring in all of these

22:55modalities as input if the model then

22:58goes off and kind of does its own thing

22:59with these inputs maybe it's reasonable

23:02but that's not what I'm looking for what

23:04do I do right like do I just go back and

23:05try again

23:06it would be ideal if there's some way of

23:08having iterative editing mechanisms

23:11where whatever I get back I have a way

23:12of communicating to the model what it is

23:14that I want changed in what way so that

23:16over iterations I can get to the content

23:18that I intended in sort of a fairly

23:22reasonable way without having to sort of

23:24spend hours learning on YouTube or

23:25something like that right so if that can

23:27be done in a very intuitive interface I

23:30think that would be pretty awesome

23:31where do you think we will get to in

23:33terms of the frontier of like controls

23:36for video generation over the next

23:37couple years or five years

23:40I think control sort of tends to lag

23:43behind the core capability like even

23:45with images I feel like we first had to

23:47get to a point where these models can

23:48actually generate nice looking images

23:50before we start worrying about why this

23:52is really doing what I wanted it to do

23:54and I feel like we're not quite there

23:55with video yet so get random good first

23:58exactly exactly exactly like at least

24:00get random good first then maybe let me

24:02give a text then let me give it these

24:03other prompts

24:05um so I do think we'll first probably

24:06see more progress in just the core

24:09capabilities of sort of text to video

24:10Generation

24:12Um before we look at uh uh prompting

24:14although we are and this is in the

24:17context of sort of me generating

24:19something from scratch right which is

24:20where I might want the iterative control

24:22and things like that a parallel scenario

24:24is where I already have a video and I'm

24:26trying to edit it an interesting way I

24:28might want to stylize it and all of that

24:30I think we're already seeing that even

24:31in products um with with runway for

24:34example right so I think that we'll

24:36probably see much more of uh we're

24:37already seeing that and I think we'll

24:39see more of where you already have a

24:40video that you're starting with and then

24:42you're trying to edit it um which has

24:44similarities too but is a little bit is

24:46a little bit different in my mind uh

24:48compared to sort of generating something

24:50from scratch and wanting control over

24:52that the other um potential uh part of

24:55output for videos obviously is

24:56text-to-speech or some sort of Voice or

24:59other ways to sort of accompany the

25:01video or animate it what is your view in

25:03terms of the state of the art of

25:04tech-to-speech systems and how those are

25:06evolving

25:07I think I've

25:11quite as much what I have dragged a

25:13little bit more closely is things like

25:15text to audio where you might say that

25:17the sound of a car driving down the

25:19street and what you expect is sort of

25:21effect of a car driving down the street

25:23uh to be generated

25:25um and so there uh the state of the art

25:27right now is

25:29um sort of roughly sort of a few seconds

25:32to tens of seconds long audio

25:36um and I would say that roughly it

25:38probably works reasonably well uh one in

25:40five times or so

25:43um It's like because there aren't

25:44concrete metrics it's kind of hard to

25:46articulate where state of the art is but

25:49hopefully this is this is helpful and I

25:51do think that

25:53um audio added to visual content makes

25:57it much more expressive and much more

25:59delightful and I do think that it tends

26:01to be under invested uh both for audio

26:04similarly for music I think it just

26:06makes the content much more expressive

26:08much more delightful

26:09um but I feel like we don't do enough of

26:11that yeah it's interesting too because

26:13there are actually very large sound

26:14effect libraries out there they're very

26:16well labeled as well in terms of what

26:18the exact sound effect is and the length

26:19and the components and all the rest and

26:21so yeah it's interesting that the state

26:23of the art hasn't quite caught up with

26:24with you know what used to be a really

26:26interesting old business where you

26:28generate an enormous amount of IP for

26:29different sound effects and then you

26:30just license them out

26:32yeah yeah which it seems like eventually

26:34that industry is likely to go away so

26:36and even with audio similar to what we

26:38were talking about with video there is

26:39the right like the same kinds of

26:40challenges and dimensions exist that you

26:42want the piece to be longer you may want

26:45compositionality right I might want to

26:47be able to say that well first it's the

26:48car driving down the street and then

26:49there is a sound of I don't know a baby

26:52crying and then something else and maybe

26:54I'm saying that two of these sounds are

26:55happening simultaneously which is not

26:57like that's something that can happen in

26:59audio where you can have the support in

27:00position

27:01um but in video is not something that

27:03would uh where it's quite as natural and

27:06so all of that isn't stuff that these

27:08models can do uh very well right now if

27:11I described a complex sequence of sounds

27:13or if I try to talk about these

27:15different sounds simultaneously

27:17um these models can't do that very well

27:19where do you think we'll see the first

27:21application areas or what do you think

27:22are the first sort of use cases that

27:24we'll see immediately and then how does

27:25that evolve over time

27:27yeah I think

27:28I think not too much

27:31have the strongest intuitions there

27:33um but I think for like kind of like I

27:36was touching on earlier that a lot of

27:38these situations where we find ourselves

27:40searching for things to express

27:41ourselves

27:43um I think thinking of whether that can

27:44be generated so that it's a closer

27:47reflection of what you're trying to

27:49communicate as likely things that we'll

27:51see and I know we're not talking about

27:52sort of llms and conversational agents

27:55and all of that too much but I think AI

27:56agents is going to be a thing that we'll

27:58see a whole bunch of um across many

28:01different surfaces and then thinking

28:03about what media creation looks like in

28:05the context of AI agents is is another

28:07uh Dimension to this yeah that makes a

28:10lot of sense I mean there's all sorts of

28:12obvious sort of near-term applications

28:14in terms of generating your own animated

28:16gifs or to your point

28:18um mid-stream video editing

28:21or you know different types of shorter

28:23Forum animations or other things that

28:25you could do marketing Etc and so you

28:28know it definitely feels like there are

28:29some near-term applications and some

28:30longer term ones and then the thing I

28:31always find interesting about these

28:32sorts of Technologies is the spaces

28:34where they kind of emerge in a way that

28:36you don't quite expect but end up being

28:38a primary use case you know it's sort of

28:39the Uber version of the

28:41of the mobile Revolution where you push

28:43a button and a stranger picks you up in

28:44a car and you're fine with it right

28:46um and it feels like those sort of

28:48unexpected delightful experiences are

28:49are going to be very exciting in terms

28:51of a lot of areas of this field

29:01not that we can kind of directly plug

29:03into and kind of see or did the metric

29:05go up did the metric go down I think

29:06there's a lot of just kind of thinking

29:08about where do we anticipate people will

29:11be excited to use this and I as you said

29:13I think I there's a very good chance

29:15that there will be things that we don't

29:17necessarily foresee um but just kind of

29:19come up as very exciting spaces

29:23um I I think an interesting cynicism has

29:26been that like there aren't that many

29:27artists out there or like people don't

29:29want to create imagery uh when when

29:33looking at some of these uh generative

29:35like much less video when looking at

29:37some of these generative Technologies

29:38but you know the the recent history of

29:42social media would say that's like

29:43certainly not true right um if you look

29:46at Instagram democratizing photography

29:48or Tick-Tock democratizing short form

29:51video creation by just like reducing the

29:53number of parameters of control right as

29:55you said like you know sound makes video

29:58much richer but it's also really hard to

30:01produce any one of these pieces so you

30:03just take one control away and like you

30:05know record with your phone and you get

30:07you get something like Tick Tock but I

30:09think it's really exciting like the

30:11explosion and usage of things like

30:14mid-journey right because the attraction

30:16suggests they're an awful lot of people

30:18who are actually interested in

30:19generating high quality imagery for a

30:22whole range of use cases professional or

30:24not yeah yeah and I think there's people

30:27across the entire Spectrum right like on

30:29one hand you can talk about artists who

30:31already had a voice were already

30:32involved in sort of um creating

30:35it's hard and then the other end of the

30:37spectrum are people who don't

30:39necessarily have the skills may not have

30:41had the training but are still

30:42interested in being able to express

30:44their voice is a little bit more

30:45creatively than they would have

30:47otherwise and so I do think that there

30:50is one question of whether or not

30:51artists want to be engaging with this

30:54technology and there is the other

30:55question of does it kind of just lift

30:57the tide for all of the rest of us to be

31:00able to be more

31:01um expressive in in what we can uh

31:03create and what we can communicate

31:06um and so I think both of those ends are

31:08are relevant here

31:10um and with artists there are artists

31:11who are like whose sort of brand is AI

31:14artists where they are explicitly using

31:16AI as the tool of choice

31:19um for expressing themselves and their

31:21entire practices around that someone

31:22like Sophia Crespo or Scott Eaton and

31:25others

31:26um so there's also and this was before

31:28mid Journey or anything like that right

31:30like they've been doing this for years

31:31um even with like Gans for example that

31:33existed that were popular before

31:36yeah you're an artist yourself both um

31:39you know digital AI driven analog

31:43um some of it's behind you like how does

31:44that impact like your view of this

31:47I I kind of always

31:49um hesitate a little bit to call myself

31:51an artist I feel like somebody else

31:52should be deciding whether I'm an artist

31:54or not but then there's this whole

31:55Community we'll say you're an artist

31:56yeah by the way we should mention um

31:59some of your lovely macrame art is on

32:01the wall behind you as well so I think

32:03it looks great yeah thank you thank you

32:05and yeah so to be honest I don't know if

32:08I it's hard to kind of look back and get

32:11a sense for did that play a certain role

32:13in it or not I know for sure that it

32:15plays a role in just how excited I am

32:17about this technology that anytime

32:19there's some new model out there whether

32:21it's from the teams that I'm working

32:23with or if it's something external I'm

32:25definitely very enthusiastic to want to

32:27try it out and see what it can do what

32:28it can't do and sort of tell people

32:29about it and so just kind of my Baseline

32:31level of excitement around this

32:33technology

32:34um is high uh in part because of all

32:37these other interests that I have I I'm

32:39pretty sure that my emphasis on control

32:41is probably also coming from that where

32:43I feel like I want to be using these

32:45tools to kind of have them do this thing

32:48that I want to do and sort of text

32:50prompts are restrictive in that way

32:53um and I mean the we talked about it in

32:55the context of control that if you can

32:57bring in multiple modalities as input

32:59that definitely gives you more control

33:00but it also means that there is more

33:02space to be creative right I can now

33:04pick interesting images or interesting

33:06videos or interesting pieces of audio

33:08then pair that up with like this really

33:10interesting text prompt and just kind of

33:12see what happens like if I put all of

33:14this in you don't know what the model is

33:15necessarily going to do as it's also

33:17just more Norms to play with

33:20um as you're as you're trying to

33:22interact with these that yeah there's

33:24just more more space to be creative if

33:26there's more ways of more knobs to

33:28control these models

33:30yeah I was talking to um Alex Israel

33:32who's an LA based artist and uh he's not

33:36a technical guy but an amazing artist

33:38and he was describing this new video

33:41project he wants to do that involves use

33:44of AI and I was very inspired by like

33:46how specific the vision was and like

33:49thinking through the implementation a

33:50little bit for somebody who doesn't come

33:52from the technical field and imagine

33:54there would be a whole crop of people

33:55who look at the capabilities as another

33:57tool for expression yeah yeah and so and

34:01there are some people who have a very

34:03specific vision and they just want the

34:04tool to kind of help them get there and

34:06then there are others whose process

34:08involves sort of bringing the model

34:10along where the unpredictability and

34:13sort of not necessarily knowing what

34:14this model is going to generate is a

34:16part of their process and is a part of

34:17the final piece that they create so some

34:20view of them view these models very much

34:22as tools and then others tend to view

34:24them as more of a collaborator

34:27um in this process of of creating and

34:29it's always interesting to see what end

34:31of the spectrum different people lie on

34:33yeah okay so as we're nearing the end of

34:36our time together we want to run through

34:37a few rapid fire questions if that's

34:39okay

34:40um maybe I'll start with one just given

34:42your breath in the in the field

34:44um is there an area of image audio video

34:48generation understanding control that

34:51you feel like is um just under explored

34:53for people looking for research problems

34:57yeah so one is the control piece that we

35:00already talked about quite a bit and I

35:01think the other is

35:03multi-modality like bringing all of

35:04these modalities together right now we

35:06have models that can generate text we

35:08have models that can generate images

35:09models that can generate video but

35:12there's no reason these all need to be

35:13independent you can Envision systems

35:15that are sort of ingesting all of these

35:16modalities understanding all of it and

35:18generating all of these modalities

35:21um and I haven't I'm starting to see

35:23some work in that direction but I

35:24haven't seen a whole lot of it that goes

35:27across many different modalities

35:29you just got back from cvpr and

35:31presented there can you mention both

35:34what you were talking about and then

35:35sort of the project or work that most

35:38inspired you there yes I was at cbpage

35:42um I was on a few different panels and I

35:44was giving some talks so one of it was

35:45on Vision language and creativity at the

35:48at the main conference

35:49um and so yeah that was kind of what I

35:51was representing there

35:52um in terms of something exciting that I

35:54saw the

35:56um not necessarily a paper but there was

35:58a workshop there called uh Scholars and

36:01big models where the topic of discussion

36:04was as these models are getting larger

36:06and larger making a lot of progress in

36:09what way can sort of academics or Labs

36:11that don't have at sort of as many

36:13compute resources what should their

36:15strategy be how should they be

36:17approaching these things

36:19um and that I thought was a really nice

36:20discussion in general I tend to enjoy

36:22venues they talk about kind of the meta

36:25like the human aspects of the work that

36:27we do we always we have a lot of

36:28technical stations but we don't tend to

36:30talk about these other components and so

36:32that Workshop is something that I

36:34enjoyed quite a bit

36:36um is there a prediction you'd make

36:38about the impact of all these

36:40Technologies on social media since

36:42you're working at the intersection

36:43yeah there's going to be more of it

36:45that's one that's one prediction that I

36:48can uh very confidently make that we are

36:50going to see uh all of these tools sort

36:52of show up where uh millions and

36:54billions of people can be can be using

36:56these in various forms

36:58um and I'm I'm excited about that like I

37:01said I think it kind of enhances uh

37:03creative expression and just sort of

37:05communication and we're going to have

37:06these entirely new ways of interacting

37:09with each other

37:10um then even the social entities on

37:13these networks might change right like

37:15when we talk about AI agents and you

37:17think about them being sort of part of

37:19the social graph

37:20um what what that does and how that

37:23changes how we connect with each other

37:24all of that is is fascinating and I

37:26can't I can't wait to see how that

37:28evolves it actually feels a little bit

37:30under discussed in terms of how much

37:32impact it's already had in some ways

37:33even if it sometimes is transitory like

37:35a lensa or you know there's my

37:38understanding is tens of millions of

37:39people at least who ended up using that

37:40product there's character in terms of

37:42the engagement there one could argue mid

37:45Journey for certain types of art you

37:46share with friends Etc and so I feel

37:48like there's already these

37:49social expression modalities bot-based

37:52interactions Etc that are already

37:54impacting aspects of social or

37:56communicative media in ways that um

37:58people don't really recognize it as such

37:59so to your point it's very exciting to

38:01see where all this is heading yeah yeah

38:03David you also write quite a bit online

38:06in in terms of sort of giving back to

38:08the community and um advice for

38:10researchers or young people newer to the

38:13fields you talk about time management

38:14and other topics what's one one piece of

38:17wisdom you'd offer to people in terms of

38:19productivity and joy and AI research

38:28I'll post on that it's sort of main

38:30philosophies that you should be writing

38:33everything that you want to do down on

38:36your calendar so it's kind of the point

38:37is that it should be on your calendar it

38:39shouldn't be your to-do list and the

38:40reason it should be on your calendar is

38:41it forces you to think through how much

38:43time everything is going to take if it's

38:45just a list you have no idea how long

38:46it's going to take and that's not a good

38:48way to plan your time out so that's kind

38:50of the main thesis of it

38:52um that I hadn't anticipated but it

38:54resonated a whole lot with many people

38:55which was kind of surprising when I put

38:57it out there so yeah if anyone's

38:58interested

38:59um you should check that out in terms of

39:01advice outside of what I've written one

39:04advice that has stuck with me over the

39:06years is don't self-select that if you

39:09want something go for it if you want a

39:11job apply for it if you want a

39:13fellowship for any students who might be

39:15listening just apply for it you wanted

39:17an internship just apply for it

39:19um and yeah like don't don't assume

39:22don't question oh am I good enough am I

39:24not

39:25um it's on the World to say no to you if

39:27you are not a good fit the world will

39:29tell you that and so yeah there's

39:31nothing to lose by just kind of giving

39:33it a shot so don't self-select it's a

39:35great note to end on Devi thank you so

39:37much for joining us on no priors thank

39:39you thank you for having me ah thanks

39:41for the time

🎥 Related Videos

What vaccinating vampire bats can teach us about pandemics | Daniel Streicker

a16z Podcast | Things Come Together -- Truths about Tech in Africa

2024 TSCRS Applications of anterior segments diagnostic instruments in cataract surgery

a16z Podcast | The Infrastructure of Total Health

The Robot Lawyer Resistance with Joshua Browder of DoNotPay

NES Controllers Explained

🔥 Recently Summarized Examples

The Hitler-Stalin Pact | Reflections Episode 9

Uncovering Corruption From Health "Experts" | Scott Carney

The Forgotten Geometry: A New Path to Unification

Joe Rogan Experience #2194 - Luis Elizondo

From Tesla to DNA: The Science of Scalar Waves - Dr. Sandra Rose Michael - Think Tank E44

Bitcoin Holders...Watch Out for Sept

View original video