00:05tax prompts are democratizing creative
00:07expression and the Holy Grail is AI
00:09generated and edited video elad Gill and
00:12I sit down with Debbie parake she's a
00:15research director in generative AI at
00:17meta a leading researcher at
00:19multimodality and AI for visual audio
00:21and video and she's an associate
00:23professor in the school of interactive
00:25Computing at Georgia Tech recently she
00:27worked on make a video 3D which creates
00:29animations from text prompts she's also
00:31a talented artist herself Debbie welcome
00:34to no priors thank you thank you for
00:36having me let's start with your
00:37background and how you got started in
00:39computer vision I I've heard you say you
00:42choose projects based on what brings you
00:44Joy is that how you got into AI research
00:48um kind of kind of yeah so my background
00:50is that I grew up in India and then I
00:53moved to the US After High School and I
00:56went uh to a small school called Roman
00:58University in Southern New Jersey for my
01:00undergrad and that is where I first got
01:05um what at the time was being called
01:06pattern recognition we weren't even
01:07calling it machine learning and got
01:09exposed to some research projects there
01:11was a professor there who kind of showed
01:13some interest in me thought I might have
01:14potential to contribute meaningfully to
01:18um and that's how I got exposed and I
01:20really really enjoyed what I was doing
01:21there decided to go to grad school to
01:24Carnegie Mellon I knew I was enjoying it
01:26but I wasn't sure if I wanted to do a
01:28PhD so at first I wanted to just kind of
01:30get a master's degree with the thesis
01:32where I can do some research but the
01:34year that I applied uh that the ECE
01:37department at CMU decided that there
01:39wasn't going to be a masters track for
01:41thesis like either you can just take
01:42courses or you go to a PhD and so they
01:45kind of slotted me onto the PHD track
01:48um which I wasn't so short of but my
01:50advisor there was reasonably confident
01:52that I'm going to enjoy it and I'm going
01:53to want to keep going
01:55um so yeah that's how I got started in
01:58Forest I was doing projects that didn't
02:01have a visual element to it how did you
02:03pick a thesis project
02:04so at first I wasn't I was working on
02:07projects that didn't have too much of a
02:09visual element to them
02:11um but when I got to CMU my advisor's
02:14lab was working in image processing in
02:15computer vision and I always thought
02:17that it was pretty cool that everybody
02:18gets to kind of look at the outputs of
02:20their algorithms and see what they're
02:23doing whereas if it's kind of non-visual
02:25then yeah you see these metrics but you
02:26don't really have a sense for what's
02:27what's uh happening if it's working if
02:31um and so that's how I got interested in
02:32computer vision and that then defined
02:34the topic of my thesis over the course
02:38so you have been working in machine
02:42learning long enough that as you said it
02:43was called pattern recognition and
02:45you've worked across a bunch of
02:47different modalities how does how has
02:49that changed your your research path
02:51because like things like diffusion
02:54models and games and large Transformers
02:56none of that existed when you were first
02:58starting and I I think you have managed
03:00to sort of translate or transition your
03:03interests in a way that keeps you on The
03:05Cutting Edge how does that happen
03:08yes I think I mean it you can always
03:11kind of look back and try and find
03:13patterns like when you're actually doing
03:14it you don't necessarily have a grand
03:16strategy of anything in mind but when I
03:18look back I think one common theme that
03:21led to me transitioning across topics a
03:25um was that I was interested in seeing
03:27how we can get humans to interact with
03:30machines in more meaningful ways and so
03:34kind of even my transition from kind of
03:35non-visual to visual modalities in
03:37hindsight I feel like was essentially
03:38that I felt like you can't interact with
03:40these systems too much if it's sort of
03:42these abstract modalities that you're
03:43looking at and then when I was working
03:45in computer vision I wanted to find ways
03:48for humans to be able to interact with
03:49these systems more so I started looking
03:51at kind of these attributes and
03:53adjectives of like oh something is furry
03:54or something is shiny and using that as
03:57a mode of communication between humans
03:58and machines both for humans to teach
04:01machines new Concepts and for machines
04:03to be more interpretable than explaining
04:05why they're making the decisions that
04:06they're making and that slowly led to
04:08the the sort of more international
04:10language processing where instead of
04:12these kind of just adjectives and
04:13attributes looking at more natural
04:14language as a way of interacting so a
04:16lot of my work in visual question
04:18answering where you're answering
04:19questions about images image captioning
04:21was coming from there and then over time
04:24I sort of started thinking of are there
04:27ways to go even deeper in this
04:29interaction there are other ways very
04:30high tools can enhance sort of creative
04:33expression for people give them more
04:35tools for expressing themselves
04:37um and that's how I got interested in AI
04:39for creativity and I was dabbling in
04:40kind of a few fairly random projects a
04:44few years ago kind of my bread and
04:46butter research was still multimodal
04:50um but then sort of I enjoyed what I was
04:52doing with AI for creativity and a
04:54couple of years ago that took sort of a
04:55little bit more of a serious turn where
04:57I made it more of my sort of
04:59full-fledged research agenda and that's
05:01how I started working more seriously on
05:03generative modeling including
05:05Transformer based approaches diffusion
05:07models for images but video for 3D video
05:10things of that sort so you became a
05:13professor and you also you know now work
05:15in Industry at meta what brought you
05:18so this was I've been at meta for about
05:21seven years now and so this had started
05:23when I was transitioning from Virginia
05:25Tech to Georgia Tech's I was an
05:27assistant professor at Virginia Tech and
05:28I was getting started at Georgia Tech
05:30and in that uh transition uh I decided
05:34to spend a year at Fair
05:36um at the time it was called Facebook AI
05:38research now fundamentally I research at
05:40meta and I knew colleagues there from
05:43some of them had been at Microsoft
05:44research before that I had interned at
05:46MSR I had spent summers at MSR even as a
05:50um and so I had a lot of colleagues who
05:52I knew once I thought it would be fun to
05:54kind of spend a year collaborate with
05:56um get to know what fair is like
05:58um so that's what that was It was
05:59supposed to be a one-year uh stint and
06:01then I was going to go back to Georgia
06:02Tech and kind of continue with my uh
06:06um but in that one year I enjoyed it
06:08enough uh I think Fair enjoyed having me
06:11around enough where we tried to figure
06:12out is there a way to keep this going
06:16um and so for uh many years after that
06:18for five years so I was splitting my
06:20time but every fall I would go back to
06:22Georgia Tech to Atlanta spend the fall
06:24semester there teach and then the rest
06:26of the year I would be in Menlo Park at
06:28Facebook no matter and and you
06:30transition from fundamental AI research
06:33to a new sort of generative AI group can
06:35you talk about why that's interesting to
06:37matter sort of what what kind of things
06:38you're working on now
06:39yeah yeah I mean yeah that's a very
06:42exciting space right now there's a lot
06:44happening both uh within my diet and
06:46outside as I'm sure many of the people
06:48listening to this are aware of
06:50um but yeah so that the new organization
06:52was created a few months ago
06:54um so not not a long not a long time ago
06:57um and it's uh looking at things like
06:59large language models image Generation
07:01video generation generating 3D content
07:05um yeah all sort of all sorts of
07:07modalities that you might think of
07:11why is it interesting I mean like right
07:14now if you think about all the content
07:16there's so much content that we consume
07:19in all modalities and all sorts of
07:22um and it makes a lot of sense to ask
07:24that instead of uh maybe not instead of
07:26but in addition to all of this
07:28consumption can more of us be creating
07:30more of this content right and so
07:33um almost everything that you think of
07:35images video you can ask this question
07:38like for any situation where you're
07:39searching for something trying to find
07:40something it's relevant to ask well
07:42could I just create what it is that I
07:45um and so when you think of it that way
07:46you can see how it can touch
07:49um a lot of different things across a
07:51variety of products and surfaces so yeah
07:53yeah that makes a ton of sense I think
07:55we'll um come back and ask you some
07:56questions about images and audio and a
07:59few other things since you've done so
08:00much interesting work across so many
08:01different areas but maybe we can start a
08:03little bit with video generation in part
08:05you know due to the fact that you had a
08:07really interesting recent project called
08:08make a video and in that approach users
08:11can generate video with a text prompt so
08:13you could type in Imagine a corgi
08:15playing with a ball and it would
08:16generate a short video of a quirky
08:19um could you tell us a bit more about
08:21that project how it started and also how
08:24it works and what's the basis for the
08:26yeah yeah so make a video
08:30um it it started because I mean this was
08:33um a couple of years ago where uh this
08:35was before Dali 2 by the way so this was
08:37like after Dali one had happened and I
08:39feel like a lot of people don't even
08:40remember Dali one anymore like people
08:42don't even talk about that it's fun to
08:44go check out what those images look like
08:46um and that had that had blown our minds
08:48at the time but now when you go back
08:49you're like wait like that's not even
08:50interesting but anyway so we had seen a
08:53lot of progress in image generation
08:55um and so it seemed like the next kind
08:57of entirely open question where we
08:59hadn't seen much work at all was to see
09:01what can we do with video generation and
09:03so that was kind of the the inspiration
09:06um and for make a video the approach uh
09:09specifically the thinking was we have
09:13these image generation models by this
09:14time we had seen a lot of progress with
09:16diffusion based models
09:18um from a variety of different
09:19institutions and so the idea was is
09:21there a way of leveraging all the
09:23progress that's happened with images in
09:26a very way to sort of make video
09:28generation possible and so that led to
09:31this intuition that what if we try and
09:34use images and Associated text as a way
09:37of learning what the world looks like
09:39and how we how people talk about the
09:42visual content and then separate that
09:45out from trying to learn how the how
09:48things in the world move so separate out
09:50appearance and language and that
09:51correspondence from motion
09:54um of how things of how things move and
09:57so that is what led to make a video
09:59um and so the there are sort of multiple
10:01advantages of thinking of it that way
10:04one is there's less for the model to
10:06learn because you're directly bringing
10:07in everything that you already know
10:09about images to start with
10:12um the second is all of the diversity
10:15that we have in our image data sets that
10:17image models already know all sorts of
10:19Fantastical depictions of like dragons
10:20and unicorns and things like that where
10:22you may not have as much video data
10:25easily available all of that is
10:27inherited so all the diversity of the
10:29visual concepts can come in through
10:30images even if your video data set
10:32doesn't have all of that
10:34um and the third benefit maybe the
10:36biggest one is that because of the
10:38separation you don't need video and
10:41Associated text aspired Data you have
10:43images and text as per data and then you
10:45just have one labeled video to learn
10:48and so these were kind of three things
10:50that we thought were quite interesting
10:52in how we approached make a video
10:54um and so concretely the way it works is
10:56that when you initialize the model
10:58you're starting off with image
11:01Um sort of parameters that have already
11:03been learned so before you do any
11:05training for make a video you you we set
11:07it up so that it can generate a few
11:10frames that are not temporarily coherent
11:13so there is going to be independent
11:15images like the Corgi playing with the
11:16ball is just going to be independent
11:18images of corgis playing with blue balls
11:20but they're not going to be temporally
11:22coherent and then what the network is
11:24trying to do as it goes through the
11:25learning process is to make these images
11:27temporally coherent so that at the end
11:30um it is generating a video rather than
11:33just unrelated images and that's where
11:35the videos come in as training data
11:37that's a great explanation one question
11:39just like if we if we use an example of
11:42something that is not going to be in
11:45your video training set right um so so I
11:49want to a flying Corgi for example right
11:52how should I think of this in terms of
11:54like interpreted motion
11:57yeah so one way of thinking of it could
11:59be that you may not have seen a flying
12:01Corgi but you've probably seen flying
12:03airplanes or flying birds and other
12:07um in images and in video and so from
12:09images you will have text associated
12:12with it so you will have a sense for
12:13what things tend to look like when
12:15someone is saying oh this is X flying or
12:16why flying and then in videos you will
12:18have seen the motion of what stuff looks
12:20like when it when it flies and in images
12:23you'll have seen what corgis look like
12:25and so it's hard to kind of know for
12:28sure what it is that these models learn
12:29sort of interpretability is not a
12:32strength of many of these uh deep large
12:34architectures but that could be one
12:36intuitive explanation for how the model
12:39is managing to figure out what a flying
12:41Corgi might look like yeah
12:42well what are some of the major
12:44forward-looking aspects of this sort of
12:45project and research
12:47I think there's a ton to do in the
12:50context of video generation like if you
12:51look at make a video it was very
12:53exciting you know sort of first of its
12:55um capabilities at the time but it's
12:57still it's a four second video it's
12:59essentially an animated image right it's
13:01kind of the same scene the same set of
13:03objects that are moving around in
13:05reasonable ways but you're not seeing
13:07objects appear objects uh disappear
13:09you're not seeing objects reappear
13:11you're not seeing scene transitions
13:14um none of this is is is in there and so
13:17if you look at if you think about the
13:19complexity of videos that you just
13:20regularly come across on various
13:22surfaces this is far from that and so
13:24there is a ton to be done in terms of
13:26making these videos longer more complex
13:29having memory so that if an object
13:32reappears it's actually consistent it
13:34doesn't now look entirely different
13:36um things of that sort of being able to
13:38tell more complex stories through videos
13:40all of this is uh entirely open well and
13:45I know these things are always extremely
13:46hard to predict but if you you look for
13:48it a year or two what do you think the
13:49state of the art will be in terms of
13:51length of video complexity of the scenes
13:53that you can animate things like that
13:54yeah that is that is hard to say and
13:59to have been seen more of this already
14:02like make a video was I think what nine
14:05months or so ago maybe approaching one
14:07year and it's not like even from other
14:10institutions it's not like we are seeing
14:12amazingly longer videos or significantly
14:14higher resolution or much more
14:15complexity we're still kind of in this
14:17videos equals animated images and yeah
14:20maybe the resolution is a little bit
14:21bigger quality is a little bit higher um
14:24but it's not like we've made significant
14:25breakthroughs unlike for example what
14:27we've seen with images
14:29um so I do I I that has given me a sense
14:32that maybe this is harder than what we
14:35might think and sort of our usual curves
14:37of like with language models or image
14:38models we're like oh just six more ones
14:41then there's going to be something else
14:42that's an entirely different step change
14:45um I think that might be harder in video
14:46and I wonder if there is something that
14:48we're kind of fundamentally missing in
14:50terms of how we approach video
14:52Um so it's not quite answering uh what
14:55you asked me but I do think that it
14:56might be a little bit slower than what
14:58we might have guessed just because based
15:00on problems and other modalities what do
15:02you think is the main either challenge
15:05that you think has slowed progress in
15:07this field or not slowed it I mean
15:08obviously there's been you know every a
15:10lot of people are working very hard on
15:11these problems but to your point it
15:13seems like sometimes you have these
15:14fundamental breakthroughs and sometimes
15:15it's an architecture like Transformer
15:17based models versus traditional NLP and
15:21um you know iterating on a lot of
15:23other things that already exist in the
15:25pre-existing approaches and just sort of
15:26solving specific engineering or
15:28technical challenges
15:29if you were to sort of list out the
15:31bottle next to this why do you think
15:32they're likely to be
15:33yeah I think there's a few different
15:35things one is videos I just sort of from
15:39an infrastructure perspective harder to
15:40work with right they're just sort of
15:42larger more storage and sort of more
15:44expensive to process and more expensive
15:46to generate and all of that so there's
15:48just that iteration cycle that is much
15:51slower with video than it would be with
15:52other modalities so that is is one
15:55um the second is I don't I don't think
15:58we've still figured out the right
16:00representations for video right there is
16:03a lot of redundancy in video One frame
16:05to the next frame there's not a whole
16:08um we still kinda approach them fairly
16:10independently as sort of independent
16:12images even if you're generating it it's
16:14kind of one after the other or if even
16:16if you're generating in parallel and
16:18then making it finer grain
16:20um so I think maybe that could be
16:21something that helps with a breakthrough
16:23that if we really figure out how to
16:25represent videos efficiently
16:28um and the third is this hierarchical
16:30architecture that if you want longer
16:33videos there's just so many pixels that
16:35you're trying to generate right it's a
16:36very very high dimensional signal um
16:38compared to anything else that we're
16:41and so just thinking through how do we
16:43even approach that what sort of
16:44hierarchical representation
16:46um makes sense especially if you want
16:48these scene transitions if you want to
16:49have this consistency which may be a
16:53um figuring those architectural pieces
16:55out um I think maybe another piece of
16:57this puzzle um and then finally data
16:59right data is kind of gold in anything
17:01that we're uh we're trying to do
17:03um and I don't know if as the community
17:05we've quite built the muscle of thinking
17:09um sort of massaging the data
17:11um and all of that in the context of
17:13video we have that muscle quite a bit
17:15with language quite a bit with images
17:17but with video we're perhaps not quite
17:19there yet what would be the ideal
17:21training set for for video in that case
17:24or what's lacking from the existing
17:27yeah I think what's what's lacking may
17:32um the data source itself although that
17:34is certainly a challenge as it is with
17:36other modalities but I think it might
17:38also be the data recipes that do we want
17:41to start with training with sort of
17:43these very short videos where not much
17:45is happening the scene isn't really
17:46changing but then that also tends to
17:48limit the motion there's just not much
17:50happening and so you're going to end up
17:51with these kind of animated image
17:54um and on the other hand you have sort
17:55of very complex video might be multiple
17:57minutes long with all sorts of scene
17:58Transitions and that's ideally what you
18:00want to shoot for that's where you want
18:02to get but if you're just kind of
18:03directly throw all of that into the
18:05network it's unclear if the models will
18:08um all of that complexity well so I
18:10think thinking through some sort of a
18:11curriculum may be valuable here and I
18:14don't think we've quite nailed that
18:17I feel like every generation of sort of
18:19Technology shifts always runs into video
18:22is the hardest thing to do and if you
18:24look at sort of just the first
18:25substantiation of the web one of the
18:26reasons YouTube sold was the
18:27infrastructure point you made earlier
18:29where just dealing with that huge amount
18:31of streaming and the costs associated
18:32with it and everything else even in the
18:34prior generation of just you know can we
18:36host and stream this effectively in part
18:39led to them you know getting sold to
18:41Google reasonably early in the life of a
18:43company so it's interesting how video is
18:44always that much more complicated yeah
18:46yeah and same thing for computer vision
18:48right like here we talk about generation
18:50but even just understanding
18:52um with images with image understanding
18:53there was so much progress that was
18:55being made and videos was always kind of
18:57not only trailing behind but just sort
18:59of continue to be harder even sort of
19:00the rate of progress uh was slower not
19:03just the absolute progress and I think
19:05yeah I think we're seeing some of that
19:07uh for alternative models as well
19:09Devi I know this isn't within your like
19:11core field but I'm sure you also pay
19:13attention like how do you think advances
19:15in video may like impact robotics
19:23there the video understanding piece is
19:26probably more relevant than the video
19:29um and video is like if you think of
19:32embodied agents right they're sort of
19:33moving around and consuming visual
19:36content which inherently is video right
19:37they're not looking at static images and
19:39so I think that video understanding
19:41piece is is very relevant there what's
19:43also interesting in the context of
19:45embodied agents or sort of Robotics uh
19:47physical robots that are moving around
19:48is that it's not passive consumption of
19:51videos right it's not like how you and I
19:53might be watching videos on YouTube or
19:55anything else it's that the I'm yelling
19:58at the screen I'm not passive
20:01it's that the the next visual signal
20:05that the robot will see will be a
20:07consequence of the action that the robot
20:08had taken right so if it chose to move a
20:10certain way that's going to change what
20:12the video looks like in the next few
20:15um and so there's that interesting
20:16feedback loop there where it knows what
20:18action it had taken it sees how that
20:20changed the visual signal that it is uh
20:23that it is now getting as input and so
20:26that connection makes it adds a layer of
20:29interestingness to how it can process
20:31the video sort of in contrast to with
20:34sort of regular computer vision
20:35disembodied tasks where we think of sort
20:37of it just streaming a video is just
20:39kind of happening and you're not
20:40controlling what you're seeing
20:42you started by saying that human
20:45interaction was a big driving force in
20:49in your research interests and
20:51um you know going Beyond like metrics as
20:53outputs and um even language as inputs
20:57um how do you think about uh
20:59controllability in video and like how
21:01important text prompting is to sort of
21:04the next generation of creation
21:07yeah I think that's um I think that's
21:09very important exactly to your point
21:11that if we want these generative models
21:13not just for video but for any modality
21:16um tools for Creative expression then it
21:20needs to be generating content that
21:22corresponds to what someone wants to
21:25express right like it has to bring
21:26somebody's voice to life and that is not
21:29possible if there aren't good enough
21:32um these models and so text is one way
21:34that's better than random samples that's
21:37that's one way in which I can I can say
21:39what I want but right now for the most
21:40part you type in a text prompt you get
21:43an image back a video back
21:45um and either you take it or leave it
21:47right like if you like it that's great
21:48if not you're just going to try again
21:50and maybe you tweak The Prompt a little
21:53um try a whole bunch of these prompt
21:55engineering tricks and and hope that you
21:57get lucky but it's not really a very
21:59direct form of control
22:02um and so I think of more control at
22:04least in two different ways one is to
22:07allow for prompts that are not just text
22:09but are multimodal themselves so for
22:12image generation for example instead of
22:14just text it would be nice if I can kind
22:16of sketch out what I want the
22:17composition of the scene to look like
22:19and and the model would be expected to
22:21kind of respect that
22:23for video instead of just text as input
22:26maybe I can also provide an image as
22:27input so that I can tell the system that
22:29this is the kind of scene that I want
22:30maybe I can provide sort of a little
22:32audio clip as input to convey that this
22:34is the kind of audio or sound that I
22:36want to associated with it maybe I also
22:38bring in a short video clip
22:40um and expect the model to sort of bring
22:41in all of these different modalities in
22:45um to generate a video so that's one
22:46piece where I can bring in more inputs
22:49as a way of more control and the second
22:51piece is sort of the predictability part
22:53that even if I bring in all of these
22:55modalities as input if the model then
22:58goes off and kind of does its own thing
22:59with these inputs maybe it's reasonable
23:02but that's not what I'm looking for what
23:04do I do right like do I just go back and
23:06it would be ideal if there's some way of
23:08having iterative editing mechanisms
23:11where whatever I get back I have a way
23:12of communicating to the model what it is
23:14that I want changed in what way so that
23:16over iterations I can get to the content
23:18that I intended in sort of a fairly
23:22reasonable way without having to sort of
23:24spend hours learning on YouTube or
23:25something like that right so if that can
23:27be done in a very intuitive interface I
23:30think that would be pretty awesome
23:31where do you think we will get to in
23:33terms of the frontier of like controls
23:36for video generation over the next
23:37couple years or five years
23:40I think control sort of tends to lag
23:43behind the core capability like even
23:45with images I feel like we first had to
23:47get to a point where these models can
23:48actually generate nice looking images
23:50before we start worrying about why this
23:52is really doing what I wanted it to do
23:54and I feel like we're not quite there
23:55with video yet so get random good first
23:58exactly exactly exactly like at least
24:00get random good first then maybe let me
24:02give a text then let me give it these
24:05um so I do think we'll first probably
24:06see more progress in just the core
24:09capabilities of sort of text to video
24:12Um before we look at uh uh prompting
24:14although we are and this is in the
24:17context of sort of me generating
24:19something from scratch right which is
24:20where I might want the iterative control
24:22and things like that a parallel scenario
24:24is where I already have a video and I'm
24:26trying to edit it an interesting way I
24:28might want to stylize it and all of that
24:30I think we're already seeing that even
24:31in products um with with runway for
24:34example right so I think that we'll
24:36probably see much more of uh we're
24:37already seeing that and I think we'll
24:39see more of where you already have a
24:40video that you're starting with and then
24:42you're trying to edit it um which has
24:44similarities too but is a little bit is
24:46a little bit different in my mind uh
24:48compared to sort of generating something
24:50from scratch and wanting control over
24:52that the other um potential uh part of
24:55output for videos obviously is
24:56text-to-speech or some sort of Voice or
24:59other ways to sort of accompany the
25:01video or animate it what is your view in
25:03terms of the state of the art of
25:04tech-to-speech systems and how those are
25:11quite as much what I have dragged a
25:13little bit more closely is things like
25:15text to audio where you might say that
25:17the sound of a car driving down the
25:19street and what you expect is sort of
25:21effect of a car driving down the street
25:25um and so there uh the state of the art
25:29um sort of roughly sort of a few seconds
25:32to tens of seconds long audio
25:36um and I would say that roughly it
25:38probably works reasonably well uh one in
25:43um It's like because there aren't
25:44concrete metrics it's kind of hard to
25:46articulate where state of the art is but
25:49hopefully this is this is helpful and I
25:53um audio added to visual content makes
25:57it much more expressive and much more
25:59delightful and I do think that it tends
26:01to be under invested uh both for audio
26:04similarly for music I think it just
26:06makes the content much more expressive
26:08much more delightful
26:09um but I feel like we don't do enough of
26:11that yeah it's interesting too because
26:13there are actually very large sound
26:14effect libraries out there they're very
26:16well labeled as well in terms of what
26:18the exact sound effect is and the length
26:19and the components and all the rest and
26:21so yeah it's interesting that the state
26:23of the art hasn't quite caught up with
26:24with you know what used to be a really
26:26interesting old business where you
26:28generate an enormous amount of IP for
26:29different sound effects and then you
26:30just license them out
26:32yeah yeah which it seems like eventually
26:34that industry is likely to go away so
26:36and even with audio similar to what we
26:38were talking about with video there is
26:39the right like the same kinds of
26:40challenges and dimensions exist that you
26:42want the piece to be longer you may want
26:45compositionality right I might want to
26:47be able to say that well first it's the
26:48car driving down the street and then
26:49there is a sound of I don't know a baby
26:52crying and then something else and maybe
26:54I'm saying that two of these sounds are
26:55happening simultaneously which is not
26:57like that's something that can happen in
26:59audio where you can have the support in
27:01um but in video is not something that
27:03would uh where it's quite as natural and
27:06so all of that isn't stuff that these
27:08models can do uh very well right now if
27:11I described a complex sequence of sounds
27:13or if I try to talk about these
27:15different sounds simultaneously
27:17um these models can't do that very well
27:19where do you think we'll see the first
27:21application areas or what do you think
27:22are the first sort of use cases that
27:24we'll see immediately and then how does
27:25that evolve over time
27:28I think not too much
27:31have the strongest intuitions there
27:33um but I think for like kind of like I
27:36was touching on earlier that a lot of
27:38these situations where we find ourselves
27:40searching for things to express
27:43um I think thinking of whether that can
27:44be generated so that it's a closer
27:47reflection of what you're trying to
27:49communicate as likely things that we'll
27:51see and I know we're not talking about
27:52sort of llms and conversational agents
27:55and all of that too much but I think AI
27:56agents is going to be a thing that we'll
27:58see a whole bunch of um across many
28:01different surfaces and then thinking
28:03about what media creation looks like in
28:05the context of AI agents is is another
28:07uh Dimension to this yeah that makes a
28:10lot of sense I mean there's all sorts of
28:12obvious sort of near-term applications
28:14in terms of generating your own animated
28:16gifs or to your point
28:18um mid-stream video editing
28:21or you know different types of shorter
28:23Forum animations or other things that
28:25you could do marketing Etc and so you
28:28know it definitely feels like there are
28:29some near-term applications and some
28:30longer term ones and then the thing I
28:31always find interesting about these
28:32sorts of Technologies is the spaces
28:34where they kind of emerge in a way that
28:36you don't quite expect but end up being
28:38a primary use case you know it's sort of
28:39the Uber version of the
28:41of the mobile Revolution where you push
28:43a button and a stranger picks you up in
28:44a car and you're fine with it right
28:46um and it feels like those sort of
28:48unexpected delightful experiences are
28:49are going to be very exciting in terms
28:51of a lot of areas of this field
29:01not that we can kind of directly plug
29:03into and kind of see or did the metric
29:05go up did the metric go down I think
29:06there's a lot of just kind of thinking
29:08about where do we anticipate people will
29:11be excited to use this and I as you said
29:13I think I there's a very good chance
29:15that there will be things that we don't
29:17necessarily foresee um but just kind of
29:19come up as very exciting spaces
29:23um I I think an interesting cynicism has
29:26been that like there aren't that many
29:27artists out there or like people don't
29:29want to create imagery uh when when
29:33looking at some of these uh generative
29:35like much less video when looking at
29:37some of these generative Technologies
29:38but you know the the recent history of
29:42social media would say that's like
29:43certainly not true right um if you look
29:46at Instagram democratizing photography
29:48or Tick-Tock democratizing short form
29:51video creation by just like reducing the
29:53number of parameters of control right as
29:55you said like you know sound makes video
29:58much richer but it's also really hard to
30:01produce any one of these pieces so you
30:03just take one control away and like you
30:05know record with your phone and you get
30:07you get something like Tick Tock but I
30:09think it's really exciting like the
30:11explosion and usage of things like
30:14mid-journey right because the attraction
30:16suggests they're an awful lot of people
30:18who are actually interested in
30:19generating high quality imagery for a
30:22whole range of use cases professional or
30:24not yeah yeah and I think there's people
30:27across the entire Spectrum right like on
30:29one hand you can talk about artists who
30:31already had a voice were already
30:32involved in sort of um creating
30:35it's hard and then the other end of the
30:37spectrum are people who don't
30:39necessarily have the skills may not have
30:41had the training but are still
30:42interested in being able to express
30:44their voice is a little bit more
30:45creatively than they would have
30:47otherwise and so I do think that there
30:50is one question of whether or not
30:51artists want to be engaging with this
30:54technology and there is the other
30:55question of does it kind of just lift
30:57the tide for all of the rest of us to be
31:01um expressive in in what we can uh
31:03create and what we can communicate
31:06um and so I think both of those ends are
31:10um and with artists there are artists
31:11who are like whose sort of brand is AI
31:14artists where they are explicitly using
31:16AI as the tool of choice
31:19um for expressing themselves and their
31:21entire practices around that someone
31:22like Sophia Crespo or Scott Eaton and
31:26um so there's also and this was before
31:28mid Journey or anything like that right
31:30like they've been doing this for years
31:31um even with like Gans for example that
31:33existed that were popular before
31:36yeah you're an artist yourself both um
31:39you know digital AI driven analog
31:43um some of it's behind you like how does
31:44that impact like your view of this
31:49um hesitate a little bit to call myself
31:51an artist I feel like somebody else
31:52should be deciding whether I'm an artist
31:54or not but then there's this whole
31:55Community we'll say you're an artist
31:56yeah by the way we should mention um
31:59some of your lovely macrame art is on
32:01the wall behind you as well so I think
32:03it looks great yeah thank you thank you
32:05and yeah so to be honest I don't know if
32:08I it's hard to kind of look back and get
32:11a sense for did that play a certain role
32:13in it or not I know for sure that it
32:15plays a role in just how excited I am
32:17about this technology that anytime
32:19there's some new model out there whether
32:21it's from the teams that I'm working
32:23with or if it's something external I'm
32:25definitely very enthusiastic to want to
32:27try it out and see what it can do what
32:28it can't do and sort of tell people
32:29about it and so just kind of my Baseline
32:31level of excitement around this
32:34um is high uh in part because of all
32:37these other interests that I have I I'm
32:39pretty sure that my emphasis on control
32:41is probably also coming from that where
32:43I feel like I want to be using these
32:45tools to kind of have them do this thing
32:48that I want to do and sort of text
32:50prompts are restrictive in that way
32:53um and I mean the we talked about it in
32:55the context of control that if you can
32:57bring in multiple modalities as input
32:59that definitely gives you more control
33:00but it also means that there is more
33:02space to be creative right I can now
33:04pick interesting images or interesting
33:06videos or interesting pieces of audio
33:08then pair that up with like this really
33:10interesting text prompt and just kind of
33:12see what happens like if I put all of
33:14this in you don't know what the model is
33:15necessarily going to do as it's also
33:17just more Norms to play with
33:20um as you're as you're trying to
33:22interact with these that yeah there's
33:24just more more space to be creative if
33:26there's more ways of more knobs to
33:28control these models
33:30yeah I was talking to um Alex Israel
33:32who's an LA based artist and uh he's not
33:36a technical guy but an amazing artist
33:38and he was describing this new video
33:41project he wants to do that involves use
33:44of AI and I was very inspired by like
33:46how specific the vision was and like
33:49thinking through the implementation a
33:50little bit for somebody who doesn't come
33:52from the technical field and imagine
33:54there would be a whole crop of people
33:55who look at the capabilities as another
33:57tool for expression yeah yeah and so and
34:01there are some people who have a very
34:03specific vision and they just want the
34:04tool to kind of help them get there and
34:06then there are others whose process
34:08involves sort of bringing the model
34:10along where the unpredictability and
34:13sort of not necessarily knowing what
34:14this model is going to generate is a
34:16part of their process and is a part of
34:17the final piece that they create so some
34:20view of them view these models very much
34:22as tools and then others tend to view
34:24them as more of a collaborator
34:27um in this process of of creating and
34:29it's always interesting to see what end
34:31of the spectrum different people lie on
34:33yeah okay so as we're nearing the end of
34:36our time together we want to run through
34:37a few rapid fire questions if that's
34:40um maybe I'll start with one just given
34:42your breath in the in the field
34:44um is there an area of image audio video
34:48generation understanding control that
34:51you feel like is um just under explored
34:53for people looking for research problems
34:57yeah so one is the control piece that we
35:00already talked about quite a bit and I
35:03multi-modality like bringing all of
35:04these modalities together right now we
35:06have models that can generate text we
35:08have models that can generate images
35:09models that can generate video but
35:12there's no reason these all need to be
35:13independent you can Envision systems
35:15that are sort of ingesting all of these
35:16modalities understanding all of it and
35:18generating all of these modalities
35:21um and I haven't I'm starting to see
35:23some work in that direction but I
35:24haven't seen a whole lot of it that goes
35:27across many different modalities
35:29you just got back from cvpr and
35:31presented there can you mention both
35:34what you were talking about and then
35:35sort of the project or work that most
35:38inspired you there yes I was at cbpage
35:42um I was on a few different panels and I
35:44was giving some talks so one of it was
35:45on Vision language and creativity at the
35:48at the main conference
35:49um and so yeah that was kind of what I
35:51was representing there
35:52um in terms of something exciting that I
35:56um not necessarily a paper but there was
35:58a workshop there called uh Scholars and
36:01big models where the topic of discussion
36:04was as these models are getting larger
36:06and larger making a lot of progress in
36:09what way can sort of academics or Labs
36:11that don't have at sort of as many
36:13compute resources what should their
36:15strategy be how should they be
36:17approaching these things
36:19um and that I thought was a really nice
36:20discussion in general I tend to enjoy
36:22venues they talk about kind of the meta
36:25like the human aspects of the work that
36:27we do we always we have a lot of
36:28technical stations but we don't tend to
36:30talk about these other components and so
36:32that Workshop is something that I
36:36um is there a prediction you'd make
36:38about the impact of all these
36:40Technologies on social media since
36:42you're working at the intersection
36:43yeah there's going to be more of it
36:45that's one that's one prediction that I
36:48can uh very confidently make that we are
36:50going to see uh all of these tools sort
36:52of show up where uh millions and
36:54billions of people can be can be using
36:56these in various forms
36:58um and I'm I'm excited about that like I
37:01said I think it kind of enhances uh
37:03creative expression and just sort of
37:05communication and we're going to have
37:06these entirely new ways of interacting
37:10um then even the social entities on
37:13these networks might change right like
37:15when we talk about AI agents and you
37:17think about them being sort of part of
37:20um what what that does and how that
37:23changes how we connect with each other
37:24all of that is is fascinating and I
37:26can't I can't wait to see how that
37:28evolves it actually feels a little bit
37:30under discussed in terms of how much
37:32impact it's already had in some ways
37:33even if it sometimes is transitory like
37:35a lensa or you know there's my
37:38understanding is tens of millions of
37:39people at least who ended up using that
37:40product there's character in terms of
37:42the engagement there one could argue mid
37:45Journey for certain types of art you
37:46share with friends Etc and so I feel
37:48like there's already these
37:49social expression modalities bot-based
37:52interactions Etc that are already
37:54impacting aspects of social or
37:56communicative media in ways that um
37:58people don't really recognize it as such
37:59so to your point it's very exciting to
38:01see where all this is heading yeah yeah
38:03David you also write quite a bit online
38:06in in terms of sort of giving back to
38:08the community and um advice for
38:10researchers or young people newer to the
38:13fields you talk about time management
38:14and other topics what's one one piece of
38:17wisdom you'd offer to people in terms of
38:19productivity and joy and AI research
38:28I'll post on that it's sort of main
38:30philosophies that you should be writing
38:33everything that you want to do down on
38:36your calendar so it's kind of the point
38:37is that it should be on your calendar it
38:39shouldn't be your to-do list and the
38:40reason it should be on your calendar is
38:41it forces you to think through how much
38:43time everything is going to take if it's
38:45just a list you have no idea how long
38:46it's going to take and that's not a good
38:48way to plan your time out so that's kind
38:50of the main thesis of it
38:52um that I hadn't anticipated but it
38:54resonated a whole lot with many people
38:55which was kind of surprising when I put
38:57it out there so yeah if anyone's
38:59um you should check that out in terms of
39:01advice outside of what I've written one
39:04advice that has stuck with me over the
39:06years is don't self-select that if you
39:09want something go for it if you want a
39:11job apply for it if you want a
39:13fellowship for any students who might be
39:15listening just apply for it you wanted
39:17an internship just apply for it
39:19um and yeah like don't don't assume
39:22don't question oh am I good enough am I
39:25um it's on the World to say no to you if
39:27you are not a good fit the world will
39:29tell you that and so yeah there's
39:31nothing to lose by just kind of giving
39:33it a shot so don't self-select it's a
39:35great note to end on Devi thank you so
39:37much for joining us on no priors thank
39:39you thank you for having me ah thanks