00:00Suppose I give you two different lists of numbers, or maybe two different functions,

00:04and I ask you to think of all the ways you might combine those two lists to get

00:07a new list of numbers, or combine the two functions to get a new function.

00:12Maybe one simple way that comes to mind is to simply add them together term by term.

00:17Likewise with the functions, you can add all the corresponding outputs.

00:20In a similar vein, you could also multiply the two lists

00:23term by term and do the same thing with the functions.

00:26But there's another kind of combination just as fundamental as both of those,

00:30but a lot less commonly discussed, known as a convolution.

00:34But unlike the previous two cases, it's not something that's

00:37merely inherited from an operation you can do to numbers.

00:39It's something genuinely new for the context of lists of numbers or combining functions.

00:45They show up all over the place, they are ubiquitous in image processing,

00:49it's a core construct in the theory of probability,

00:51they're used a lot in solving differential equations,

00:54and one context where you've almost certainly seen it, if not by this name,

00:58is multiplying two polynomials together.

01:00As someone in the business of visual explanations, this is an especially great topic,

01:04because the formulaic definition in isolation and without context can look kind of

01:09intimidating, but if we take the time to really unpack what it's saying,

01:12and before that actually motivate why you would want something like this,

01:16it's an incredibly beautiful operation.

01:18And I have to admit, I actually learned a little something

01:21while putting together the visuals for this project.

01:23In the case of convolving two different functions,

01:25I was trying to think of different ways you might picture what that could mean,

01:29and with one of them I had a little bit of an aha moment for why it is that

01:33normal distributions play the role that they do in probability,

01:36why it's such a natural shape for a function.

01:39But I'm getting ahead of myself, there's a lot of setup for that one.

01:41In this video, our primary focus is just going to be on the discrete case,

01:45and in particular building up to a very unexpected but very clever algorithm for

01:50And I'll pull out the discussion for the continuous case into a second part.

01:58It's very tempting to open up with the image processing examples,

02:01since they're visually the most intriguing, but there are a couple bits of finickiness

02:05that make the image processing case less representative of convolutions overall,

02:09so instead let's kick things off with probability,

02:11and in particular one of the simplest examples that I'm sure everyone here has

02:15thought about at some point in their life, which is rolling a pair of dice and

02:18figuring out the chances of seeing various different sums.

02:22And you might say, not a problem, not a problem.

02:24Each of your two dice has six different possible outcomes,

02:27which gives us a total of 36 distinct possible pairs of outcomes,

02:31and if we just look through them all we can count up how many pairs have a given sum.

02:36And arranging all the pairs in a grid like this,

02:39one pretty nice thing is that all of the pairs that have a constant sum are visible

02:43along one of these different diagonals.

02:45So simply counting how many exist on each of those diagonals

02:48will tell you how likely you are to see a particular sum.

02:53And I'd say, very good, very good, but can you think of

02:55any other ways that you might visualize the same question?

02:59Other images that can come to mind to think of

03:01all the distinct pairs that have a given sum?

03:04And maybe one of you raises your hand and says, yeah, I've got one.

03:08Let's say you picture these two different sets of possibilities each in a row,

03:12but you flip around that second row.

03:13That way all of the different pairs which add up to seven line up vertically like this.

03:19And if we slide that bottom row all the way to the right,

03:22then the unique pair that adds up to two, the snake eyes, are the only ones that align,

03:26and if I schlunk that over one unit to the right,

03:28the pairs which align are the two different pairs that add up to three.

03:32And in general, different offset values of this lower array,

03:36which remember I had to flip around first, reveal all the distinct pairs that

03:44As far as probability questions go, this still isn't especially interesting because

03:48all we're doing is counting how many outcomes there are in each of these categories.

03:52But that is with the implicit assumption that there's

03:55an equal chance for each of these faces to come up.

03:58But what if I told you I have a special set of dice that's not uniform?

04:02Maybe the blue die has its own set of numbers describing the probabilities for

04:05each face coming up, and the red die has its own unique distinct set of numbers.

04:10In that case, if you wanted to figure out, say,

04:12the probability of seeing a 2, you would multiply the probability

04:16that the blue die is a 1 times the probability that the red die is a 1.

04:20And for the chances of seeing a 3, you look at the two distinct

04:23pairs where that's possible, and again multiply the corresponding

04:26probabilities and then add those two products together.

04:30Similarly, the chances of seeing a 4 involves multiplying together

04:33three different pairs of possibilities and adding them all together.

04:36And in the spirit of setting up some formulas, let's name these top probabilities a 1,

04:41a 2, a 3, and so on, and name the bottom ones b 1, b 2, b 3, and so on.

04:46And in general, this process where we're taking two different arrays of numbers,

04:50flipping the second one around, and then lining them up at various different

04:54offset values, taking a bunch of pairwise products and adding them up,

04:57that's one of the fundamental ways to think about what a convolution is.

05:04So just to spell it out a little more exactly, through this process,

05:08we just generated probabilities for seeing 2, 3, 4, on and on up to 12,

05:12and we got them by mixing together one list of values, a, and another list of values, b.

05:17In the lingo, we'd say the convolution of those two sequences gives us this new sequence,

05:22the new sequence of 11 values, each of which looks like some sum of pairwise products.

05:27If you prefer, another way you could think about the same operation is to first

05:31create a table of all the pairwise products, and then add up along all these diagonals.

05:37Again, that's a way of mixing together these two

05:39sequences of numbers to get us a new sequence of 11 numbers.

05:43It's the same operation as the sliding windows thought, just another perspective.

05:47Putting a little notation to it, here's how you might see it written down.

05:50The convolution of a and b, denoted with this little asterisk, is a new list,

05:54and the nth element of that list looks like a sum,

05:58and that sum goes over all different pairs of indices, i and j,

06:01so that the sum of those indices is equal to n.

06:05It's kind of a mouthful, but for example, if n was 6,

06:08the pairs we're going over are 1 and 5, 2 and 4, 3 and 3, 4 and 2, 5 and 1,

06:13all the different pairs that add up to 6.

06:16But honestly, however you write it down, the notation is secondary in

06:19importance to the visual you might hold in your head for the process.

06:23Here, maybe it helps to do a super simple example,

06:26where I might ask you what's the convolution of the list 1, 2, 3 with the list 4, 5, 6.

06:31You might picture taking both of these lists, flipping around that second one,

06:35and then starting with its lid all the way over to the left.

06:38Then the pair of values which align are 1 and 4,

06:40multiply them together, and that gives us our first term of our output.

06:43Slide that bottom array one unit to the right,

06:46the pairs which align are 1, 5 and 2 and 4, multiply those pairs,

06:50add them together, and that gives us 13, the next entry in our output.

06:54Slide things over once more, and we'll take 1 times 6 plus 2 times 5 plus 3 times 4,

07:00which happens to be 28.

07:02One more slide and we get 2 times 6 plus 3 times 5, and that gives us 27.

07:07And finally the last term will look like 3 times 6.

07:10If you'd like, you can pull up whatever your favorite programming

07:13language is and your favorite library that includes various numerical operations,

07:17and you can confirm I'm not lying to you.

07:18If you take the convolution of 1, 2, 3 against 4,

07:215, 6, this is indeed the result that you'll get.

07:25We've seen one case where this is a natural and desirable operation,

07:29adding up to probability distributions, and another common example would be a

07:34Imagine you have some long list of numbers, and you take

07:37another smaller list of numbers that all add up to 1.

07:40In this case, I just have a little list of 5 values and they're all equal to 1 5th.

07:44Then if we do this sliding window convolution process,

07:46and kind of close our eyes and sweep under the rug what happens at

07:50the very beginning of it, once our smaller list of values entirely

07:53overlaps with the bigger one, think about what each term in this convolution really means.

07:59At each iteration, what you're doing is multiplying each of the values

08:03from your data by 1 5th and adding them all together,

08:06which is to say you're taking an average of your data inside this little window.

08:11Overall, the process gives you a smoothed out version of the original data,

08:14and you could modify this starting with a different little list of numbers,

08:18and as long as that little list all adds up to 1,

08:20you can still interpret it as a moving average.

08:23In the example shown here, that moving average

08:25would be giving more weight towards the central value.

08:28This also results in a smoothed out version of the data.

08:33If you do kind of a two-dimensional analog of this,

08:35it gives you a fun algorithm for blurring a given image.

08:38And I should say the animations I'm about to show are modified from something

08:42I originally made for part of a set of lectures I did with the Julia lab at

08:46MIT for a certain open courseware class that included an image processing unit.

08:51There we did a little bit more to dive into the code behind all of this,

08:54so if you're curious, I'll leave you some links.

08:56But focusing back on this blurring example, what's going on is I've got this

09:00little 3x3 grid of values that's marching along our original image, and if we zoom in,

09:05each one of those values is 1 9th, and what I'm doing at each iteration is

09:09multiplying each of those values by the corresponding pixel that it sits on top of.

09:13And of course in computer science, we think of colors as little vectors of three values,

09:17representing the red, green, and blue components.

09:20When I multiply all these little values by 1 9th and I add them together,

09:24it gives us an average along each color channel,

09:26and the corresponding pixel for the image on the right is defined to be that sum.

09:31The overall effect, as we do this for every single pixel on the image,

09:35is that each one kind of bleeds into all of its neighbors,

09:38which gives us a blurrier version than the original.

09:41In the lingo, we'd say that the image on the right is a

09:44convolution of our original image with a little grid of values.

09:48Or more technically, maybe I should say that it's the convolution

09:51with a 180 degree rotated version of that little grid of values.

09:54Not that it matters when the grid is symmetric,

09:56but it's just worth keeping in mind that the definition of a convolution,

10:00as inherited from the pure math context, should always invite you to think about

10:03flipping around that second array.

10:05If we modify this slightly, we can get a much more elegant

10:08blurring effect by choosing a different grid of values.

10:11In this case, I have a little 5x5 grid, but the distinction is not so much its size.

10:15If we zoom in, we notice that the value in the middle

10:18is a lot bigger than the value towards the edges.

10:21And where this is coming from is they're all sampled from a bell curve,

10:24known as a Gaussian distribution.

10:26That way, when we multiply all of these values by the corresponding

10:29pixel that they're sitting on top of, we're giving a lot more weight

10:33to that central pixel, and much less towards the ones out at the edge.

10:36And just as before, the corresponding pixel on the right is defined to be this sum.

10:41As we do this process for every single pixel, it gives a blurring effect,

10:44which much more authentically simulates the notion of putting your lens out of focus,

10:48or something like that.

10:49But blurring is far from the only thing that you can do with this idea.

10:53For instance, take a look at this little grid of values,

10:56which involves some positive numbers on the left,

10:58and some negative numbers on the right, which I'll color with blue and red respectively.

11:03Take a moment to see if you can predict and understand

11:06what effect this will have on the final image.

11:10So in this case, I'll just be thinking of the image as grayscale instead of colored,

11:14so each of the pixels is just represented by one number instead of three.

11:18And one thing worth noticing is that as we do this convolution,

11:21it's possible to get negative values.

11:23For example, at this point here, if we zoom in,

11:25the left half of our little grid sits entirely on top of black pixels,

11:28which would have a value of zero, but the right half of negative values all sit on

11:32top of white pixels, which would have a value of one.

11:36So when we multiply corresponding terms and add them together,

11:39the results will be very negative.

11:41And the way I'm displaying this with the image on the right

11:43is to color negative values red and positive values blue.

11:46Another thing to notice is that when you're on a patch that's all the same color,

11:50everything goes to zero, since the sum of the values in our little grid is zero.

11:55This is very different from the previous two examples where the sum of our little

11:58grid was one, which let us interpret it as a moving average and hence a blur.

12:03All in all, this little process basically detects wherever there's

12:06variation in the pixel value as you move from left to right,

12:09and so it gives you a kind of way to pick up on all the vertical edges from your image.

12:16And similarly, if we rotated that grid around so that it varies as you move from

12:20the top to the bottom, this will be picking up on all the horizontal edges,

12:24which in the case of our little pie creature image does result in some pretty

12:30This smaller grid, by the way, is often called a kernel,

12:32and the beauty here is how just by choosing a different kernel,

12:35you can get different image processing effects, not just blurring your edge detection,

12:39but also things like sharpening.

12:40For those of you who have heard of a convolutional neural network,

12:44the idea there is to use data to figure out what the kernels should be in

12:47the first place, as determined by whatever the neural network wants to detect.

12:52Another thing I should maybe bring up is the length of the output.

12:55For something like the moving average example,

12:57you might only want to think about the terms when both of the windows

13:00fully align with each other.

13:02Or in the image processing example, maybe you want

13:04the final output to have the same size as the original.

13:07Now, convolutions as a pure math operation always produce an

13:10array that's bigger than the two arrays that you started with,

13:13at least assuming one of them doesn't have a length of one.

13:16Just know that in certain computer science contexts,

13:19you often want to deliberately truncate that output.

13:24Another thing worth highlighting is that in the computer science context,

13:28this notion of flipping around that kernel before you let it march across

13:31the original often feels really weird and just uncalled for, but again,

13:35note that that's what's inherited from the pure math context,

13:38where like we saw with the probabilities, it's an incredibly natural thing to do.

13:43And actually, I can show you one more pure math example where

13:45even the programmers should care about this one,

13:48because it opens the doors for a much faster algorithm to compute all of these.

13:52To set up what I mean by faster here, let me go back and pull up some Python again,

13:56and I'm going to create two different relatively big arrays.

13:59Each one will have a hundred thousand random elements in it,

14:03and I'm going to assess the runtime, of the convolve function from the NumPy library.

14:08And in this case, it runs it for multiple different iterations, tries to find an average,

14:12and it looks like, on this computer at least, it averages at 4.87 seconds.

14:16By contrast, if I use a different function from the SciPy library called fftConvolve,

14:21which is the same thing just implemented differently,

14:25that only takes 4.3 milliseconds on average, so three orders of magnitude improvement.

14:30And again, even though it flies under a different name,

14:32it's giving the same output that the other convolve function does,

14:36it's just doing something to go about it in a cleverer way.

14:42Remember how with the probability example, I said another way you could

14:45think about the convolution was to create this table of all the pairwise products,

14:49and then add up those pairwise products along the diagonals.

14:53There's of course nothing specific to probability.

14:55Any time you're convolving two different lists of numbers,

14:57you can think about it this way.

14:59Create this kind of multiplication table with all pairwise products,

15:02and then each sum along the diagonal corresponds to one of your final outputs.

15:07One context where this view is especially natural

15:10is when you multiply together two polynomials.

15:13For example, let me take the little grid we already have and replace the top terms

15:18with 1, 2x, and 3x squared, and replace the other terms with 4, 5x, and 6x squared.

15:24Now, think about what it means when we're creating all of

15:26these different pairwise products between the two lists.

15:29What you're doing is essentially expanding out the full product of

15:32the two polynomials I have written down, and then when you add up along the diagonal,

15:37that corresponds to collecting all like terms.

15:40Which is pretty neat.

15:41Expanding a polynomial and collecting like terms

15:44is exactly the same process as a convolution.

15:47But this allows us to do something that's pretty cool,

15:50because think about what we're saying here.

15:52We're saying if you take two different functions and you multiply them together,

15:56which is a simple pointwise operation, that's the same thing as if you had

16:00first extracted the coefficients from each one of those, assuming they're polynomials,

16:05and then taken a convolution of those two lists of coefficients.

16:09What makes that so interesting is that convolutions feel,

16:12in principle, a lot more complicated than simple multiplication.

16:15And I don't just mean conceptually they're harder to think about.

16:18I mean, computationally, it requires more steps to perform a convolution

16:22than it does to perform a pointwise product of two different lists.

16:26For example, let's say I gave you two really big polynomials,

16:29say each one with a hundred different coefficients.

16:32Then if the way you multiply them was to expand out this product,

16:36you know, filling in this entire 100 by 100 grid of pairwise products,

16:40that would require you to perform 10,000 different products.

16:43And then, when you're collecting all the like terms along the diagonals,

16:47that's another set of around 10,000 operations.

16:50More generally, in the lingo, we'd say the algorithm is O of n squared,

16:54meaning for two lists of size n, the way that the number of

16:58operations scales is in proportion to the square of n.

17:01On the other hand, if I think of two polynomials in terms of their outputs, for example,

17:06sampling their values at some handful of inputs,

17:09then multiplying them only requires as many operations as the number of samples,

17:13since again, it's a pointwise operation.

17:16And with polynomials, you only need finitely many

17:18samples to be able to recover the coefficients.

17:20For example, two outputs are enough to uniquely specify a linear polynomial,

17:24three outputs would be enough to uniquely specify a quadratic polynomial,

17:29and in general, if you know n distinct outputs,

17:32that's enough to uniquely specify a polynomial that has n different coefficients.

17:37Or, if you prefer, we could phrase this in the language of systems of equations.

17:41Imagine I tell you I have some polynomial, but I don't tell you what the coefficients are.

17:45Those are a mystery to you.

17:46In our example, you might think of this as the product that we're trying to figure out.

17:50And then, suppose I say, I'll just tell you what the outputs of this polynomial

17:54would be if you inputted various different inputs, like 0, 1, 2, 3, on and on,

17:59and I give you enough so that you have as many equations as you have unknowns.

18:04It even happens to be a linear system of equations, so that's nice,

18:07and in principle, at least, this should be enough to recover the coefficients.

18:11So the rough algorithm outline then would be, whenever you want to convolve two lists

18:16of numbers, you treat them like they're coefficients of two polynomials,

18:20you sample those polynomials at enough outputs, multiply those samples point-wise,

18:24and then solve this system to recover the coefficients as a sneaky backdoor way to find

18:31And, as I've stated it so far, at least, some of you could rightfully complain, grant,

18:36that is an idiotic plan, because, for one thing,

18:38just calculating all these samples for one of the polynomials we know already takes

18:43on the order of n-squared operations.

18:45Not to mention, solving that system is certainly going to be

18:48computationally as difficult as just doing the convolution in the first place.

18:52So, like, sure, we have this connection between multiplication and convolutions,

18:56but all of the complexity happens in translating from one viewpoint to the other.

19:01But there is a trick, and those of you who know about Fourier

19:04transforms and the FFT algorithm might see where this is going.

19:07If you're unfamiliar with those topics, what I'm

19:09about to say might seem completely out of the blue.

19:12Just know that there are certain paths you could have

19:14walked in math that make this more of an expected step.

19:17Basically, the idea is that we have a freedom of choice here.

19:20If instead of evaluating at some arbitrary set of inputs, like 0, 1, 2, 3,

19:24on and on, you choose to evaluate on a very specially selected set of complex numbers,

19:29specifically the ones that sit evenly spaced on the unit circle,

19:32what are known as the roots of unity, this gives us a friendlier system.

19:38The basic idea is that by finding a number where taking its powers falls into

19:42this cycling pattern, it means that the system we generate is going to have a

19:46lot of redundancy in the different terms that you're calculating,

19:49and by being clever about how you leverage that redundancy,

19:52you can save yourself a lot of work.

19:56This set of outputs that I've written has a special name.

19:58It's called the discrete Fourier transform of the coefficients,

20:02and if you want to learn more, I actually did another lecture for that same Julia MIT

20:06class all about discrete Fourier transforms, and there's also a really excellent video on

20:11the channel Reducible talking about the fast Fourier transform,

20:14which is an algorithm for computing these more quickly.

20:17Also Veritasium recently did a really good video on FFTs, so you've got lots of options.

20:22And that fast algorithm really is the point for us.

20:25Again, because of all this redundancy, there exists a method to go from

20:28the coefficients to all of these outputs, where instead of doing on

20:32the order of n squared operations, you do on the order of n times the

20:35log of n operations, which is much much better as you scale to big lists.

20:39And, importantly, this FFT algorithm goes both ways.

20:42It also lets you go from the outputs to the coefficients.

20:46So, bringing it all together, let's look back at our algorithm outline.

20:49Now we can say, whenever you're given two long lists of numbers,

20:52and you want to take their convolution, first compute the fast Fourier transform of

20:56each one of them, which, in the back of your mind,

20:59you can just think of as treating them like they're the coefficients of a polynomial,

21:03and evaluating it at a very specially selected set of points.

21:06Then, multiply together the two results that you just got, point-wise,

21:10which is nice and fast, and then do an inverse fast Fourier transform,

21:13and what that gives you is the sneaky backdoor way to compute the convolution that

21:17we were looking for.

21:19But this time, it only involves O of n log n operations.

21:23That's really cool to me.

21:25This very specific context where convolutions show up,

21:27multiplying two polynomials, opens the doors for an algorithm

21:30that's relevant everywhere else where convolutions might come up.

21:34If you want to add probability distributions, do some large image processing,

21:38whatever it might be, and I just think that's such a good example of why you should be

21:42excited when you see some operation or concept in math show up in a lot of seemingly

21:48If you want a little homework, here's something that's fun to think about.

21:51Explain why when you multiply two different numbers,

21:54just ordinary multiplication the way we all learn in elementary school,

21:57what you're doing is basically a convolution between the digits of those numbers.

22:02There's some added steps with carries and the like, but the core step is a convolution.

22:07In light of the existence of a fast algorithm,

22:09what that means is if you have two very large integers,

22:12then there exists a way to find their product that's faster than the method we learn

22:16in elementary school, that instead of requiring O of n squared operations,

22:20only requires O of n log n, which doesn't even feel like it should be possible.

22:25The catch is that before this is actually useful in practice,

22:28your numbers would have to be absolutely monstrous.

22:31But still, it's cool that such an algorithm exists.

22:35And next up, we'll turn our attention to the continuous

22:37case with a special focus on probability distributions.