00:00I created a patch that allows the
00:01PlayStation 3 version of new to run to
00:03any frame rate while I was creating this
00:04patch I found an odd pattern in the
00:06game's code the game likes a zero Vector
00:08register then convert that zeroed
00:09register to a floating Point number
00:11thing is unlike all other numbers the
00:13way that zero is represented in binary
00:14is the same for both floating point and
00:16integer numbers in other words the
00:18second instruction here is essentially
00:20nothing this pattern appears in near
00:242434 times thankfully when translating
00:27the power PC code of the PlayStation 3
00:29to x86 we're able to optimize the
00:31floating point conversion away but I
00:33can't show you the code in rpcs3 that
00:35optimizes this pattern because rpcs3
00:38doesn't optimize this
00:40pattern rather than translating power PC
00:43code directly to x86 we instead
00:45translate to something known as llvm IR
00:48the open- source llvm project knows how
00:50to ingest llvm ir and output code for
00:52many different computer architectures
00:55since the LM project receives so many
00:56contributions simple optimizations like
00:59evaluating floating Point conversions on
01:00a constant value don't need to be
01:02reimplemented by the rpcs3
01:04team this is why when you boot up a game
01:06for the first time in rpcs3 it takes so
01:08long on this step compiling PPU modules
01:12an analyzer in rpcs3 finds all the power
01:14PC code in the binary translates it to
01:17llvm ir and llvm converts it into well
01:19optimized x86 code performing the step
01:23before boot helps mitigate any stutters
01:24or hitches that would normally occur due
01:26to translating the code at runtime but
01:28the core that runs power PC code on the
01:30PS3 known as the PPU is pretty much the
01:32least interesting part of the PS3 this
01:35is a die shot of the PS3 CPU this large
01:40PPU it supports everything needed to run
01:42a modern operating system and can run
01:44generic code these eight repeating
01:46structures are the spu cores they're
01:49Built For maximum throughput at the
01:51programmability there's eight of them
01:53here on the D itself but one is disabled
01:55for better yields and another is
01:56reserved by the operating system so when
01:59emulating the place 3 we only need to
02:02them let's take a look at the spu
02:07manual I can't read this let's take a
02:10look at one of the official manuals
02:12instead there's a lot of features that
02:15make the spus unique let's talk about
02:17how they handle floating Point
02:19numbers first of all the manual mentions
02:22that much of the code base for game
02:23applications assumes a single Precision
02:25floating Point format that is distinct
02:26from the i e 754 format commonly
02:29implemented on general purpose
02:31processors what in the world does that
02:32mean well it turns out that the spus
02:34inherit the same crackhead floating
02:36Point format that the PlayStation 2 uses
02:38in this format values of infinity and
02:40Nan are not supported and are instead
02:42interpreted as a number with a very
02:44large exponent under i e 754 any number
02:48with all bits and exponents set are
02:49interpreted as either Nan or Infinity
02:52but in the extended range floating Point
02:53format they're instead treated as an
02:56number this poses a problem for
02:58emulators as we need to ex Ute
03:00additional instructions to ensure
03:01compatibility with PlayStation 3
03:03software this is what Ninja guid and
03:05sigma 2 looks like when we don't
03:06properly emulate the extended range
03:08floating Point format you can see that
03:10most of the geometry is missing how do
03:12we handle it one method that is common
03:14to both PlayStation 2 and Playstation 3
03:16emulators is to clamp the input values
03:18down from what would be interpreted as
03:19Nan or INF on i e processors to fmax
03:22instead this isn't entirely accurate
03:25however it is pretty fast still even a
03:28fast method adds several instructions to
03:29our implementation what does that look
03:31like this is what the code that emits
03:33the clamping instructions looks
03:35like and this is sort of what the
03:37resulting x86 assembly looks
03:40like two instructions isn't a lot of
03:42overhead but considering that the
03:43PlayStation 3 is 40 times faster than
03:45the PlayStation 2 at crunching through
03:46floating Point numbers you still hope
03:49faster and so I did come up with
03:51something Faster by using the V range PS
03:54instruction we only need one instruction
03:56to do both positive and negative
03:58clamping the behavior of this
04:00instruction is controlled by a constant
04:01value by using the correct constant we
04:03can take the absolute value of our input
04:05take the minimum of fmax in our input
04:06then copy the sign bit from the original
04:08input to the resulting
04:10value crucially the instruction also
04:12differs in how it behaves when either
04:14inputs are Nan if either input is Nan it
04:16will use the second input value as the
04:18result since our second input is fmax
04:20this behavior is desirable unfortunately
04:23the V range PS instruction can only be
04:25used on CPUs that support the AVX 512
04:27instruction set and most people still
04:28don't have a CPU that can execute AVX
04:33instructions so what does ninja giden
04:35look like with more accurate floating
04:45emulation the game was missing geometry
04:48with inaccurate floating Point emulation
04:49because the game is performing as much
04:51geometry processing on the spus as
04:53possible many games use a Sony developed
04:55Library known as spus geometry for this
05:00if we zoom into momiji's face here you
05:02can see that the game is actually
05:03calling triangles that are too small to
05:04be rendered at the game's native
05:08720p if we switch over to the same
05:10footage from the PC version you can see
05:12all of that clear up there's still one
05:15more difference with the PlayStation 3
05:16version if you shake the controller like
05:18a [Â __Â ] then her boobs jiggle like this
05:21for some reason they didn't include this
05:22feature in the PC version what the
05:26hell let's take a look at some specific
05:28spu instructions now fcq or floating
05:31compare equal is a pretty
05:32straightforward instruction take two
05:34numbers if the numbers are equal fill
05:36the result with all ones if they aren't
05:38equal fill the result with all zeros
05:40there's one issue when comparing to the
05:42equivalent instruction in x86 how it
05:44deals with Nan values Nan can't equal
05:46Nan so there's instead two choices of
05:48behavior to choose from an unordered
05:50comparison will return a result of all
05:52ones if either operand is Nan and an
05:54ordered comparison will never return a
05:55result of all ones if either operand is
05:57Nan The Simple Solution here is to
06:00perform both an ordered floating Point
06:01comparison along with an integer
06:03comparison at the same time then
06:04logically order the results together we
06:07do this because an equals comparison in
06:08floating point is essentially an integer
06:10comparison with special cases for Nan
06:12and zero so by combining the results we
06:14can get something accurate to the
06:15PlayStation 3 comparison instructions
06:18are typically followed by this next
06:19instruction cellb this instruction takes
06:21three inputs depending on the contents
06:23of the third input the instruction
06:24selects a bit from either the first or
06:26the second input this process is
06:28repeated 12 28 times for each bit in the
06:32registers since comparison instructions
06:34like fcq fill the result with the result
06:36of all ones or all zeros this can be
06:38used to select between not just bits but
06:40entire 32-bit floating Point numbers but
06:43if the spu registers are 128 bits why
06:45are we talking about 32-bit numbers well
06:48the spus Implement something called simd
06:50single instruction multiple data this
06:52means that instructions like fcq
06:54actually perform four floating Point
06:56comparisons and pack the results into a
06:57single 128bit register
07:00this is more efficient than executing
07:02four Standalone instructions but comes
07:03at the expense of programmability simd
07:06instructions aren't something exotic all
07:08modern processors support them for
07:10instance SS AVX and AVX 512 are some x86
07:14simd instruction sets that you may have
07:15heard of however The spus Stand Out by
07:18not supporting scaler instructions
07:20instructions that only operate on a
07:23input moving back to cellb how do we
07:25emulate it well there's a simple
07:27solution that takes three instructions
07:28we need to Launch logically and one of
07:30the inputs logically and the inverse of
07:32the other input and then or the result
07:34of those two operations together with
07:36AVX 512 there's actually an instruction
07:38that allows you to do all of that in
07:39just one step the VP turn log
07:41instruction can perform any combination
07:43of bitwise operations on up to three
07:46operation but even without AVX 512
07:49there's a fast way to execute celb 90%
07:51of the time the VP blend instructions in
07:53x86 will select an entire 32-bit Lane of
07:56either of the first two inputs based off
07:58of the most significant bit of the third
08:00Vector since we know that the selection
08:02mask provided for celby will always be
08:04all ones or all zeros following a
08:06comparison instruction the VP blend
08:08instruction is equivalent in this
08:10case let's take a look at the fcg or
08:12floating compare greater than
08:14instruction for just a second we have to
08:16execute some extra instructions to match
08:18PlayStation 3 Behavior just like with
08:20fcq there's a special case optimization
08:22for this instruction that is able to
08:24simplify our emulation to a single
08:25integer comparison given that one of the
08:27inputs is a positive number the great
08:30thing about this is that LM is able to
08:31see the pattern of an integer comparison
08:33greater than followed by a select into a
08:36much simpler integer mid Max instruction
08:38meaning that we only need one
08:39instruction to emulate two PlayStation 3
08:45neat let's move on to what I consider to
08:48be the most iconic spu instruction Shu
08:52B Shuff b or Shuffle bites is an
08:55incredibly powerful instruction that
08:56also sees a lot of usage since it's so
08:58commonly used it's important that the
09:00implementation is very
09:02optimized x86 programmers may be
09:05familiar with the SS instruction pshu be
09:07pshu b means pack Shuffle bites since
09:10pshu B is simpler than Shu B we're going
09:12to do this in reverse and explain pshu B
09:14first first let's picture an array of 16
09:17numbers and number them in Reverse from
09:1915 down to zero then if we have another
09:22array of 16 numbers we can think of
09:24these values as indices values that will
09:26select a result from our other array so
09:28a value of zero will select a result of
09:30nine from over here this is what it
09:32looks like once the process is repeated
09:33for each element of our Vector you'll
09:35note that this case resulted in a zero
09:37being written to the result P sh B has a
09:39special case where it will write zero to
09:41the result if the indices have the most
09:43significant bit set so how does shf B
09:46differ first of all our input array is
09:48numbered in the opposite order this time
09:50from 0 to 15 secondly we actually have a
09:52second set of numbers this time numbered
09:5416 to 31 since shuy takes two 128bit
09:58input vectors rather than one
10:00just like with pshu B shuby has a
10:02special case when the most significant
10:03bid is set but shuby has three special
10:06cases in total allowing some special
10:08constants to be written to the result
10:09when desired that's a lot of behavior to
10:11emulate how do we do it we're going to
10:14start by taking the input indices and
10:15bit shifting them four places to the
10:17right we then need to take those shifted
10:19indices and use them to index into a
10:21special constant with p b we'll now have
10:23an array filled with our special
10:25constants and zeros for all other values
10:27we'll just hang on to this Vector now
10:28and come back to it later next we need
10:31to xor the value hex F into the original
10:33unshifted indices the purpose of this is
10:35to reverse the order which we index into
10:37the Shu B indices such that pshu B can
10:40match Its Behavior now we need to
10:42execute two pshu B instructions with the
10:44Reversed indices this has to be done
10:46twice since psh B can't take two input
10:48vectors like Shu B can finally we need
10:51to merge the two psh B results along
10:53with our special cases from earlier we
10:55need to bit shift the Shu B indices to
10:56the left three times and then take
10:58advantage of the X8 six blend
10:59instructions to select from each of our
11:01two pshu B results finally we need to
11:04logically or our special cases from the
11:06start into our result since pshu B write
11:08zero in all three of the special cases
11:09where Shu B would write a result the
11:11special cases can be cleanly added to
11:13the result this is what the resulting
11:15x86 assembly looks like we need about
11:17nine instructions here note that I
11:19didn't include loading any constants in
11:21the instruction count in most situations
11:23lovm is able to find some additional
11:24optimizations to reduce the number of
11:26needed instructions here for instance if
11:28shuy is used twice with the same indices
11:30it will recognize that it's already
11:32calculated the special cases for that
11:33set of indices and will admit
11:35calculating it again alternatively if
11:37shuby was called in a loop lvm would
11:39hoise calculating the special cases
11:41outside of the loop such that it's not
11:43being recalculated each iteration
11:45despite the fact that lovm is able to
11:47add so many optimizations by itself we
11:49still have some extra optimizations of
11:50our own for Shu lines of code isn't a
11:52good metric to judge code but just look
11:54at how big this implementation is what I
11:56showed you earlier is all that's needed
11:58for an accurate implementation but all
11:59this extra code exists to make your
12:01video games run fast we have
12:03optimizations for when the input comes
12:04from special indic generating
12:06instructions that are meant to be paired
12:08B we have optimizations for when the
12:10indices are a constant value allowing
12:12llvm to simplify the code even further
12:15we have optimizations for when llvm's
12:17known bits analysis can determine that
12:18the indices don't contain any special
12:21cases we have optimizations for when the
12:23input vectors were recently bite swapped
12:25allowing us to shuffle the unswapped
12:26data and skip the step where we reverse
12:30finally we have a special AVX 512 path
12:33this is the piece of code that I'm most
12:34proud of writing in my whole life not
12:36only is it a real mind Bender it's also
12:38blazingly fast this is going to take a
12:40minute to explain so strap in let's take
12:43a look at the resulting assembly this
12:44code produces first this time we only
12:46need five instructions here if we don't
12:48count loading constants the key to all
12:50this is a vg2 p8 apine QB instruction
12:54this instruction is a real mouthful and
12:55I often see people who are new to x86
12:57programming fixate on this instruction
12:59in particular and gawk at the absurdity
13:01of that long name however this
13:03instruction is adored by experienced
13:05programmers who have taken the time to
13:07it this instruction is part of the gfn
13:10instruction set these instructions
13:12operate on something either known as a
13:14galwa field or a finite field galwa
13:16fields are named after the French
13:18mathematician G this doesn't really have
13:21much to do with PlayStation 3 emulation
13:23but wow this guy made a number of
13:25contributions to mathematics before
13:26dying in a duel at the age of 20 what
13:28did do accomplish at 20 years old what
13:30do you intend to accomplish at 20 years
13:32old anyways uh without a background in
13:35mathematics the description for this
13:37instruction is pretty
13:38impenetrable this instruction was
13:40designed to accelerate cryptography and
13:41the description is appropriate for that
13:43audience but this instruction has become
13:45so well known for its potential for
13:46non-cryptographic uses that even Intel
13:48has produced documentation on how it may
13:50be used for non-cryptographic
13:51purposes rather than plagiarizing what
13:53they've written I'm just going to quote
13:54a little bit from the documentation and
13:56then follow up with some
13:58examples Intel says the gf2 field is
14:01simply a gwa field with only two
14:02elements 0 and one the addition of two
14:05values in gf2 is equivalent to addition
14:07modulo 2 or an exclusive or the
14:10multiplication of two values in gf2 is
14:12equivalent to multiplication modulo 2 or
14:14the logical and operation other
14:16operations are similarly defined so a
14:19background and math isn't really
14:20required here from now on we can simply
14:22think of addition in a gwa field as an
14:25ore here's an example Intel gives for
14:28transforming an input bite for each
14:30resulting bit we have one input bite a
14:32constant bit and one source bite for
14:34each bit in the first input we take a
14:36bit from The Source bite in the same
14:37position if there's a bit there set it
14:39once we've taken all those bits we
14:41horizontally exor all those bits
14:42together along with the constant bit as
14:45well by only setting one bit in each of
14:47the first input bites we can treat this
14:49instruction as sort of a bit Shuffle
14:51instruction not to similar to how pshu B
14:55work all sorts of other possibilities
14:57arise like emulating a sinx exension of
14:59an odd bit count or perhaps an 8 bit
15:01shift instruction the kind of thing
15:02that's missing from x86
15:04simd let's take a look back at our shfb
15:07llvm code these 16 bytes are mirrored
15:09down the middle since each set of eight
15:11bytes operates in an independent Lane
15:13this bite will select the second most
15:15significant bit from our input these
15:17seven bits will select the third most
15:18significant bit from our input finally
15:21we exclusive or each result by hex 7f
15:24what we're left with are the special
15:25cases with a Shu D instruction we need
15:27to take the minimum between our results
15:28and z because if the second most
15:30significant bit of the indices was not
15:31set the constant should be
15:33zero I'm really proud of how this code
15:36works I've been waiting to talk about
15:37how it works in detail for several years
15:39now and honestly I'm so tired of people
15:42bashing on gf2 p8 aine QB that I'm
15:45officially announcing that I will be
15:46dueling anyone who talks smack about
15:48this instruction from now on that's
15:49right if you say something stupid I'm
15:51going to find you and we're going to
15:54death anyways rpcs3 has some other uses
15:57of the gf2 p apine QB instruction too
16:01but this video is long enough already
16:02apparently I left some comments here
16:04though so maybe you can figure out how
16:06help okay enough distractions we're
16:09almost done talking about Chef B there's
16:11just one more part of the AVX 512 path
16:13that I want to talk about this vperm 2B
16:16function is unassuming enough if you're
16:18running an AMD processor that supports
16:19ABX 512 it will just output vperm 2B an
16:22instruction that takes two input vectors
16:24sort of like shf B does just without any
16:26special cases cool but on Intel we
16:29instead emulate vperm 2B with a sequence
16:31of two instructions why well vperm 2B is
16:34implemented with three Micro Ops on
16:36Intel you can imagine that it looks
16:37something like how we shuffled twice and
16:39then merge the results when we're
16:40emulating Shu B with pshu B but there's
16:42a faster way to handle this on Intel
16:44slightly Faster by inserting our Second
16:46Source Vector into the upper 128 bits of
16:48the First Source Vector vperm B actually
16:50becomes equivalent in Behavior to vperm
16:522B and yes emitting all of this llvm
16:55magically compiles down to just two
16:57instructions I I was unsure if this
16:59optimization really would be faster
17:01despite documentation suggesting that it
17:03would so I benchmarked it back in the
17:04day on my tiger Lake laptop the avx2
17:07path had a throughput of one emulated sh
17:09B per four Cycles the old AVX 512 path
17:12had a speed of one sh B per three cycles
17:14and the latest AVX 512 path has a speed
17:16of one shy per 2.3 Cycles pretty
17:19sweet so I could spend more time talking
17:22about specific PlayStation 3
17:24instructions but I think by now you
17:25understand the type of effort that the
17:26rpcs3 team puts into making them run
17:29so let's switch subject a little what
17:31kind of impact do different instruction
17:34performance this is how God of War 3
17:37performs when targeting a processor with
17:38ssse 4.1 instructions the game performs
17:41about 30% faster with avx2 instructions
17:44largely thanks to the addition of fuse
17:45multiply ad instructions another 20%
17:48performance is gained when targeting AVX
17:52instructions so when looking for a new
17:54CPU you should get one with AVX 512
17:56right well sort of depending on the game
17:59the fastest CPU might be something
18:00without AVX 512 support and yeah the
18:03latest Intel CPUs don't support AVX 512
18:06this is despite Intel creating the
18:07instruction set and shipping it on CPUs
18:09from several years ago how
18:12come well Intel's latest CPUs have
18:14something they call efficiency cores on
18:16them unfortunately since the efficiency
18:17cores don't support AVX 512 the larger
18:20performance cores can't enable AVX 512
18:22support until is working on a solution
18:24to this that they call
18:25avx1 avx1 is essentially AVX 512 but
18:28with with only 256bit long vectors so
18:31you can use all the new AVX 512
18:32instructions without implementing
18:34512-bit long vectors problem is ax10 is
18:37still several years away from Shipping
18:39product AMD recently posted some patches
18:42to the open source GCC compiler with
18:44details on their upcoming Zen 5 CPUs one
18:47of the new features with Zen 5 is that
18:48it can execute 512-bit wide AVX 512
18:51instructions in a single cycle I made an
18:53entire blog post on this subject
18:55previously with the goal of explaining
18:56exactly why AVX 512 is useful for RPC 3
18:59which has nothing to do with how wide
19:00the instructions are rpcs3 does use some
19:02wide AVX 512 instructions but since
19:04those make up less than 1% of the code
19:06executed doubling the speed won't make
19:09impact however the other new features
19:11detailed in the Zen 5 GCC patches are a
19:13big deal for rpcs3 like the doubled L1
19:15and L2 bandwidth let's get back to
19:18talking about the PlayStation 3 Hardware
19:20Theus are a pain to program part of the
19:22problem is that the spus execute nothing
19:24but 128bit simd instructions but there's
19:26an even bigger problem when it comes to
19:27programmability the spus cannot directly
19:30access main memory why do they do this
19:33let's check out how a conventional
19:34machine loads from memory first we
19:36execute a load instruction which
19:37generates an address next we need to
19:39translate that address from a virtual
19:41address to a physical address the
19:42mapping for translating addresses is
19:44held somewhere in memory but it would be
19:45too slow to look that up each time so
19:47address translations are cached in
19:48something called a translation look
19:50aside buffer once the address is
19:52translated the processor will check if
19:54that address exists in the caches
19:56typical modern processors have three
19:59finally if all level of caches didn't
20:01contain our address the processor makes
20:02a request to the memory controller to
20:04load the cach line that contains our
20:05address that's an extreme amount of
20:07complexity considering how common load
20:08and store instructions are on the spus
20:11load and store instructions are quite
20:12simple once the address is generated the
20:14processor loads the data found at that
20:16address there's no virtual to physical
20:18translation and there's no CES to check
20:21each spu can only directly access 256
20:23KOB of local storage which is made up of
20:25SRAM similar to caches on other CPS
20:27without the complexity of having the
20:29hardware manage evictions and fills to
20:31actually perform real work there still
20:32needs to be a way to communicate with
20:34the rest of the system to accomplish
20:35this the spus makes use of something
20:37known as dma direct memory access the
20:39dma controller receives requests from
20:41programs and moves blocks of memory
20:42between main memory and the spus the big
20:44pain point is that all of this is
20:47managed let's talk about one last
20:49subject I'm going to use near as an
20:51example again this game has more than a
20:53few funky things hiding in its code the
20:55game likes to sleep for 10 microc
20:56seconds on the main thread on Linux this
20:58work works fine enough we sleep for 10
21:00micros on Windows this becomes a bit of
21:02a problem because the minimum amount of
21:04time we can sleep for is 500 microc the
21:07solution is to busy weight keep the
21:09thread active and wait down the timer in
21:10user space instead the problem is that
21:12this Burns Power and doesn't let the
21:14operating system reschedule the thread
21:15elsewhere while waiting on the timer
21:18there exists instructions that allow the
21:19processor to efficiently wait on a timer
21:21without blowing up power usage
21:23historically these instructions have
21:24only been available to the operating
21:25system and normal programs can't execute
21:27them recently however AMD and Intel have
21:30added instructions that allow you to
21:31wait on a timer efficiently to user
21:33space thing is this is one rare
21:35situation where AMD and Intel have
21:37implemented different instructions for
21:38the same purpose so we need to write
21:40different paths for each brand this time
21:42the good news is that using these
21:43instructions comes with the measurable
21:44performance and power draw
21:47Improvement it's about time to wrap up
21:49this video I talked about a lot of stuff
21:51that I find exciting but for each
21:53subject I touched upon there were about
21:5410 others that I admitted I make
21:56contributions to the rpcs3 proc project
21:58so what I want to speak about aligns
22:00with the kinds of things that I want to
22:01work on as well but there are many other
22:03parts of the emulator that I don't work
22:04on like the GPU emulation which is full
22:06of just as much genius as the rest of
22:08the emulator I'll leave some links in
22:10the description that cover some
22:11additional details on rpcs3 if you're
22:14curious I wanted to cover some very
22:16technical subjects but I also didn't
22:17want to spend time explaining some
22:19Basics like what a bti is and so on so I
22:21hope that I struck a good balance if you
22:24enjoyed the video I'd appreciate it if
22:25you could like And subscribe as I do
22:27intend to create several more videos
22:28which may or may not be rpcs3 related
22:31I'm between jobs at the moment so I'd
22:32appreciate if you could share the video
22:34with as many people who you think would
22:35like it as you can I don't intend to
22:37become a full-time YouTuber but I don't
22:39have much else going on right now so I
22:41do intend to make at least a couple more
22:42videos all right peace out see you nerds