Why is PS3 emulation so fast: RPCS3 optimizations explained

Whatcookie2024-04-12

8K views|4 months ago

💫 Short Summary

The video delves into optimizing floating point conversion for PlayStation 3 emulation, including handling unique formats inherited from PlayStation 2. It explores clamping input values for improved performance, utilizing SIMD instructions and AVX 512 for parallel operations. Emulating behavior through pshu B and shu B instructions in x86 assembly is discussed, along with AVX 512 path code for cryptographic acceleration. Benchmarking results show AVX 512's impact on game performance, with upcoming Zen 5 CPUs supporting these instructions. AMD and Intel have added efficient waiting instructions, with the speaker contributing to the rpcs3 project and planning more technical content.

✨ Highlights

📊 Transcript

✦

Emulation of PlayStation 3 game code requires handling unique floating point format.

03:03

PlayStation 3 floating point format inherited from PlayStation 2 does not support infinity or NaN values.

Emulators need to implement additional instructions for compatibility with PS3 software.

Failure to properly emulate floating point format can lead to missing geometry in games like Ninja Gaiden and Sigma 2.

✦

Clamping input values for PlayStation 2 and PlayStation 3 emulators.

03:18

V range PS instruction allows for positive and negative clamping with one instruction.

Behavior controlled by a constant value for accurate floating-point emulation.

V range PS instruction requires CPUs supporting AVX 512 instruction set.

Issues with Nan values in comparisons addressed with ordered and unordered comparisons for accuracy in PlayStation 3 comparison instructions.

✦

Discussion on SIMD (Single Instruction Multiple Data) instructions and their efficiency in performing multiple operations in parallel.

06:50

SIMD instructions like VP blend and VP turn log can optimize bitwise operations and comparisons on data.

Mention of AVX 512 instruction set and its benefits.

Importance of optimized implementations for common instructions like ShuB (Shuffle bytes) and PshuB (Pack Shuffle bytes) in x86 programming.

✦

Explanation of emulating behavior using pshu B and shu B instructions in x86 assembly.

10:14

The process involves shifting indices, indexing into special constants, XOR operations, and merging results.

Additional optimizations by LLVM include recognizing repeated calculations and simplifying code based on known bits analysis.

Optimizations for constant values, input vectors, and byte swapping are highlighted to improve code efficiency and execution speed for video games.

✦

Overview of AVX 512 Path Code.

12:30

The instruction vg2 p8 apine QB operates on galwa fields with two elements for addition and multiplication modulo 2.

Background in mathematics is not required to understand the instruction, which functions as an exclusive or operation.

Intel provides examples showcasing the versatility of the instruction in applications such as bit shuffling and emulation.

The instruction's potential extends beyond cryptography, offering unique possibilities not found in x86 SIMD.

✦

Optimization of the vperm2B function on Intel processors.

16:16

Inserting the second source vector into the upper 128 bits of the first source vector reduces the number of instructions required and improves performance significantly.

Benchmarking results show faster throughput for the latest AVX 512 path compared to previous versions.

AVX2 and AVX 512 instructions offer substantial speed improvements in game performance.

Selecting a CPU with AVX 512 support is emphasized for enhanced performance.

✦

Overview of AVX1 and Zen 5 CPU features.

18:25

Zen 5 CPUs will support AVX 512 instructions, benefiting RPCS3.

Zen 5 offers doubled L1 and L2 bandwidth.

Memory access complexities on SPUs with limited local storage.

Game programming example involving sleep times on different operating systems.

✦

AMD and Intel have added instructions for efficient waiting on a timer, resulting in performance and power draw improvements.

21:47

The speaker contributes to the rpcs3 project and focuses on technical subjects without covering basics.

Plans to create more videos, possibly unrelated to rpcs3, while in between jobs.

Encourages viewers to like, subscribe, and share the video, with intentions to continue creating content despite not aiming to be a full-time YouTuber.

00:00I created a patch that allows the

00:01PlayStation 3 version of new to run to

00:03any frame rate while I was creating this

00:04patch I found an odd pattern in the

00:06game's code the game likes a zero Vector

00:08register then convert that zeroed

00:09register to a floating Point number

00:11thing is unlike all other numbers the

00:13way that zero is represented in binary

00:14is the same for both floating point and

00:16integer numbers in other words the

00:18second instruction here is essentially

00:19doing

00:20nothing this pattern appears in near

00:23code

00:242434 times thankfully when translating

00:27the power PC code of the PlayStation 3

00:29to x86 we're able to optimize the

00:31floating point conversion away but I

00:33can't show you the code in rpcs3 that

00:35optimizes this pattern because rpcs3

00:38doesn't optimize this

00:40pattern rather than translating power PC

00:43code directly to x86 we instead

00:45translate to something known as llvm IR

00:48the open- source llvm project knows how

00:50to ingest llvm ir and output code for

00:52many different computer architectures

00:55since the LM project receives so many

00:56contributions simple optimizations like

00:59evaluating floating Point conversions on

01:00a constant value don't need to be

01:02reimplemented by the rpcs3

01:04team this is why when you boot up a game

01:06for the first time in rpcs3 it takes so

01:08long on this step compiling PPU modules

01:12an analyzer in rpcs3 finds all the power

01:14PC code in the binary translates it to

01:17llvm ir and llvm converts it into well

01:19optimized x86 code performing the step

01:23before boot helps mitigate any stutters

01:24or hitches that would normally occur due

01:26to translating the code at runtime but

01:28the core that runs power PC code on the

01:30PS3 known as the PPU is pretty much the

01:32least interesting part of the PS3 this

01:35is a die shot of the PS3 CPU this large

01:38structure is the

01:40PPU it supports everything needed to run

01:42a modern operating system and can run

01:44generic code these eight repeating

01:46structures are the spu cores they're

01:49Built For maximum throughput at the

01:50expense of

01:51programmability there's eight of them

01:53here on the D itself but one is disabled

01:55for better yields and another is

01:56reserved by the operating system so when

01:59emulating the place 3 we only need to

02:00worry about six of

02:02them let's take a look at the spu

02:07manual I can't read this let's take a

02:10look at one of the official manuals

02:12instead there's a lot of features that

02:15make the spus unique let's talk about

02:17how they handle floating Point

02:19numbers first of all the manual mentions

02:22that much of the code base for game

02:23applications assumes a single Precision

02:25floating Point format that is distinct

02:26from the i e 754 format commonly

02:29implemented on general purpose

02:31processors what in the world does that

02:32mean well it turns out that the spus

02:34inherit the same crackhead floating

02:36Point format that the PlayStation 2 uses

02:38in this format values of infinity and

02:40Nan are not supported and are instead

02:42interpreted as a number with a very

02:44large exponent under i e 754 any number

02:48with all bits and exponents set are

02:49interpreted as either Nan or Infinity

02:52but in the extended range floating Point

02:53format they're instead treated as an

02:55exceedingly large

02:56number this poses a problem for

02:58emulators as we need to ex Ute

03:00additional instructions to ensure

03:01compatibility with PlayStation 3

03:03software this is what Ninja guid and

03:05sigma 2 looks like when we don't

03:06properly emulate the extended range

03:08floating Point format you can see that

03:10most of the geometry is missing how do

03:12we handle it one method that is common

03:14to both PlayStation 2 and Playstation 3

03:16emulators is to clamp the input values

03:18down from what would be interpreted as

03:19Nan or INF on i e processors to fmax

03:22instead this isn't entirely accurate

03:25however it is pretty fast still even a

03:28fast method adds several instructions to

03:29our implementation what does that look

03:31like this is what the code that emits

03:33the clamping instructions looks

03:35like and this is sort of what the

03:37resulting x86 assembly looks

03:40like two instructions isn't a lot of

03:42overhead but considering that the

03:43PlayStation 3 is 40 times faster than

03:45the PlayStation 2 at crunching through

03:46floating Point numbers you still hope

03:48for something

03:49faster and so I did come up with

03:51something Faster by using the V range PS

03:54instruction we only need one instruction

03:56to do both positive and negative

03:58clamping the behavior of this

04:00instruction is controlled by a constant

04:01value by using the correct constant we

04:03can take the absolute value of our input

04:05take the minimum of fmax in our input

04:06then copy the sign bit from the original

04:08input to the resulting

04:10value crucially the instruction also

04:12differs in how it behaves when either

04:14inputs are Nan if either input is Nan it

04:16will use the second input value as the

04:18result since our second input is fmax

04:20this behavior is desirable unfortunately

04:23the V range PS instruction can only be

04:25used on CPUs that support the AVX 512

04:27instruction set and most people still

04:28don't have a CPU that can execute AVX

04:30512

04:33instructions so what does ninja giden

04:35look like with more accurate floating

04:36Point

04:45emulation the game was missing geometry

04:48with inaccurate floating Point emulation

04:49because the game is performing as much

04:51geometry processing on the spus as

04:53possible many games use a Sony developed

04:55Library known as spus geometry for this

04:58purpose

05:00if we zoom into momiji's face here you

05:02can see that the game is actually

05:03calling triangles that are too small to

05:04be rendered at the game's native

05:06resolution of

05:08720p if we switch over to the same

05:10footage from the PC version you can see

05:12all of that clear up there's still one

05:15more difference with the PlayStation 3

05:16version if you shake the controller like

05:18a [ __ ] then her boobs jiggle like this

05:21for some reason they didn't include this

05:22feature in the PC version what the

05:26hell let's take a look at some specific

05:28spu instructions now fcq or floating

05:31compare equal is a pretty

05:32straightforward instruction take two

05:34numbers if the numbers are equal fill

05:36the result with all ones if they aren't

05:38equal fill the result with all zeros

05:40there's one issue when comparing to the

05:42equivalent instruction in x86 how it

05:44deals with Nan values Nan can't equal

05:46Nan so there's instead two choices of

05:48behavior to choose from an unordered

05:50comparison will return a result of all

05:52ones if either operand is Nan and an

05:54ordered comparison will never return a

05:55result of all ones if either operand is

05:57Nan The Simple Solution here is to

06:00perform both an ordered floating Point

06:01comparison along with an integer

06:03comparison at the same time then

06:04logically order the results together we

06:07do this because an equals comparison in

06:08floating point is essentially an integer

06:10comparison with special cases for Nan

06:12and zero so by combining the results we

06:14can get something accurate to the

06:15PlayStation 3 comparison instructions

06:18are typically followed by this next

06:19instruction cellb this instruction takes

06:21three inputs depending on the contents

06:23of the third input the instruction

06:24selects a bit from either the first or

06:26the second input this process is

06:28repeated 12 28 times for each bit in the

06:31S spu

06:32registers since comparison instructions

06:34like fcq fill the result with the result

06:36of all ones or all zeros this can be

06:38used to select between not just bits but

06:40entire 32-bit floating Point numbers but

06:43if the spu registers are 128 bits why

06:45are we talking about 32-bit numbers well

06:48the spus Implement something called simd

06:50single instruction multiple data this

06:52means that instructions like fcq

06:54actually perform four floating Point

06:56comparisons and pack the results into a

06:57single 128bit register

07:00this is more efficient than executing

07:02four Standalone instructions but comes

07:03at the expense of programmability simd

07:06instructions aren't something exotic all

07:08modern processors support them for

07:10instance SS AVX and AVX 512 are some x86

07:14simd instruction sets that you may have

07:15heard of however The spus Stand Out by

07:18not supporting scaler instructions

07:20instructions that only operate on a

07:21single

07:23input moving back to cellb how do we

07:25emulate it well there's a simple

07:27solution that takes three instructions

07:28we need to Launch logically and one of

07:30the inputs logically and the inverse of

07:32the other input and then or the result

07:34of those two operations together with

07:36AVX 512 there's actually an instruction

07:38that allows you to do all of that in

07:39just one step the VP turn log

07:41instruction can perform any combination

07:43of bitwise operations on up to three

07:45inputs in just one

07:46operation but even without AVX 512

07:49there's a fast way to execute celb 90%

07:51of the time the VP blend instructions in

07:53x86 will select an entire 32-bit Lane of

07:56either of the first two inputs based off

07:58of the most significant bit of the third

08:00Vector since we know that the selection

08:02mask provided for celby will always be

08:04all ones or all zeros following a

08:06comparison instruction the VP blend

08:08instruction is equivalent in this

08:10case let's take a look at the fcg or

08:12floating compare greater than

08:14instruction for just a second we have to

08:16execute some extra instructions to match

08:18PlayStation 3 Behavior just like with

08:20fcq there's a special case optimization

08:22for this instruction that is able to

08:24simplify our emulation to a single

08:25integer comparison given that one of the

08:27inputs is a positive number the great

08:30thing about this is that LM is able to

08:31see the pattern of an integer comparison

08:33greater than followed by a select into a

08:36much simpler integer mid Max instruction

08:38meaning that we only need one

08:39instruction to emulate two PlayStation 3

08:41instructions

08:45neat let's move on to what I consider to

08:48be the most iconic spu instruction Shu

08:52B Shuff b or Shuffle bites is an

08:55incredibly powerful instruction that

08:56also sees a lot of usage since it's so

08:58commonly used it's important that the

09:00implementation is very

09:02optimized x86 programmers may be

09:05familiar with the SS instruction pshu be

09:07pshu b means pack Shuffle bites since

09:10pshu B is simpler than Shu B we're going

09:12to do this in reverse and explain pshu B

09:14first first let's picture an array of 16

09:17numbers and number them in Reverse from

09:1915 down to zero then if we have another

09:22array of 16 numbers we can think of

09:24these values as indices values that will

09:26select a result from our other array so

09:28a value of zero will select a result of

09:30nine from over here this is what it

09:32looks like once the process is repeated

09:33for each element of our Vector you'll

09:35note that this case resulted in a zero

09:37being written to the result P sh B has a

09:39special case where it will write zero to

09:41the result if the indices have the most

09:43significant bit set so how does shf B

09:46differ first of all our input array is

09:48numbered in the opposite order this time

09:50from 0 to 15 secondly we actually have a

09:52second set of numbers this time numbered

09:5416 to 31 since shuy takes two 128bit

09:58input vectors rather than one

10:00just like with pshu B shuby has a

10:02special case when the most significant

10:03bid is set but shuby has three special

10:06cases in total allowing some special

10:08constants to be written to the result

10:09when desired that's a lot of behavior to

10:11emulate how do we do it we're going to

10:14start by taking the input indices and

10:15bit shifting them four places to the

10:17right we then need to take those shifted

10:19indices and use them to index into a

10:21special constant with p b we'll now have

10:23an array filled with our special

10:25constants and zeros for all other values

10:27we'll just hang on to this Vector now

10:28and come back to it later next we need

10:31to xor the value hex F into the original

10:33unshifted indices the purpose of this is

10:35to reverse the order which we index into

10:37the Shu B indices such that pshu B can

10:40match Its Behavior now we need to

10:42execute two pshu B instructions with the

10:44Reversed indices this has to be done

10:46twice since psh B can't take two input

10:48vectors like Shu B can finally we need

10:51to merge the two psh B results along

10:53with our special cases from earlier we

10:55need to bit shift the Shu B indices to

10:56the left three times and then take

10:58advantage of the X8 six blend

10:59instructions to select from each of our

11:01two pshu B results finally we need to

11:04logically or our special cases from the

11:06start into our result since pshu B write

11:08zero in all three of the special cases

11:09where Shu B would write a result the

11:11special cases can be cleanly added to

11:13the result this is what the resulting

11:15x86 assembly looks like we need about

11:17nine instructions here note that I

11:19didn't include loading any constants in

11:21the instruction count in most situations

11:23lovm is able to find some additional

11:24optimizations to reduce the number of

11:26needed instructions here for instance if

11:28shuy is used twice with the same indices

11:30it will recognize that it's already

11:32calculated the special cases for that

11:33set of indices and will admit

11:35calculating it again alternatively if

11:37shuby was called in a loop lvm would

11:39hoise calculating the special cases

11:41outside of the loop such that it's not

11:43being recalculated each iteration

11:45despite the fact that lovm is able to

11:47add so many optimizations by itself we

11:49still have some extra optimizations of

11:50our own for Shu lines of code isn't a

11:52good metric to judge code but just look

11:54at how big this implementation is what I

11:56showed you earlier is all that's needed

11:58for an accurate implementation but all

11:59this extra code exists to make your

12:01video games run fast we have

12:03optimizations for when the input comes

12:04from special indic generating

12:06instructions that are meant to be paired

12:07with SH

12:08B we have optimizations for when the

12:10indices are a constant value allowing

12:12llvm to simplify the code even further

12:15we have optimizations for when llvm's

12:17known bits analysis can determine that

12:18the indices don't contain any special

12:21cases we have optimizations for when the

12:23input vectors were recently bite swapped

12:25allowing us to shuffle the unswapped

12:26data and skip the step where we reverse

12:28the indices

12:30finally we have a special AVX 512 path

12:33this is the piece of code that I'm most

12:34proud of writing in my whole life not

12:36only is it a real mind Bender it's also

12:38blazingly fast this is going to take a

12:40minute to explain so strap in let's take

12:43a look at the resulting assembly this

12:44code produces first this time we only

12:46need five instructions here if we don't

12:48count loading constants the key to all

12:50this is a vg2 p8 apine QB instruction

12:54this instruction is a real mouthful and

12:55I often see people who are new to x86

12:57programming fixate on this instruction

12:59in particular and gawk at the absurdity

13:01of that long name however this

13:03instruction is adored by experienced

13:05programmers who have taken the time to

13:06learn how to use

13:07it this instruction is part of the gfn

13:10instruction set these instructions

13:12operate on something either known as a

13:14galwa field or a finite field galwa

13:16fields are named after the French

13:18mathematician G this doesn't really have

13:21much to do with PlayStation 3 emulation

13:23but wow this guy made a number of

13:25contributions to mathematics before

13:26dying in a duel at the age of 20 what

13:28did do accomplish at 20 years old what

13:30do you intend to accomplish at 20 years

13:32old anyways uh without a background in

13:35mathematics the description for this

13:37instruction is pretty

13:38impenetrable this instruction was

13:40designed to accelerate cryptography and

13:41the description is appropriate for that

13:43audience but this instruction has become

13:45so well known for its potential for

13:46non-cryptographic uses that even Intel

13:48has produced documentation on how it may

13:50be used for non-cryptographic

13:51purposes rather than plagiarizing what

13:53they've written I'm just going to quote

13:54a little bit from the documentation and

13:56then follow up with some

13:58examples Intel says the gf2 field is

14:01simply a gwa field with only two

14:02elements 0 and one the addition of two

14:05values in gf2 is equivalent to addition

14:07modulo 2 or an exclusive or the

14:10multiplication of two values in gf2 is

14:12equivalent to multiplication modulo 2 or

14:14the logical and operation other

14:16operations are similarly defined so a

14:19background and math isn't really

14:20required here from now on we can simply

14:22think of addition in a gwa field as an

14:24exclusive

14:25ore here's an example Intel gives for

14:28transforming an input bite for each

14:30resulting bit we have one input bite a

14:32constant bit and one source bite for

14:34each bit in the first input we take a

14:36bit from The Source bite in the same

14:37position if there's a bit there set it

14:39once we've taken all those bits we

14:41horizontally exor all those bits

14:42together along with the constant bit as

14:45well by only setting one bit in each of

14:47the first input bites we can treat this

14:49instruction as sort of a bit Shuffle

14:51instruction not to similar to how pshu B

14:53and Shu B

14:55work all sorts of other possibilities

14:57arise like emulating a sinx exension of

14:59an odd bit count or perhaps an 8 bit

15:01shift instruction the kind of thing

15:02that's missing from x86

15:04simd let's take a look back at our shfb

15:07llvm code these 16 bytes are mirrored

15:09down the middle since each set of eight

15:11bytes operates in an independent Lane

15:13this bite will select the second most

15:15significant bit from our input these

15:17seven bits will select the third most

15:18significant bit from our input finally

15:21we exclusive or each result by hex 7f

15:24what we're left with are the special

15:25cases with a Shu D instruction we need

15:27to take the minimum between our results

15:28and z because if the second most

15:30significant bit of the indices was not

15:31set the constant should be

15:33zero I'm really proud of how this code

15:36works I've been waiting to talk about

15:37how it works in detail for several years

15:39now and honestly I'm so tired of people

15:42bashing on gf2 p8 aine QB that I'm

15:45officially announcing that I will be

15:46dueling anyone who talks smack about

15:48this instruction from now on that's

15:49right if you say something stupid I'm

15:51going to find you and we're going to

15:52duel to the

15:54death anyways rpcs3 has some other uses

15:57of the gf2 p apine QB instruction too

16:01but this video is long enough already

16:02apparently I left some comments here

16:04though so maybe you can figure out how

16:05it works without my

16:06help okay enough distractions we're

16:09almost done talking about Chef B there's

16:11just one more part of the AVX 512 path

16:13that I want to talk about this vperm 2B

16:16function is unassuming enough if you're

16:18running an AMD processor that supports

16:19ABX 512 it will just output vperm 2B an

16:22instruction that takes two input vectors

16:24sort of like shf B does just without any

16:26special cases cool but on Intel we

16:29instead emulate vperm 2B with a sequence

16:31of two instructions why well vperm 2B is

16:34implemented with three Micro Ops on

16:36Intel you can imagine that it looks

16:37something like how we shuffled twice and

16:39then merge the results when we're

16:40emulating Shu B with pshu B but there's

16:42a faster way to handle this on Intel

16:44slightly Faster by inserting our Second

16:46Source Vector into the upper 128 bits of

16:48the First Source Vector vperm B actually

16:50becomes equivalent in Behavior to vperm

16:522B and yes emitting all of this llvm

16:55magically compiles down to just two

16:57instructions I I was unsure if this

16:59optimization really would be faster

17:01despite documentation suggesting that it

17:03would so I benchmarked it back in the

17:04day on my tiger Lake laptop the avx2

17:07path had a throughput of one emulated sh

17:09B per four Cycles the old AVX 512 path

17:12had a speed of one sh B per three cycles

17:14and the latest AVX 512 path has a speed

17:16of one shy per 2.3 Cycles pretty

17:19sweet so I could spend more time talking

17:22about specific PlayStation 3

17:24instructions but I think by now you

17:25understand the type of effort that the

17:26rpcs3 team puts into making them run

17:28fast

17:29so let's switch subject a little what

17:31kind of impact do different instruction

17:33sets have on

17:34performance this is how God of War 3

17:37performs when targeting a processor with

17:38ssse 4.1 instructions the game performs

17:41about 30% faster with avx2 instructions

17:44largely thanks to the addition of fuse

17:45multiply ad instructions another 20%

17:48performance is gained when targeting AVX

17:49512

17:52instructions so when looking for a new

17:54CPU you should get one with AVX 512

17:56right well sort of depending on the game

17:59the fastest CPU might be something

18:00without AVX 512 support and yeah the

18:03latest Intel CPUs don't support AVX 512

18:06this is despite Intel creating the

18:07instruction set and shipping it on CPUs

18:09from several years ago how

18:12come well Intel's latest CPUs have

18:14something they call efficiency cores on

18:16them unfortunately since the efficiency

18:17cores don't support AVX 512 the larger

18:20performance cores can't enable AVX 512

18:22support until is working on a solution

18:24to this that they call

18:25avx1 avx1 is essentially AVX 512 but

18:28with with only 256bit long vectors so

18:31you can use all the new AVX 512

18:32instructions without implementing

18:34512-bit long vectors problem is ax10 is

18:37still several years away from Shipping

18:38in a client

18:39product AMD recently posted some patches

18:42to the open source GCC compiler with

18:44details on their upcoming Zen 5 CPUs one

18:47of the new features with Zen 5 is that

18:48it can execute 512-bit wide AVX 512

18:51instructions in a single cycle I made an

18:53entire blog post on this subject

18:55previously with the goal of explaining

18:56exactly why AVX 512 is useful for RPC 3

18:59which has nothing to do with how wide

19:00the instructions are rpcs3 does use some

19:02wide AVX 512 instructions but since

19:04those make up less than 1% of the code

19:06executed doubling the speed won't make

19:08much of an

19:09impact however the other new features

19:11detailed in the Zen 5 GCC patches are a

19:13big deal for rpcs3 like the doubled L1

19:15and L2 bandwidth let's get back to

19:18talking about the PlayStation 3 Hardware

19:20Theus are a pain to program part of the

19:22problem is that the spus execute nothing

19:24but 128bit simd instructions but there's

19:26an even bigger problem when it comes to

19:27programmability the spus cannot directly

19:30access main memory why do they do this

19:33let's check out how a conventional

19:34machine loads from memory first we

19:36execute a load instruction which

19:37generates an address next we need to

19:39translate that address from a virtual

19:41address to a physical address the

19:42mapping for translating addresses is

19:44held somewhere in memory but it would be

19:45too slow to look that up each time so

19:47address translations are cached in

19:48something called a translation look

19:50aside buffer once the address is

19:52translated the processor will check if

19:54that address exists in the caches

19:56typical modern processors have three

19:57levels of cache

19:59finally if all level of caches didn't

20:01contain our address the processor makes

20:02a request to the memory controller to

20:04load the cach line that contains our

20:05address that's an extreme amount of

20:07complexity considering how common load

20:08and store instructions are on the spus

20:11load and store instructions are quite

20:12simple once the address is generated the

20:14processor loads the data found at that

20:16address there's no virtual to physical

20:18translation and there's no CES to check

20:21each spu can only directly access 256

20:23KOB of local storage which is made up of

20:25SRAM similar to caches on other CPS

20:27without the complexity of having the

20:29hardware manage evictions and fills to

20:31actually perform real work there still

20:32needs to be a way to communicate with

20:34the rest of the system to accomplish

20:35this the spus makes use of something

20:37known as dma direct memory access the

20:39dma controller receives requests from

20:41programs and moves blocks of memory

20:42between main memory and the spus the big

20:44pain point is that all of this is

20:46programmer

20:47managed let's talk about one last

20:49subject I'm going to use near as an

20:51example again this game has more than a

20:53few funky things hiding in its code the

20:55game likes to sleep for 10 microc

20:56seconds on the main thread on Linux this

20:58work works fine enough we sleep for 10

21:00micros on Windows this becomes a bit of

21:02a problem because the minimum amount of

21:04time we can sleep for is 500 microc the

21:07solution is to busy weight keep the

21:09thread active and wait down the timer in

21:10user space instead the problem is that

21:12this Burns Power and doesn't let the

21:14operating system reschedule the thread

21:15elsewhere while waiting on the timer

21:18there exists instructions that allow the

21:19processor to efficiently wait on a timer

21:21without blowing up power usage

21:23historically these instructions have

21:24only been available to the operating

21:25system and normal programs can't execute

21:27them recently however AMD and Intel have

21:30added instructions that allow you to

21:31wait on a timer efficiently to user

21:33space thing is this is one rare

21:35situation where AMD and Intel have

21:37implemented different instructions for

21:38the same purpose so we need to write

21:40different paths for each brand this time

21:42the good news is that using these

21:43instructions comes with the measurable

21:44performance and power draw

21:47Improvement it's about time to wrap up

21:49this video I talked about a lot of stuff

21:51that I find exciting but for each

21:53subject I touched upon there were about

21:5410 others that I admitted I make

21:56contributions to the rpcs3 proc project

21:58so what I want to speak about aligns

22:00with the kinds of things that I want to

22:01work on as well but there are many other

22:03parts of the emulator that I don't work

22:04on like the GPU emulation which is full

22:06of just as much genius as the rest of

22:08the emulator I'll leave some links in

22:10the description that cover some

22:11additional details on rpcs3 if you're

22:14curious I wanted to cover some very

22:16technical subjects but I also didn't

22:17want to spend time explaining some

22:19Basics like what a bti is and so on so I

22:21hope that I struck a good balance if you

22:24enjoyed the video I'd appreciate it if

22:25you could like And subscribe as I do

22:27intend to create several more videos

22:28which may or may not be rpcs3 related

22:31I'm between jobs at the moment so I'd

22:32appreciate if you could share the video

22:34with as many people who you think would

22:35like it as you can I don't intend to

22:37become a full-time YouTuber but I don't

22:39have much else going on right now so I

22:41do intend to make at least a couple more

22:42videos all right peace out see you nerds

🎥 Related Videos

This Game is an Absolute Mess! Lords of the Fallen - Review

Scumbag Intel: Shady Practices, Terrible Responses, & Failure to Act

The Omnissiah EXPLAINED By An Australian | Warhammer 40k Lore

Let's Talk About Rebirth's Ending | Theories & Ultimania Details SPOILERS

I Became a Millionaire by Obliterating an Entire Island in Hydroneer

The History of Cheating in Clash of Clans

🔥 Recently Summarized Examples

The Hitler-Stalin Pact | Reflections Episode 9

Uncovering Corruption From Health "Experts" | Scott Carney

The Forgotten Geometry: A New Path to Unification

Joe Rogan Experience #2194 - Luis Elizondo

From Tesla to DNA: The Science of Scalar Waves - Dr. Sandra Rose Michael - Think Tank E44

Bitcoin Holders...Watch Out for Sept

View original video