WEBVTT

00:00.000 --> 00:12.520
So, holy everyone. My name is Tibur Afayak. This is an honor to be here for the second

00:12.520 --> 00:20.800
time as a speaker and for them. I'm here to share with you some more innovations in H264

00:20.800 --> 00:27.240
and video decoding in general. A bit about me first, I'm a PhD, I have a PhD in a human

00:27.240 --> 00:32.560
computer interaction and software engineering. Also, background in algorithmics and

00:32.560 --> 00:37.200
competitive programming. I worked on the architectures of graphical user interfaces in

00:37.200 --> 00:44.640
the past. I do a lot of code golfing on my spare time. I'm really passionate about

00:44.640 --> 00:51.600
software architectures and simplifying things in general. So, the base of this talk is my

00:51.600 --> 00:56.820
work that has been going on for 10 years. On the edge to 64, which I will call edge in this

00:56.820 --> 01:07.560
talk. This is decoder for AVC, that supports progressive high and MVC 3D profiles. In

01:07.560 --> 01:14.560
bold, I'm displaying things that were added last year. So, I've worked on SSH2 and ARM64 supports,

01:14.560 --> 01:20.320
which are now complete. They won't evolve. It's just like it's fully optimized for them.

01:20.320 --> 01:25.760
It's supposed to be in those and Linux, but at the moment, it needs CI, continuously integration

01:25.760 --> 01:32.760
because it keeps breaking when I add things on my Mac. So, I need to work on that. Multi-threading

01:32.760 --> 01:38.280
was also implemented. It was super hard. It's still unsatisfactory at the moment, because

01:38.280 --> 01:43.720
it's not really my priority. I just added so that I could see how the API would change because

01:43.720 --> 01:49.920
of multi-threading, but it would change, of course. It will improve. With multi-view support,

01:49.920 --> 01:57.240
now we have a funpag. Hopefully, I will actually add it this year. That's a mood. So,

01:57.240 --> 02:04.600
this is the right time to add it into a funpag. Okay, first, some benchmarks. So, I did

02:04.600 --> 02:12.960
this benchmark last week. It's currently about 10% faster than a funpag. Both on Intel

02:13.040 --> 02:19.880
and ARM chips on my Raspberry. I just bought a Raspberry because it's cheaper. The code size

02:19.880 --> 02:28.320
is actually what's important. It's four times lighter than all of the state of the art.

02:28.320 --> 02:32.600
And the last year, it was a three times code gap. Now, it's a four times code gap, because

02:32.600 --> 02:42.720
I added now ARM support from other decoders. And the binary size includes all the X86 micro

02:42.800 --> 02:49.440
levels, macro architectural levels, and the debug. So, that's why it's two times lighter,

02:49.440 --> 02:54.440
only, but it becomes six times lighter if you remove all this and compile it natively for

02:54.440 --> 02:57.800
your chip.

02:57.800 --> 03:04.440
So, no, thank you for your attention. This was from last year. So, last year, I actually presented

03:04.440 --> 03:10.440
nine techniques, which I encourage you to watch if you haven't already. Overall, I've been

03:10.440 --> 03:17.320
really satisfied with them from last year. Except for one, which is the use of global

03:17.320 --> 03:22.920
register rivals for the context pointer. It's required patching all of the function definitions

03:22.920 --> 03:27.880
and calls with a macro. But the benefit vanished over the years, probably due to higher

03:27.880 --> 03:34.680
register pressure. So, today I'm going to present a five new techniques, more techniques. And

03:34.760 --> 03:40.280
first of all, a question, who among you worked already on video or audio codecs, just to

03:40.280 --> 03:49.240
know, okay, that's a third of the room. This talk will be much, we've talked much to people

03:49.240 --> 03:54.440
who have worked on codecs, but it might also also people who worked on multi architecture

03:54.440 --> 04:03.960
Cindi programming. Let's go first to an introduction on AVC. So, AVC is a video format that

04:04.040 --> 04:10.680
segments your image into 16 by 16 blocks called macro blocks, that it gives you, that it gives

04:10.680 --> 04:17.400
to you in zigzag order. So, in a framescan order, and then each macro block is further segmented

04:17.400 --> 04:24.040
into eight or four blocks. And your code base is going to be split roughly into six parts,

04:24.600 --> 04:30.840
which I won't enable it here, but which you can get for reference. Let's go for the first.

04:31.320 --> 04:37.720
So, the first technique, I'm sharing these techniques here as I think they are interesting to

04:37.720 --> 04:42.840
know, and also that they are different than the usual things you will see in code bases on our codecs.

04:42.840 --> 04:49.400
So, that's why I'm sharing them here. The first thing is piston bitstream reader. In AVC,

04:49.960 --> 04:57.560
you will usually get a sequence of bytes from the memory. And you want to extract a number of bits

04:57.640 --> 05:01.400
each time from the byte stream that are going to be your symbols that you are going to fetch,

05:01.400 --> 05:07.240
one by one. And the spec says that you always start from the most significant bit inside each byte

05:07.240 --> 05:13.720
to the 2DSD. So, the multi byte order will be big engine. So, you always have to convert your

05:13.720 --> 05:22.440
little engine words to big engine. Then the most common codes that are found in AVC are called

05:22.520 --> 05:30.440
exponential gum codes. These codes and code the size of the code in the number of zeros.

05:30.440 --> 05:35.960
For example, the first one is here. The size of the code in the number of zeros. And then you have

05:35.960 --> 05:42.840
the code itself in bits plus one. So, the first one is going to be 14. Then if you still have

05:42.920 --> 05:54.360
always exponential gum codes, it's going to be 144. Then 85. With exponential gum code,

05:54.360 --> 06:01.960
the maximum bit length of any exponential gum codes code is going to be 63 bits that matters

06:01.960 --> 06:10.520
for our bitstream reader. And the maximum number of zeros that you may have is 31. And what I

06:10.520 --> 06:16.200
describe as a bitstream reader essentially boils down to two functions. One that shows you

06:16.200 --> 06:21.320
shows you the next bits that are in your bitstream that you need to consume. And one that allows you

06:21.320 --> 06:27.160
to consume the bits that you the number of bits you want to consume. So, or to discard them.

06:27.160 --> 06:37.080
So, that you may move on to the next bits. Now for the technique itself. I'm going to give you

06:37.400 --> 06:42.600
a quick overview of some approaches you may try or have tried. They are not complete,

06:42.600 --> 06:47.560
but they are meant that are just what I could enumerate. For each approach, I will give the number

06:47.560 --> 06:54.600
of bits that you get when you show bits. That is the working number of bits that you have access to.

06:55.160 --> 07:01.240
And the number of reads and writes each time you call the function show bits. So, basically the

07:01.240 --> 07:10.760
number of reads per code read. The first series of techniques use direct reads to memory either

07:10.760 --> 07:19.480
an aligned or aligned. They incur always a big engine conversion. And they are branchless. So,

07:20.120 --> 07:27.240
they are branchless in in in the sense that every show bits doesn't incur a test. It just updates

07:27.400 --> 07:32.840
the memory pointer. So, the variables you have to date there are the byte pointer, the byte

07:32.840 --> 07:39.800
pointer to memory. And the shift a bit shift to actually realign the value in a bit position.

07:40.600 --> 07:45.160
The problem here is that we have a read dependency that is you have to first fetch your

07:46.440 --> 07:51.960
pointer then to read from that pointer. And you have a direct memory dependency between two reads,

07:52.040 --> 07:58.760
which is awful on every CPU. The second series of techniques are going to be single cache.

07:59.480 --> 08:04.920
So, you have a cache where you store your bits and that you refill sometimes when you don't have

08:04.920 --> 08:12.040
any enough bits. You have to maintain two variables, one for the cache and one for the bit shift

08:12.760 --> 08:20.040
to count the number of bits that are available in your cache or remain. It introduces a

08:20.120 --> 08:26.120
refill test each time you you call your show function. And the bits depend on the refill size.

08:26.120 --> 08:32.680
So, it really depends on the implementation. And you also have a variant that updates the cache

08:32.680 --> 08:39.320
itself each time you show, which has a shorter dependency. So, it trades a right for shorter dependency

08:39.320 --> 08:44.920
and that's usually a good thing. And this is the approach the last approach the last variant is used

08:44.920 --> 08:59.720
in FFNPEG and in H264 FFNPEG and in David the 81 decoder. The last series used two cache, two

08:59.720 --> 09:07.960
hashes, wider cache and trade a slower shift that is a double shift for more rights and and more

09:08.520 --> 09:16.200
rights to have a bigger available number of bits. And my design is going to be a double cache

09:16.920 --> 09:23.880
without a shift count and so I replace the shift count with a set bit at the end of the cache

09:23.880 --> 09:31.560
to mark the end of the bits available. So, it's roughly equivalent to the one that FFNPEG and

09:31.560 --> 09:39.880
David use but with a bigger number of bits however you get a slower shift. So, it's a trade

09:39.880 --> 09:47.400
it's not going to be perfect. The additional design choices that you may have are whether you

09:47.400 --> 09:55.160
refill before or after a show showing bits and if you in line or not your show bits. So, to describe

09:55.160 --> 10:01.720
this technique, let's say you have your two caches. I set them as 32 bits because otherwise it's

10:01.720 --> 10:08.120
too large to display. Let's say we have bits. In bold we have the in black bowl we have the bits

10:08.120 --> 10:18.120
they actually bit from the by the by the stream and in the red one is the set bit. When showing bits

10:18.200 --> 10:23.480
everything is already in the most in the MSB cache. So, we just return this variable. Then to

10:24.520 --> 10:34.120
discard the bits you actually shift both caches to the down and you do so on and on until the

10:34.120 --> 10:41.400
second cache is 0. So, to test for refill you actually test that the second cache is 0. If it is

10:42.280 --> 10:49.480
then it's easy. You just mark the position of the piston bit which I call piston bit. You remove

10:49.480 --> 10:56.600
it. You call your refill function insert the bits inside both caches shifted and then you

10:56.600 --> 11:05.320
re-insert the piston bit. The training bit is essentially a piston that pushes the bits to be

11:05.320 --> 11:11.640
consumed which is why I call it piston bit. It might already exist this technique. I haven't seen

11:11.640 --> 11:17.320
it anywhere in any other code base but I'm not inventing things. I just rediscovering things that

11:17.320 --> 11:25.240
probably already exist. Note that the refills here are reduced size now. So, here each refill is

11:25.240 --> 11:31.080
going to be wide. It's going to be either 32 bits on 32 bit machines or 64 bits on 64 bit machines.

11:32.040 --> 11:37.880
And note also that the refill test is cheap. It's just testing for 0. So, you don't need to compute

11:37.880 --> 11:45.640
the number of remaining bits with the counterling 0 which is good. The second technique goes

11:45.640 --> 11:52.680
as to that is now that we have wide refills we can talk about unescaping. Actually ABC

11:52.760 --> 12:00.760
make contain escape sequences which are 0, 0, 3. This sequence must be trimmed every time

12:01.320 --> 12:09.160
you parse by removing the byte 3 before parsing. It prevents sequences 0, 0, 0, 0 and 0, 0, 1

12:09.160 --> 12:14.600
to appear anywhere inside the sequence because they are commonly used as millimeters in transport streams.

12:16.280 --> 12:22.120
Usually what they call this do is that they scan beforehand and make an unescaping

12:22.200 --> 12:28.120
if the buffer contains escape sequences. But now with wide refills,

12:29.000 --> 12:36.680
CD becomes viable because we have 8 bytes to escape. So, we can escape these 8 bytes in

12:36.680 --> 12:44.680
parallel all at once. And since these are wide we can actually unescaping them on the fly.

12:44.680 --> 12:49.080
We don't need to actually manage a buffer or store anything beforehand or even scan the

12:49.080 --> 12:56.120
stream beforehand. We just can escape these bytes on the fly. And since this is a good example

12:56.120 --> 13:01.880
of an algorithm that we call first matching condition in CD, I'm going to detail it here

13:01.880 --> 13:06.840
because I think it's like classical algorithm that anybody has to know and that's used a lot

13:06.840 --> 13:12.120
in the whole code base. So, first matching condition.

13:12.280 --> 13:21.480
The process takes a memory pointer as input, your byte pointer, byte stream pointer,

13:21.480 --> 13:28.120
and I'll put the 64-bit value in the end. I've put it in the non-gray is the 64-bit value

13:28.120 --> 13:34.440
that you want to extract in the end. What you would do is first load 16 bytes out of memory,

13:35.160 --> 13:40.520
two bytes ahead of the pointer because the two bytes before make contain the start of an escape

13:40.600 --> 13:45.960
sequence that algorithm is going to make your first byte being escaped. It's possible.

13:47.400 --> 13:54.200
If we get too close to the end of the buffer, we actually read up to an aligned boundary rather than

13:54.200 --> 14:02.120
up to the end itself because there is a very unlikely case of a tiny buffer where reading 16 bytes

14:02.120 --> 14:09.960
before the end of the buffer could make you read before the, could make you cross a page

14:09.960 --> 14:19.960
boundary and trigger a secult. So, reading up to an aligned boundary avoids that risk because

14:19.960 --> 14:25.560
pages always have a bigger alignment so you're sure you're not going to go into a secult.

14:26.440 --> 14:36.680
So, once we have our vector, we compare it to 0 and 3 to make mask masks of the positions of 0s

14:36.680 --> 14:42.280
and 3s in the, in the vector, then we combine them, shift and combine them to form a byte mask

14:42.840 --> 14:48.520
of the positions where we are going to delete the number 3, the byte 3.

14:49.160 --> 14:57.160
We convert that mask into a bit mask using a move mask on Intel and the narrowing right shift on

14:57.160 --> 15:03.800
arm that was published two or three years ago by Google engineers. I encourage you to google that

15:03.800 --> 15:11.960
if you're interested, which is faster than the equivalent move mask and it's important because

15:11.960 --> 15:18.600
here this code is actually critical in the code. Once you have your bit mask, you would iterate

15:18.600 --> 15:24.440
on set bits masking only the first 8 bits because these are the bytes we're going to extract

15:24.440 --> 15:28.920
we're going to want to extract and for each position we're going to shuffle the vector

15:28.920 --> 15:34.040
to remove the byte and this is a shuffle on a mask that is selected by an array basically.

15:36.040 --> 15:41.160
And as you can see the first shuffle made the another 3 appear as in the bit mask so we are also

15:42.120 --> 15:47.880
and in the end, once our bit mask is 0, we have our vector, we just, we just, we just,

15:47.880 --> 15:53.320
is it this? Nope, yeah. We just extract this value and give it to the reflection function.

15:54.840 --> 16:00.920
Is it good to form? Third technique. Okay, this is probably what you will be most interested in.

16:01.640 --> 16:07.720
It is the multi-actualist in the programming. What I mean is programming in C for both

16:07.800 --> 16:17.320
SSE, Intel, SSE and ARM, Neon. As a reminder, Edge uses both Vector, GCC Vector,

16:17.320 --> 16:23.240
its extensions in addition to Vector, Inferencing. The reason for that is basically Vector extensions

16:23.240 --> 16:29.800
replace all of the arithmetic instructions and element accesses with just array accesses and

16:29.880 --> 16:38.440
plus operators from C. So it helps your code being more compact. I use the type S, I use the

16:39.880 --> 16:47.080
unions for Vector, Kable, arrays and I shortened, from last year I shortened all of the

16:47.080 --> 16:53.400
Intel intrinsic into aliases because they are unreadable and hard to type on, you know, just

16:53.480 --> 16:59.080
typing the, and the underscore is annoying. So I replaced all of them and as a bonus,

16:59.080 --> 17:05.160
it facilitated the transition to ARM because I already had aliases to all of the macros,

17:05.160 --> 17:13.240
to all of the intrinsic. Who among you is a second last question already worked with ARM,

17:13.320 --> 17:25.080
ARM code, just know. Okay, quite not many, but still. So a bit, one more disclaimer is here,

17:25.080 --> 17:29.240
I make a difference between multi-architecture and portable. This is not portable code.

17:30.120 --> 17:35.480
I'm here, I'm working on multi-architecture code that is code that is going to yield the best

17:35.480 --> 17:43.400
possible code for both Intel and ARM. It is not going to compile for risk 5 or for other

17:43.400 --> 17:49.640
architectures, not now. My goal is only to have the best possible code and restricting the number of

17:49.640 --> 17:57.880
CPUs helps you have actually better performance on these architectures. Last year I actually

17:58.520 --> 18:02.840
talked against using shuffle Vector, convective vector element wise and reduce

18:02.840 --> 18:09.400
built-ins from clong, and I did not really explain why. So now I actually describe it a bit more.

18:09.400 --> 18:15.640
I think these extensions are designed so that they work on most of the most number of architectures

18:15.640 --> 18:22.600
possible. But behind the scene, they actually make it worth as a bigger number of instructions,

18:22.600 --> 18:27.480
and that's not what I want. I want to have the best possible code. So usually I try to avoid them

18:27.960 --> 18:31.800
because I'm not always sure that they're going to translate it to one instruction.

18:33.400 --> 18:38.440
So their philosophy is more as portable and my code is designed to be multi-architecture. It's

18:38.440 --> 18:46.920
kind of a different philosophy. So here I'm going to describe the workflow, how I did to add

18:47.000 --> 19:00.440
neon support on top code, SSE, intrinsic. I think this workflow is reproducible and I'm going to describe

19:00.440 --> 19:05.480
it this way. So the first thing you would do is to write unit tests. That's essential. You don't

19:05.480 --> 19:11.640
want to debug afterwards. You want to have something that works right when you when you code it

19:12.360 --> 19:17.480
that is correct. So you want to write unit tests using your existing code. So you create test

19:17.480 --> 19:24.520
cases. You run your existing code on these test cases. Pick the results and serialize these results

19:24.520 --> 19:33.560
so that you can match your future code against these tests. Then that's not avoidable. You have

19:33.560 --> 19:39.000
to write the best possible neon code in intrinsic. You don't write something that is in between.

19:39.080 --> 19:47.160
You have to write the best possible code in assembly. You can't go around that. Then once you

19:47.160 --> 19:54.920
have your best code in neon and SSE, then you can work into actually merging the codes together

19:54.920 --> 20:04.360
to make your code a shorter. So what I do usually is I will find aliases whenever I find otherwise

20:04.360 --> 20:10.280
I fall back to if else blocks, pre-processor blocks and that accounts for about 20% of the

20:10.280 --> 20:16.680
signal code. Which is not that much. The code base is still quite small. If you're interested,

20:16.680 --> 20:21.960
you can look inside the code base at five finds internally in trying to residual and debug.

20:23.960 --> 20:31.480
Now let's go through what went right and what went wrong. The good aliases for instructions

20:31.480 --> 20:39.720
that essentially is the same in ARM and Intel. There are instructions that are closed,

20:39.720 --> 20:45.560
but not exactly the same. And for these, I created semantic variations. That is, I created

20:45.560 --> 20:50.760
functions, like I duplicate the function with semantic variations. For example, typically

20:50.760 --> 20:58.760
example is the blend. In SSE, you have a blend instruction that operates on the most significant

20:58.760 --> 21:06.920
bit of a vector to select based on both on the two other vectors. On ARM, it's called bit

21:06.920 --> 21:12.760
select. It's based on the mask, not the most significant bit. So I created actually two functions,

21:12.760 --> 21:19.880
if else based on the most significant bit or if else based on mask. Same goes for the shuffle instructions

21:19.880 --> 21:28.200
because the semantics are very different. On ARM, the shuffle instruction, if you give it a negative

21:28.200 --> 21:35.080
or an out of bounds, if you give it an out of bound positive value, it will return zero.

21:35.080 --> 21:44.840
If you give it an out of bound value that is positive, it will do a, it will end. So it will

21:44.840 --> 21:53.160
give you a value value not zero. Polyfields. For key ARM instructions that are absent in SSE,

21:53.160 --> 22:02.040
I created Polyfields. So for example, the duplicate of the broadcast in Intel is going to be

22:02.040 --> 22:08.920
set plus shuffle. Polyfields the other way around. So missing SSE instructions are going to be

22:08.920 --> 22:17.400
compiled into set of instructions for ARM. And I created specialize the helper functions for code

22:17.400 --> 22:21.800
that is very redundant. For example, loading a four form matrix. Basically, I have a function

22:21.800 --> 22:29.160
for that, or for the multiply add shift write and a pack to eight, which is very common.

22:31.080 --> 22:38.440
Now what went wrong? GCC and Clung do not model multiple latency instructions. So basically,

22:39.080 --> 22:44.360
what I understand is that in GCC and Clung for what it have tested so far, any instruction

22:44.360 --> 22:50.040
is going to be modeled as having a single latency, but ARM has instructions that have several

22:50.120 --> 22:54.600
multiple latency in SSE. What I mean is for example, you have an instruction that is

22:55.640 --> 23:01.880
MLA is multiply accumulate. It's going to be a multiplication and then an addition to another vector.

23:01.880 --> 23:06.520
The multiplication is going to be like four cycles. The addition is going to be one cycle,

23:07.080 --> 23:15.320
but GCC and Clung will think it's four cycles overall, but actually inside the macro code,

23:15.400 --> 23:22.200
it's modeled as two separate instructions somehow. So it has two latencies, and the compiler

23:22.200 --> 23:26.760
is not understanding that. So that's actually a big, big, big problem, and that probably

23:26.760 --> 23:33.080
be accounts for the fact that the code is not so fast, so much faster than FFA in practice.

23:35.560 --> 23:39.960
Other problem, Clung splits a accumulating intrinsic instead of merely generating them.

23:39.960 --> 23:46.680
Boo, Clung. Boo. It thinks it's clever, so it's just splitting my multiply accumulate,

23:47.720 --> 23:51.960
and then I have to work afterwards to make it generate multiply accumulate the way I want.

23:55.640 --> 24:03.800
Boo. Now, running intrinsic in ARM, they return to 64-bit vectors, whereas the instructions

24:03.800 --> 24:11.080
actually return to wide vectors. So the upper part is going to be zero, but the API doesn't show

24:11.080 --> 24:17.160
it that. So you can't expect that, and that's a problem. The across vector intrinsic, for example,

24:17.160 --> 24:22.600
add all the elements in a vector, return to int, whereas the instructions actually return to

24:22.600 --> 24:32.040
wide vector. So that's huge problem. I usually just ignore that. And between ARM and Intel,

24:32.120 --> 24:38.440
the instructions for shifting by element and multiplying by element are widely incompatible,

24:38.440 --> 24:49.640
so that will lead us to the ugly. The workarounds. Okay, first one is to use inline SM,

24:49.640 --> 24:57.080
in a macro, of course, to force the use of multiply of MLA, or to make a vector return to a vector

24:57.160 --> 25:03.880
that is proper. Another one, this one is ugly, but I prefer this, is to rely on

25:03.880 --> 25:10.840
that code elimination from the compiler to ensure that one of the instructions is not going to

25:10.840 --> 25:17.880
be compiled actually. So I said that the variable shift is different between, okay,

25:17.880 --> 25:24.120
the variable shift is different between Intel and ARM. So I implement on the bottom different

25:24.200 --> 25:30.040
shift instructions. And on the top, the computed variable shift is different between the two,

25:30.040 --> 25:39.000
but it's in common code. And whenever I go here, this makes WD64 going to be used, and the other

25:39.000 --> 25:43.480
one is going to be unused. The compiler detects that and removes that, removes that altogether.

25:45.160 --> 25:51.640
It doesn't, I disable the unused warning. Come on, I mean, we know what we're doing here. We're

25:51.640 --> 26:01.480
grown-ups. Address calculation. This one is annoying. GCC loan do not model multiple agencies.

26:01.480 --> 26:07.480
So they won't generate post-index for reads and writes that are going to be spaced by less than a few

26:07.480 --> 26:13.400
cycles apart. So in practice that tend to generate a lot of pointer arithmetic, where you don't

26:13.400 --> 26:19.400
want that. That's a problem. Because binary size is crucial, especially if you want to fit that

26:19.480 --> 26:27.080
in L1 cache and L1 cache is scarce. My solution is to compute a set of variables beforehand,

26:27.880 --> 26:35.560
and then reach all of the needed addresses in gray, like gray, with the only offset addressing

26:35.560 --> 26:42.440
from Intel and ARM. And since Intel and ARM addressing modes differ, we are going to

26:42.440 --> 26:51.560
track them with macros. Intel is amazing on that part, I must admit it's real cool. Intel can

26:51.560 --> 26:59.400
add a register times 1, 2, 4, and 8, and an immediate. So in yellow, it makes all of the addresses

26:59.400 --> 27:07.240
reachable from one single address plus addressing, plus offset addressing. So with five variables,

27:07.240 --> 27:15.160
with just five pre-computated variables, the three pointers plus two strides, we can reach

27:15.160 --> 27:22.280
all of the necessary addresses, which is a trick from the fifth and peg, of course. In ARM, it's harder.

27:23.800 --> 27:30.920
ARM only supports only immediate or offset from a register, not both. And worse, some instructions

27:30.920 --> 27:38.680
from like load with duplication cannot offset. So we cannot offset without post-post index,

27:38.680 --> 27:43.400
but we can't use that, so we don't have access to that. So sometimes we have to actually offset

27:43.400 --> 27:51.320
the pointers one by two the left. So here we have like 10 or so, I think 14, 14 variables,

27:51.320 --> 27:59.160
but ARM has 32 registers, so that's fine, of course. So I'm not really completely satisfied with that,

27:59.880 --> 28:06.920
but at least macrowing of the addresses, keeps track of all the pointer uses in the code,

28:06.920 --> 28:14.920
so I may change that in the future. Forst technique, I'm quite okay late, but that's really fine.

28:15.880 --> 28:20.520
This is an architectural pattern. It's called the structure of array pattern.

28:21.560 --> 28:27.480
You usually see as array of structures or structures of our structure of array. This is an architectural

28:27.640 --> 28:33.240
pattern that is used in optimization to vectorize the operations on arrays of structures.

28:34.120 --> 28:39.800
So what you do is to convert arrays of structures into individual arrays for each of the fields.

28:41.560 --> 28:47.880
In the decoder, in edge, it's used for frames lists, for frame buffers, and the list of tasks,

28:47.880 --> 28:55.480
so to formality threading. What you do typically is you explode your arrays into independent arrays.

28:56.120 --> 29:00.840
And in the code, it makes you replace loops on structures with vector operations.

29:01.640 --> 29:08.200
That's going to make your code both shorter and more compact, shorter in binary and also in code.

29:08.200 --> 29:14.600
It's going to make it more compact. And I've put three examples here to find an anti-frame buffer out of

29:16.280 --> 29:21.560
the mask of reference slides that I showed earlier. Reference flags out to flags. These are 32 bit masks.

29:21.560 --> 29:28.200
To find an anti-frame buffer, you just count the training zeros on the combination of reference

29:28.200 --> 29:33.720
flags and out to flags. So find me a frame that is not a reference and not waiting to be output.

29:34.760 --> 29:39.720
Then I combine them and then I count leading zeros to find the first anti-frame buffer.

29:40.280 --> 29:45.000
To count the number of pending frames, the frame that I have, the output flag bit set,

29:45.400 --> 29:50.360
I just do a per count. That's one liner awesome. And then to loop through reference frames,

29:50.440 --> 29:53.320
it's a classical looping through set bits.

29:57.000 --> 30:04.440
This pattern has advantages that is, it makes you have less allocation and thus less

30:04.440 --> 30:11.640
memory fragmentation of your inventory, less my locks and so on. But it also gives you smaller code

30:11.640 --> 30:16.360
and binary overall. So that's much easier, especially when you want to work on your frame

30:16.360 --> 30:22.440
buffer and all these operations, which then when you work on big code, especially in this part

30:22.440 --> 30:26.360
on the frame buffer management, you want to have things that are short because it's usually very

30:26.360 --> 30:31.080
hard to work on. The problem with that is that the code is a bit less intuitive. You have to know

30:31.080 --> 30:36.440
binary operations and it leads to catch pollution and fall sharing in multi-threading. So you have to

30:36.440 --> 30:42.040
be careful in how you manage this. And five minutes left, that's perfect for the last technique.

30:43.000 --> 30:52.360
The last technique here is some code from a fempegg, about error checking. So when you read

30:52.360 --> 30:58.200
symbols from memory, from your by-stream, from your by-stream, these are the first two function

30:58.200 --> 31:04.120
calls. You're going to call a function, get you a new column, the 31, then you get your value,

31:04.120 --> 31:11.160
and then you're going to test your value against conditions and then fail. Typically, error

31:11.320 --> 31:21.640
checking works with go-to's or return. You get your value, test whether it's right or wrong,

31:21.640 --> 31:29.240
and then fail with either go-to return. But the problem is that it needs code to bubble up the errors

31:29.240 --> 31:36.360
from if your code is in sub-functions. It makes many exit paths that you may want to test if you

31:36.360 --> 31:43.320
want your code to be robust. And in the critical code, it acts as noise for the branch predictor,

31:43.320 --> 31:49.720
because these tests are going to be just passing tests most of the time. So better than having

31:49.720 --> 31:56.840
them passing all the time is not having them all together. In general, error reporting is a

31:56.840 --> 32:02.680
graze on. We report errors where they appear in the code, but we never really know how to make

32:02.840 --> 32:08.280
them useful, how to make useful messages. So usually they are very factual, like I showed earlier,

32:08.280 --> 32:16.280
legal bit data value, that's very factual. But here we can actually separate the different users

32:16.280 --> 32:20.600
that are going to be interested in our code. First there is the viewers, the people who actually

32:20.600 --> 32:24.040
review your stream, they just don't care about error messages, they don't want to see them,

32:24.040 --> 32:30.760
you know your code, your decoder actually works or not, if it doesn't work, well reboot, restart,

32:30.840 --> 32:36.440
or go to another player. Media players are going to be interested if you're misusing their

32:36.440 --> 32:42.440
API. Otherwise, they don't really care about the message beneath, because it's not going to change

32:42.440 --> 32:47.720
that your stream is not going to work anyway. And the code and code developers, people who

32:47.720 --> 32:52.920
develop encoders and test their by stream with your decoder are going to be interested in everything,

32:52.920 --> 33:00.600
of course. Here I make a distinction to simplify things between a validator and a passer.

33:00.920 --> 33:07.560
A validator is going to check if a stream is valid, otherwise it's going to help you fix the

33:07.560 --> 33:14.440
stream. And the passer is going to try to decode the stream, if it's valid, if it's not valid,

33:15.640 --> 33:24.200
runaway crying and that's fine. But here if we drop the responsibility to validate the stream

33:24.360 --> 33:31.240
and if we stick only to encoder users or media players, we can relax the error reporting

33:31.800 --> 33:36.360
to actually say we're not going to be interested in where we find errors, but if we find an error.

33:40.760 --> 33:45.800
And on top of that fortunately, check some of the stream. Also, ensure we actually rarely see

33:45.880 --> 33:51.960
a runaway stream. I hope I'm right, please correct me if I'm wrong. I'm not really a

33:51.960 --> 34:00.440
demoxer developer. And a good thing with with AVC is the variable variable and

34:00.440 --> 34:09.240
codes are very sensitive to disynchronization. That is the first line we actually had three

34:09.240 --> 34:14.120
exponential common codes. And the second line, if we discard the first, just one bit, if we

34:14.200 --> 34:19.640
just discard one one bit or if it turns it into a one, it's going to disynchronize so

34:19.640 --> 34:27.080
move the reader pointer by wild amount. So in the end, any error that appears in the stream,

34:27.800 --> 34:37.880
you're going to see it at the end of the stream by an offset in the read pointer. So what I do

34:38.760 --> 34:47.160
is for each input, I just clomp the values to the expected range. And I don't care anymore.

34:47.160 --> 34:52.120
I just clomp the value so that actually the rest of the code is going to be robust because I know

34:52.120 --> 34:58.280
that the value is still valid. And at the end, I just for I check for this synchronization of the

34:58.280 --> 35:03.960
bit stream pointer. And these are actually the values that I should expect on MSBLS and LSB.

35:03.960 --> 35:10.440
It has to be one on top, which is the RBSP training bit in AVC. If you know the spec.

35:12.120 --> 35:16.840
The impact on that is that there are no branches inside the parsing code anymore, except for refill.

35:17.720 --> 35:21.560
But the refill is inside the function so it's not going to act as a noise for the branch predictor.

35:22.680 --> 35:27.720
We check for unsupported stuff also once for errors and unsported stuff once at the end of the process.

35:28.040 --> 35:35.880
There are less exit paths to test and to care about. So it may make the code more robust to errors.

35:37.320 --> 35:44.360
And good thing inside the slides parsing code is that the return values are not used anymore for errors.

35:44.360 --> 35:50.600
So you can use them to return something else, which is always good.

35:50.600 --> 35:58.440
Now the roadmap for release, these are the things that I have to work this year. Hopefully MVC

35:58.440 --> 36:04.760
support for H264 is going to be added this year. I will have one more content for one more presentation

36:04.760 --> 36:09.880
next year. It will be hopefully about optimizing Quebec, Quebec, which is the arithmetic decoder.

36:09.880 --> 36:14.520
I have lots on that. It's going to be awesome and optimization for caches and profiling.

36:15.560 --> 36:19.800
With this, I would like to thank you for your attention. Here are some key resources that you may

36:19.800 --> 36:24.120
want to read if you're interested in working on that. And I'm also looking for jobs, so please

36:24.120 --> 36:29.400
reach me by mail if you want. If you have a job. Thank you.

36:36.040 --> 36:37.800
Questions?

36:38.680 --> 36:42.680
I got it. It was too intense.

36:45.160 --> 36:46.680
Yes?

36:46.680 --> 36:53.080
When you detect errors in the stream, you would just hope for the best.

36:53.080 --> 36:58.360
When I detect code in the non-destructive questions.

36:58.360 --> 37:05.240
Yes. So the question is, when I detect errors in the stream, do I do nothing or do I do I continue

37:05.800 --> 37:11.640
do I continue decoding? So there are two things here. First is to know where the weather is

37:11.640 --> 37:21.560
in the stream. And second thing is falling back to clean image. So if we

37:22.200 --> 37:31.000
when we pass, we don't detect errors. We only detect errors in the end. So for example, if we pass a

37:31.000 --> 37:38.120
header, the header might eventually decoding to something that is just not valid, but we don't care,

37:38.120 --> 37:43.000
because we then check if there is an error. And then we know that the pass that the header that we

37:43.000 --> 37:48.600
decoded is wrong. And the second thing, I don't know if answer your question for that.

37:48.600 --> 37:53.480
And the second thing is that for frames, if in the end, I finish decoding in frame and I

37:53.480 --> 37:59.480
realize that it went wrong. I actually go back to the entire frame, compute a probability for each

37:59.480 --> 38:08.120
block, macro block, that it contained the error. And then I reconstruct the pixels based on that

38:08.120 --> 38:17.960
probability. So I actually do a weighing between the predicted, a predicted for back and the thing

38:17.960 --> 38:26.200
that was decoding. And hopefully, if anyone from a 5MP is here, I would really like to have an API

38:26.280 --> 38:31.720
that allows me to call a 5MP. I say, these pixels are wrong. Please replace them. I don't know if

38:31.720 --> 38:39.960
there is, I hopefully. That's it. Yeah, a concealment. Is there an API to actually tap into this

38:39.960 --> 38:51.400
concealment from outside? Okay. Yeah. I can do that from the codec itself, but the thing that I don't

38:51.400 --> 38:56.040
want is, so the question is, no, is it a question? That concealment is usually done from the codec

38:56.120 --> 39:01.160
itself. My problem is that good concealment, usually you look at the state of the art and you

39:01.160 --> 39:06.440
want to implement things like with image processing and complex image processing. I'm not going

39:06.440 --> 39:11.000
to want to implement that, especially since the state of the art is going to evolve. So I would

39:11.000 --> 39:17.080
really like to have something that is DFFM packaging that I can tap into to do an error concealment.

39:17.080 --> 39:25.000
I don't know if I'm clear. I believe that the permissive of a big issue is to use a better

39:25.000 --> 39:31.400
default color for the frames that are empty. So when you corrupt the reference, you can hear it out

39:32.040 --> 39:38.280
as such, and you're right. Yeah. And you're not that, it was the less costly and

39:39.720 --> 39:45.720
visually affected. Yeah. But I know it from hardware hardware tells you which macro

39:45.720 --> 39:53.320
block art parameters and amount of corruption, and you get to decide if you copy a reference frame

39:53.320 --> 39:59.880
over that block, or you can do that. Okay, that's perfect. Thank you for saying that because

39:59.880 --> 40:04.520
I'm actually doing the same thing also. Yeah. Yeah.

40:04.520 --> 40:10.200
So as soon as those those starts all the way to the end when you start a point. For example,

40:10.200 --> 40:15.240
when you allocate a frame, when you start playing a frame, you take a look at the references.

40:15.240 --> 40:20.440
And the reference reference goes not exist. You'll make one up and note that's how the

40:20.440 --> 40:22.440
And that does.

40:22.440 --> 40:28.440
It usually takes the blind color to use as a missing reference.

40:28.440 --> 40:32.440
And it's only because it does the best critical.

40:32.440 --> 40:36.440
So it's very interlead with a way to go in.

40:36.440 --> 40:42.440
We have tried in the past actually recently over the past year

40:42.440 --> 40:46.440
to export much of the interaction

40:46.440 --> 40:50.440
and our Americans people call it the introation.

40:50.440 --> 40:53.440
But the look that's at the end.

40:53.440 --> 40:55.440
I think it's so built and chill.

40:55.440 --> 41:00.440
And so it's the fundamental defaulting that it's part of the track.

41:00.440 --> 41:03.440
Yeah, and I do the same actually.

41:03.440 --> 41:06.440
And that's because actually looked at the FFM pack

41:06.440 --> 41:07.440
when I saw Scott.

41:07.440 --> 41:08.440
Let's do the same.

41:08.440 --> 41:10.440
Well, if you really think you can do a better job

41:10.440 --> 41:12.440
and you just want to replace individual pixels.

41:12.440 --> 41:14.440
You just do the filter.

41:14.440 --> 41:16.440
Yeah.

41:16.440 --> 41:21.440
Any other questions?

41:21.440 --> 41:23.440
We don't have something.

41:23.440 --> 41:24.440
Perfect.

41:24.440 --> 41:25.440
Thank you.