WEBVTT

00:00.000 --> 00:14.560
I'll talk about UDP and how it can make quick, a little bit quicker, in particular, on optimizing

00:14.560 --> 00:19.600
Firefox's HTTP3 IO stack.

00:19.600 --> 00:26.200
Quick introduction about myself, a max software engineer at Mozilla, I'm working on the

00:26.200 --> 00:29.960
HTTP3 and Quick Stack in Firefox.

00:29.960 --> 00:35.200
You can reach me with Mailid Max-Maisin.d, but also here in the hallway track or anywhere

00:35.200 --> 00:36.200
around.

00:36.200 --> 00:37.200
Cool.

00:37.200 --> 00:42.600
MX-in pretty much everywhere, you find me on GitHub, and then you might have seen me here

00:42.600 --> 00:49.200
talking about Piotrubian networking or Kubernetes and Prometheus in a past life.

00:49.200 --> 00:53.200
Okay, so what does Firefox?

00:53.200 --> 00:55.200
You don't have the time for this.

00:55.200 --> 00:57.200
What is quick?

00:57.200 --> 01:01.200
Quick is a general purpose, transport protocol.

01:01.200 --> 01:10.200
It's running on top of UDP, and the big thing here is that quick comes with its own encryption,

01:10.200 --> 01:16.200
and it's the encryption both the data itself and the metadata, so the protocol data.

01:16.200 --> 01:20.200
We're just very powerful, and we'll go a little bit into that to anybody.

01:20.200 --> 01:26.200
It does connection establishment within one RTT, so you can send your first request after the first

01:26.200 --> 01:33.200
RTT, and then in ideal cases, even on consecutive connections, you can sometimes do zero RTT,

01:33.200 --> 01:38.200
which is wonderful, especially in a web context where they can see really matters.

01:38.200 --> 01:43.200
It is stream-based for those familiar with HTTP2 streams, for example.

01:43.200 --> 01:49.200
It does not have the problem of head-of-line blocking here, as the stream mechanism is built into the

01:49.200 --> 01:52.200
network protocol itself, and not built on top of it.

01:52.200 --> 01:58.200
It has fancy features, like, for example, connection migration, so let's say you're at home,

01:58.200 --> 02:04.200
in your Wi-Fi, and you're going outside, and you're switching to your 5G,

02:04.200 --> 02:09.200
then the connection can migrate between those two networks.

02:09.200 --> 02:11.200
It's easy to evolve.

02:11.200 --> 02:18.200
One big thing here is that a lot of quick is encrypted, especially also the transport part,

02:18.200 --> 02:25.200
and those boxes in the middle cannot make assumptions on a lot of properties of the protocol,

02:25.200 --> 02:36.200
and thus cannot bake it into their processes, and thus it's easier to evolve the quick protocol at the two endpoints.

02:36.200 --> 02:43.200
And relevant for this talk, it's often implemented in user space, on top of UDP, always on top of UDP,

02:43.200 --> 02:47.200
most cases implemented in user space.

02:47.200 --> 02:55.200
There are kernel space implementations, the folks that Microsoft do that, but yeah, we'll focus on user space quickly here.

02:55.200 --> 02:56.200
Cool.

02:56.200 --> 03:00.200
Let's put this a little bit in the larger picture of web protocols.

03:00.200 --> 03:04.200
Most of you are probably familiar with the HTTP semantics.

03:04.200 --> 03:08.200
Now, in the early days, that will go over HTTP1.

03:08.200 --> 03:12.200
Optionally, that traffic would be encrypted with TLSSL.

03:12.200 --> 03:18.200
That would then be on top of TCP, and then yeah, that would be shut down into UIP stack.

03:18.200 --> 03:31.200
Later on, we had HTTP2 encryption mandatory with TLS1.2013 running on top of TCP, and then again on top of IP.

03:31.200 --> 03:37.200
And now, the new thing here is saying semantics still for your application.

03:37.200 --> 03:43.200
Under that, we have HTTP3, and that is then running on top of quick,

03:43.200 --> 03:52.200
tightly integrated with TLS on top of UDP this time with IP underneath.

03:52.200 --> 03:54.200
Cool.

03:54.200 --> 03:57.200
So, why is all of this relevant?

03:57.200 --> 04:00.200
Well, quick is already powering a big chunk of the internet.

04:00.200 --> 04:03.200
Obviously, there are many perspectives.

04:03.200 --> 04:07.200
When it comes to the internet, one is, for example, a cloud for a radar.

04:07.200 --> 04:11.200
They're roughly seeing one third of the traffic, currently a bit quick.

04:11.200 --> 04:19.200
Firefox roughly sees 25% of the traffic being H3, and then quick.

04:19.200 --> 04:24.200
But yeah, different perspectives on a very complex system.

04:25.200 --> 04:31.200
So, coming to the subject of the talk itself, quick and user space,

04:31.200 --> 04:38.200
unsurprisingly, if you run a transport protocol unoptimized in user space,

04:38.200 --> 04:44.200
this will not be as efficient as the heavily optimized TCP stack in the various kernels.

04:44.200 --> 04:51.200
Especially since the TCP has been optimized all the way down to the NAIC,

04:51.200 --> 04:53.200
we're going to get the base card.

04:53.200 --> 04:56.200
So, yeah, very optimized stack, very hard to compete,

04:56.200 --> 04:59.200
especially when you're in user space.

04:59.200 --> 05:04.200
Now, in the unoptimized case, you would think that the user space

05:04.200 --> 05:10.200
quick implementation would do one Cisco per UDP data brand.

05:10.200 --> 05:14.200
And if you think of the internet within the MPU,

05:14.200 --> 05:17.200
so maximum transmission unit of 1,500 bytes,

05:17.200 --> 05:22.200
that's one Cisco for every 1,500 bytes, both receiving and sending.

05:22.200 --> 05:24.200
That's a lot of Cisco's.

05:24.200 --> 05:29.200
And then, just one other thing here is that in the worst case,

05:29.200 --> 05:32.200
you get one Mac for every second packet.

05:32.200 --> 05:36.200
So, again, a lot of Cisco's to send and receive those facts.

05:36.200 --> 05:40.200
So, in the early days, folks at Google,

05:40.200 --> 05:42.200
I'm posted the link here at the slides online,

05:42.200 --> 05:44.200
so if you want to check out the papers,

05:44.200 --> 05:52.200
folks at Google, roughly estimated a 3.5x in terms of CPU cycles per byte

05:52.200 --> 05:57.200
when you compare a quick to TCP in the unoptimized case.

05:57.200 --> 05:58.200
So, that's a lot.

05:58.200 --> 06:05.200
That's a lot of CPU cycles that you're wasting by using a quick in that case.

06:05.200 --> 06:08.200
Cool.

06:08.200 --> 06:12.200
Yeah, I should go into this since we're covering the adorn.

06:13.200 --> 06:19.200
On the right side, you'll see this diagram at this diagram coming up more often.

06:19.200 --> 06:22.200
The basic case, Firefox has a quick stack.

06:22.200 --> 06:25.200
That quick stack exchanges data grams with the operating system,

06:25.200 --> 06:28.200
the operating system exchanges data grams with the nick,

06:28.200 --> 06:31.200
and then the nick sends it out on the internet, right?

06:31.200 --> 06:36.200
Very simplified version of everything.

06:36.200 --> 06:41.200
How is this affecting applications like, for example, Firefox?

06:41.200 --> 06:46.200
I don't expect you to read any of this, just to give you a little bit of a ballpark.

06:46.200 --> 06:50.200
What you see here is the socket thread, the Firefox socket thread,

06:50.200 --> 06:54.200
Firefox drives all its I-O with a single socket thread.

06:54.200 --> 07:00.200
This socket thread here is doing a one gigabyte transfer on loopback,

07:00.200 --> 07:05.200
transferring data over quick.

07:05.200 --> 07:09.200
And the circled area down there, what you're seeing here,

07:09.200 --> 07:12.200
is Firefox just allocating a receive buffer,

07:12.200 --> 07:17.200
and then passing it down to the operating system for the operating system to fill it with EDP data.

07:17.200 --> 07:23.200
Again, the perspective, that's a lot of CPU time in the unnoticed case.

07:23.200 --> 07:26.200
So how can we do better than this?

07:26.200 --> 07:30.200
The holy grail in this is segmentation offloading,

07:30.200 --> 07:34.200
and most of you probably know this from the TCP world.

07:34.200 --> 07:39.200
Linux and Windows both support this also on UDP.

07:39.200 --> 07:47.200
The idea is that instead of sending one small list of 1,500 in the ideal case data gram,

07:47.200 --> 07:53.200
down to the operating system, how about I give the operating system a very large one,

07:53.200 --> 07:57.200
and then tell the operating system where it should divide them later on.

07:57.200 --> 07:59.200
So where to segment them.

07:59.200 --> 08:03.200
And then in the ideal case again, Firefox in its quick stack,

08:03.200 --> 08:06.200
passes down the very large diagram to the operating system,

08:06.200 --> 08:08.200
operating system to the Nick.

08:08.200 --> 08:13.200
Then the Nick segments it into separate data grams,

08:13.200 --> 08:19.200
puts the headers in front of it, and then sends it out on the internet.

08:19.200 --> 08:23.200
Yeah. Linux and Windows support this.

08:23.200 --> 08:28.200
We have some, we have zero, and actually also euro,

08:28.200 --> 08:33.200
but that's, there are some caveats to this on Firefox nightly today,

08:33.200 --> 08:36.200
and looking at the metrics in the wild,

08:36.200 --> 08:42.200
reversely see on the 75% I have we see two or more packets being read.

08:42.200 --> 08:45.200
So that's already powerful as when you read,

08:45.200 --> 08:49.200
instead of one, two, you basically like the overhead of the Cisco,

08:49.200 --> 08:51.200
like diminishes by 50%.

08:51.200 --> 08:56.200
And then in the 95% I have we see 10 or more packets being read.

08:56.200 --> 09:01.200
So on large, like, large throughput transfers,

09:01.200 --> 09:04.200
this is already giving you a lot of benefit here.

09:04.200 --> 09:09.200
If you compare this, oftentimes once as like,

09:09.200 --> 09:12.200
packet trains move through the internet,

09:12.200 --> 09:16.200
ideally they arrive together at your Nick on the receiver side.

09:16.200 --> 09:19.200
And yeah, a lot of CDNs do like a GSO,

09:19.200 --> 09:22.200
so segmented send of 10 packets,

09:22.200 --> 09:29.200
and so it's unsurprising that we see the number 10 here in our 95% now.

09:29.200 --> 09:36.200
Yeah, in terms of a kilobite, it's 2.4 on the 75th,

09:36.200 --> 09:43.200
and then yeah, in the 95th again, we're reading quite some large chunks here.

09:43.200 --> 09:49.200
This all gets us close to a gigabit on benchmarks,

09:50.200 --> 09:59.200
which is, I think, quite nice for browser running on low-powered hardware.

09:59.200 --> 10:03.200
So when we come to the holy grail of segmentation offloading,

10:03.200 --> 10:07.200
we fall back to multi-message source calls.

10:07.200 --> 10:10.200
And the idea here, again, depicted on the right,

10:10.200 --> 10:14.200
instead of sending a small datagram that send multiple small datagrams

10:14.200 --> 10:18.200
down to their operating system and then to Nick and so on.

10:18.200 --> 10:22.200
We do this on macOS.

10:22.200 --> 10:26.200
There's the send message and receive message calls,

10:26.200 --> 10:30.200
but they're not entirely supported,

10:30.200 --> 10:33.200
but they are there and they do work.

10:33.200 --> 10:36.200
It's similar to what Linux offers with a send many message,

10:36.200 --> 10:38.200
and they receive many message,

10:38.200 --> 10:39.200
so it's called.

10:39.200 --> 10:44.200
But on Linux, we have the fancier segmentation offloading.

10:44.200 --> 10:49.200
Combining the two has not been shown fruitful so far,

10:49.200 --> 10:53.200
so that's why we only do segmentation offloading on Linux,

10:53.200 --> 10:56.200
for example, in a multi-message on macOS.

10:56.200 --> 11:00.200
And here again, if you are a CPU bound,

11:00.200 --> 11:02.200
so on a CPU bound benchmark,

11:02.200 --> 11:05.200
receive roughly like an 11% performance improvement,

11:05.200 --> 11:10.200
just by using these multi-message receive calls on our quick throughputs.

11:10.200 --> 11:17.200
So that's quite massive that really shows that those source calls are very expensive for us.

11:17.200 --> 11:19.200
Okay.

11:19.200 --> 11:22.200
Another optimization is PLP MTUD,

11:22.200 --> 11:27.200
that's the packetization layer path MTU discovery for datagram transports,

11:27.200 --> 11:28.200
for those not familiar with it.

11:28.200 --> 11:30.200
And MTU is also an abbreviation,

11:30.200 --> 11:32.200
so it's actually the packetization layer path,

11:32.200 --> 11:36.200
maximum transmission unit discovery for datagram transports.

11:37.200 --> 11:40.200
In short, it's RC889.

11:40.200 --> 11:47.200
The idea is, let's see whether our path can send larger datagrams,

11:47.200 --> 11:50.200
and let's try it out, basically,

11:50.200 --> 11:52.200
and fall back to smaller ones if we can't.

11:52.200 --> 11:56.200
So for example, if you're tunneling through a VPN,

11:56.200 --> 11:58.200
your MTU will be smaller, right?

11:58.200 --> 12:00.200
But if you're not tunneling through a VPN,

12:00.200 --> 12:02.200
you can send larger datagrams.

12:02.200 --> 12:05.200
So ideally, like we support the VPN case,

12:05.200 --> 12:08.200
we start small and then ramp up eventually.

12:08.200 --> 12:10.200
So here, again, the picture on the right,

12:10.200 --> 12:13.200
we have a datagram that is just slightly larger,

12:13.200 --> 12:16.200
and we pass that to the US and vice versa and so on.

12:16.200 --> 12:18.200
I think you get the idea.

12:18.200 --> 12:23.200
We hope at least that we get like a 10% improvement

12:23.200 --> 12:26.200
in terms of like the amount of bytes.

12:26.200 --> 12:27.200
We can send per datagram,

12:27.200 --> 12:29.200
and then, again, the overhead per datagram,

12:29.200 --> 12:32.200
is significant being the introduced.

12:33.200 --> 12:37.200
Yep, again, links, those lights are online here.

12:39.200 --> 12:42.200
Then, early on, I said,

12:42.200 --> 12:46.200
Act frequency or an XR.

12:46.200 --> 12:49.200
We do a lot of test calls for X.

12:49.200 --> 12:52.200
There is a draft at the ITF.

12:52.200 --> 12:54.200
It's called a quick acknowledgement frequency.

12:54.200 --> 12:58.200
The idea is, let's not send so many X,

12:58.200 --> 13:00.200
but obviously in a controlled way,

13:00.200 --> 13:02.200
so our congestion controller, for example,

13:02.200 --> 13:04.200
doesn't suffer from it.

13:04.200 --> 13:07.200
And then a contrived example,

13:07.200 --> 13:10.200
just to give you an idea of where the problem is.

13:10.200 --> 13:13.200
Let's say you have one gigabit transfer.

13:13.200 --> 13:16.200
Let's say you want to transfer that to bytes,

13:16.200 --> 13:17.200
so divide by eight.

13:17.200 --> 13:19.200
We have an empty U of 1500,

13:19.200 --> 13:23.200
divided by that, divided by two for every second packet you act.

13:23.200 --> 13:26.200
So that gives you 40k X per second.

13:26.200 --> 13:29.200
And in the worst case, you read every single of those X.

13:29.200 --> 13:31.200
With a single Cisco, right?

13:31.200 --> 13:34.200
So that's 40k Cisco's per second,

13:34.200 --> 13:36.200
just for the acknowledgement.

13:36.200 --> 13:38.200
It's not actually for the data, right?

13:38.200 --> 13:40.200
All your worst case scenarios,

13:40.200 --> 13:44.200
quick implementation will optimize that by default,

13:44.200 --> 13:48.200
but then there is coming up the quick acknowledgement

13:48.200 --> 13:50.200
frequency draft,

13:50.200 --> 13:55.200
which will allow the sender to propose an act right to the receiver

13:55.200 --> 13:58.200
in this way, basically the receiver sending less X

13:58.200 --> 14:02.200
and then the sender having to receive less X.

14:02.200 --> 14:04.200
So that's coming.

14:04.200 --> 14:06.200
A couple of additional wins,

14:06.200 --> 14:11.200
they're not directly tied to performance.

14:11.200 --> 14:14.200
As part of this project in Firefox,

14:14.200 --> 14:18.200
to optimize the IO path, the UDPIO path.

14:18.200 --> 14:22.200
We've been looking around and instead reinventing the real.

14:22.200 --> 14:27.200
We are using Quinn UDP for the UDP source call there.

14:27.200 --> 14:31.200
And Quinn UDP is actually part of the Quinn project.

14:31.200 --> 14:35.200
Quinn is a different rust, quick implementation.

14:35.200 --> 14:38.200
And so we're collaborating with them,

14:38.200 --> 14:42.200
and yeah, basing everything on their code here on the IO path.

14:42.200 --> 14:43.200
So that's very nice.

14:43.200 --> 14:46.200
And now Firefox's quick stack already in Rust,

14:46.200 --> 14:49.200
does all of its IO also in Rust.

14:49.200 --> 14:51.200
So in the memory safe language.

14:51.200 --> 14:55.200
In addition, using all these modern source calls,

14:55.200 --> 14:58.200
we can now get more metadata.

14:58.200 --> 15:01.200
When we send and receive UDP data grams,

15:01.200 --> 15:04.200
and one very important one here is ECN,

15:04.200 --> 15:08.200
explicit congestion notification.

15:08.200 --> 15:11.200
I'm not going to introduce that today for those.

15:11.200 --> 15:13.200
There are already familiar with it.

15:13.200 --> 15:17.200
We see roughly in the wild, so Firefox 90 already does mark.

15:17.200 --> 15:19.200
And read ECN.

15:19.200 --> 15:24.200
We see roughly 50% of paths being ECN capable,

15:25.200 --> 15:28.200
which is very promising and great for us.

15:28.200 --> 15:30.200
And in the 75th percentile,

15:30.200 --> 15:36.200
we see roughly 0.6% of packets being marked.

15:36.200 --> 15:41.200
So that obviously congestion in itself is not great,

15:41.200 --> 15:46.200
but packets being marked shows us that the boxes on our path

15:46.200 --> 15:50.200
actually are able to manage the queue with ECN.

15:51.200 --> 15:54.200
And then lastly,

15:54.200 --> 15:58.200
we have been working a bunch on our memory management.

15:58.200 --> 16:00.200
I showed you earlier.

16:00.200 --> 16:03.200
There is a big chunk in our CPU flame grass cruise,

16:03.200 --> 16:06.200
the of allocating memory just to receive something.

16:06.200 --> 16:10.200
And what we're doing now is we have a single 64K buffer

16:10.200 --> 16:14.200
for the entire Firefox process for all quick connections.

16:14.200 --> 16:17.200
That's allocated once on the first connection,

16:18.200 --> 16:23.200
and then use for receiving throughout the entire lifetime of the Firefox process.

16:23.200 --> 16:28.200
So this entire chunk of CPU that you saw earlier in the profile is gone.

16:28.200 --> 16:31.200
What is nice is Rust Sporechacker,

16:31.200 --> 16:37.200
gives us a certain check of this of our memory reuse at compile time.

16:37.200 --> 16:43.200
And yeah, it does show a significant CPU time reduction.

16:43.200 --> 16:45.200
So what's next?

16:45.200 --> 16:47.200
Well, we have one more quick talk.

16:47.200 --> 16:49.200
I'll fill this here later today,

16:49.200 --> 16:53.200
and then Lars is giving another talk at the Mozilla.

16:53.200 --> 16:57.200
For those interested and want to learn more about quick.

16:57.200 --> 17:00.200
In terms of what's next for Firefox,

17:00.200 --> 17:03.200
we're rolling out PM2UD.

17:03.200 --> 17:06.200
Then ECN is already in Firefox nightly,

17:06.200 --> 17:10.200
and hopefully it's going to make it into better and release soon.

17:10.200 --> 17:14.200
And then we're looking into the ECN graph

17:14.200 --> 17:16.200
to get our current implementation,

17:16.200 --> 17:20.200
which is an older version of the draft up to date.

17:20.200 --> 17:24.200
The optimizations I talked about today,

17:24.200 --> 17:27.200
I introduced an generic version both send and receive.

17:27.200 --> 17:30.200
We have mostly focused on the receive path so far.

17:30.200 --> 17:34.200
That's unsurprising, like the browser mostly downloads things.

17:34.200 --> 17:37.200
But long term, we would also try to optimize the send path,

17:37.200 --> 17:43.200
so introduce QSO, USO, which is the Windows equivalent send message X and so on,

17:43.200 --> 17:46.200
and also ideally have a long live send buffer

17:46.200 --> 17:49.200
to reduce those memory allocations.

17:49.200 --> 17:51.200
Other things around condition controllers,

17:51.200 --> 17:54.200
like for example, high-start for a cubic would be nice

17:54.200 --> 17:57.200
and various other things.

17:57.200 --> 18:01.200
So in case you want to get involved,

18:01.200 --> 18:05.200
as I said, a lot of this is already in Firefox nightly,

18:05.200 --> 18:09.200
so check out Firefox nightly, and you're already running those optimizations.

18:09.200 --> 18:13.200
Some of them are already in better and release.

18:13.200 --> 18:15.200
Check that out.

18:15.200 --> 18:20.200
Then in addition, maybe I can convince some folks here

18:20.200 --> 18:24.200
the quick implementation of Firefox is on GitHub.

18:24.200 --> 18:26.200
It's entirely written in Rust.

18:26.200 --> 18:30.200
And so if you want to help make a modern transport protocol faster,

18:30.200 --> 18:33.200
and thus also make a browser faster,

18:33.200 --> 18:37.200
come over on GitHub talk to us.

18:37.200 --> 18:42.200
There's also a matrix room, and you can reach out to me.

18:42.200 --> 18:45.200
That's all for my end.

18:45.200 --> 18:47.200
Thank you very much.

