WEBVTT

00:00.000 --> 00:12.000
So, basically, I'm from Pascho, I'm developed advocate for yoga by D.B.

00:12.000 --> 00:19.000
Daniel Finneiden, I'm a technical support engineer at Pink App, which is the company behind TITIB.

00:19.000 --> 00:26.000
And we look at the different layers. We will start at the SQL layer.

00:26.000 --> 00:33.000
Yes, of course, as well layer, there's a big difference here, because there's both squares in my SQL.

00:33.000 --> 00:42.000
But another very important difference here is that we decided to be my SQL compatible, but not to use any of the mySQL source code.

00:42.000 --> 00:52.000
So, we implemented the protocol from scratch. The good thing about this is, like, we were able to do this in Go, which is a really nice modern language.

00:53.000 --> 01:02.000
The drawback of this whole thing is that you don't get any of the things implemented for free, because you have to re-implement all the parts of the protocol.

01:02.000 --> 01:13.000
And the advantage of re-using postgres, we have a lot of features that they come like this, but it's need to find significant errors, and that makes it a bit more difficult to make.

01:14.000 --> 01:23.000
So, that was about the SQL layer about data distribution, because that's the goal of distributed database.

01:23.000 --> 01:32.000
Yeah, so, with TITIB, we have TIKV, which is a CNCF project.

01:32.000 --> 01:37.000
So, there are multiple TIKV servers, on which we are actually storing the data.

01:37.000 --> 01:46.000
So, we don't have to make sure that, like, the data gets distributed for all the servers, and not that, like, one server has all the data, the other server has no data.

01:46.000 --> 01:52.000
So, we need to make sure that the data is distributed, but also that there are, like, multiple copies being stored.

01:52.000 --> 01:59.000
To make sure that if one server goes down, that everything, like, still is able to continue to run.

02:00.000 --> 02:08.000
Which also means that, like, you need to take care of, like, a availability zones, because you don't want to have all the copies in the same availability zone.

02:08.000 --> 02:22.000
And very similar on, on UGAB-DB, basically, this is based on the Google Spanner paper, where the table rows and the index entries are distributed, charted on their key.

02:22.000 --> 02:28.000
And in UGAB-DB, we have also range sharding, so range of key, and hash sharding.

02:28.000 --> 02:39.000
When you have only equality, queries, equality predicates on key, then you can apply your hash function, but you cannot do that when you have ranges.

02:39.000 --> 02:47.000
I don't know. If you have more questions that just tell us, we cannot go into all details for all layers.

02:47.000 --> 02:59.000
So, if you have questions or tell us, maybe I would just mention that the name, when we distribute the table rows on the index entry, we call the short tablets.

02:59.000 --> 03:05.000
And, basically, the idea is to store that in our LSM-3 in RoxDB.

03:05.000 --> 03:14.000
In a small volume, enough that it can move to one server in another in case of any problem.

03:14.000 --> 03:22.000
Yeah, also with TIDB, we are splitting the data, we are not calling it a tablet, we call it a region, but basically it's the same thing.

03:22.000 --> 03:30.000
So, these are, like, parts of a table, like 100 megabytes, those are the things that are distributed over all the servers.

03:30.000 --> 03:41.000
And by doing this with a smaller volume, then, like, a complete table, you can easily, like, move takes in such a way that things are distributed equally.

03:41.000 --> 03:54.000
And then it can be rigid distributed. I think it's the case in both case. You can add new nodes to your cluster, and then it will be rebalanced to have all cluster running your queries.

03:54.000 --> 04:06.000
Yeah, another thing to think about with data distribution is that if you have one table, there can be, like, say, there's YouTube videos in there.

04:06.000 --> 04:11.000
There might be some very popular ones, there might be some ones that are, like, old and no one really looks at them.

04:11.000 --> 04:21.000
So, you need to be able to, like, move things around in your cluster, that the data is distributed, but also that, like, your reads and your writes are distributed throughout the whole cluster.

04:21.000 --> 04:30.000
Then I have a question, do you distribute, then, not only on the size, but also on the number of reads and write operations,

04:30.000 --> 04:34.000
we just continue to guide, we look on the other side.

04:34.000 --> 04:40.000
Yeah, we try to distribute based on, like, many of these things.

04:40.000 --> 04:52.000
And that's also, like, one thing there to take into account is that, with my style, you have an auto increment, which ends up always in the same data region.

04:52.000 --> 04:58.000
So, you, all the new writes to a table, might go to the same place to the same server.

04:58.000 --> 05:08.000
Usually, that's not much of a problem, but that's also why we created the auto random, where we, instead of a unique idea that increments, get our random ID,

05:08.000 --> 05:14.000
then we can distribute writes, even more. Now, I'll run up to take the question.

05:14.000 --> 05:37.000
So, the question is, in your guide, when we move the world, what's DB instance, or just a part of it, in your guide, each tablet is a root DB, and we move the world of it.

05:37.000 --> 05:47.000
So, the goal of the rebalancing and auto splitting is to keep each tablet between 10 gigabytes and each gigabytes, that doesn't really matter,

05:47.000 --> 05:53.000
but the goal is not to have one terabyte to move, when there is something to move.

05:53.000 --> 05:59.000
Okay, that was about the distribution.

05:59.000 --> 06:01.000
There's a question there.

06:01.000 --> 06:29.000
So, the auto random, it really gives you a random ID. So, that's not really sortable by design.

06:30.000 --> 06:43.000
There's even another thing there, is that with the auto increment in a distributed system, like, we have the option to either have like a central service, giving out the IDs, make sure that they increment,

06:43.000 --> 06:56.000
but also like by default, we assign ranges of IDs to servers, then they are given out, which means that for my client, you might see like different gaps of numbers,

06:56.000 --> 07:06.000
if the IDs are given out by different servers. So, the central one might be better for my scope compatibility, but might not be the best for scalability.

07:06.000 --> 07:20.000
So, with distributed systems, you sometimes have to adapt things a little bit to take real advantage of the resources that you have.

07:20.000 --> 07:30.000
Okay, so that was about how we showed our ranges of keys that go somewhere in a tablet or in a region.

07:30.000 --> 07:44.000
And then, before storing it, we went to replicate it, because that's one goal in distributed SQL database is to replicate the data in different data centers, different zones, so that if one is down everything continues.

07:44.000 --> 07:48.000
And then, what do we use to replicate?

07:48.000 --> 08:02.000
So, rough this, what's being used for data replication. So, this means that, like, if you're writing a something, it will also send this write to like two other nodes,

08:02.000 --> 08:09.000
and then wait to get acknowledgments from the other nodes to make sure that, like, the data is always stored on multiple machines.

08:10.000 --> 08:29.000
Yeah, so the rough consensus, the idea of the rough consensus is to get a consensus between multiple nodes, and the idea of rough is that you have a leader and you have followers, which makes it easier, because the leader can synchronize many things, and in a SQL database, we have transactions, we have many things to synchronize.

08:29.000 --> 08:39.000
So, all the reads and writes, go to the leader, then the writes are sent to the follower, and it writes for the core of it.

08:39.000 --> 08:53.000
So, when you look at it, it looks like a primary standby database, except that this leader is only for a small part of the rows because of the distribution top of it.

08:53.000 --> 09:02.000
Question for both of you, and like we're going to assume that in a rough group, all the nodes are necessarily on the front of a particular node.

09:02.000 --> 09:14.000
Yes, this should all be on different availability zones. Well, one thing also to consider is that when we're talking about nodes, like, we have a data region, and the data region, that's a rough group.

09:14.000 --> 09:25.000
So, if you have like three physical servers, and many like data regions, you have many, many rough groups. There's not a physical server that has one rough group.

09:25.000 --> 09:42.000
Typically, those deployment are very nice in the cloud, where it's easy to have free availability zones. You can also do multi-region, but then you think about the placement of your data, because you don't want to pay the multi-region latency to often.

09:43.000 --> 09:52.000
Yeah. So, my question is about rough and consensus. So, who might extend this kind of very vague about it?

09:52.000 --> 10:06.000
Think that when you write, you go through rough consensus. So, if you write multiple round groups, or, like, establishing the consensus and stuff, or, in some cases, it's more tricky than that.

10:06.000 --> 10:16.000
So, there's the writing to a rough group, and there's a whole consensus that happens with, like, leader, election, et cetera, if the configuration of the group changes.

10:16.000 --> 10:27.000
So, if you're writing, you're writing to the leader, that is this is sent to the, to the other nodes of the rough group.

10:27.000 --> 10:36.000
And then they acknowledge back, and I think as soon as you have a majority, the right goes through.

10:36.000 --> 10:48.000
Yeah, the idea is that if the majority has acknowledged, then you know, so, the leader has acknowledged, and one of the two followers have acknowledged, if you have in replication factor free.

10:48.000 --> 10:58.000
And then if a follower is down, you still have the leader with the latest value. If the leader is down, the two followers can't all together, and know which one has the latest value.

10:58.000 --> 11:14.000
They need to, to all together, to know which one, but then they can. So, that's the idea, which means that the rights involve the latency from the leader to the, the fastest follower.

11:14.000 --> 11:24.000
Okay, so, the only successful part, people can be worse than human classification. So, that's a very good question.

11:24.000 --> 11:31.000
So, the question is, when everything is up, then it's not worse, that's the synchronous replication.

11:31.000 --> 11:38.000
Next up, that in synchronous replication, usually in traditional databases, you wait only at commit.

11:38.000 --> 11:47.000
In yoga bites, we wait more often, because we want to be resilient, if one note is down, you can continue your transaction without holding it back.

11:47.000 --> 11:54.000
So, we don't synchronize for each write, we batch the write, and we synchronize when we need it.

11:54.000 --> 12:03.000
For example, if you have reads and writes, then the reads must wait that it is synchronized to go to the other.

12:03.000 --> 12:17.000
Yeah, and if you're doing like an update statement, you're probably not even dealing with like one rough group, but with like 10 or 100, because there might be like multiple data regions that you need to change.

12:17.000 --> 12:27.000
Yeah, secondary indexes, and maybe checking some foreign keys, and yeah, another question.

12:27.000 --> 12:37.000
These are all draft limitations. The distribution is mostly about your ability of the message rather than applying it to the follower notes.

12:37.000 --> 12:57.000
It's not the case, you're both cases as well. So, that if I insert the raw, the draft protocol, just negotiate the fact that I requested it, rather than actually waiting for the answer to be acknowledged and applied from the written code.

12:57.000 --> 13:15.000
Yeah, I'm not completely sure the exact details, but as far as I know, it needs to be like persisted, but also because it's the leader knows about it, and it's the leader where we by default are like reading and writing.

13:15.000 --> 13:23.000
It's not like you're seeing outdated data, or you don't have to go to like multiple layers of something.

13:23.000 --> 13:40.000
For example, in Uganda by DB, when you write, so when you write it writes to the draft log, the writer adds log for the tablet on the leader sends it to the followers, the followers write to their log and acknowledge.

13:40.000 --> 13:51.000
And then when you have one that has acknowledged, for the majority that has acknowledged, then you can acknowledge to the client and the leader then writes to its memory table.

13:51.000 --> 14:04.000
So, it's really written, after it is on the log of the majority.

14:04.000 --> 14:13.000
So, this is the beauty storage. One of the common components here is ROXDB.

14:13.000 --> 14:23.000
But both IDB and Uganda by using ROXDB as an LSM storage engine, which is a library.

14:23.000 --> 14:34.000
And for us with IDB, it's the IKV that's actually using ROXDB to process things on this.

14:34.000 --> 14:44.000
I think in both cases, we're not using just normal ROXDB, but always with small additions, small changes to make the performance better.

14:44.000 --> 14:48.000
But it's still one of the common components.

14:48.000 --> 14:57.000
And of course, some modification on ROXDB, ROXDB is very good to get in search, maybe a bit less at reads.

14:57.000 --> 15:05.000
And in a SQL database, you have a lot of reads. Even when you write, you have to read for the phone key, duplicate key, all that.

15:05.000 --> 15:19.000
So, a lot was improved for reads. The case also for food tables, cans, you don't want that of food tables, can just use all the cache and remove all the other.

15:19.000 --> 15:31.000
So, it's based on all the DB, but with some modifications to make it efficient in this context.

15:31.000 --> 15:46.000
And ROXDB is also one of the things that you can actually use with my SQL, with my ROX plugin. So, that's an interesting thing if you're interested in LSM.

15:46.000 --> 15:55.000
OK, and then there are transactions, because in SQL database, it's not as easy as just writing what you update.

15:55.000 --> 16:04.000
You need to do that within a transaction. And the big thing that is difficult is that when you read, you need to know the commit time.

16:04.000 --> 16:11.000
But when you write, you don't know the commit time. You need to come back and change the commit time.

16:11.000 --> 16:20.000
So, basically, for all the data, all the data gets a timestamp attached to it from the transaction, and all the transactions also have timestamp.

16:20.000 --> 16:29.000
So, based on this timestamp and based on the data, you know whether or not this transaction needs to be able to see this ROX or not.

16:29.000 --> 16:37.000
So, that's why it's really important that it's mostly correct, but that is really correct in a very small timeframe.

16:38.000 --> 16:47.000
And in a distributed system, you have multiple nodes, and they all have their own clock, but here you need a time that is cluster wide.

16:47.000 --> 16:54.000
So, here we do things differently. You can explain how you do in TIGB to get a timestamp that is cluster wide.

16:54.000 --> 17:01.000
So, in TIGB, we have a timestamp Oracle. This is one of the things of the PD role of the cluster.

17:01.000 --> 17:10.000
So, it's a single node that's also a rough group with two other nodes, but this one gives out timestamps.

17:10.000 --> 17:20.000
And the timestamps which are given out are the actual normal physical timestamp, like the one you get from a clock, with a logical increment.

17:20.000 --> 17:25.000
In case there, you need multiple timestamps for the same second.

17:25.000 --> 17:31.000
And because it's always given out by this one component, we can guarantee that the order is always the same.

17:31.000 --> 17:38.000
Like any thing that gets a timestamp later than the client before, always gets a higher number.

17:38.000 --> 17:43.000
And for you, Gammaid, the choice was different, was not to rely on a single point of truth.

17:43.000 --> 17:57.000
So, the way I under the clock that can drift, but adding a LAN port clock to it, when the messages exchange between two nodes, they update their clock to the latest one.

17:57.000 --> 18:07.000
So, it's a neighborhood logical clock where you have a physical component that comes from NTP, but may drift and may have a high skew.

18:07.000 --> 18:16.000
And an additional logical clock that doesn't synchronize all nodes at the same time, but at least those who exchange messages.

18:16.000 --> 18:24.000
And to do that, and to guarantee consistency with that, you need to define the maximum clock skew.

18:24.000 --> 18:29.000
You know that you have a clock skew, but cannot be more than 500 milliseconds.

18:29.000 --> 18:38.000
You read something that was written less than 500 milliseconds, then you read it because you are not sure that it was done before or after.

18:38.000 --> 18:46.000
But more and more in the clouds, you have access to precision time protocol or anatomic clocks.

18:46.000 --> 18:57.000
And then you can read you that, for example, on AWS, we can reduce this clock skew from 500 milliseconds to something very small in microseconds.

18:57.000 --> 19:01.000
And both databases are based on the idea of spanner.

19:01.000 --> 19:07.000
And spanner itself requires like good quality, physical clocks.

19:07.000 --> 19:13.000
But if you're deploying on cloud, it might be available like on AWS, but that don't guarantee it.

19:13.000 --> 19:23.000
If you want people to be able to run it on their own data center, you can also not assume that the clock will be high quality low precision.

19:23.000 --> 19:28.000
And do you need to be 100% sure or you have in consistency? You have a question?

19:28.000 --> 19:38.000
Can I just use the timestamp or if you have a no objective? Why not? Do you have problems when the nature is changed?

19:38.000 --> 19:45.000
Because you know what I'm going to say.

19:45.000 --> 19:51.000
So with the placement driver, it's a rough group of free notes.

19:51.000 --> 20:11.000
Like if the leader changes, then there might be like a very short time that like the timestamp might be a problem, but I don't really see that like overness like causing issues or at all basically.

20:11.000 --> 20:15.000
Sorry, I'm going to go to good news and distributed as well.

20:15.000 --> 20:26.000
Is there like a centralized place where all the locations for your data are stored, or is that kind of split out?

20:26.000 --> 20:34.000
So I'm struggling with something you've said to, of course, how will you know what's going to be?

20:34.000 --> 20:44.000
That's a very good question. So for TIDB, you might have like nine notes. You have a table of a few hundred megabytes.

20:44.000 --> 20:55.000
It's split up in chunks. And then we have like in a central place, we keep like, well, this range of keys is stored in on this machine.

20:55.000 --> 21:05.000
This range is stored there. So that's why we can like really do a quick look up if we don't like the primary key of a table.

21:05.000 --> 21:15.000
So if I can ask, if that's your ghost document, it's a record of this entry start here in that entry start there.

21:15.000 --> 21:27.000
So I'm not sure what you got by it does with with with IKV.

21:27.000 --> 21:35.000
It's the data region that sends heart beats with all kinds of metadata. This is part of this metadata.

21:35.000 --> 21:39.000
So if anything happens there, like it's easily rebuilt.

21:39.000 --> 21:49.000
And the other thing there is that in case like the TIDB and the SQL layer has the wrong information about what key is where.

21:49.000 --> 22:03.000
It doesn't really matter because it might try to write to a note that is not a leader for like that region, but then it will just get an error basically to refresh the metadata it has and try again.

22:03.000 --> 22:06.000
So it's quite safe.

22:06.000 --> 22:28.000
The idea is it's to not have a single point of truth. You depend on so each tablet or each region are really independent rough to group sending a bit and trying to solve the failure by themselves.

22:28.000 --> 22:32.000
And we can move a bit more on transaction.

22:32.000 --> 22:34.000
If you have the time.

22:34.000 --> 22:42.000
Comment in question regarding the leader of the election, which is likely that the 90 makes for a rough to do an election.

22:42.000 --> 22:48.000
You've wanted it, but I think it was more than clock school.

22:48.000 --> 23:00.000
Question does it mean that every week we could request the TIDB and the city also includes another hope to be clocked service to tell you of giving us a question.

23:00.000 --> 23:04.000
In many cases, yes.

23:08.000 --> 23:14.000
I don't know the actual details, but what I said, like that there's only like one PD.

23:14.000 --> 23:17.000
Of course there are two other PDs.

23:17.000 --> 23:22.000
I think you can like delegate a few things to there.

23:22.000 --> 23:28.000
And for some things, delegated times then might be good enough.

23:28.000 --> 23:32.000
But yeah, I'm a bit lighter than details are.

23:32.000 --> 23:38.000
This is typically a trade off having one time stamp or clock school.

23:38.000 --> 23:44.000
And last topic about MVCC multivation control.

23:44.000 --> 23:57.000
So this is also based on the time stamp, not the same time stamp, but the idea is the same is that if you update data, someone may need to read a soft previews read time.

23:57.000 --> 24:00.000
So you need to store all the versions.

24:00.000 --> 24:08.000
And it requires this time stamp and it requires storage of all the versions for some time.

24:08.000 --> 24:11.000
And that's on point some garbage collection.

24:11.000 --> 24:16.000
Typically in your garbage DB, we set the retention to 15 minutes.

24:16.000 --> 24:19.000
If you have long queries, you can increase it.

24:19.000 --> 24:26.000
But basically you can read as of any read point in the last 15 minutes.

24:26.000 --> 24:29.000
So your query is consistent.

24:29.000 --> 24:34.000
And garbage collection is done because it's whole DB, there is compaction.

24:34.000 --> 24:41.000
So the removal of the old version is built in the compaction process.

24:41.000 --> 24:46.000
Yeah, and we try to be will allow you to see like old versions of a table.

24:46.000 --> 24:51.000
So you can just say like, I want the version of this table as it was like five minutes ago.

24:51.000 --> 24:54.000
It will give you the data of the table of course.

24:54.000 --> 24:57.000
But it will also give you the schema at that time.

24:57.000 --> 25:04.000
So if you drop a column or edit an index, you will see those changes as well, which is quite powerful.

25:04.000 --> 25:11.000
And the other powerful thing that we've done there is that if you drop a table by accident, that of course never happens.

25:11.000 --> 25:16.000
But we allow you to just do a flashback table.

25:16.000 --> 25:23.000
And because it was not garbage collected yet, we just restore it to the last version that we had and your back.

25:23.000 --> 25:34.000
And so that's one of the like big advantage of like a garbage collected set up.

25:34.000 --> 25:39.000
No, for backups, we have a different way of doing this.

25:39.000 --> 25:46.000
So if we take a backup, all the type of storage nodes are writing their own backup.

25:46.000 --> 25:54.000
And then we rely on the NPCC as well to be able to have a consistent backup of the whole cluster.

25:54.000 --> 26:00.000
So it's a combination, you take snapshots, but the snapshots may be from different point in time in the different nodes.

26:00.000 --> 26:07.000
And this is where MVCC allows you to restore at a specific point in time for all nodes.

26:07.000 --> 26:12.000
Typically, when you divide the B by default, we have 15 minutes retention for MVCC.

26:12.000 --> 26:19.000
When you do snapshot every hours, it will be garbage collected only after one hour.

26:19.000 --> 26:27.000
And besides all the components that we talked about, there are so many other components that are shared between distributed databases.

26:27.000 --> 26:39.000
And not just Yugabyte, TIDB, it also features, for example, like GRPC and many of the other things there.

26:39.000 --> 26:41.000
Any more questions?

26:43.000 --> 26:51.000
Anyone? How do you do the discovery of the storage nodes from the JETS?

26:51.000 --> 26:57.000
So for us, we have the placement driver, the placement driver knows of all the machines.

26:57.000 --> 27:02.000
So you connect to the placement driver, you get all the information, and that's how things work there.

27:02.000 --> 27:09.000
And something similar, there is a catalog, a map of the nodes, and where are the tablets that must be shared.

27:09.000 --> 27:18.000
Because we have something that we call master, I don't really like the name, it's more a master that orchestrates, where are the nodes.

27:18.000 --> 27:27.000
It's a component that must be shared, so you do not want to rely on it for all operations to be scalable, but this mapping is shared.

27:27.000 --> 27:34.000
And we also store the Postgres catalog in it.

27:34.000 --> 27:44.000
So it's basically a two-phase commit that's being used.

27:44.000 --> 27:48.000
I think the same is true also for a Yugabyte.

27:48.000 --> 27:51.000
We have an optimization.

27:51.000 --> 27:54.000
I think Yugabyte is one of the best things to do.

27:54.000 --> 27:58.000
So it's basically a two-phase commit that's being used.

27:58.000 --> 28:01.000
I think the same is true also for a Yugabyte.

28:01.000 --> 28:03.000
We have an optimization.

28:03.000 --> 28:09.000
I think Yugabyte has the same one that if you're only hitting a single data region or tablet,

28:09.000 --> 28:13.000
then we don't do the whole two-phase commit or don't require you to do so.

28:13.000 --> 28:18.000
But for multi-shot transaction, you have a transaction table somewhere that is also distributed.

28:18.000 --> 28:20.000
That has the status of the transaction.

28:20.000 --> 28:26.000
So when you read, you need to know if it's committed or not from it.

28:27.000 --> 28:28.000
Okay, thank you.

28:28.000 --> 28:29.000
Thank you.

28:29.000 --> 28:32.000
Thank you.

28:40.000 --> 28:41.000
Here, Mike.

28:41.000 --> 28:42.000
Yep.