mvcc原理和hbase实现

【mvcc原理和hbase实现】The simplest way to keep read consistency is to make all readers wait until the writer is done, which is known as a read-write lock.Locks are known to create contention especially between long read transactions and update transactions.MVCC aims at solving the problem by keeping multiple copies of each data item. In this way, each user connected to the database sees a snapshot of the database at a particular instant in time. Any changes made by a writer will not be seen by other users of the database until the changes have been completed (or, in database terms: until the transaction has been committed.)在并发环境下,为了保证读写一致性我们必须要使用read-write lock来保证,但是这样性能较低每次操作要去竞争锁,所以提出了mvcc,将写操作变成多个版本,读转化为snapshot读(HLC/True Time/ TSO这里发挥很多作用)
MVCC uses timestamps (TS), and incrementing transaction IDs, to achieve transactional consistency. MVCC ensures a transaction (T) never has to wait to Read a database object (P) by maintaining several versions of the object. Each version of object P has both a Read Timestamp (RTS) and a Write Timestamp (WTS) which lets a particular transaction Ti read the most recent version of the object which precedes the transaction's Read Timestamp RTS(Ti).(这里的设计思路是使用timestamp 数据时间和递增的transactionID 事务时间,mvcc中先后顺序是根据transactionID来确定的,每一个事务都有一个RTS和WTS,事务T能够读TS<=RTS的数据,主要好处就是写不用等读,hbase设计在此之上有简化)
If transaction Ti wants to Write to object P, and there is also another transaction Tk happening to the same object, the Read Timestamp RTS(Ti) must precede the Read TimestampRTS(Tk), i.e., RTS(Ti) < RTS(Tk)[clarification needed], for the object Write Operation (WTS) to succeed. A Write cannot complete if there are other outstanding transactions with an earlier Read Timestamp (RTS) to the same object. Like standing in line at the store, you cannot complete your checkout transaction until those in front of you have completed theirs.(写WTS(Ti)不能完成当存在一个事务Tx,这个RTS(Tx) To restate; every object (P) has a Timestamp (TS), however if transaction Ti wants to Write to an object, and the transaction has a Timestamp (TS) that is earlier than the object's current Read Timestamp, TS(Ti) < RTS(P), then the transaction is aborted and restarted. (This is because a later transaction already depends on the old value.) Otherwise, Ti creates a new version of object P and sets the read/write timestamp TS of the new version to the timestamp of the transaction TS=TS(Ti).(如果有一个写事务TS(Ti) The drawback to this system is the cost of storing multiple versions of objects in the database. On the other hand, reads are never blocked, which can be important for workloads mostly involving reading values from the database. MVCC is particularly adept at implementing true snapshot isolation, something which other methods of concurrency control frequently do either incompletely or with high performance costs.(MVCC最大的目的是读不会有任何阻塞)
HBASE mvcc设计:
一个mvcc从属与一个region,一个mvcc负责将数据写入到memstore,openRegion时将next sequence id作为mvcc的初始化writeNumber,sequenceid是一个自增的每一个store维护的序号,持久化到hfile,所以对于一个region 从属的mvcc来说mvcc version永远是递增的
问题1:writeNumber是否有实际含义,hbase中数据时间戳与版本号怎么处理,readNumber怎么使用的
writeNumber与readNumber在物理上都是sequenceId,写入memstore时每一个cell上带上一个sequenceId这个是从writeEntry.getWriteNumber()获取到的
问题2:multiVersion怎么处理的,过期的version数据删掉还是?
HBASE本身是多版本的,过期version通过compact来删掉

问题3:hbase读写怎么和mvcc结合起来
读:
1.获取相关的锁,由于HBase要确保行一级的原子性,所以获取锁的时候获取的是整个rowkey的锁而不是单个cell的锁;也只有当至少获取一个锁的时候,这个方法才会继续,否则直接返回。
2.更新cell中的时间戳(timestamp)以及获取mvcc相关参数,其中timestamp(也可以叫做version)可以在客户端自己手动指定,所以在一致性上不能用来做参考,也许正是因此才会引入一个叫做sequenceId的概念(当然更多的用途是为了保证修改操作在HLog里面的顺序)来完成mvcc,最后会介绍一下mvcc以及在这里HBase是如何处理mvcc的。
3.将这些put操作写入memstore,虽然数据库系统中写日志永远比写数据重要,但是这里可以认为当前“事务”尚未提交,即使现在挂了没有日志恢复也不要紧,因为这个“事务”是没有提交的。
4.构建walEdit,这一步主要是为了构建WALEdit类型的walEdit变量,这个变量主要是以list的形式聚合了很多HBase里面cell的概念,以后会写入到HLog中。
5.追加刚才构建好的walEdit:首先构造一个walKey,注意这里的walKey的sequenceId为默认值-1,到后面才会修改为跟region挂钩的唯一递增id;接着调用wal的append方法并返回一个递增数值(txid),用来表示这个追加到wal内存中日志条目的编号,在第七步中这个数值将会作为参数传入,确保该数值之前的日志信息都被写入到HLog日志文件中,而且在append方法中会保证walKey的sequenceId变成了region的sequenceId(也是一个递增序列)。
6.释放获取的锁。
7.将wal写入磁盘,正如第五步所说,这里保证txid以及之前的日志条目都被写入到日志文件中了,一旦写完便可以认为这个“事务”成功了,这里跟MySQL里面的auto commit很像。
8.提交本次操作,让put操作对读可见,核心步骤就是增加对应memstore的readpoint,使得以前讲的MemStoreScanner可以看见put过来的数据,这根后面讲的mvcc有关。
读:
StoreScanner初始化时会去拿一下readPoint,然后将这个readPoint从cell中筛选出可见的数据
问题4:failover时mvcc怎么处理
failover时,使用该region上所有storefile最大的sequenceId作为mvcc初始化的version号

    推荐阅读