object存储原子操作和一致性考虑和设计

July 18, 2016

参考信息

osd/bluestore
rgw/s3/swift
其他

概念

atomic put

making sure that two or more concurrent writers that write to the same object don’t end up with an inconsistent object

atomic get

when one client reads an object while another client writes to the same object, the result is consistent. That is, when reading an object a client should get either the old or the new version of the object, and never a mix of the two[^1]

AWS S3 Data Consistency Model

Amazon S3 achieves high availability by replicating 复制data across multiple servers within Amazon’s data centers.
Amazon S3 provides read-after-write consistency for PUTS of new objects in your S3 bucket
- If a PUT request is successful, your data is safely stored.
- A process writes a new object to Amazon S3 and will be immediately able to read the Objec
- A process writes a new object to Amazon S3 and immediately lists keys within its bucket. Until the change is fully propagated, the object might not appear in the list.
Amazon S3 provides eventual consistency for overwrite PUTS and DELETES in all regions.
- For updates and deletes to Objects, the changes are eventually reflected and not available immediately
- A process replaces an existing object and immediately attempts to read it. Until the change is fully propagated, Amazon S3 might return the prior data.
- A process deletes an existing object and immediately attempts to read it. Until the deletion is fully propagated, Amazon S3 might return the deleted data.
- A process deletes an existing object and immediately lists keys within its bucket. Until the deletion is fully propagated, Amazon S3 might list the deleted object.
Updates to a single key are atomic. For e.g., if you PUT to an existing key, a subsequent read might return the old data or the updated data, but it will never write corrupted or partial data.
Amazon S3 does not currently support object locking. For e.g. If two PUT requests are simultaneously made to the same key, the request with the latest time stamp wins. If this is an issue, you will need to build an object-locking mechanism into your application.
Updates are key-based; there is no way to make atomic updates across keys. For e.g, you cannot make the update of one key dependent on the update of another key unless you design this functionality into your application.
tips
- A successful response to a PUT request only occurs when a complete object is saved
- S3 provides eventual consistency for overwrite PUTS and DELETES

分层考虑、复制一致性、或者读写一致性

单节点
同一数据中心
同一region
跨region

设计原则

一致性模型作为option，一个region一种设置,
是否可以提供给用户呢？

当前问题

基于librados构造对象存储，一个对象读写落到librados可能就是多个读写。对于atomic put和get，不计划引入lock，基于librados引入锁，导致实现复杂，影响扩展性和复杂化对象网关。

具体实现

atomic put
- rados的每个操作都是原子的。对大对象写，可能需要多个librados写操作，涉及了多个object的相关操作。暂时有两种实现
  1. 写对象到一个临时object。一旦临时object写完成后，我们调用single librados clonerange operation that atomically clones the entire temp object to the destination. 一旦clone成功，就删除临时对象。因为临时对象和目标对象由rados分布式存储，需要落在一个osd的同一个pg内。利用目标对象名作为临时对象的object locator。
    - RADOS feature called compound operations which allow you to send a few operations that are bundled together and applied atomically. If one of the operations fail, nothing is applied. We use this for atomic PUT in order to set both data and metadata on the target object in a single atomic operation
    - 问题：非brtfs后端，两次对象写;为了clone，需要把临时对象刷盘;clone操作需要object locator，影响balance。
  2. 新机制：分为object head（OLH）和tail。
    - OLH里保存manifest，其描述所有的数据位置。包括每个对象片段的offset和大小，底层实际的rados对象和offset。对于一个大的对象，其OLH，包含对象属性和第一个对象chunk，一个或多个tail对象。对于一个对象实例，tail对象是唯一可区分的。
    - 普通写根据原来的对象名给tail生成一个独特的对象名。此对象名也在一个独立的命名空间。前512K的数据不写磁盘，先缓存到mem中，然后写tail，当把tail数据写完后，以原子的compound操作写head rados对象（前512k数据和the object manifest and attributes）
    - multipart写
      Each chunk is uploaded separately to a unique location that is located in the ‘multipart’ namespace. When the multipart upload completes we generate a head object with a manifest that points to where all the object parts reside. Note that in the multipart case the head object only contains the object manifest and attributes but does not contain any data.

atomic get
- 一个单独get，同时有对此对象的重写。当一个get对应多个librados read时，同时有一个重写操作时，会出现不一致情况。暂时两种实现
  1. 也采用compoud operation的方式，基于put里临时对象的方法。
    For the atomic GET we introduce an object “tag,” which is a random value that we generate for each PUT and store as an object attribute (xattr). When radosgw writes to an object it first checks for an existing object and fetches its tag (which it can do atomically). If the object exists it clones it to a new object with the tag as a suffix (taking necessarysteps to avoid name collisions) and the original object name as the locator. The compound clone operation looks like:
    1. check to see if object name tag attribute is tag
    2. clone to name_tag
    The first operation is a guard to make sure that the object hasn’t been rewritten since we first read it. (Had it been rewritten, we need to restart the whole operation and reread the tag.) We put the same guard when we write the new object instance, to make sure that there was no racing operation.
    A client that reads the object also starts by reading the tag, and putting the same guard before each subsequent read operation. If the guard fails, the client knows that the object has been rewritten. However, it also knows that since it has been rewritten, the object that it started reading can now be found at name_tag. So, reading of an object named foo looks like this:
    1. read object foo tag > 123
    2. verify object foo tag is “123′′; read object foo (offset = 0, size = 512K) > ok, read 512K
    3. check object foo tag is “123′′; read object foo (offset = 512K, size = 512K) > not ok, object was replaced
    4. read object foo_123 (offset = 512K, size = 512K) > ok, read 512K
    The final component is an intent log. Since we end up creating multiple instances of the same object under different names, we need to make sure that these object are cleaned up after some reasonable amount of time. We added a log object which we record each such object that needs to be removed. After a sufficient amount of time (however long we expect very slow GETs to still succeed), a process iterates over the log and removes old objects.
    The object is only cloned on PUT, when the object is rewritten重写. If an object is rewritten while a GET operation is going, the operation will continue on the cloned object (cloned by the rewriting PUT operation) . In any case, when the backend runs over btrfs the clone operation doesn’t copy the entire object, rather it generates a new object using existing data.
  2. 新机制，基于OLH
    与原来类似，read its first chunk and all its attributes (head) in a single atomic RADOS operation，然后根据manefest里确定的位置继续读对象数据。第一个数据chunk不一定要放到head里。读head是原子的，读tail不是原子的，Reading the tail is not consideredatomic, however, since the tail resides in a unique RADOS object（不同的对象实例不同的rados对象）, we don’t need to access it atomically.
    对于重写，原来基于tag，现在基于manefest，不需要clone原来的对象。只需要clone原来的manefest就行。（加的）

refs

http://shlomoswidler.com/2009/12/read-after-write-consistency-in-amazon.html

http://www.allthingsdistributed.com/2007/12/eventually_consistent.html
http://www.allthingsdistributed.com/2008/12/eventually_consistent.html

Eventually consistent acm 研读
http://dl.acm.org/citation.cfm?id=1435432
http://dl.acm.org/citation.cfm?id=2576794
http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingBucket.html#access-bucket-intro
https://www.quora.com/What-is-the-data-consistency-model-for-Amazon-S3
http://www.allthingsdistributed.com/2008/12/eventually_consistent.html
http://www.allthingsdistributed.com/2008/12/eventually_consistent.html
评论
http://jayendrapatil.com/aws-s3-data-consistency-model/
http://ceph.com/dev-notes/atomicity-of-restful-radosgw-operations/
rgw atomic operations, revisited
Announcement: US Standard now supports read-after-write consistency

https://forums.aws.amazon.com/ann.jspa?annID=3112
Read-after-write consistency allows you to retrieve objects immediately after creation in Amazon S3. Prior to this change, Amazon S3 buckets in the US Standard Region provided eventual consistency for newly created objects, which meant that some small set of objects might not have been available to read immediately after new object upload. These occasional delays could complicate data processing workflows where applications need to read objects immediately after creating the objects. Please note that in US Standard Region, this consistency change applies to the Northern Virginia endpoint (s3-external-1.amazonaws.com). Customers using the global endpoint (s3.amazonaws.com) should switch to using the Northern Virginia endpoint (s3-external-1.amazonaws.com) in order to leverage the benefits of this read-after-write consistency in the US Standard Region.

Amazon S3 Data Consistency Model

http://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#Regions

[^1]: ATOMICITY OF RESTFUL RADOSGW OPERATIONS

INEVITY's Blog