July 18, 2016
- making sure that two or more concurrent writers that write to the same object don’t end up with an inconsistent object
- when one client reads an object while another client writes to the same object, the result is consistent. That is, when reading an object a client should get either the old or the new version of the object, and never a mix of the two[^1]
AWS S3 Data Consistency Model
- Amazon S3 achieves high availability by replicating 复制data across multiple servers within Amazon’s data centers.
Amazon S3 provides read-after-write consistency for PUTS of new objects in your S3 bucket
- If a PUT request is successful, your data is safely stored.
- A process writes a new object to Amazon S3 and will be immediately able to read the Objec
- A process writes a new object to Amazon S3 and immediately lists keys within its bucket. Until the change is fully propagated, the object might not appear in the list.
Amazon S3 provides eventual consistency for overwrite PUTS and DELETES in all regions.
- For updates and deletes to Objects, the changes are eventually reflected and not available immediately
- A process replaces an existing object and immediately attempts to read it. Until the change is fully propagated, Amazon S3 might return the prior data.
- A process deletes an existing object and immediately attempts to read it. Until the deletion is fully propagated, Amazon S3 might return the deleted data.
- A process deletes an existing object and immediately lists keys within its bucket. Until the deletion is fully propagated, Amazon S3 might list the deleted object.
- Updates to a single key are atomic. For e.g., if you PUT to an existing key, a subsequent read might return the old data or the updated data, but it will never write corrupted or partial data.
- Amazon S3 does not currently support object locking. For e.g. If two PUT requests are simultaneously made to the same key, the request with the latest time stamp wins. If this is an issue, you will need to build an object-locking mechanism into your application.
- Updates are key-based; there is no way to make atomic updates across keys. For e.g, you cannot make the update of one key dependent on the update of another key unless you design this functionality into your application.
- A successful response to a PUT request only occurs when a complete object is saved
- S3 provides eventual consistency for overwrite PUTS and DELETES
写对象到一个临时object。一旦临时object写完成后，我们调用single librados clonerange operation that atomically clones the entire temp object to the destination. 一旦clone成功，就删除临时对象。因为临时对象和目标对象由rados分布式存储，需要落在一个osd的同一个pg内。利用目标对象名作为临时对象的object locator。
- RADOS feature called compound operations which allow you to send a few operations that are bundled together and applied atomically. If one of the operations fail, nothing is applied. We use this for atomic PUT in order to set both data and metadata on the target object in a single atomic operation
- 问题：非brtfs后端，两次对象写;为了clone，需要把临时对象刷盘;clone操作需要object locator，影响balance。
- 普通写 根据原来的对象名给tail生成一个独特的对象名。此对象名也在一个独立的命名空间。前512K的数据不写磁盘，先缓存到mem中，然后写tail，当把tail数据写完后，以原子的compound操作写head rados对象（前512k数据和the object manifest and attributes）
Each chunk is uploaded separately to a unique location that is located in the ‘multipart’ namespace. When the multipart upload completes we generate a head object with a manifest that points to where all the object parts reside. Note that in the multipart case the head object only contains the object manifest and attributes but does not contain any data.
For the atomic GET we introduce an object “tag,” which is a random value that we generate for each PUT and store as an object attribute (xattr). When radosgw writes to an object it first checks for an existing object and fetches its tag (which it can do atomically). If the object exists it clones it to a new object with the tag as a suffix (taking necessarysteps to avoid name collisions) and the original object name as the locator. The compound clone operation looks like:
- check to see if object name tag attribute is tag
- clone to name_tag
The first operation is a guard to make sure that the object hasn’t been rewritten since we first read it. (Had it been rewritten, we need to restart the whole operation and reread the tag.) We put the same guard when we write the new object instance, to make sure that there was no racing operation.
A client that reads the object also starts by reading the tag, and putting the same guard before each subsequent read operation. If the guard fails, the client knows that the object has been rewritten. However, it also knows that since it has been rewritten, the object that it started reading can now be found at name_tag. So, reading of an object named foo looks like this:
- read object foo tag > 123
- verify object foo tag is “123′′; read object foo (offset = 0, size = 512K) > ok, read 512K
- check object foo tag is “123′′; read object foo (offset = 512K, size = 512K) > not ok, object was replaced
- read object foo_123 (offset = 512K, size = 512K) > ok, read 512K
The final component is an intent log. Since we end up creating multiple instances of the same object under different names, we need to make sure that these object are cleaned up after some reasonable amount of time. We added a log object which we record each such object that needs to be removed. After a sufficient amount of time (however long we expect very slow GETs to still succeed), a process iterates over the log and removes old objects.
The object is only cloned on PUT, when the object is rewritten重写. If an object is rewritten while a GET operation is going, the operation will continue on the cloned object (cloned by the rewriting PUT operation) . In any case, when the backend runs over btrfs the clone operation doesn’t copy the entire object, rather it generates a new object using existing data.
与原来类似，read its first chunk and all its attributes (head) in a single atomic RADOS operation，然后根据manefest里确定的位置继续读对象数据。第一个数据chunk不一定要放到head里。读head是原子的，读tail不是原子的，Reading the tail is not consideredatomic, however, since the tail resides in a unique RADOS object（不同的对象实例不同的rados对象）, we don’t need to access it atomically.
Eventually consistent acm 研读
rgw atomic operations, revisited
Announcement: US Standard now supports read-after-write consistency
Read-after-write consistency allows you to retrieve objects immediately after creation in Amazon S3. Prior to this change, Amazon S3 buckets in the US Standard Region provided eventual consistency for newly created objects, which meant that some small set of objects might not have been available to read immediately after new object upload. These occasional delays could complicate data processing workflows where applications need to read objects immediately after creating the objects. Please note that in US Standard Region, this consistency change applies to the Northern Virginia endpoint (s3-external-1.amazonaws.com). Customers using the global endpoint (s3.amazonaws.com) should switch to using the Northern Virginia endpoint (s3-external-1.amazonaws.com) in order to leverage the benefits of this read-after-write consistency in the US Standard Region.
Amazon S3 Data Consistency Model
[^1]: ATOMICITY OF RESTFUL RADOSGW OPERATIONS