SERVER-47244 Add the Auto-splitting section to the sharding arch guide

2024-12-01 09:32:32 +01:00 · 2020-06-15 23:01:06 +00:00 · 2020-06-15 23:01:06 +00:00 · 7419262b63
commit 7419262b63
parent 128ea14211
1 changed files with 82 additions and 0 deletions
--- a/src/mongo/db/s/README.md
+++ b/src/mongo/db/s/README.md
@ -314,6 +314,88 @@ The config server replica set durably stores settings for the maximum chunk size
 should be automatically split and balanced.

 ## Auto-splitting
+When the mongos routes an update or insert to a chunk, the chunk may grow beyond the configured
+chunk size (specified by the server parameter maxChunkSizeBytes) and trigger an auto-split, which
+partitions the oversized chunk into smaller chunks. The shard that houses the chunk is responsible
+for:
+* determining if the chunk should be auto-split
+* selecting the split points
+* committing the split points to the config server
+* refreshing the routing table cache 
+* updating in memory chunk size estimates
+
+### Deciding when to auto-split a chunk
+The server schedules an auto-split if:
+1. it detected that the chunk exceeds a threshold based on the maximum chunk size 
+2. there is not already a split in progress for the chunk
+
+Every time an update or insert gets routed to a chunk, the server tracks the bytes written to the
+chunk in memory through the collection's ChunkManager. The ChunkManager has a ChunkInfo object for
+each of the collection's entries in the local config.chunks. Through the ChunkManager, the server
+retrieves the chunk's ChunkInfo and uses its ChunkWritesTracker to increment the estimated chunk
+size. 
+
+Even if the new size estimate exceeds the maximum chunk size, the server still needs to check that
+there is not already a split in progress for the chunk. If the ChunkWritesTracker is locked, there
+is already a split in progress on the chunk and trying to start another split is prohibited.
+Otherwise, if the chunk is oversized and there is no split for the chunk in progress, the server
+submits the chunk to the ChunkSplitter to be auto-split.
+
+### The auto split task
+The ChunkSplitter is a replica set primary-only service that manages the process of auto-splitting
+chunks. The ChunkSplitter runs auto-split tasks asynchronously - thus, multiple chunks can undergo
+an auto-split concurrently, provided they don't overlap. 
+
+To prepare for the split point selection process, the ChunkSplitter flags that an auto-split for the
+chunk is in progress. There may be incoming writes to the original chunk while the split is in
+progress. For this reason, the estimated data size in the ChunkWritesTracker for this chunk is
+reset, and the same counter is used to track the number of bytes written to the chunk while the
+auto-split is in progress.
+ 
+splitVector manages the bulk of the split point selection logic. First, the data size and number of
+records are retrieved from the storage engine to approximate the number of keys that each chunk
+partition should have. This number is calculated such that if each document were uniform in size,
+each chunk would be half of maxChunkSizeBytes.
+
+If the actual data size is less than the maximum chunk size, no splits are made at all.
+Additionally, if all documents in the chunk have the same shard key, no splits can be made. In this
+case, the chunk may be classified as a jumbo chunk. 
+
+In the general case, splitVector:
+* performs an index scan on the shard key index
+* adds every k'th key to the vector of split points, where k is the approximate number of keys each chunk should have
+* returns the split points
+
+If no split points were returned, then the auto-split task gets abandoned and the task is done.
+
+If split points are successfully generated, the ChunkSplitter executes the final steps of the
+auto-split task where the shard:
+* commits the split points to config.chunks on the config server by removing the document containing
+  the original chunk and inserting new documents corresponding to the new chunks indicated by the
+split points
+* refreshes the routing table cache 
+* replaces the original oversized chunk's ChunkInfo with a ChunkInfo object for each partition. The
+  estimated data size for each new chunk is the number bytes written to the original chunk while the
+auto-split was in progress
+
+### Top Chunk Optimization
+While there are several optimizations in the auto-split process that won't be covered here, it's
+worthwhile to note the concept of top chunk optimization. If the chunk being split is the first or
+last one on the collection, there is an assumption that the chunk is likely to see more insertions
+if the user is inserting in ascending/descending order with respect to the shard key. So, in top
+chunk optimization, the first (or last) key in the chunk is set as a split point. Once the split
+points get committed to the config server, and the shard refreshes its CatalogCache, the
+ChunkSplitter tries to move the top chunk out of the shard to prevent the hot spot from sitting on a
+single shard.
+
+#### Code references
+* [**ChunkInfo**](https://github.com/mongodb/mongo/blob/18f88ce0680ab946760b599437977ffd60c49678/src/mongo/s/chunk.h#L44) class
+* [**ChunkManager**](https://github.com/mongodb/mongo/blob/master/src/mongo/s/chunk_manager.h) class
+* [**ChunkSplitter**](https://github.com/mongodb/mongo/blob/master/src/mongo/db/s/chunk_splitter.h) class
+* [**ChunkWritesTracker**](https://github.com/mongodb/mongo/blob/master/src/mongo/s/chunk_writes_tracker.h) class
+* [**splitVector**](https://github.com/mongodb/mongo/blob/18f88ce0680ab946760b599437977ffd60c49678/src/mongo/db/s/split_vector.cpp#L61) method
+* [**splitChunk**](https://github.com/mongodb/mongo/blob/18f88ce0680ab946760b599437977ffd60c49678/src/mongo/db/s/split_chunk.cpp#L128) method
+* [**commitSplitChunk**](https://github.com/mongodb/mongo/blob/18f88ce0680ab946760b599437977ffd60c49678/src/mongo/db/s/config/sharding_catalog_manager_chunk_operations.cpp#L316) method where chunk splits are committed

 ## Auto-balancing