GitOrigin-RevId: 35db3811d8f749edd5b79ba910adcbc1ceb54cc4
19 KiB
FSM-based Concurrency Testing Framework
Overview
The FSM tests are meant to exercise concurrency within MongoDB. The suite consists of workloads, which define discrete units of work as states in a FSM, and runners, which define which tests to run and how they should be run. Each workload defines states, which are JS functions that perform some meaningful series of tasks and assertions, and transitions, which define how to move between those states. A single workload begins by executing its setup function, which is called once during the runner's thread of execution. Next, the runner generates the number of threads specified by the workload, and each spawned thread executes the start state (typically named "init") defined by the workload. From this point on, each worker thread executes its own independent copy of the FSM, and will randomly move between states (after executing the function) based on the probabilities defined in the workload's transition table. Each worker thread continues doing so until the number of transitions it makes has reached the number of iterations defined by the workload. Once all the worker threads have finished, the runner executes the workload's teardown function.
The runner provides two modes of execution for workloads: serial and parallel. Serial mode runs the provided workloads one after the other, waiting for all threads of a workload to complete before moving on to the next workload. Parallel mode runs subsets of the provided workloads in separate threads simultaneously.
New methods were added to allow for finer-grained assertions under different
situations. For example, a test that inserts a document into a collection, and
wants to assert its existence will fail if another test removes that document.
One option would have been to disable all assertions when running a mixture of
different workloads together, but doing so would make the system incapable of
detecting anything other than server crashes. Another option would have been to
design the workloads to be conflict-free (e.g. writing to separate collections,
using commutative operators), but this would leave large gaps in the achievable
test coverage. Neither of those options were found to be very appealing.
Instead, we chose to introduce the concept of an "assertion level" that acts as
a precondition for when an assertion is evaluated. This allows us to still make
some assertions, even when running a mixture of different workloads together.
There are three assertion levels: ALWAYS
, OWN_COLL
, and OWN_DB
. They can
be thought of as follows:
-
ALWAYS
: A statement that remains unequivocally true, regardless of what another workload might be doing to the collection I was given (hint: think defensively). Examples include "1 = 1" or inserting a document into a collection (disregarding any unique indices). -
OWN_COLL
: A statement that is true only if I am the only workload operating on the collection I was given. Examples include counting the number of documents in a collection or updating a previously inserted document. -
OWN_DB
: A statement that is true only if I am the only workload operating on the database I was given. Examples include renaming a collection or verifying that a collection is capped. The workload typically relies on the use of another collection aside from the one given.
Creating your own workload
All workloads are stored in jstests/concurrency/fsm_workloads
and as specific
examples you can refer to
jstests/concurrency/fsm_example.js
jstests/concurrency/fsm_example_inheritance.js
for writing new workloads. Every workload is loaded in as inline JavaScript
using the "load" function, which is a lot more like a #include
than
require.js
. This means that whatever variables are declared in the global
scope of the file will become part of the scope where load is called. The runner
will be looking for a variable called $config
which will store the
configuration of your workload.
The $config object
There should be exactly one $config
per workload. For style consistency as
well as safety, be sure to wrap the value of $config
in an anonymous function.
This will create a JS closure and a new scope:
$config = (function() {
/* ... */
return {
threadCount: "<number of threads>",
iterations: "<number of steps>",
startState: "<start state for this workload>",
states: "<state functions>",
transitions: "<transition probability map>",
setup: "<function to initialize workload>",
teardown: "<function to cleanup workload if necessary>",
data: "<'this' property available to each state function>",
};
)();
When finished executing, $config
must return an object containing the properties
above (some of which are optional, see below).
Defining states
It's best to also declare states within its own closure so as not to interfere with the scope of $config. Each state takes two arguments, the db object and the collection name. For later, note that this db and collection are the only one that you can be guaranteed to "own" when asserting. Try to make each state a discrete unit of work that can stand alone without the other states. Additionally, try to define each function that makes up a state with a name as opposed to anonymously - this makes easier to read backtraces when things go wrong.
$config = (function () {
/* ... */
var states = (function () {
function getRand() {
return Random.randInt(10);
}
function init(db, collName) {
this.start = getRand() * this.tid;
}
function scanGT(db, collName) {
db[collName].find({_id: {$gt: this.start}}).itcount();
}
function scanLTE(db, collName) {
db[collName].find({_id: {$lte: this.start}}).itcount();
}
return {
init: init,
scanGT: scanGT,
scanLTE: scanLTE,
};
})();
/* ... */
return {
/* ... */
states: states,
/* ... */
};
})();
Defining transitions
The transitions object defines the probabilities of moving from one state to a different state. When a state's function is finished executing, the FSM randomly chooses the next state using the probabilities provided in the transitions object. The probabilities of the transitions object do not necessarily need to sum to 1.0, since the mechanism for choosing the next state uses normalized random values. Here it is not necessary to use a separate closure. In the example below, we're denoting an equal probability of moving to either of the scan states from the init state:
$config = (function () {
/* ... */
var transitions = {
init: {scanGT: 0.5, scanLTE: 0.5},
scanGT: {scanGT: 0.8, scanLTE: 0.2},
scanLTE: {scanGT: 0.2, scanLTE: 0.8},
};
/* ... */
return {
/* ... */
transitions: transitions,
/* ... */
};
})();
Setup and teardown functions
The setup and teardown functions are special in that they'll only be executed in
one thread. See the Runners section for more information about when they're run
relative to other workloads in various modes. The setup and teardown functions
take three arguments: db, coll, and cluster. The setup function (and
corresponding teardown) should perform most of the initialization your workload
needs, for example setting parameters on the server, adding seed data, or
setting up indexes. Note that rather than executing adminCommands (and others)
against the provided db
you should use the provided
cluster.executeOnMongodNodes
and cluster.executeOnMongosNodes
functionality.
$config = (function () {
/* ... */
function setup(db, collName, cluster) {
// Workloads should NOT drop the collection db[collName], as doing so
// is handled by jstests/concurrency/fsm_libs/runner.js before 'setup' is called.
for (var i = 0; i < 1000; ++i) {
db[collName].insert({_id: i});
}
cluster.executeOnMongodNodes(function (db) {
db.adminCommand({
setParameter: 1,
internalQueryExecYieldIterations: 5,
});
});
cluster.executeOnMongosNodes(function (db) {
printjson(db.serverCmdLineOpts());
});
}
function teardown(db, collName, cluster) {
cluster.executeOnMongodNodes(function (db) {
db.adminCommand({
setParameter: 1,
internalQueryExecYieldIterations: 128,
});
});
}
/* ... */
return {
/* ... */
setup: setup,
teardown: teardown,
/* ... */
};
})();
The data
object
The data
object preserves information between different states of an FSM within
an individual thread. Within a single state, the data object becomes the 'this'
context in which the state executes. Additionally, a tid attribute is added to
data by the runner to allow each thread to access a unique ID. Data is usually
defined above states inside the config, but left below it in the returned
object. Data is also available as the 'this' context in setup and teardown
functions. Note that once the FSM begins, the context data that was passed to
the setup function is copied into each thread - meaning each thread has its own
copy of the data and modifications to data will not be passed back to the
teardown function outside of what was changed in setup. Additionally, in
composition, each workload has its own data, meaning you don't have to worry
about properties being overridden by workloads other than the current one.
$config = (function () {
var data = {
start: 0,
};
/* ... */
return {
/* ... */
data: data,
/* ... */
};
})();
Other properties of $config
threadCount
threadCount is the number of threads that will be used to run your workload in Serial and Parallel modes. In both modes, the number of threads you provide will execute the FSM simultaneously, cycling through different states of the workload. Note that in serial mode, no other threads will be running outside of those pertaining to this workload, and in parallel mode, other workloads will also be given threads to execute their FSM. In some cases in parallel mode, this number will be scaled down to make sure that all workloads can fit within the number of threads available due to system or performance constraints.
iterations
This is just the number of states the FSM will go through before exiting. NOTE: it is not the number of times each state will be executed.
startState
(optional)
Default value is 'init'. If your workload does not have an init state than you must specify in which state to begin.
Workload helpers
jstests/concurrency/fsm_workload_helpers
contains a few files that you can
include using 'load' at the top of a workload. These provide auxiliary
functionality that might be necessary for some workloads. The most important of
which is probably server_types.js
server_types.js
This helper file contains four functions: isMongos, isMongod, isMMAPv1, and isWiredTiger. These can be used to restrict operations on different functionality available in sharded environments, as well as based on storage engine, and work as you would expect. One thing to note is that before calling either isMMAPv1 or isWiredTiger, first verify isMongod. When special casing functionality for sharded environments or storage engines, try to special case a test for the exceptionality while still leaving in place assertions for either case.
indexed_noindex.js
This helper can be used along with inheritance, to create a workload that is exactly the same as an existing workload, but with the index created during setup removed. In order to use this replace the function you provide to the extendWorkload function with indexedNoindex. Additionally, ensure that the workload you are extending has a function in its data object called "getIndexSpec" that returns the spec for the index to be removed.
import {extendWorkload} from "jstests/concurrency/fsm_libs/extend_workload.js";
load("jstests/concurrency/fsm_workload_modifiers/indexed_noindex.js"); // for indexedNoindex
import {$config as $baseConfig} from "jstests/concurrency/fsm_workloads/workload_with_index.js";
export const $config = extendWorkload($baseConfig, indexedNoIndex);
drop_utils.js
These helpers provide safe methods for dropping collections, databases, roles, and users created during a workload's execution. The methods take a regular expression that the collection, database, role, or user name must match for it to be dropped. Prefixing the items in any of these categories you create with a prefix defined by your workload name is a good idea since the workload file name can be assumed unique and will allow you to only affect your workload in these cases.
Test runners
By default, all runners below are allowed to open a maximum of
maxAllowedConnections
(= 100 by default) explicit connections. In replicated
and sharded environments, implicit connections are created to the original
mongod provided to the mongo shell executing the runner (one for each thread).
This behavior cannot be controlled, but it highlights the importance of always
using the db object provided in the FSM states rather than the global db which
will always correspond to the mongod the mongo shell initially connected to.
Execution modes
Serial
Serial is the simplest of all three modes and basically works as explained above. Setup is run single threaded, data is copied into multiple threads where the states are executed, and once all the threads have finished a teardown function is run and the runner moves onto the next workload.
Parallel (Simultaneous)
In parallel or simultaneous mode (the naming convention has been slightly inconsistent), the ordering becomes a little different. All workloads have their setup functions run, then threads are spawned for each workload, and once they all complete, all threads have their teardown function run.
Existing runners
The existing runners all use jstests/concurrency/fsm_libs/runner.js
to
actually execute the workloads. Most information about arguments and available
runWorkloads methods can be found by inspecting the source. Below you can find
the existing runners explained. The first argument to the three runWorkloads
methods (each corresponding to a different run mode), is an array of workload
files to run. clusterOptions, the second argument to the runWorkloads functions,
is explained in the other components section below. Execution options for
runWorkloads functions, the third argument, can contain the following options
(some depend on the run mode):
numSubsets
- Not available in serial mode, determines how many subsets of workloads to execute in parallel modesubsetSize
- Not available in serial mode, determines how large each subset of workloads executed is
fsm_all.js
Runs all workloads serially. For each workload, $config.threadCount
threads
are spawned and each thread runs for exactly $config.iterations
steps starting
at $config.startState
and transitioning to other states based on the
transition probabilities defined in $config.transitions.
fsm_all_simultaneous.js
options: numSubsets, subsetSize
Runs numSubsets subsets of size subsetSize of all workloads. The workloads in
each subset are started in parallel and each workload is run according to
settings in $config
.
fsm_all_replication.js
Sets up a replica set (with 3 mongods by default) and runs workloads serially or in parallel. For example,
runWorkloadsSerially([<workload1>, <workload2>, ...], { replication: true } )
creates a replica set with 3 members and runs some workloads serially on the primary.
fsm_all_sharded.js
Sets up a sharded cluster (with 2 shards and 1 mongos by default) and runs workloads serially or in parallel. For example,
runWorkloadsInParallel([<workload1>, <workload2>, ...], { sharded: true } )
creates a sharded cluster and runs workloads in parallel.
fsm_all_sharded_replication.js
Sets up a sharded cluster (with 2 shards, each having 3 replica set members, and 1 mongos by default) and runs workloads serially or in parallel.
Excluding a workload
If any workloads fail because of known bugs in MongoDB, persistent MCI failures or timeouts, the troublesome workload can be excluded from running by placing it in the exclusion array in the corresponding runner. Please remember to place a comment next to the excluded workload name identifying the reason a workload is being excluded. For example,
'agg_sort_external.js', // SERVER-16700 Deadlock on WiredTiger LSM
Each file should also have two predefined sections - one for known bugs and one for restrictions. The one above would be considered a known bug. However, excluding a compact workload from sharded runners would be a restriction because compact can only be run against individual mongods.
Other components of the FSM library
Most of these components live in jstests/concurrency/fsm_libs and provide the functionality used by the runner.
ThreadManager
Responsible for spawning and joining worker threads. Each spawned thread is
wrapped in a try/finally block to ensure that the database connection implicitly
created during the thread's execution is eventually closed explicitly. The
ThreadManager sets a random seed ([0, randInt(1e13))
which is the range of
new Date().getTime())
before executing each workload.
Worker Thread
Thread spawned by ThreadManager and used to run a Finite State Machine.
Cluster
cluster.js is responsible for providing the cluster object that is passed to
setup and teardown functions, and the initial connection to a db to be used by
runner to pass to the workloads. For anything except for standalone, it makes
use of the shell's built-in cluster test helpers like ShardingTest
and
ReplSetTest
. clusterOptions are passed to cluster.js for initialization.
clusterOptions include:
replication
: boolean, whether or not to use replication in the clustersameCollection
: boolean, whether or not all workloads are passed the same collectionsameDB
: boolean, whether or not all workloads are passed the same DBsetupFunctions
: object, containing at most two functions under the keys 'mongod' and 'mongos'. This allows you to run a function against all mongod or mongos nodes in the cluster as part of the cluster initialization. Each function takes a single argument, the db object against which configuration can be run (will be set for each mongod/mongos)sharded
: boolean, whether or not to use sharding in the cluster
Note that sameCollection and sameDB can increase contention for a resource, but will also decrease the strength of the assertions by ruling out the use of OwnDB and OwnColl assertions.
Miscellaneous Execution Notes
- A
CountDownLatch
(exposed through the v8-based mongo shell, as of MongoDB 3.0) is used as a synchronization primitive by the ThreadManager to wait until all spawned threads have finished being spawned before starting workload execution. - If more than 20% of the threads fail while spawning, we abort the test. If fewer than 20% of the threads fail while spawning we allow the non-failed threads to continue with the test. The 20% threshold is somewhat arbitrary; the goal is to abort if "mostly all" of the threads failed but to tolerate "a few" threads failing.