mongodb

mirror of https://github.com/mongodb/mongo.git synced 2024-11-24 00:17:37 +01:00

History

Tianyu Wang 78ea5eafd4 SERVER-81354 : Only set tracer provider once in resmoke (#28125) GitOrigin-RevId: 6b0522714e750c58546e5d977a9425f81c42ce71		2024-10-22 15:02:45 +00:00
..
__init__.py	SERVER-94077 Use isort in Ruff configs (#27865)	2024-10-10 19:33:49 +00:00
attach_core_analyzer_task.py	SERVER-94077 Use isort in Ruff configs (#27865)	2024-10-10 19:33:49 +00:00
core_analyzer.py	SERVER-81354 : Only set tracer provider once in resmoke (#28125)	2024-10-22 15:02:45 +00:00
dumper.py	SERVER-95782 Run gdb twice in the core analyzer to populate index cache (#28084)	2024-10-15 18:16:57 +00:00
extractor.py	SERVER-94077 Use isort in Ruff configs (#27865)	2024-10-10 19:33:49 +00:00
gen_hang_analyzer_tasks.py	SERVER-94077 Use isort in Ruff configs (#27865)	2024-10-10 19:33:49 +00:00
hang_analyzer.py	SERVER-81354 : Only set tracer provider once in resmoke (#28125)	2024-10-22 15:02:45 +00:00
process_list.py	SERVER-94077 Use isort in Ruff configs (#27865)	2024-10-10 19:33:49 +00:00
process.py	SERVER-94077 Use isort in Ruff configs (#27865)	2024-10-10 19:33:49 +00:00
README.md	SERVER-87034 Initial markdown format (#19276)	2024-02-27 19:58:04 +00:00

README.md

Running the core analyzer

There are two main ways of running the core analyzer.

Running the core analyzer with local core dumps and binaries.
Running the core analyzer with core dumps and binaries from an evergreen task. Note that some analysis might fail if you are not on the same AMI (Amazon Machine Image) that the task was run on.

To run the core analyzer with local core dumps and binaries:

python3 buildscripts/resmoke.py core-analyzer

This will look for binaries in the build/install directory, and it will look for core dumps in the current directory. If your local environment is different you can include --install-dir and --core-dir in your invocation to specify other locations.

To run the core analyzer with core dumps and binaries from an evergreen task:

python3 buildscripts/resmoke.py core-analyzer --task-id={task_id}

This will download all of the core dumps and binaries from the task and put them into the configured --working-dir, this defaults to the core-analyzer directory.

All of the task analysis will be added to the analysis directory inside the configured --working-dir.

Note: Currently the core analyzer only runs on linux. Windows uses the legacy hang analyzer but will be switched over when we run into issues or have time to do the transition. We have not tackled the problem of getting core dumps on macOS so we have no core dump analysis on that operating system.

Getting core dumps

sequenceDiagram
    Task Timed Out ->> Hang Analyzer: Scan all python processes<br/> for resmoke process
    Hang Analyzer ->> Resmoke: Signal resmoke to archive data <br/>files and take core dumps
    Resmoke ->> Hang Analyzer: Report resmoke pids to hang <br/>analyzer to take core dumps of
    Hang Analyzer ->> Core Dumps: Attach to pid and generate core dumps

When a task times out, it hits the timeout section in the defined evergreen config. In this timeout section, we run this task which runs the hang-analyzer with the following invocation:

python3 buildscripts/resmoke.py hang-analyzer -o file -o stdout -m exact -p python

This tells the hang-analyzer to look for all of the python processes (we are specifically looking for resmoke) on the machine and to signal them. When resmoke is signaled, it again invokes the hang analyzer with the specific pids of it's child processes. It will look similar to this most of the time:

python3 buildscripts/resmoke.py hang-analyzer -o file -o stdout -k -c -d pid1,pid2,pid3

The things to note here are the -k which kills the process and -c which takes core dumps. The resulting core dumps are put into the current running directory. When a task fails normally, core dumps may also be generated by the linux kernel and put into the working directory.

We use a non-standard way of uploading core dumps to evergreen due to timeout issues we were facing when archiving and uploading them normally through evergreen commands. After investigation of the above issue, we found that compressing and uploading core dumps was slow for a couple reasons:

Tarring all of the core dumps into one file takes up a lot of disk IO and disk IO was the bottleneck.
Gzip is single threaded.
Uploading a big file synchronously is not fast.

We made a script that gzips all of the core dumps in parallel and uploads them to S3 individually asynchronously. This solved all of the problems listed above.

Generating the core analyzer task

sequenceDiagram
    Task Shut Down ->> Generate task script: If core dumps are present, <br/>generate task config
    Generate Task Script ->> Task Shut Down: Write generated task config to disk
    Task Shut Down ->> Generated Task: Use evergreen command to generate task
    Task Shut Down ->> Core Analyzer Output: Upload temporary text file containing a link to the generated task
    Generated Task ->> Core Analyzer Output: Overwrite output with<br/> core dump analysis

In the post task section, we define the evergreen function used to generate the core analyzer task. This script runs on every task (passing or failing) and is independent of anything else that happened prior in the task and does all of the checks to ensure it should run. These checks include:

The task is being run on an operating system supported by the core analyzer.
The task has any core dumps uploaded and attached to it.
At least one of the binaries uploaded is from a binary we know how to process.

The output from this script is a json file in the format evergreen expects. We then pass this json file into the generate.tasks evergreen command to generate the task.

After the task is generated, we have another script that finds the task that was just generated and attaches it to the current task being ran.

The reason we upload a temporary file to the original task is to attach that s3 file link to the task. Evergreen does not currently have a way to attach files to a task after it was ran so we need to upload something while the original task is in progress.