Database migration¶

Introduction¶

In this document, I'll explain how we migrate the NMDC database from conforming to one version of the NMDC Schema, to conforming to another version of the NMDC Schema.

Overview¶

We currently use an Extract, Transform, Load (ETL) process to migrate the database.

FAQ: Why not "Transform in place?"

...like Alembic (Python), Active Record (Ruby), Sequelize (JS), etc. do?

At the time we began designing a migration process, some NMDC team members did not feel comfortable with us using a "Transform in place" process. A contributing factor to that may have been the fact that—at that time—the MongoDB instances used in the Runtime's local development and CI (GHA) test environments did not support the use of transactions.

The decision not to use a "Transform in place" process is one we expect to revisit, now that (a) team members' confidence in the migration process has increased, and (b) the MongoDB instances used in the aforementioned environments now support the use of transactions.

We use Jupyter notebooks to perform the "Extract" and "Load" steps, and to orchestrate the "Transform" step. We use Python scripts to perform the "Transform" step.

The Jupyter notebooks reside in the db/ directory of the nmdc-runtime repository. In general, we try to keep all code that interacts directly with the NMDC database, in that repository.

FAQ: Why use a Jupyter notebook?

...as opposed to a Python script/module?

At the time we began designing a migration process, we wanted to micromanage the process (i.e. scrutinize each command—whether shell or Python—and its output in the moment) the first few times we executed it. We thought doing that would be easier for us by using a Jupyter notebook compared to using a CLI script.

Now that (a) the notebook has remained roughly the same for each recent migration, and (b) we can use transactions in the Runtime's local development and CI (GHA) test environments; we think moving to a Python script/module is within reach. Ultimately, we want to eliminate human intervention from (i.e. automate) the migration process.

The Python scripts [that we use to perform the "Transform" step] reside in the nmdc_schema/migrators/ directory of the nmdc-schema repository. These are typically written by data modelers.

Process¶

The migration process looks like this:

Note

When viewing this web page in light mode, some of the text in the diagram below may not be legible. That would be due to this issue. As a workaround, you can view the web page in dark mode.

%% This is a Mermaid diagram.
%% Docs: https://docs.mermaidchart.com/mermaid-oss/syntax/sequenceDiagram.html

sequenceDiagram
    actor USER as Administrator

    box transparent Laptop
        participant DB_T as Mongo<br>(transformer)
        participant NB   as Jupyter<br>Notebook
    end

    box transparent Production infrastructure
        participant DB_O    as Mongo<br>(origin)
        participant RUNTIME as Runtime
    end

    activate RUNTIME
    USER ->> RUNTIME: Take offline
    deactivate RUNTIME

    USER ->> NB: Run notebook

    activate NB
    NB ->> DB_O: Revoke access<br>by other users
    NB ->> DB_O: Extract data<br>via mongodump
    DB_O -->> NB: 
    NB ->> DB_T: Load data<br>via mongorestore
    activate DB_T
    NB ->> DB_T: Transform data<br>via Python scripts
    NB ->> DB_T: Validate data<br>via LinkML
    NB ->> DB_T: Extract data<br>via mongodump
    DB_T -->> NB: 
    deactivate DB_T
    Note right of NB: Last chance to<br>abort migration
    NB ->> DB_O: Load data<br>via mongorestore
    NB ->> DB_O: Restore access<br>by other users
    deactivate NB

    USER ->> RUNTIME: Bring online (typically a new version, using the new schema)
    activate RUNTIME

Show/hide glossary

Mongo: A nickname (alias) for MongoDB.
Mongo (origin): The database we are migrating.
Mongo (transformer): The database we are using to transform data.

Each Jupyter notebook walks the administrator through the above steps, except for the "Take offline" and "Bring online" steps at the beginning and end of the process.

Those two exceptional steps—which are specific to our hosting environment—are covered in the Runtime release management documentation, located in our internal infrastructure administration documentation repository (named infra-admin).

Appendix¶

Precursors to this document¶

The "Data Releases" section of docs/howto-guides/release-process.md