Use Python to access the NMDC Runtime API¶

Introduction¶

In this tutorial, I'll show you how you can use Python to interact with the NMDC Runtime API.

By the end of this tutorial, you will have:

Accessed several NMDC Runtime API endpoints
Learned how you can discover additional NMDC Runtime API endpoints
Learned how you can contact NMDC team members for help

Getting help¶

In case you have questions about the contents of this notebook, you can post them as GitHub issues in the microbiomedata/nmdc-runtime GitHub repository (that's where this notebook resides). NMDC team members regularly review open issues there. In case you don't have a GitHub account, you can email your questions to the NMDC Support Team.

1. Install dependencies¶

Before you can access the NMDC Runtime API—which runs as an HTTP service—you'll need an HTTP client. A popular HTTP client for Python is called requests. You can install it on your computer by running the following cell:

In [ ]:

Copied!

%pip install requests
%pip install requests

Now that the requests package is installed, you can use it to send HTTP requests to HTTP servers. For example, you can run the following cell to submit an HTTP GET request to an example HTTP server:

Note: This example HTTP server is not maintained by the NMDC team. It is a third-party HTTP server you can use to confirm your HTTP client works, independently of the NMDC Runtime.

In [ ]:

Copied!

import requests

# Submit an HTTP GET request to an example HTTP server.
response = requests.get("https://jsonplaceholder.typicode.com/posts/1")
import requests

# Submit an HTTP GET request to an example HTTP server.
response = requests.get("https://jsonplaceholder.typicode.com/posts/1")

Now that you've submitted the HTTP request, the response variable contains information about the HTTP response the example HTTP server sent back. You can examine it by running the following cells:

In [ ]:

Copied!

# Get the HTTP status code from the response.
response.status_code
# Get the HTTP status code from the response.
response.status_code

In [ ]:

Copied!

# Parse the response as a JSON string.
response.json()
# Parse the response as a JSON string.
response.json()

If the first of those cells outputs the number 200 and the second one outputs a Python dictionary having several keys (including id and title), you are good to go!

💡 Tip: In case those cells did not output those things, here are some troubleshooting tips: (1) check your Internet connection, (2) visit the same URL from the example above, in your web browser, (3) review the documentation of the requests package, and (4) restart your Jupyter kernel so it "becomes aware" of all recently-installed packages—in this case, the requests package.

Now that you can access an HTTP server, let's access the NMDC Runtime API.

2. Access an NMDC Runtime API endpoint¶

The NMDC Runtime API has a variety of API endpoints that you can send HTTP requests to.

💡 Tip: The full list of API endpoints is listed in the NMDC Runtime API's API documentation.

One of the API endpoints that I like to send HTTP requests to is /studies. That API endpoint responds with a list of all the studies that exist in the NMDC database!

You can run the following cell to send an HTTP GET request to that API endpoint:

Note: The HTTP response the server sends back will be stored in the response variable.

In [ ]:

Copied!

response = requests.get("https://api.microbiomedata.org/studies")
response = requests.get("https://api.microbiomedata.org/studies")

Now that you have received an HTTP response from the endpoint, you can examine it like before. You can see the JSON data—in this case, a list of studies—by running the code in this cell:

In [ ]:

Copied!

response.json()
response.json()

Whoa! That's a lot of output. Let's break it down.

💡 Tip: In the API documentation for the /studies API endpoint, the "Responses" section contains an example response from the API endpoint, as well as a generic schema that all of the API endpoint's responses will conform to. You can use both of those things to make sense of the API endpoint's response.

Given that—for this API endpoint—response.json() returns a Python dictionary, you can run the following cell to see the dictionary's top-level keys:

In [ ]:

Copied!

response.json().keys()
response.json().keys()

The meta item contains data about the response, such as pagination parameters and search filter criteria.

The results item contains the requested data—in this case, a list of studies.

You can ignore the group_by item. According to the NMDC Runtime's API documentation, group_by is not implemented yet.

Let's examine the meta item:

In [ ]:

Copied!

response.json()["meta"]
response.json()["meta"]

According to the meta item, there are 32 studies in the database.

Note: At the time of this writing, there are 32. When you run the cell, you may see a different number as the database is constantly changing.

Let's count the studies we received in the results list:

In [ ]:

Copied!

len(response.json()["results"])
len(response.json()["results"])

The results list contains only 25 studies—as opposed to 32. That's because this endpoint uses pagination, and the default page size happens to be 25.

You can customize the page size like this:

In [ ]:

Copied!

# Resend the same HTTP request, but include a higher page size than the default of 25.
response = requests.get("https://api.microbiomedata.org/studies?per_page=100")

# Count the studies in the `results` list.
len(response.json()["results"])
# Resend the same HTTP request, but include a higher page size than the default of 25.
response = requests.get("https://api.microbiomedata.org/studies?per_page=100")

# Count the studies in the `results` list.
len(response.json()["results"])

There they are!

You can use the per_page parameter to customize the number of items you want to receive per HTTP response.

You can use other parameters to customize the response in other ways, too. For example, you can run the following cell to request only studies whose ecosystem_category value is Aquatic, request that the API response contain at most 2 studies, and request that they be sorted by name.

In [ ]:

Copied!





response = requests.get("https://api.microbiomedata.org/studies?filter=ecosystem_category:Aquatic&per_page=2&sort_by=name")

# Print the number of studies in the response.
print(len(response.json()["results"]))

# Print their names in the order in which they appear in the response.
for study in response.json()["results"]:
    print(study["name"])
response = requests.get("https://api.microbiomedata.org/studies?filter=ecosystem_category:Aquatic&per_page=2&sort_by=name")

# Print the number of studies in the response.
print(len(response.json()["results"]))

# Print their names in the order in which they appear in the response.
for study in response.json()["results"]:
    print(study["name"])

Congratulations! You've used a Python notebook to retrieve data residing in the NMDC database, via the NMDC Runtime API. 🎉

3. Access a private NMDC Runtime API endpoint¶

In the previous section, you accessed an API endpoint that did not require authentication. In this tutorial, I'll refer to such an API endpoint as a "public" API endpoint. Indeed, most of the NMDC Runtime's API endpoints are "public."

However, there are some API endpoints that do require authentication; for example, API endpoints that can be used to modify existing data or perform resource-intensive operations. In this tutorial, I'll refer to those API endpoints as "private" API endpoints.

💡 Tip: You can tell whether an API endpoint is "public" or "private" by checking whether there is a padlock icon next to it in the API documentation. If there is, the API endpoint is "private" (i.e., accessing it requires authentication); otherwise, it is "public" (i.e., accessing it does not require authentication).

In this section, I'll show you how you can access a "private" API endpoint.

The first step is to tell this notebook what your NMDC Runtime username and password are. You can do that by running the cell below, which will prompt you for input:

⚠️ Warning: Storing real usernames and passwords directly in a Python notebook—or in any other form of source code—increases the risk that they be accidentally committed to a source code repository. That's why I'm using Python's getpass module here, instead of suggesting that you type your username and password directly into the cell.

In [ ]:

Copied!





from getpass import getpass

# Prompt the user for their NMDC Runtime username and password.
username = getpass(prompt="NMDC Runtime username: ")
password = getpass(prompt="NMDC Runtime password: ")

# Display string lengths as a "sanity test."
print(f"Username length: {len(username)}")
print(f"Password length: {len(password)}")
from getpass import getpass

# Prompt the user for their NMDC Runtime username and password.
username = getpass(prompt="NMDC Runtime username: ")
password = getpass(prompt="NMDC Runtime password: ")

# Display string lengths as a "sanity test."
print(f"Username length: {len(username)}")
print(f"Password length: {len(password)}")

Now that the username and password variables contain your NMDC Runtime username and password, you can exchange those for an NMDC Runtime API access token. You can do that by running this cell:

In [ ]:

Copied!





response = requests.post(
    "https://api.microbiomedata.org/token",
    data={
        "grant_type": "password",
        "username": username,
        "password": password,
    },
)

# Print the response payload, which includes the access token.
response.json()
response = requests.post(
    "https://api.microbiomedata.org/token",
    data={
        "grant_type": "password",
        "username": username,
        "password": password,
    },
)

# Print the response payload, which includes the access token.
response.json()

The API response will contain several properties (you can list them via response.json().keys()). One of them is named access_token. Its value is an access token; i.e., a string you can use to access "private" API endpoints.

I recommend storing that access token in a Python variable for future reference. You can do that by running this cell:

In [ ]:

Copied!

access_token = response.json()["access_token"]

print(f"Access token: {access_token}")
access_token = response.json()["access_token"]

print(f"Access token: {access_token}")

Now that you have an access token, you can use it to access a "private" API endpoint.

One of the "private" API endpoints I like to access is called /queries:run. I use it to query the NMDC database in more sophisticated ways than some of the "public" API endpoints allow.

💡 Tip: As with all API endpoints, you can learn about this one by reading the NMDC Runtime's API documentation.

Let's use the "private" /queries:run API endpoint to find all the studies whose ecosystem_category value is Aquatic (just like we did with the "public" /studies API endpoint earlier).

In [ ]:

Copied!





response = requests.post(
    "https://api.microbiomedata.org/queries:run",
    headers={
        "Authorization": f"Bearer {access_token}",
    },
    json={
        "find": "study_set",
        "filter": {"ecosystem_category": "Aquatic"},
    },
)

response.json()
response = requests.post(
    "https://api.microbiomedata.org/queries:run",
    headers={
        "Authorization": f"Bearer {access_token}",
    },
    json={
        "find": "study_set",
        "filter": {"ecosystem_category": "Aquatic"},
    },
)

response.json()

The API response's shape is different from that of the /studies API endpoint. Let's explore this API response. You can get a list of its top-level properties by running the following cell:

In [ ]:

Copied!

response.json().keys()
response.json().keys()

In the case of the /queries:run API endpoint, the results are in the cursor property. Let's dig into that property. You can see its properties by running the following cell:

In [ ]:

Copied!

response.json()["cursor"].keys()
response.json()["cursor"].keys()

The studies are in the firstBatch property. You can count them by running this cell:

In [ ]:

Copied!

len(response.json()["cursor"]["firstBatch"])
len(response.json()["cursor"]["firstBatch"])

You can print their names by running this cell:

In [ ]:

Copied!

for study in response.json()["cursor"]["firstBatch"]:
    print(study["name"])
for study in response.json()["cursor"]["firstBatch"]:
    print(study["name"])

Congratulations! You've used a Python notebook to retrieve data residing in the NMDC database, via a "private" NMDC Runtime API endpoint. 🎉

Finally, let's see what would have happened it you had visited the same API endpoint without including your access token in the API request. You can do that by running this cell:

In [ ]:

Copied!





response = requests.post(
    "https://api.microbiomedata.org/queries:run",
    json={
        "find": "study_set",
        "filter": {"ecosystem_category": "Aquatic"},
    },
)

response.json()
response = requests.post(
    "https://api.microbiomedata.org/queries:run",
    json={
        "find": "study_set",
        "filter": {"ecosystem_category": "Aquatic"},
    },
)

response.json()

Since this is a "private" API endpoint; when you access it without specifying an access token, it responds with the message, "Could not validate credentials" (in this case, we didn't give it any credentials to validate).

Conclusion¶

In this tutorial, I showed you how you could access a "public" API endpoint, how you could obtain an access token, and how you could use that access token to access a "private" API endpoint. I also showed you how you could explore a few API responses. Finally, I told you where you could find the API documentation, which contains a list of all API endpoints.

Thank you for going through this tutorial. You can continue to explore the API documentation and send API requests to API endpoints you find interesting.

We'd love to know what you think about the NMDC Runtime API and about this tutorial. You can tell us what you think by creating a GitHub issue in the microbiomedata/nmdc-runtime GitHub repository or sending the NMDC Support Team an email at [email protected].