Bioremediation use-case description


The following is a use-case demonstrating the federation of data across SPARQL endpoints and Solid Pods.

The objective of this use-case is to provide a demonstrator showing how data from different SPARQL endpoints can be combined to retrieve complex information: in this specific case, identifying organisms with possible bioremediation potential for the pollutant atrazine.

The queries in this use-case are targeting the following 5 SPARQL endpoints:

The steps involved in this demonstrator pipeline are the following:

  1. Query data from a Solid Pod to retrieve the CAS registry number of atrazine, the pollutant for which bioremediation is sought in this demonstrator use-case.

  2. Identify chemical compounds that are similar to atrazine. This is done to widen the search for organisms with potential for bioremediation: if an organism can metabolize a closely resembling chemical compound, then it is possible that it can also metabolize the original pollutant.

    The search is done via the IDSM sparql endpoint, which is here queried on the basis of the pollutant’s CAS numbers retrieved at step 1.

    Each retrieved “similar” chemical compound is identified by its ChEBI identifier.

  3. Retrieve metabolic chemical reactions that involve atrazine, or one of its similar chemical compounds.

    This is done using the Rhea service, a database of chemical and transport reactions of biological interest.

  4. Retrieve proteins/enzymes that are involved in the metabolic reactions returned by the Rhea endpoint (step 3). This is done using the UniProt service, the largest available protein database.

  5. Identify the biological organisms with potential for bioremediation. This is done by querying the Oma (Orthologous matrix) service.



Use-case code and results


Environment setup

library(rlang)

# Load our own SPARQL library and helper functions.
source("../sparqlr.git/sparql.R")
source("utils.R")

# Set endpoints and paths to SPARQL queries used throughout the use-case.
endpoint_wikidata <- "https://query.wikidata.org/sparql"
endpoint_idsm <- "https://idsm.elixir-czech.cz/sparql/endpoint/idsm"
endpoint_rhea <- "https://sparql.rhea-db.org/sparql"
endpoint_uniprot <- "https://sparql.uniprot.org/sparql"
endpoint_oma <- "https://sparql.omabrowser.org/sparql"

query_file_wikidata <- "queries/query_1_wikidata.rq"
query_file_idsm <- "queries/query_2_idsm.rq"
query_file_uniprot <- "queries/query_3_uniprot.rq"
query_file_oma <- "queries/query_4_oma.rq"
subquery_file_wikidata <- "queries/subquery_1_wikidata.rq"
subquery_file_idsm <- "queries/subquery_2_idsm.rq"
subquery_file_uniprot <- "queries/subquery_3_uniprot.rq"

pollutant_name <- "atrazine"


Step 1: retrieve the CAS number of atrazine from a Solid Pod

The first step of the use-case demonstrates the retrieval of data from a Solid Pod.

Specifically, the following table with a list of pollutants and their CAS registry number (Chemical Abstracts Service) is retrieved.

# Retrieve data from a Solid Pod.
pollutant_table_source <- paste0(
  "https://triple.ilabt.imec.be/bioremediation-use-case/",
  "input/pollutants_table.ttl"
)
pollutants <- read.table(pollutant_table_source, sep = "\t", header = TRUE)
pollutant_cas_number <- pollutants[
  pollutants$compoundLabel == pollutant_name, "cas_number"
]
List of pollutants retrieved from the Solid Pod and their CAS number
Pollutant CAS Number Average LD50
benomyl 17804-35-2 37.0000
benomyl 17804-35-2 37.0000
aldrin 309-00-2 136.6407
ANTU 86-88-4 709.5708
atrazine 1912-24-9 1503.5455
ammonium sulfamate 7773-06-0 2952.4000
amitrole 61-82-5 5333.3333

From the above pollutant list, the use-case will from here-on focus on the pollutant atrazine, which is identified by the CAS number 1912-24-9.


Note: the data in the table downloaded from the Solid Pod was originally retrieved via the following commands:

query_wikidata <- load_query_from_file(query_file_wikidata)
pollutants <- sparql_query(
  endpoint = endpoint_wikidata, query = query_wikidata
)
readr::write_tsv(pollutants, "pollutants_table.ttl")


Step 2: identify chemical substances similar to atrazine

In this second step of the use-case, chemical substances with structural similarity to atrazine are retrieved by running a SPARQL query on the IDSM endpoint.

The query is made using the CAS number, a unique ID for chemical substances, assigned by the Chemical Abstracts Service (CAS). It returns a list of similar chemical compounds, that are identified by their ChEBI identifier. Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on “small” chemical compounds.

query_idsm <- load_query_from_file(query_file_idsm)

similar_pollutants <- sparql_query(
  endpoint = endpoint_idsm,
  query = replace_values_clause(
    "cas_number",
    as_values_clause(single_quote(pollutant_cas_number), "cas_number"),
    query_idsm
  ),
  use_post = TRUE
)

Here are some of the chemical compounds similar to atrazine returned by the query:

List of chemical substances with structural similarity to atrazine
Chemical compound (CHEBI numbers) Similarity score
obo:CHEBI_15930 1.0000000
obo:CHEBI_83790 0.7575758
obo:CHEBI_83789 0.7575758
obo:CHEBI_83791 0.7575758
obo:CHEBI_82227 0.7142857


Step 3: identify enzymes involved in metabolic pathways of atrazine degradation

In order to identify micro-organisms that are potential candidates for the bioremediation of atrazine, a federated SPARQL query is run on the Rhea and UniProt endpoints to retrieve the enzymes/proteins that are involved in the degradation of atrazine or similar chemical compounds (as identified in step 2).

Specifically, this federated query is consists in of the following two steps:

  1. A sub-query to the Rhea endpoint (via a SERVICE clause) is used to retrieve metabolic reactions involving atrazine or similar chemical compounds that are similar to it.
  2. A query to the UniProt endpoints then retrieves the UniProt identifiers of proteins/enzymes that are part of the metabolic reactions identified from the Rhea endpoint.
query_uniprot <- load_query_from_file(query_file_uniprot)

chebi_values <- similar_pollutants |>
  dplyr::pull("similar_compound_chebi") |>
  sort() |>
  unique() |>
  stringr::str_replace(
    "<http://purl.obolibrary.org/obo/CHEBI_(.*)>",
    "CHEBI:\\1"
  )

uniprot_ids <- sparql_query(
  endpoint = endpoint_uniprot,
  query = replace_values_clause(
    "similar_compound_chebi",
    as_values_clause(chebi_values, "similar_compound_chebi"),
    query_uniprot
  ),
  use_post = TRUE
)


Here are some of the enzymes retrieved by the query:

Examples of enzymes involved in the metabolic reaction involving atrazine or a structurally similar compound.
Uniprot ID Rhea ID Rhea equation
upk:P72156 rh:11312 atrazine + H2O = hydroxyatrazine + chloride + H(+)
upk:O66908 rh:11348 arsenite(in) + ATP + H2O = arsenite(out) + ADP + phosphate + H(+)
upk:O66674 rh:11348 arsenite(in) + ATP + H2O = arsenite(out) + ADP + phosphate + H(+)
upk:O52027 rh:11348 arsenite(in) + ATP + H2O = arsenite(out) + ADP + phosphate + H(+)
upk:O50593 rh:11348 arsenite(in) + ATP + H2O = arsenite(out) + ADP + phosphate + H(+)


Step 4: identify organisms with bioremediation potential

In this last step of this illustrative bioremediation use-case, the OMA database is queried to retrieve all organisms that produce some of the enzymes/proteins identified to potentially take part in the metabolic degradation of atrazine.

# Query organisms with potential for bioremediation from the OMA database.
query_oma <- load_query_from_file(query_file_oma, remove_comments = TRUE)

uniprot_values <- uniprot_ids |>
  dplyr::pull("uniprot") |>
  unique() |>
  sort() |>
  stringr::str_replace("<http://purl.uniprot.org/uniprot/(.*)>", "upk:\\1")

oma_taxa <- sparql_query(
  endpoint = endpoint_oma,
  query = replace_values_clause(
    "uniprot",
    as_values_clause(uniprot_values, "uniprot"),
    query_oma
  ),
  use_post = FALSE
)

# Filter the OMA results to keep only bacterial taxa.
# This is done by querying uniprot for the "mnemonic" codes of all bacterial
# taxa present in Uniprot, then filtering the query results from OMA to keep
# only those with a bacterial mnemonic code.
query_file_uniprot_mnemonic <- "queries/query_5_uniprot.rq"
query_uniprot_mnemonic <- load_query_from_file(
  query_file_uniprot_mnemonic,
  remove_comments = TRUE
)
bacterial_taxon <- sparql_query(
  endpoint = endpoint_uniprot,
  query = query_uniprot_mnemonic,
  use_post = TRUE
)
taxa_counts <- oma_taxa |>
  dplyr::semi_join(bacterial_taxon, by = "mnemonic") |>
  dplyr::count(mnemonic, taxon_sci_name, name = "enzyme_count") |>
  dplyr::arrange(desc(enzyme_count)) |>
  dplyr::select("taxon_sci_name", "mnemonic", "enzyme_count")


This query identifies 829 bacterial species with potential for bioremediation of atrazine.

Here are the 10 that have the highest number of mappings to enzymes potentially involved in the metabolic degradation of atrazine (i.e. something that could be interpreted as having higher likelihood of providing potential for bioremediation for atrazine).

Taxa with bioremediation potential for atrazine
Taxon name Uniprot mnemonic Enzyme count
Burkholderia multivorans (strain ATCC 17616 / 249) BURM1 321
Paraburkholderia xenovorans (strain LB400) PARXL 222
Enterobacter cloacae subsp. cloacae (strain ATCC 13047 / DSM 30054 / NBRC 13535 / NCTC 10005 / WDCM 00083 / NCDC 279-56) ENTCC 217
Cupriavidus taiwanensis (strain DSM 17343 / BCRC 17206 / CCUG 44338 / CIP 107171 / LMG 19424 / R1) CUPTR 193
Methylorubrum extorquens (strain ATCC 14718 / DSM 1338 / JCM 2805 / NCIMB 9133 / AM1) METEA 192
Ralstonia pickettii (strain 12D) RALP1 167
Burkholderia vietnamiensis (strain G4 / LMG 22486) BURVG 164
Burkholderia ambifaria (strain MC40-6) BURA4 163
Burkholderia cenocepacia (strain ATCC BAA-245 / DSM 16553 / LMG 16656 / NCTC 13227 / J2315 / CF5610) BURCJ 163
Burkholderia ambifaria (strain ATCC BAA-244 / AMMD) BURCM 163



Summary network graph

Visual representation of the SPARQL queries ran for the first 3 steps of the bioremediation use case.

The following graph is built using SPARQL CONSTRUCT queries representing the first 3 steps of the bioremediation use-case. The graph visually illustrates the connection between the queries made across 3 different SPARQL endpoints.

As can be seen in the figure below, only the cluster for atrazine is linked across the 3 SPARQL endpoints. Other clusters do not have any link with a protein/enzyme in the Uniprot database.


Note: performing a "mouse-over" on the nodes in the figure below will show more detailed information about the node, when available.


List of prefixes used in the above graph.
Prefix Expanded prefix
CHEBI http://purl.obolibrary.org/obo/CHEBI_
cas https://identifiers.org/cas:
cmp http://rdf.ncbi.nlm.nih.gov/pubchem/compound/
rh http://rdf.rhea-db.org/
up http://purl.uniprot.org/core/
up http://purl.uniprot.org/uniprot/
upt http://purl.uniprot.org/taxonomy/
wd http://www.wikidata.org/entity/