CHIST-ERA TRIPLE demonstrator use-case in R language

Bioremediation use-case description

The following is a use-case demonstrating the federation of data across SPARQL endpoints and Solid Pods.

The objective of this use-case is to provide a demonstrator showing how data from different SPARQL endpoints can be combined to retrieve complex information: in this specific case, identifying organisms with possible bioremediation potential for the pollutant atrazine.

The queries in this use-case are targeting the following 5 SPARQL endpoints:

A Solid Pod with public data
IDSM
Uniprot
Rhea
OMA

The steps involved in this demonstrator pipeline are the following:

Query data from a Solid Pod to retrieve the CAS registry number of atrazine, the pollutant for which bioremediation is sought in this demonstrator use-case.
Identify chemical compounds that are similar to atrazine. This is done to widen the search for organisms with potential for bioremediation: if an organism can metabolize a closely resembling chemical compound, then it is possible that it can also metabolize the original pollutant.

The search is done via the IDSM sparql endpoint, which is here queried on the basis of the pollutant’s CAS numbers retrieved at step 1.

Each retrieved “similar” chemical compound is identified by its ChEBI identifier.
Retrieve metabolic chemical reactions that involve atrazine, or one of its similar chemical compounds.

This is done using the Rhea service, a database of chemical and transport reactions of biological interest.
Retrieve proteins/enzymes that are involved in the metabolic reactions returned by the Rhea endpoint (step 3). This is done using the UniProt service, the largest available protein database.
Identify the biological organisms with potential for bioremediation. This is done by querying the Oma (Orthologous matrix) service.

Use-case code and results

Environment setup

library(rlang)

# Load our own SPARQL library and helper functions.
source("../sparqlr.git/sparql.R")
source("utils.R")

# Set endpoints and paths to SPARQL queries used throughout the use-case.
endpoint_wikidata <- "https://query.wikidata.org/sparql"
endpoint_idsm <- "https://idsm.elixir-czech.cz/sparql/endpoint/idsm"
endpoint_rhea <- "https://sparql.rhea-db.org/sparql"
endpoint_uniprot <- "https://sparql.uniprot.org/sparql"
endpoint_oma <- "https://sparql.omabrowser.org/sparql"

query_file_wikidata <- "queries/query_1_wikidata.rq"
query_file_idsm <- "queries/query_2_idsm.rq"
query_file_uniprot <- "queries/query_3_uniprot.rq"
query_file_oma <- "queries/query_4_oma.rq"
subquery_file_wikidata <- "queries/subquery_1_wikidata.rq"
subquery_file_idsm <- "queries/subquery_2_idsm.rq"
subquery_file_uniprot <- "queries/subquery_3_uniprot.rq"

pollutant_name <- "atrazine"

Step 1: retrieve the CAS number of atrazine from a Solid Pod

The first step of the use-case demonstrates the retrieval of data from a Solid Pod.

Specifically, the following table with a list of pollutants and their CAS registry number (Chemical Abstracts Service) is retrieved.

# Retrieve data from a Solid Pod.
pollutant_table_source <- paste0(
  "https://triple.ilabt.imec.be/bioremediation-use-case/",
  "input/pollutants_table.ttl"
)
pollutants <- read.table(pollutant_table_source, sep = "\t", header = TRUE)
pollutant_cas_number <- pollutants[
  pollutants$compoundLabel == pollutant_name, "cas_number"
]

List of pollutants retrieved from the Solid Pod and their CAS number
Pollutant	CAS Number	Average LD50
benomyl	17804-35-2	37.0000
benomyl	17804-35-2	37.0000
aldrin	309-00-2	136.6407
ANTU	86-88-4	709.5708
atrazine	1912-24-9	1503.5455
ammonium sulfamate	7773-06-0	2952.4000
amitrole	61-82-5	5333.3333

From the above pollutant list, the use-case will from here-on focus on the pollutant atrazine, which is identified by the CAS number 1912-24-9.

Note: the data in the table downloaded from the Solid Pod was originally retrieved via the following commands:

query_wikidata <- load_query_from_file(query_file_wikidata)
pollutants <- sparql_query(
  endpoint = endpoint_wikidata, query = query_wikidata
)
readr::write_tsv(pollutants, "pollutants_table.ttl")

Step 2: identify chemical substances similar to atrazine

In this second step of the use-case, chemical substances with structural similarity to atrazine are retrieved by running a SPARQL query on the IDSM endpoint.

The query is made using the CAS number, a unique ID for chemical substances, assigned by the Chemical Abstracts Service (CAS). It returns a list of similar chemical compounds, that are identified by their ChEBI identifier. Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on “small” chemical compounds.

query_idsm <- load_query_from_file(query_file_idsm)

similar_pollutants <- sparql_query(
  endpoint = endpoint_idsm,
  query = replace_values_clause(
    "cas_number",
    as_values_clause(single_quote(pollutant_cas_number), "cas_number"),
    query_idsm
  ),
  use_post = TRUE
)

Here are some of the chemical compounds similar to atrazine returned by the query:

List of chemical substances with structural similarity to atrazine
Chemical compound (CHEBI numbers)	Similarity score
obo:CHEBI_15930	1.0000000
obo:CHEBI_83790	0.7575758
obo:CHEBI_83789	0.7575758
obo:CHEBI_83791	0.7575758
obo:CHEBI_82227	0.7142857

Step 3: identify enzymes involved in metabolic pathways of atrazine degradation

In order to identify micro-organisms that are potential candidates for the bioremediation of atrazine, a federated SPARQL query is run on the Rhea and UniProt endpoints to retrieve the enzymes/proteins that are involved in the degradation of atrazine or similar chemical compounds (as identified in step 2).

Specifically, this federated query is consists in of the following two steps:

A sub-query to the Rhea endpoint (via a SERVICE clause) is used to retrieve metabolic reactions involving atrazine or similar chemical compounds that are similar to it.
A query to the UniProt endpoints then retrieves the UniProt identifiers of proteins/enzymes that are part of the metabolic reactions identified from the Rhea endpoint.

query_uniprot <- load_query_from_file(query_file_uniprot)

chebi_values <- similar_pollutants |>
  dplyr::pull("similar_compound_chebi") |>
  sort() |>
  unique() |>
  stringr::str_replace(
    "<http://purl.obolibrary.org/obo/CHEBI_(.*)>",
    "CHEBI:\\1"
  )

uniprot_ids <- sparql_query(
  endpoint = endpoint_uniprot,
  query = replace_values_clause(
    "similar_compound_chebi",
    as_values_clause(chebi_values, "similar_compound_chebi"),
    query_uniprot
  ),
  use_post = TRUE
)

Here are some of the enzymes retrieved by the query:

Examples of enzymes involved in the metabolic reaction involving atrazine or a structurally similar compound.
Uniprot ID	Rhea ID	Rhea equation
upk:P72156	rh:11312	atrazine + H2O = hydroxyatrazine + chloride + H(+)
upk:O66908	rh:11348	arsenite(in) + ATP + H2O = arsenite(out) + ADP + phosphate + H(+)
upk:O66674	rh:11348	arsenite(in) + ATP + H2O = arsenite(out) + ADP + phosphate + H(+)
upk:O52027	rh:11348	arsenite(in) + ATP + H2O = arsenite(out) + ADP + phosphate + H(+)
upk:O50593	rh:11348	arsenite(in) + ATP + H2O = arsenite(out) + ADP + phosphate + H(+)

Step 4: identify organisms with bioremediation potential

In this last step of this illustrative bioremediation use-case, the OMA database is queried to retrieve all organisms that produce some of the enzymes/proteins identified to potentially take part in the metabolic degradation of atrazine.

# Query organisms with potential for bioremediation from the OMA database.
query_oma <- load_query_from_file(query_file_oma, remove_comments = TRUE)

uniprot_values <- uniprot_ids |>
  dplyr::pull("uniprot") |>
  unique() |>
  sort() |>
  stringr::str_replace("<http://purl.uniprot.org/uniprot/(.*)>", "upk:\\1")

oma_taxa <- sparql_query(
  endpoint = endpoint_oma,
  query = replace_values_clause(
    "uniprot",
    as_values_clause(uniprot_values, "uniprot"),
    query_oma
  ),
  use_post = FALSE
)

# Filter the OMA results to keep only bacterial taxa.
# This is done by querying uniprot for the "mnemonic" codes of all bacterial
# taxa present in Uniprot, then filtering the query results from OMA to keep
# only those with a bacterial mnemonic code.
query_file_uniprot_mnemonic <- "queries/query_5_uniprot.rq"
query_uniprot_mnemonic <- load_query_from_file(
  query_file_uniprot_mnemonic,
  remove_comments = TRUE
)
bacterial_taxon <- sparql_query(
  endpoint = endpoint_uniprot,
  query = query_uniprot_mnemonic,
  use_post = TRUE
)
taxa_counts <- oma_taxa |>
  dplyr::semi_join(bacterial_taxon, by = "mnemonic") |>
  dplyr::count(mnemonic, taxon_sci_name, name = "enzyme_count") |>
  dplyr::arrange(desc(enzyme_count)) |>
  dplyr::select("taxon_sci_name", "mnemonic", "enzyme_count")

This query identifies 829 bacterial species with potential for bioremediation of atrazine.

Here are the 10 that have the highest number of mappings to enzymes potentially involved in the metabolic degradation of atrazine (i.e. something that could be interpreted as having higher likelihood of providing potential for bioremediation for atrazine).

Taxa with bioremediation potential for atrazine
Taxon name	Uniprot mnemonic	Enzyme count
Burkholderia multivorans (strain ATCC 17616 / 249)	BURM1	321
Paraburkholderia xenovorans (strain LB400)	PARXL	222
Enterobacter cloacae subsp. cloacae (strain ATCC 13047 / DSM 30054 / NBRC 13535 / NCTC 10005 / WDCM 00083 / NCDC 279-56)	ENTCC	217
Cupriavidus taiwanensis (strain DSM 17343 / BCRC 17206 / CCUG 44338 / CIP 107171 / LMG 19424 / R1)	CUPTR	193
Methylorubrum extorquens (strain ATCC 14718 / DSM 1338 / JCM 2805 / NCIMB 9133 / AM1)	METEA	192
Ralstonia pickettii (strain 12D)	RALP1	167
Burkholderia vietnamiensis (strain G4 / LMG 22486)	BURVG	164
Burkholderia ambifaria (strain MC40-6)	BURA4	163
Burkholderia cenocepacia (strain ATCC BAA-245 / DSM 16553 / LMG 16656 / NCTC 13227 / J2315 / CF5610)	BURCJ	163
Burkholderia ambifaria (strain ATCC BAA-244 / AMMD)	BURCM	163

Summary network graph

Visual representation of the SPARQL queries ran for the first 3 steps of the bioremediation use case.

The following graph is built using SPARQL CONSTRUCT queries representing the first 3 steps of the bioremediation use-case. The graph visually illustrates the connection between the queries made across 3 different SPARQL endpoints.

Nodes shown in red are data retrieved from wikidata. These correspond to the pollutant retrieved from wikidata, which are the starting point of the use-case.
Nodes shown in light green are data retrieved from the IDSM SPARQL endpoint: the correspond to nodes that allow the linking of a chemical compound as defined by its CAS number, to the CHEBI number of the chemical compound and those of chemical compounds similar to the original compound.
Nodes shown in blue are data retrieved from the Uniprot and Rhea SPARQL endpoints. They illustrate the link between chemicals compounds as defined by their CHEBI numbers, and protein/enzymes that participate in a reaction involving those chemicals.

As can be seen in the figure below, only the cluster for atrazine is linked across the 3 SPARQL endpoints. Other clusters do not have any link with a protein/enzyme in the Uniprot database.

Note: performing a "mouse-over" on the nodes in the figure below will show more detailed information about the node, when available.

List of prefixes used in the above graph.
Prefix	Expanded prefix
CHEBI	http://purl.obolibrary.org/obo/CHEBI_
cas	https://identifiers.org/cas:
cmp	http://rdf.ncbi.nlm.nih.gov/pubchem/compound/
rh	http://rdf.rhea-db.org/
up	http://purl.uniprot.org/core/
up	http://purl.uniprot.org/uniprot/
upt	http://purl.uniprot.org/taxonomy/
wd	http://www.wikidata.org/entity/