The following is a use-case demonstrating the federation of data across SPARQL endpoints and Solid Pods.
The objective of this use-case is to provide a demonstrator showing how data from different SPARQL endpoints can be combined to retrieve complex information: in this specific case, identifying organisms with possible bioremediation potential for the pollutant atrazine.
The queries in this use-case are targeting the following 5 SPARQL endpoints:
The steps involved in this demonstrator pipeline are the following:
Query data from a Solid Pod to retrieve the CAS registry number of atrazine, the pollutant for which bioremediation is sought in this demonstrator use-case.
Identify chemical compounds that are similar to atrazine. This is done to widen the search for organisms with potential for bioremediation: if an organism can metabolize a closely resembling chemical compound, then it is possible that it can also metabolize the original pollutant.
The search is done via the IDSM sparql endpoint, which is here queried on the basis of the pollutant’s CAS numbers retrieved at step 1.
Each retrieved “similar” chemical compound is identified by its ChEBI identifier.
Retrieve metabolic chemical reactions that involve atrazine, or one of its similar chemical compounds.
This is done using the Rhea service, a database of chemical and transport reactions of biological interest.
Retrieve proteins/enzymes that are involved in the metabolic reactions returned by the Rhea endpoint (step 3). This is done using the UniProt service, the largest available protein database.
Identify the biological organisms with potential for bioremediation. This is done by querying the Oma (Orthologous matrix) service.
library(rlang)
# Load our own SPARQL library and helper functions.
source("../sparqlr.git/sparql.R")
source("utils.R")
# Set endpoints and paths to SPARQL queries used throughout the use-case.
endpoint_wikidata <- "https://query.wikidata.org/sparql"
endpoint_idsm <- "https://idsm.elixir-czech.cz/sparql/endpoint/idsm"
endpoint_rhea <- "https://sparql.rhea-db.org/sparql"
endpoint_uniprot <- "https://sparql.uniprot.org/sparql"
endpoint_oma <- "https://sparql.omabrowser.org/sparql"
query_file_wikidata <- "queries/query_1_wikidata.rq"
query_file_idsm <- "queries/query_2_idsm.rq"
query_file_uniprot <- "queries/query_3_uniprot.rq"
query_file_oma <- "queries/query_4_oma.rq"
subquery_file_wikidata <- "queries/subquery_1_wikidata.rq"
subquery_file_idsm <- "queries/subquery_2_idsm.rq"
subquery_file_uniprot <- "queries/subquery_3_uniprot.rq"
pollutant_name <- "atrazine"
The first step of the use-case demonstrates the retrieval of data from a Solid Pod.
Specifically, the following table with a list of pollutants and their CAS registry number (Chemical Abstracts Service) is retrieved.
# Retrieve data from a Solid Pod.
pollutant_table_source <- paste0(
"https://triple.ilabt.imec.be/bioremediation-use-case/",
"input/pollutants_table.ttl"
)
pollutants <- read.table(pollutant_table_source, sep = "\t", header = TRUE)
pollutant_cas_number <- pollutants[
pollutants$compoundLabel == pollutant_name, "cas_number"
]
Pollutant | CAS Number | Average LD50 |
---|---|---|
benomyl | 17804-35-2 | 37.0000 |
benomyl | 17804-35-2 | 37.0000 |
aldrin | 309-00-2 | 136.6407 |
ANTU | 86-88-4 | 709.5708 |
atrazine | 1912-24-9 | 1503.5455 |
ammonium sulfamate | 7773-06-0 | 2952.4000 |
amitrole | 61-82-5 | 5333.3333 |
From the above pollutant list, the use-case will from here-on focus on the pollutant atrazine, which is identified by the CAS number 1912-24-9.
Note: the data in the table downloaded from the Solid Pod was originally retrieved via the following commands:
query_wikidata <- load_query_from_file(query_file_wikidata)
pollutants <- sparql_query(
endpoint = endpoint_wikidata, query = query_wikidata
)
readr::write_tsv(pollutants, "pollutants_table.ttl")
In this second step of the use-case, chemical substances with structural similarity to atrazine are retrieved by running a SPARQL query on the IDSM endpoint.
The query is made using the CAS number, a unique ID for chemical substances, assigned by the Chemical Abstracts Service (CAS). It returns a list of similar chemical compounds, that are identified by their ChEBI identifier. Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on “small” chemical compounds.
query_idsm <- load_query_from_file(query_file_idsm)
similar_pollutants <- sparql_query(
endpoint = endpoint_idsm,
query = replace_values_clause(
"cas_number",
as_values_clause(single_quote(pollutant_cas_number), "cas_number"),
query_idsm
),
use_post = TRUE
)
Here are some of the chemical compounds similar to atrazine returned by the query:
Chemical compound (CHEBI numbers) | Similarity score |
---|---|
obo:CHEBI_15930 | 1.0000000 |
obo:CHEBI_83790 | 0.7575758 |
obo:CHEBI_83789 | 0.7575758 |
obo:CHEBI_83791 | 0.7575758 |
obo:CHEBI_82227 | 0.7142857 |
In order to identify micro-organisms that are potential candidates for the bioremediation of atrazine, a federated SPARQL query is run on the Rhea and UniProt endpoints to retrieve the enzymes/proteins that are involved in the degradation of atrazine or similar chemical compounds (as identified in step 2).
Specifically, this federated query is consists in of the following two steps:
SERVICE
clause) is used to retrieve metabolic reactions involving atrazine or similar chemical compounds that are similar to it.query_uniprot <- load_query_from_file(query_file_uniprot)
chebi_values <- similar_pollutants |>
dplyr::pull("similar_compound_chebi") |>
sort() |>
unique() |>
stringr::str_replace(
"<http://purl.obolibrary.org/obo/CHEBI_(.*)>",
"CHEBI:\\1"
)
uniprot_ids <- sparql_query(
endpoint = endpoint_uniprot,
query = replace_values_clause(
"similar_compound_chebi",
as_values_clause(chebi_values, "similar_compound_chebi"),
query_uniprot
),
use_post = TRUE
)
Here are some of the enzymes retrieved by the query:
Uniprot ID | Rhea ID | Rhea equation |
---|---|---|
upk:P72156 | rh:11312 | atrazine + H2O = hydroxyatrazine + chloride + H(+) |
upk:O66908 | rh:11348 | arsenite(in) + ATP + H2O = arsenite(out) + ADP + phosphate + H(+) |
upk:O66674 | rh:11348 | arsenite(in) + ATP + H2O = arsenite(out) + ADP + phosphate + H(+) |
upk:O52027 | rh:11348 | arsenite(in) + ATP + H2O = arsenite(out) + ADP + phosphate + H(+) |
upk:O50593 | rh:11348 | arsenite(in) + ATP + H2O = arsenite(out) + ADP + phosphate + H(+) |
In this last step of this illustrative bioremediation use-case, the OMA database is queried to retrieve all organisms that produce some of the enzymes/proteins identified to potentially take part in the metabolic degradation of atrazine.
# Query organisms with potential for bioremediation from the OMA database.
query_oma <- load_query_from_file(query_file_oma, remove_comments = TRUE)
uniprot_values <- uniprot_ids |>
dplyr::pull("uniprot") |>
unique() |>
sort() |>
stringr::str_replace("<http://purl.uniprot.org/uniprot/(.*)>", "upk:\\1")
oma_taxa <- sparql_query(
endpoint = endpoint_oma,
query = replace_values_clause(
"uniprot",
as_values_clause(uniprot_values, "uniprot"),
query_oma
),
use_post = FALSE
)
# Filter the OMA results to keep only bacterial taxa.
# This is done by querying uniprot for the "mnemonic" codes of all bacterial
# taxa present in Uniprot, then filtering the query results from OMA to keep
# only those with a bacterial mnemonic code.
query_file_uniprot_mnemonic <- "queries/query_5_uniprot.rq"
query_uniprot_mnemonic <- load_query_from_file(
query_file_uniprot_mnemonic,
remove_comments = TRUE
)
bacterial_taxon <- sparql_query(
endpoint = endpoint_uniprot,
query = query_uniprot_mnemonic,
use_post = TRUE
)
taxa_counts <- oma_taxa |>
dplyr::semi_join(bacterial_taxon, by = "mnemonic") |>
dplyr::count(mnemonic, taxon_sci_name, name = "enzyme_count") |>
dplyr::arrange(desc(enzyme_count)) |>
dplyr::select("taxon_sci_name", "mnemonic", "enzyme_count")
This query identifies 829 bacterial species with potential for bioremediation of atrazine.
Here are the 10 that have the highest number of mappings to enzymes potentially involved in the metabolic degradation of atrazine (i.e. something that could be interpreted as having higher likelihood of providing potential for bioremediation for atrazine).
Taxon name | Uniprot mnemonic | Enzyme count |
---|---|---|
Burkholderia multivorans (strain ATCC 17616 / 249) | BURM1 | 321 |
Paraburkholderia xenovorans (strain LB400) | PARXL | 222 |
Enterobacter cloacae subsp. cloacae (strain ATCC 13047 / DSM 30054 / NBRC 13535 / NCTC 10005 / WDCM 00083 / NCDC 279-56) | ENTCC | 217 |
Cupriavidus taiwanensis (strain DSM 17343 / BCRC 17206 / CCUG 44338 / CIP 107171 / LMG 19424 / R1) | CUPTR | 193 |
Methylorubrum extorquens (strain ATCC 14718 / DSM 1338 / JCM 2805 / NCIMB 9133 / AM1) | METEA | 192 |
Ralstonia pickettii (strain 12D) | RALP1 | 167 |
Burkholderia vietnamiensis (strain G4 / LMG 22486) | BURVG | 164 |
Burkholderia ambifaria (strain MC40-6) | BURA4 | 163 |
Burkholderia cenocepacia (strain ATCC BAA-245 / DSM 16553 / LMG 16656 / NCTC 13227 / J2315 / CF5610) | BURCJ | 163 |
Burkholderia ambifaria (strain ATCC BAA-244 / AMMD) | BURCM | 163 |
Visual representation of the SPARQL queries ran for the first 3 steps of the bioremediation use case.
The following graph is built using SPARQL CONSTRUCT
queries representing the first 3 steps of the bioremediation use-case. The graph visually illustrates the connection between the queries made across 3 different SPARQL endpoints.
As can be seen in the figure below, only the cluster for atrazine is linked across the 3 SPARQL endpoints. Other clusters do not have any link with a protein/enzyme in the Uniprot database.
Note: performing a "mouse-over" on the nodes in the figure below will show more detailed information about the node, when available.
Prefix | Expanded prefix |
---|---|
CHEBI | http://purl.obolibrary.org/obo/CHEBI_ |
cas | https://identifiers.org/cas: |
cmp | http://rdf.ncbi.nlm.nih.gov/pubchem/compound/ |
rh | http://rdf.rhea-db.org/ |
up | http://purl.uniprot.org/core/ |
up | http://purl.uniprot.org/uniprot/ |
upt | http://purl.uniprot.org/taxonomy/ |
wd | http://www.wikidata.org/entity/ |