Overview

Background on Systems Biology Modeling

Biological Expression Language (BEL)

Biological Expression Language (BEL) is a domain specific language that enables the expression of complex molecular relationships and their context in a machine-readable form. Its simple grammar and expressive power have led to its successful use to describe complex disease networks with several thousands of relationships. For a detailed explanation, see the BEL 1.0 and 2.0 specifications.

Design Considerations

Missing Namespaces and Improper Names

The use of openly shared controlled vocabularies (namespaces) within BEL facilitates the exchange and consistency of information. Finding the correct namespace:name pair is often a difficult part of the curation process.

Outdated Namespaces

OpenBEL provides a variety of namespaces covering each of the BEL function types. These namespaces are generated by code found at https://github.com/OpenBEL/resource-generator and distributed at http://resources.openbel.org/belframework/.

This code has not been maintained to reflect the changes in the underlying resources, so this repository has been forked and updated at https://github.com/pybel/resource-generator to reflect the most recent versions of the underlying namespaces. The files are now distributed using the Fraunhofer SCAI Artifactory server.

Generating New Namespaces

In some cases, it is appropriate to design a new namespace, using the custom namespace specification provided by the OpenBEL Framework. Packages for generating namespace, annotation, and knowledge resources have been grouped in the Bio2BEL organization on GitHub.

Synonym Issues

Due to the huge number of terms across many namespaces, it’s difficult for curators to know the domain-specific synonyms that obscure the controlled/preferred term. However, the issue of synonym resolution and semantic searching has already been generally solved by the use of ontologies. Besides just a controlled vocabulary, they also a hierarchical model of knowledge, synonyms with cross-references to databases and other ontologies, and other information semantic reasoning. Ontologies in the biomedical domain can be found at OBO and EMBL-EBI OLS.

Additionally, as a tool for curators, the EMBL Ontology Lookup Service (OLS) allows for semantic searching. Simple queries for the terms ‘mitochondrial dysfunction’ and ‘amyloid beta-peptides’ immediately returned results from relevant ontologies, and ended a long debate over how to represent these objects within BEL. EMBL-EBI also provides a programmatic API to the OLS service, for searching terms (http://www.ebi.ac.uk/ols/api/search?q=folic%20acid) and suggesting resolutions (http://www.ebi.ac.uk/ols/api/suggest?q=folic+acid)

Implementation

PyBEL is implemented using the PyParsing module. It provides flexibility and incredible speed in parsing compared to regular expression implementation. It also allows for the addition of parsing action hooks, which allow the graph to be checked semantically at compile-time.

It uses SQLite to provide a consistent and lightweight caching system for external data, such as namespaces, annotations, ontologies, and SQLAlchemy to provide a cross-platform interface. The same data management system is used to store graphs for high-performance querying.

Extensions to BEL

The PyBEL compiler is fully compliant with both BEL v1.0 and v2.0 and automatically upgrades legacy statements. Additionally, PyBEL includes several additions to the BEL specification to enable expression of important concepts in molecular biology that were previously missing and to facilitate integrating new data types. A short example is the inclusion of protein oxidation in the default BEL namespace for protein modifications. Other, more elaborate additions are outlined below.

Syntax for Epigenetics

PyBEL introduces the gene modification function, gmod(), as a syntax for encoding epigenetic modifications. Its usage mirrors the pmod() function for proteins and includes arguments for methylation.

For example, the methylation of NDUFB6 was found to be negatively correlated with its expression in a study of insulin resistance and Type II diabetes. This can now be expressed in BEL such as in the following statement:

g(HGNC:NDUFB6, gmod(Me)) negativeCorrelation r(HGNC:NDUFB6)

References:

Note

This syntax is currently under consideration as BEP-0006.

Definition of Namespaces as Regular Expressions

BEL imposes the constraint that each identifier must be qualified with an enumerated namespace to enable semantic interoperability and data integration. However, enumerating a namespace with potentially billions of names, such as dbSNP, poses a computational issue. PyBEL introduces syntax for defining namespaces with a consistent pattern using a regular expression to overcome this issue. For these namespaces, semantic validation can be perform in post-processing against the underlying database. The dbSNP namespace can be defined with a syntax familiar to BEL annotation definitions with regular expressions as follows:

DEFINE NAMESPACE dbSNP AS PATTERN "rs[0-9]+"

Note

This syntax was proposed with BEP-0005 and has been officially accepted as part of the BEL 2.1 specification.

Definition of Resources using OWL

Previous versions of PyBEL until 0.11.2 had an alternative namespace definition. Now it is recommended to either generate namespace files with reproducible build scripts following the Bio2BEL framework, or to directly add them to the database with the Bio2BEL bio2bel.manager.namespace_manager.NamespaceManagerMixin extension.

Explicit Node Labels

While the BEL 2.0 specification made it possible to represent new terms, such as the APOE gene with two variants resulting in the E2 allele, it came at the price of encoding terms in a technical and less readable way. An explicit statement for labeling nodes has been added, such that the resulting data structure will have a label for the node:

g(HGNC:APOE, var(c.388T>C), var(c.526C>T)) labeled "APOE E2"

When InChI is used, these strings are very hard to visualize. Using a label is helpful for later visualization:

Below is the same molecule again, but represented with an InChIKey:

a(INCHIKEY:"GBXSMTUPTTWBMN-XIRDDKMYSA-N") labeled "Enalapril"

It’s also easy to use the universe of RESTFul API services from UniChem, ChEMBL, or WikiData to download and annotate these automatically. For futher information on Enalapril can be found WikiData, UniChem, and ChEMBL.

Things to Consider

Do All Statements Need Supporting Text?

Yes! All statements must be minimally qualified with a citation and evidence (now called SupportingText in BEL 2.0) to maintain provenance. Statements without evidence can’t be traced to their source or evaluated independently from the curator, so they are excluded.

Multiple Annotations

All single annotations are considered as single element sets. When multiple annotations are present, all are unioned and attached to a given edge.

SET Citation = {"PubMed","Example Article","12345"}
SET ExampleAnnotation1 = {"Example Value 11", "Example Value 12"}
SET ExampleAnnotation2 = {"Example Value 21", "Example Value 22"}
p(HGNC:YFG1) -> p(HGNC:YFG2)

Namespace and Annotation Name Choices

*.belns and *.belanno configuration files include an entry called “Keyword” in their respective [Namespace] and [AnnotationDefinition] sections. To maintain understandability between BEL documents, PyBEL warns when the names given in *.bel documents do not match their respective resources. For now, capitalization is not considered, but in the future, PyBEL will also warn when capitalization is not properly stylized, like forgetting the lowercase ‘h’ in “ChEMBL”.

Why Not Nested Statements?

BEL has different relationships for modeling direct and indirect causal relations.

Direct

  • A => B means that A directly increases B through a physical process.

  • A =| B means that A directly decreases B through a physical process.

Indirect

The relationship between two entities can be coded in BEL, even if the process is not well understood.

  • A -> B means that A indirectly increases B. There are hidden elements in X that mediate this interaction through a pathway direct interactions A (=> or =|) X_1 (=> or =|) ... X_n (=> or =|) B, or through a set of multiple pathways that constitute a network.

  • A -| B means that A indirectly decreases B. Like for A -> B, this process involves hidden components with varying activities.

Increasing Nested Relationships

BEL also allows object of a relationship to be another statement.

  • A => (B => C) means that A increases the process by which B increases C. The example in the BEL Spec p(HGNC:GATA1) => (act(p(HGNC:ZBTB16)) => r(HGNC:MPL)) represents GATA1 directly increasing the process by which ZBTB16 directly increases MPL. Before, directly increasing was used to specify physical contact, so it’s reasonable to conclude that p(HGNC:GATA1) => act(p(HGNC:ZBTB16)). The specification cites examples when B is an activity that only is affected in the context of A and C. This complicated enough that it is both impractical to standardize during curation, and impractical to represent in a network.

  • A -> (B => C) can be interpreted by assuming that A indirectly increases B, and because of monotonicity, conclude that A -> C as well.

  • A => (B -> C) is more difficult to interpret, because it does not describe which part of process B -> C is affected by A or how. Is it that A => B, and B => C, so we conclude A -> C, or does it mean something else? Perhaps A impacts a different portion of the hidden process in B -> C. These statements are ambiguous enough that they should be written as just A => B, and B -> C. If there is no literature evidence for the statement A -> C, then it is not the job of the curator to make this inference. Identifying statements of this might be the goal of a bioinformatics analysis of the BEL network after compilation.

  • A -> (B -> C) introduces even more ambiguity, and it should not be used.

  • A => (B =| C) states A increases the process by which B decreases C. One interpretation of this statement might be that A => B and B =| C. An analysis could infer A -| C. Statements in the form of A -> (B =| C) can also be resolved this way, but with added ambiguity.

Decreasing Nested Relationships

While we could agree on usage for the previous examples, the decrease of a nested statement introduces an unreasonable amount of ambiguity.

  • A =| (B => C) could mean A decreases B, and B also increases C. Does this mean A decreases C, or does it mean that C is still increased, but just not as much? Which of these statements takes precedence? Or do their effects cancel? The same can be said about A -| (B => C), and with added ambiguity for indirect increases A -| (B -> C)

  • A =| (B =| C) could mean that A decreases B and B decreases C. We could conclude that A increases C, or could we again run into the problem of not knowing the precedence? The same is true for the indirect versions.

Recommendations for Use in PyBEL

After considering the ambiguity of nested statements to be a great risk to clarity, and PyBEL disables the usage of nested statements by default. See the Input and Output section for different parser settings. At Fraunhofer SCAI, curators resolved these statements to single statements to improve the precision and readability of our BEL documents.

While most statements in the form A rel1 (B rel2 C) can be reasonably expanded to A rel1 B and B rel2 C, the few that cannot are the difficult-to-interpret cases that we need to be careful about in our curation and later analyses.

Why Not RDF?

Current bel2rdf serialization tools build URLs with the OpenBEL Framework domain as a namespace, rather than respect the original namespaces of original entities. This does not follow the best practices of the semantic web, where URL’s representing an object point to a real page with additional information. For example, UniProt does an exemplary job of this. Ultimately, using non-standard URLs makes harmonizing and data integration difficult.

Additionally, the RDF format does not easily allow for the annotation of edges. A simple statement in BEL that one protein up-regulates another can be easily represented in a triple in RDF, but when the annotations and citation from the BEL document need to be included, this forces RDF serialization to use approaches like representing the statement itself as a node. RDF was not intended to represent this type of information, but more properly for locating resources (hence its name). Furthermore, many blank nodes are introduced throughout the process. This makes RDF incredibly difficult to understand or work with. Later, writing queries in SPARQL becomes very difficult because the data format is complicated and the language is limited. For example, it would be incredibly complicated to write a query in SPARQL to get the objects of statements from publications by a certain author.