Overview
Background on Systems Biology Modeling
Biological Expression Language (BEL)
Biological Expression Language (BEL) is a domain specific language that enables the expression of complex molecular relationships and their context in a machine-readable form. Its simple grammar and expressive power have led to its successful use to describe complex disease networks with several thousands of relationships. For a detailed explanation, see the BEL 1.0 and 2.0, and 2.0+ specifications.
BEL Community Links
BEL Community Portal
BEL Google Group
Design Considerations
Missing Namespaces and Improper Names
The use of openly shared controlled vocabularies (namespaces) within BEL facilitates the exchange and consistency of
information. Finding the correct namespace:name
pair is often a difficult part of the curation process.
Outdated Namespaces
BEL provides a variety of namespaces
covering each of the BEL function types. Selventa used to provide BEL namespace files generated by the deprecated
project at https://github.com/OpenBEL/resource-generator
and hosted at the abandoned website
http://www.belframework.org/
. Newer versions of these namespaces can be found at
https://github.com/pharmacome/conso/tree/master/external.
Generating New Namespaces
In some cases, it is appropriate to design a new namespace, using the custom namespace specification provided by the OpenBEL Framework. Packages for generating namespace, annotation, and knowledge resources have been grouped in the Bio2BEL organization on GitHub.
Synonym Issues
Due to the huge number of terms across many namespaces, it’s difficult for curators to know the domain-specific synonyms that obscure the controlled/preferred term. However, the issue of synonym resolution and semantic searching has already been generally solved by the use of ontologies. Besides just a controlled vocabulary, they also a hierarchical model of knowledge, synonyms with cross-references to databases and other ontologies, and other information semantic reasoning. Ontologies in the biomedical domain can be found at OBO and EMBL-EBI OLS.
Additionally, as a tool for curators, the EMBL Ontology Lookup Service (OLS) allows for semantic searching. Simple queries for the terms ‘mitochondrial dysfunction’ and ‘amyloid beta-peptides’ immediately returned results from relevant ontologies, and ended a long debate over how to represent these objects within BEL. EMBL-EBI also provides a programmatic API to the OLS service, for searching terms (http://www.ebi.ac.uk/ols/api/search?q=folic%20acid) and suggesting resolutions (http://www.ebi.ac.uk/ols/api/suggest?q=folic+acid)
Implementation
PyBEL is implemented using the PyParsing module. It provides flexibility and incredible speed in parsing compared to regular expression implementation. It also allows for the addition of parsing action hooks, which allow the graph to be checked semantically at compile-time.
It uses SQLite to provide a consistent and lightweight caching system for external data, such as namespaces, annotations, ontologies, and SQLAlchemy to provide a cross-platform interface. The same data management system is used to store graphs for high-performance querying.
Extensions to BEL
The PyBEL compiler is fully compliant with both BEL v1.0 and v2.0 and automatically upgrades legacy statements. Additionally, PyBEL includes several additions to the BEL specification to enable expression of important concepts in molecular biology that were previously missing and to facilitate integrating new data types. A short example is the inclusion of protein oxidation in the default BEL namespace for protein modifications. Other, more elaborate additions are outlined below.
Syntax for Epigenetics
PyBEL introduces the gene modification function, gmod(), as a syntax for encoding epigenetic modifications. Its usage mirrors the pmod() function for proteins and includes arguments for methylation.
For example, the methylation of NDUFB6 was found to be negatively correlated with its expression in a study of insulin resistance and Type II diabetes. This can now be expressed in BEL such as in the following statement:
g(HGNC:NDUFB6, gmod(Me)) negativeCorrelation r(HGNC:NDUFB6)
References:
Note
This syntax is currently under consideration as BEP-0006.
Definition of Namespaces as Regular Expressions
BEL imposes the constraint that each identifier must be qualified with an enumerated namespace to enable semantic interoperability and data integration. However, enumerating a namespace with potentially billions of names, such as dbSNP, poses a computational issue. PyBEL introduces syntax for defining namespaces with a consistent pattern using a regular expression to overcome this issue. For these namespaces, semantic validation can be perform in post-processing against the underlying database. The dbSNP namespace can be defined with a syntax familiar to BEL annotation definitions with regular expressions as follows:
DEFINE NAMESPACE dbSNP AS PATTERN "rs[0-9]+"
Note
This syntax was proposed with BEP-0005 and has been officially accepted as part of the BEL 2.1 specification.
Definition of Resources using OWL
Previous versions of PyBEL until 0.11.2 had an alternative namespace definition. Now it is recommended to either
generate namespace files with reproducible build scripts following the Bio2BEL framework, or to directly add them to
the database with the Bio2BEL bio2bel.manager.namespace_manager.NamespaceManagerMixin
extension.
Things to Consider
Do All Statements Need Supporting Text?
Yes! All statements must be minimally qualified with a citation and evidence (now called SupportingText in BEL 2.0) to maintain provenance. Statements without evidence can’t be traced to their source or evaluated independently from the curator, so they are excluded.
Multiple Annotations
All single annotations are considered as single element sets. When multiple annotations are present, all are unioned and attached to a given edge.
SET Citation = {"PubMed","Example Article","12345"}
SET ExampleAnnotation1 = {"Example Value 11", "Example Value 12"}
SET ExampleAnnotation2 = {"Example Value 21", "Example Value 22"}
p(HGNC:YFG1) -> p(HGNC:YFG2)
Namespace and Annotation Name Choices
*.belns
and *.belanno
configuration files include an entry called “Keyword” in their respective
[Namespace] and [AnnotationDefinition] sections. To maintain understandability between BEL documents, PyBEL
warns when the names given in *.bel
documents do not match their respective resources. For now, capitalization
is not considered, but in the future, PyBEL will also warn when capitalization is not properly stylized, like forgetting
the lowercase ‘h’ in “ChEMBL”.
Why Not Nested Statements?
BEL has different relationships for modeling direct and indirect causal relations.
Direct
A => B
means that A directly increases B through a physical process.A =| B
means that A directly decreases B through a physical process.
Indirect
The relationship between two entities can be coded in BEL, even if the process is not well understood.
A -> B
means that A indirectly increases B. There are hidden elements in X that mediate this interaction through a pathway direct interactionsA (=> or =|) X_1 (=> or =|) ... X_n (=> or =|) B
, or through a set of multiple pathways that constitute a network.A -| B
means that A indirectly decreases B. Like forA -> B
, this process involves hidden components with varying activities.
Increasing Nested Relationships
BEL also allows object of a relationship to be another statement.
A => (B => C)
means that A increases the process by which B increases C. The example in the BEL Specp(HGNC:GATA1) => (act(p(HGNC:ZBTB16)) => r(HGNC:MPL))
represents GATA1 directly increasing the process by which ZBTB16 directly increases MPL. Before, directly increasing was used to specify physical contact, so it’s reasonable to conclude thatp(HGNC:GATA1) => act(p(HGNC:ZBTB16))
. The specification cites examples when B is an activity that only is affected in the context of A and C. This complicated enough that it is both impractical to standardize during curation, and impractical to represent in a network.A -> (B => C)
can be interpreted by assuming that A indirectly increases B, and because of monotonicity, conclude thatA -> C
as well.A => (B -> C)
is more difficult to interpret, because it does not describe which part of processB -> C
is affected by A or how. Is it thatA => B
, andB => C
, so we concludeA -> C
, or does it mean something else? Perhaps A impacts a different portion of the hidden process inB -> C
. These statements are ambiguous enough that they should be written as justA => B
, andB -> C
. If there is no literature evidence for the statementA -> C
, then it is not the job of the curator to make this inference. Identifying statements of this might be the goal of a bioinformatics analysis of the BEL network after compilation.A -> (B -> C)
introduces even more ambiguity, and it should not be used.A => (B =| C)
states A increases the process by which B decreases C. One interpretation of this statement might be thatA => B
andB =| C
. An analysis could inferA -| C
. Statements in the form ofA -> (B =| C)
can also be resolved this way, but with added ambiguity.
Decreasing Nested Relationships
While we could agree on usage for the previous examples, the decrease of a nested statement introduces an unreasonable amount of ambiguity.
A =| (B => C)
could mean A decreases B, and B also increases C. Does this mean A decreases C, or does it mean that C is still increased, but just not as much? Which of these statements takes precedence? Or do their effects cancel? The same can be said aboutA -| (B => C)
, and with added ambiguity for indirect increasesA -| (B -> C)
A =| (B =| C)
could mean that A decreases B and B decreases C. We could conclude that A increases C, or could we again run into the problem of not knowing the precedence? The same is true for the indirect versions.
Recommendations for Use in PyBEL
After considering the ambiguity of nested statements to be a great risk to clarity, and PyBEL disables the usage of nested statements by default. See the Input and Output section for different parser settings. At Fraunhofer SCAI, curators resolved these statements to single statements to improve the precision and readability of our BEL documents.
While most statements in the form A rel1 (B rel2 C)
can be reasonably expanded to A rel1 B
and
B rel2 C
, the few that cannot are the difficult-to-interpret cases that we need to be careful about in our
curation and later analyses.
Why Not RDF?
Current bel2rdf serialization tools build URLs with the OpenBEL Framework domain as a namespace, rather than respect the original namespaces of original entities. This does not follow the best practices of the semantic web, where URL’s representing an object point to a real page with additional information. For example, UniProt does an exemplary job of this. Ultimately, using non-standard URLs makes harmonizing and data integration difficult.
Additionally, the RDF format does not easily allow for the annotation of edges. A simple statement in BEL that one protein up-regulates another can be easily represented in a triple in RDF, but when the annotations and citation from the BEL document need to be included, this forces RDF serialization to use approaches like representing the statement itself as a node. RDF was not intended to represent this type of information, but more properly for locating resources (hence its name). Furthermore, many blank nodes are introduced throughout the process. This makes RDF incredibly difficult to understand or work with. Later, writing queries in SPARQL becomes very difficult because the data format is complicated and the language is limited. For example, it would be incredibly complicated to write a query in SPARQL to get the objects of statements from publications by a certain author.