N. D. Zelinsky Institute of Organic Chemistry, Russian Academy of Sciences, Moscow, Russia
KEYWORDS: glycoinformatics,structure, glycomics, bacterial carbohydrates, bibliography, taxonomy, database, CSDB
Glycoconjugate Journal, 2013, v.30, p.347-349
Nowadays, its impossible to navigate in the currently acquired volume of glyco-related information without special means of informatics. Therefore, the progress of glycobiology strongly depends on a presence of an information environment including data on structures, properties and functions of carbohydrates, as well as on taxonomy and properties of their biological sources. The main approach to create such environment is development of carbohydrate databases. In contrast to genomics and proteomics, informatization of glycomics is still suffering from incompatibility between the existing projects. In the present mini-review I report a comparative analysis of currently active world-wide scale carbohydrate databases, with Russian Carbohydrate Structure Databases1 (CSDBs) emphasized.
Glyco-databases providing wide coverage are most demanded, among them meta-database GlycomeDB, GLYCOSCIENCES, GlycoSuiteDB, CFG Glycan Database, KEGG, JCGGDB, GlycoBase-Dublin, UniCarbKB, Bacterial&Archaeal CSDB, and Plant&Fungal CSDB (currently in development) and other. Historically first carbohydrate database, CCSD (Carbbank) pretended to have complete coverage of structures published before 1996, when its support was ceased. Collection and digitizing of primary data are the most time-consuming stages of the database development, and therefore almost all modern projects somehow use the Carbbank data.
Analysis of the distinctive features of various projects allows to establish the criteria of database evaluation: types of data stored, completeness of coverage, data quality, functions provided to users, interface (usability, stability and performance), integration with other projects, and database architecture. Although the last criterion is invisible to users, it has a strongest impact on a database usefulness, since architectural mistakes hamper maintainability, upgradability and error control, and continuously increase the cost of a project.
The minimal types of data stored and processed in a glyco database are the primary molecular structure, and taxonomical and bibliographic annotations. Many databases store analytical data, such as NMR or MS spectra. Storage of biochemical, genetic, medical and other related data is often supported, but their coverage remains poor. Of major carbohydrate databases, KEGG lacks for taxonomical annotations, and GlycomeDB for bibliographic annotations. The databases with stored NMR spectra provide a spectral coverage of 5-25% of the published data.
A higher coverage significantly increases the value of a database, since even a negative answer to a search request presents valuable scientific information. Restricted potential of automatization of a search for suitable publications limits the acquisition of primary data and, therefore, the coverage. Nowadays, only the Bacterial CSDB reports almost complete (>75%) structural coverage within a chosen compound class, namely glycans from procaryotic microorganisms. Since 2005, it has accumulated ca. 10 000 structures assigned to 5 000 microorganisms in 4 000 publications3. The newly established Plant&Fungal CSDB aims at achieving this level of coverage in 2016, and currently deposits re-annotated data published before 1996.
To keep the coverage actual, periodic updates are needed, which assumes one- or two-year lag between publication and deposition of data. The universal solution for keeping the data actual is a requirement of obligatory uploading every published structrure to a database prior to publication with subsequent provision of the obtained IDs to an editorial. Such approach has been realized in genomics for long but is still missing in glycomics. One of the reasons for that is an insufficient standardization of glycan description languages originating from high chemical variativity of carbohydrates. This problem, as well as limited cross-project compatibility, can be overcome by source-independent data framework, such as RDF. Several databases, including CSDB, GlycomeDB, UniCarbKB and other provide export of data as RDF triples according to an experimental version of GlycanRDF ontology formulated at Biohackathon 2012, Japan.
The process of data posting can hardly be automatized not only at the level of publication selection but also at the level of article interpretation. As a result, all chemical and biological databases contain errors. These errors originate from (in occurrence-descending order): annotators’ failures, other imported databases, original publications, architectural inconsistencies, faults in import and auto-annotation software. According to our investigation2, most records in Carbbank contain errors, and more than one third of records contain two or more errors. The most abundant error type is an incorrect taxonomical annotation of a structure. Significant gaps in the Carbbank coverage were also discovered. As most of the modern projects use the Carbbank data, these errors migrate from database to database. Some of them can be revealed and sometimes corrected automatically. Such control is present in a number of databases; however, only a retrospective expert analysis of publications can provide really high data quality. Two thirds of CSDB budget is devoted to manual literature processing.
Database functionality is its capability to process various search requests, and to combine and refine them using diverse logics and other types of queries. E.g., "find all structures published from 2001 to 2005, that contain either an α-Gal(1-->3)KDO fragment or a monosaccharide-bound lysine or alanine, except synthetic structures or those found in gamma-proteobactertia, and display their 13C NMR spectra". In contrast to a search for bibliography, taxonomy, keywords, text fragments and similar data, a search for structural fragments in bigger molecules (as well as for structures or spectra resembling a specified one) requires more meticulous programming and computational power, making the inner database architecture critical for the performance of such queries. In the mid-2000s, developers of GLYCOSCIENCES.de formulated "Ten golden rules of carbohydrate database development", which summarized the experience of the German and Russian groups. The key points of this document include usage of a connection table for inward structure representation, maximal possible indexation, minimum of free-text data (which, regretfully, are present in virtually every project), and unambiguously controlled vocabularies for as much data types as possible. An attempt to separate the monosaccharide vocabulary from glyco-databases was made within MonosaccharideDB. Nowadays it provides full coverage on monosaccharides present in mammalian glycans.
Possibility of correct processing of structural data is directly related to the format of both internal and user structure descriptions. Incapability and inter-incompatibility of glycan description languages have been limiting the progress of glycoinformatics for decades. The main criteria of carbohydrate language efficiency are: 1. unambiguity and uniqueness of every chemically distinct structure, including non-carbohydrate moieties; 2. support of all structural features of carbohydrates and glycoconjugates (single and multiple repeats, cyclic and combined glycans, glycolipids, glycoproteins, non-carbohydrate and untypical constituents, phospho- and sulfo-linkages, cyclic esters, amide and ether linkages, etc. ); 3. support of underdetermined structures at the level of monomers and their configurations, stoichiometry, substitution positions, and chain topology; 4. computer-readability with no need for ambiguous parsing, as in the case of Extended JUPAC, and human-readability required for tracking of errors that appear during human processing of data dumps; 5. compatibility with other formats (presence of converters that help language learning and cross-database operations), e.g. monomer vocabulary widely recognized by glycobiologists. The CSDB Linear and GlycoCT languages possess most of these features. However, the former does not support nested repeats and have limited aglica support, and the latter is not human-readable and supports carbohydrate moieties only. Glycomics still lacks a standard language except the JUPAC, which is highly imperfect.
The functionality can be extended by carbohydrate-related services, such as conformation map simulation, spectrum prediction, search for structural motifs etc. CSDB project provides a generator of structural variety restricted by user constraints determined from simple experiments (e.g. number of residues per repeat and other data). This approach is a gateway to theoretical structural elucidation and ranking based on experimental data, and can be applied to this or to any other glycan database. Every structure in the generator output is a subject to either database-driven averaging or fully theoretical simulation of properties. Among these properties are the NMR spectra. As realized in CSDB, 13C NMR simulation combines environment-dependent database search and statistical processing with empirical incremental prediction of chemical shifts. To find a chemical shift in a database a twelve-step generalization of atomic environment is applied. As well as the other related projects (CASPER, BIOPSEL), this feature is aimed at dramatic simplification of natural carbohydrate structural studies.
The modern quality standard in informatics implies that both user and administrative interfaces are intuitive, well-documented and freely accessible via Internet. Intuitiveness includes structure input and output formats, which should be comprehensible to users without special training. Standalone services for structure input and visual editing are of extreme usefulness, as they allow users of any database to stay within the interface which they got used to. Cross-project integration implies not only common interface of search requests but also automated data interchange via API. It concerns interactions with non-carbohydrate databases as well, at least with NCBI Taxonomy and NCBI Pubmed. First two projects that reported protocols for automated data exchange were GLYCOSCIENCES.de and Bacterial CSDB, and since then the development of glyco-related web-services has intensified.
Within the Bacterial CSDB project, we tried to develop the database architecture and to realize it via software free of disadvantages of other glyco-databases. Since 2006, Bacterial & Archaeal CSDB has been maintained and regularly updated. In 2012, we started its expansion to plant and fungal carbohydrates, the pre-last step to creation of a complete database of natural glycans, which, we hope, will ideologically replace Carbbank.