Carbohydrate databases in the recent decade
presentation slides

Download the presentation as high-resolution PDF (6.0 Mb)

Slides
(click the slide name to see it)

Annotation

Nowadays, the orientation in a currently acquired volume of glyco-related information is impossible without special features of informatics. Therefore, the progress of glycobiology strongly depends on a presence of an information environment including data on structures, properties and functions of carbohydrates, as well as on taxonomy and properties of their biological sources. The main approach to create such environment is development of carbohydrate databases (DBs). In contrast to genomics and proteomics, informatization of glycomics is still in the making. Existing projects are oriented to certain problems and not fully compatible with each other, both in coverage and data format.

Glyco-databases providing wide coverage are most demanded, among them GlycomeDB (meta-database), GLYCOSCIENCES (imported Carbbank + selected mammalian glycans + NMR data), GlycoSuiteDB (mammalian O- and N-glycans), CFG Glycan Database (mammalian glycans from Carbbank and Glycominds), KEGG (mainly imported Carbbank), GlycoBase-Dublin (N-glycans + MS data), GlycoBase-Lille (amphibean glycans + NMR data), ECODAB (E. ñoli O-antigens), and Bacterial CSDB (bacterial glycans + NMR data).

Historically first glyco-database was Carbbank, which pretended to have complete coverage of structures published before 1996, when its support was ceased. Collection and digitizing of primary data are the most time-consuming stages of a database development, and therefore almost all modern projects somehow use the Carbbank data and ideology.

Analysis of the distinctive features of various projects allows to establish thecriteria of database evaluation: types of data stored, completeness of coverage, data quality, functions provided to users, user interface (including stability and performance), integration with other projects, and, indirectly, database architecture.

The minimal types of data stored and processed in a glyco database are a primary molecular structure and taxonomical and bibliographic annotations. Many databases store experimental analytical data, such as NMR or MS spectra. Storage of biochemical, genetic, medical and other related data is often supported, but their coverage remains poor. Of major carbohydrate databases, KEGG lacks for taxonomical annotations, and GlycomeDB for bibliographic annotations. Those databases with NMR spectra stored provide NMR coverage of 5-25% of the published data.

Higher coverage significantly increases the value of a database, since even a negative answer to a search request presents valuable scientific information. Restricted potential of automatization of a search for suitable publications limits the acquisition of primary data and, therefore, the coverage. Nowadays only Bacterial CSDB pretends to have complete (>80%) structural coverage within a chosen compound class (bacterial glycans). To keep the coverage actual, periodic updates are needed. We consider a two-year time lag between publication and deposition of data acceptable. A universal solution for keeping the data actual is a requirement of obligatory upload of every published structrure to a database prior to publication with subsequent provision of the obtained IDs to an editorial. Such approach has been realized in genomics for long but is still missing in glycomics, jne of the reasons of which is insufficient standardization of glycan description languages originating from high chemical variativity of carbohydrates.

The process of data filling can hardly be automatized not only at the level of publication selection but also at the level of article interpretation. As the result, all chemical and biological databases contain errors. These errors originate from annotators’ failures, are transferred from other databases, copied from original publications, and arise from DB architecture inconsistency and software bugs in importers and auto-annotators (listed in occurrence-decreasing order). According to our investigation, most records in Carbbank contain errors, and more than one third of records contain two or more errors. The most abundant error type is an incorrect taxonomical annotation of a structure. Significant gaps in the Carbbank coverage were also discovered. As most of the modern projects use the Carbbank data, these errors are being reproduced. Some of them can be revealed and sometimes corrected automatically. Such control is present in a number of databases; however, only a retrospective expert analysis of publications can provide really high data quality.

Database functionality is its capability to process various search requests, combine them using diverse logical operations, and refine the results by other types of queries. E.g., "find all structures published from 2001 to 2005, that contain either an α-Gal(1-->3)KDO fragment or a monosaccharide-bound lysine or alanine, except synthetic structures or those found in gamma-proteobactertia, and display their 13C NMR spectra". The functionality can be extended by carbohydrate-related services, such as conformation map simulation, spectra prediction, search for structural motifs etc. In contrast to a search of bibliography, taxonomy, keywords, text fragments and similar data, a search of structures containing a specified fragment (as well as for structures or spectra resembling the specified ones) requires more meticulous programming and computational power, making the inner database architecture critical for the performance of structural queries. In the mid-2000s, developers of GLYCOSCIENCES.de formulated "Ten golden rules of carbohydrate database development", which summarized the experience of the German and Russian groups. The key points of this document include usage of a connection table for inward structure representation, maximal possible indexation, minimum of free text data (which, regretfully, are present in every project), and unambiguously controlled vocabularies for as much data types as possible, for monosaccharide names in the first place. An attempt to separate the monosaccharide vocabulary from glyco-databases was made within MonosaccharideDB. Nowadays it provides full coverage on monosaccharides present in mammalian glycans.

Possibility of correct processing of structural data is directly related to the format of both internal and user structure descriptions. Incapabilities and inter-incompatibility of glycan description languages have been limiting the progress of glycoinformatics for decades. These are the main criteria of carbohydrate language evaluation:

The CSDB Linear and GlycoCT languages possess most of these features. However, the former does not support some topologies, and the latter is not human-readable. Glycomics still lacks a standard language except the JUPAC, which is highly imperfect.

Modern quality standard in informatics implies that both user and administrative interfaces are intuitive, well-documented and freely accessible via Internet. Intuitiveness includes structure input and output formats, which users should not have to study. Standalone services for structure input and editing are of extreme usefulness, as they allow users of any database to stay within the interface which they are used to. Integration between projects implies not only common interface of search requests but also automated data interchange via API. It concerns interactions with non-carbohydrate databases as well, NCBI Taxonomy and NCBI Pubmed at the least. First two projects that reported protocols for automated data exchange were GLYCOSCIENCES.de and Bacterial CSDB, and since then the development of glyco-related web-services has intensified.

Two special projects should be mentioned. EurocarbDB was funded as a design study of a database completely lacking any disadvantages; however, its development was limited to the design of approaches without their realization (which is always a bottleneck due to human factor) and Carbbank import. The opposite end of the ideological hierarchy is occupied by the meta-database GlycomeDB, which does not provide its own data but integrates the other databases and imports them. Thus, GlycomeDB is a database of databases and allows the cross-project operation within a single interface. Its disadvantages are absence of bibliographic search, limitation of structural scope to carbohydrate moieties only, and time lag between updates of GlycomeDB and source databases.

Within the Bacterial CSDB project started in 2005, we tried to develop the database architecture and to realize it in a software free of disadvantages of other glyco-databases. Since then Bacterial CSDB has been maintained and regularly updated. In 2012, we started its expansion to plant and fungal carbohydrates, being the pre-last step to creation of a complete database of natural glycans, which, we hope, will ideologically replace Carbbank.

Slides


ScienceHome : Science NMRScience : CSDB CoursesHome : Teaching

Last update: 2013 jun 20      Home