Phyl Toukach: Glyco databases

Carbohydrate databases in the recent decade
presentation slides

Download the presentation as high-resolution PDF (6.3 Mb)

Slides
(click the slide name to see it)

Annotation

Carbohydrates are one of the most chemically diverse class of biomacromolecules. After the role of protein glycosylation was revealed, interest to glycans in cellular interactions has been continuously growing. Nowadays, orientation in currently acquired scope of glyco-related information is impossible without special tools of informatics. Therefore, progress of glycobiology strongly depends on the availability of an informatic environment, including data on structure, properties and functions of carbohydrates, linked to taxonomy and biological sources. The main approach to create such environment are glycomic databases and prognostic services on top of them. In contrast to genomics and proteomics, structure identification and data exchange protocols in glycomics have been standardized not long ago; this process has not been yet finished. The projects in a new knowledge area (glycoinformatics) are not fully compatible each to other in coverage, data formats, and services provided to chemists, biologists, genetics, and pharmaceutists. Each of these projects is devoted to its specific tasks, but nevertheless, the integration tendency is clear.

Glyco-databases providing wide coverage are most demanded, among them GLYCOSCIENCES (imported Carbbank + selected mammalian glycans + NMR data), UnicarbKB (mammalian O- and N-glycans), KEGG Glycan (mainly, imported Carbbank), Carbohydrate Structure Database (bacterial, plant and fungal glycans + NMR data) and other. Special databases, such as GlycoBase-Dublin (N-glycans + MS data), GlycoBase-Lille (amphibian glycans + NMR data), ECODAB (E. coli O-antigens) and Glytoucan (meta-repository) should be noticed. Historically first universal glycan database was CCSD (Carbbank), which pretended to have complete coverage on structures published before 1996, when its support was ceased. Collection and digitizing of primary data are the most time-consuming stages of database development, and therefore almost all modern projects somehow inherit the CCSD data and ideology.

Analysis of the distinctive features of various projects allows establishing the criteria of database evaluation: types of data deposited, completeness of coverage, data quality, functions provided to users (including stability and performance), user interface, integration with other projects, and, implicitly, database architecture.

The minimal types of data stored and processed in a glycan database, are a primary molecular structure and taxonomical and bibliographic annotations. Many databases store experimental analytical data, such as NMR or MS spectra. Storage of biochemical, genetic, medical and other related data is often supported, but their coverage remains poor. Even some major carbohydrate databases lack taxonomical and bibliographical annotations in a considerable part of records. Databases, which store NMR spectra, provide NMR coverage for 5-35% of the published sequences.

Higher coverage significantly increases the value of a database, since even a negative answer to a search request presents valuable scientific information. Restricted potential of automation of search for suitable publications limits the acquisition of primary data and, therefore, the coverage. Nowadays, only Carbohydrate Structure Database (CSDB in its bacterial and fungal parts) pretends to have complete (>80%) structural coverage within a chosen domain. To keep the coverage actual, regular updates are needed, and the time lag between publication and deposition should not exceed two years. A universal solution for keeping the data actual is a requirement of obligatory upload of every published structure to a database prior to publication, with subsequent provision of IDs upon article submission. Such approach has existed in genomics for long but is still missing in glycomics. One of reasons for this gap is insufficient standardization of glycan description languages (notations) originating from high chemical diversity of carbohydrates.

The process of data deposition can hardly be automated not only at the level of publication selection but also at the level of article interpretation. As a result, all chemical and biological databases contain errors. These errors originate from annotators’ failures, are transferred from other databases, copied from original publications, and arise from DB architecture inconsistency and software bugs in importers and auto-annotators (listed in occurrence-descending order). According to our investigation, most records in Carbbank contained errors, and more than one third of records contained more than one error. The most abundant error type was an incorrect taxonomical annotation of structures. Significant gaps in the CCSD coverage were also discovered. As most of the modern projects use the Carbbank data, these errors are continuously reproduced. Some of them can be detected and sometimes corrected automatically. Such control is present in a number of databases; however, only a retrospective expert analysis of publications can provide high data quality.

Database functionality is its capability to process various search requests, combine them using logical operations, and refine the results by queries of other types. E.g., "find all structures published from 2001 to 2005, that contain either an α-Gal(1-->3)KDO fragment or a monosaccharide-bound lysine or alanine, except synthetic structures or those found in gamma-proteobactertia, and display their ¹³C NMR spectra". The functionality can be extended by carbohydrate-related services, such as conformation map simulation, spectra prediction, search for structural motifs etc. In contrast to search for bibliography, taxonomy, keywords, text fragments and similar data, the search for structures containing a specific fragment (as well as for structures or spectra resembling the specified ones) requires more meticulous programming and computer resources, making the inner database architecture critical for performance of the structural queries. In the mid-2000s, developers of GLYCOSCIENCES.de formulated "The decalogue of carbohydrate database development", which summarized the experience of the German and Russian groups. The key points of this document included usage of a connection table for inward structure representation, maximal possible indexing, minimum of the free text data (which, regretfully, are present in every project), and controlled vocabularies for as much data types as possible, first in turn for monosaccharide names. An attempt to separate the monosaccharide vocabulary from glyco-databases was made within MonosaccharideDB. Further, glycan data storage and processing rules were advanced by Glycoinformatics Consortium and NCBI Glycoinformatics advisory group. The improvements included standardization of glycan presentation in publications and on the web (IUPAC appendices, SNFG), and a trend of usage of the semantic web (with Resource Description Framework) for obtaining of implicit database-independent knowledge. Adapting this model to chemistry and biology led to dedicated carbohydrate (GlycoRDF) and glycoconjugate (GlycoCoO) ontologies.

Correct processing of structural data is directly related to the format of both internal and end-user structure descriptions. Incapability and inter-incompatibility of glycan notations have been limiting the progress of glycoinformatics for decades. Below are the main criteria of carbohydrate language evaluation:

unambiguity (strict rules that allow recording of every chemically distinct structure in a unique way);
support of all structural features present in published carbohydrates (polymeric, oligomeric, cyclic and combined glycans, glycolipids, glycoproteins), including those containing non-carbohydrate constituents and various special cases (untypical residues, phospho- and sulfo-linkages, cyclic esters, amidic and ether linkages, etc. );
support of incompletely determined structures at the level of monomers and their configurations, substitution positions, chain topology and side chains stoichiometry;
computer-readability (with no need for complicated parsing, as for Extended IUPAC) and human-readability (required for tracking of errors that appear during human preparation of data);
compatibility with other formats (presence of converters that help language learning and cross-database work), e.g. monomer vocabulary conventional for glycobiologists;
maximal possible independence on the manually curated resources, such as monomer or ligand vocabulary (in part contradictory to the previous issue).

The CSDB Linear, GlycoCT, and WURCS notations possess most of these features. However, the former does not support some topologies, and the two latter are not human-readable. In contrast to genomics and proteomics, glycomics still lacks a widely recognized language except a highly imperfect IUPAC.

Modern quality standard in scientific software engineering implies that both user and administrative interfaces are intuitive, well-documented and freely accessible via Internet. Intuitiveness includes structure input and output formats, which users should not have to study. Standalone services for structure input and editing are of extreme usefulness, as they allow users of any database to stay within the interface which they got used to. Integration between projects implies not only common interface of search requests but also automated data interchange. It concerns interactions with non-carbohydrate projects as well, such as bibliographic (NCBI Pubmed), taxonomic (NCBI Taxonomy), genetic (NCBI Genbank), proteomic (Uniprot), and other databases. First two projects that reported protocols for automated data exchange were GLYCOSCIENCES.de and Bacterial CSDB, and since then the development of glyco-related web services has intensified.

Two special projects should be mentioned. EurocarbDB was funded as a design study of a database completely lacking common disadvantages; however, its development was limited to the design of approaches without their implementation (which is always a bottleneck due to a human factor) and CCSD import. The opposite end of the ideological hierarchy is occupied by the Glytoucan repository, which does not provide own data but integrates the other databases and assigns unique IDs to glycan moieties of published glycans and glycoconjugates. This "database of databases" allows cross-project operations in a single interface.

Within the Carbohydrate Structure Database (CSDB) project started in 2005, we tried to develop a database free of disadvantages of other glyco-databases both at architectural and content level. Since then, CSDB has been regularly updated and upgraded, and it has served as a platform for multiple services of glycoinformatics. CSDB has become one of the most recognized source of data on carbohydrates of microorganisms; it aims at ideological replacement of Carbbank. More details on positioning of CSDB in glycoinformatics are discussed in the Russian version of this lection.

Li X., Xu Z., Hong X., Yan Zhang Y., Zou X. Databases and Bioinformatic Tools for Glycobiology and Glycoproteomics // Int. J. Mol. Sci. 2020. T. 21. ¹ 18. ID 6727. DOI: 10.3390/ijms21186727
Abrahams J.L., Taherzadeh G., Jarvas G., Guttman A., Zhou Y., Campbell M.P. Recent advances in glycoinformatic platforms for glycomics and glycoproteomics // Curr. Opin. Struct. Biol. 2020. T. 62. Ñ. 59-69. DOI: 10.1016/j.sbi.2019.11.009
Scherbinina S. I., Toukach P. V. Three-Dimensional Structures of Carbohydrates and Where to Find Them // Int. J. Mol. Sci. 2020. Ò. 21. ¹20. ID 7702. DOI: 10.3390/ijms21207702
Copoiu L., Malhotra S. The current structural glycome landscape and emerging technologies // Curr. Opin. Struct. Biol. 2020. T. 62. Ñ. 132-139. DOI: 10.1016/j.sbi.2019.12.020
Aoki-Kinoshita K. F. (Ed.) A Practical Guide to Using Glycomics Databases // Japan: Springer, 2017. DOI: 10.1007/978-4-431-56454-6
Lutteke T., Frank M. (eds.) Glycoinformatics // series: Methods in Molecular Biology, v. 1273. New York: Humana Press, 2015. DOI: 10.1007/978-1-4939-2343-4
Aoki-Kinoshita K. F. Using Databases and Web Resources for Glycomics Research // Mol. Cell. Proteomics. 2013. T. 12. ¹ 4. Ñ. 1036–1045. DOI: 10.1074/mcp.R112.026252

Slides

Home : Science

Science : CSDB

Home : Teaching

Last update: 2021 Mar 31 Home

Carbohydrate databases in the recent decadepresentation slides

Slides(click the slide name to see it)

Annotation

Slides

Carbohydrate databases in the recent decade
presentation slides

Slides
(click the slide name to see it)