From A critical assessment of text mining methods in molecular biology
[Background] Molecular Biology accumulated substantial amounts of data concerning functions of
genes and proteins. Information relating to functional descriptions is generally extracted manually
from textual data and stored in biological databases to build up annotations for large collections of
gene products. Those annotation databases are crucial for the interpretation of large scale analysis
approaches using bioinformatics or experimental techniques. Due to the growing accumulation of
functional descriptions in biomedical literature the need for text mining tools to facilitate the
extraction of such annotations is urgent. In order to make text mining tools useable in real world
scenarios, for instance to assist database curators during annotation of protein function,
comparisons and evaluations of different approaches on full text articles are needed.
[Results] The Critical Assessment for Information Extraction in Biology (BioCreAtIvE) contest
consists of a community wide competition aiming to evaluate different strategies for text mining
tools, as applied to biomedical literature. We report on task two which addressed the automatic
extraction and assignment of Gene Ontology (GO) annotations of human proteins, using full text
articles. The predictions of task 2 are based on triplets of protein – GO term – article passage. The
annotation-relevant text passages were returned by the participants and evaluated by expert
curators of the GO annotation (GOA) team at the European Institute of Bioinformatics (EBI). Each
participant could submit up to three results for each sub-task comprising task 2. In total more than
15,000 individual results were provided by the participants. The curators evaluated in addition to
the annotation itself, whether the protein and the GO term were correctly predicted and traceable
through the submitted text fragment.
[Conclusion] Concepts provided by GO are currently the most extended set of terms used for
annotating gene products, thus they were explored to assess how effectively text mining tools are
able to extract those annotations automatically. Although the obtained results are promising, they
are still far from reaching the required performance demanded by real world applications. Among
the principal difficulties encountered to address the proposed task, were the complex nature of
the GO terms and protein names (the large range of variants which are used to express proteins
and especially GO terms in free text), and the lack of a standard training set. A range of very
different strategies were used to tackle this task. The dataset generated in line with the BioCreative
challenge is publicly available and will allow new possibilities for training information extraction
methods in the domain of molecular biology.
The Protein Design Group (PDG) contributions to the BioCreAtIvE workshop
were funded by the European Commission as part of the E-BioSci and
ORIEL projects, contract nos. QLRI-CT-2001-30266 and IST-2001-32688,
under the RTD Programs "Quality of Life and Management of Living
Resources" and "Multimedia Content and Tools (KA3)". The work of M.
Krallinger was sponsored by DOC scholarship program of the Austrian
Academy of Sciences.
Peer reviewed