This article is available from: http://www.biomedcentral.com/1471-2105/7/41
[Background] Experimental techniques such as DNA microarray, serial analysis of gene expression
(SAGE) and mass spectrometry proteomics, among others, are generating large amounts of data
related to genes and proteins at different levels. As in any other experimental approach, it is
necessary to analyze these data in the context of previously known information about the biological
entities under study. The literature is a particularly valuable source of information for experiment
validation and interpretation. Therefore, the development of automated text mining tools to assist
in such interpretation is one of the main challenges in current bioinformatics research.
[Results] We present a method to create literature profiles for large sets of genes or proteins
based on common semantic features extracted from a corpus of relevant documents. These
profiles can be used to establish pair-wise similarities among genes, utilized in gene/protein
classification or can be even combined with experimental measurements. Semantic features can be
used by researchers to facilitate the understanding of the commonalities indicated by experimental
results. Our approach is based on non-negative matrix factorization (NMF), a machine-learning
algorithm for data analysis, capable of identifying local patterns that characterize a subset of the
data. The literature is thus used to establish putative relationships among subsets of genes or
proteins and to provide coherent justification for this clustering into subsets. We demonstrate the
utility of the method by applying it to two independent and vastly different sets of genes.
[Conclusion] The presented method can create literature profiles from documents relevant to
sets of genes. The representation of genes as additive linear combinations of semantic features
allows for the exploration of functional associations as well as for clustering, suggesting a valuable
methodology for the validation and interpretation of high-throughput experimental data.
This work has been partially funded by Santander-UCM (grant PR27/05-
13964), Comunidad Autonoma de Madrid (grant CAM GR/SAL/0653/
2004), Comision Interministerial de Ciencia y Tecnologia (grants CICYT
BFU2004-00217/BMC and GEN2003-20235-c05-05) and a collaborative
grant between the Spanish Research Council and the National Research
Council of Canada (CSIC-050402040003). PCS is recipient of a grant from
Comunidad Autonoma de Madrid. APM acknowledges the support of the
Spanish Ramón y Cajal program. HS is supported by the Canadian NSERC
Discovery Grant 298292-04.
Peer reviewed