Interactive WikiProteins Project Invites Researchers to Annotate Biological Concepts

- June 12, 2008

Hoping to bring together “a million minds” to annotate proteins, a team of researchers has launched a large-scale, community-based project called WikiProteins that combines automated text, data, and concept mining, manual annotation, and a newly developed software component called the Knowlet.

The project, which aims to annotate proteins and protein-related biomedical concepts such as diseases or organisms, will be powered by a platform technology designed by a group of scientists and a Rockville, Md.-based startup and will be made available free of charge in perpetuity for the scientific community and the public.

Called WikiProfessional, the platform “in a technical sense powers the community-version that we have now put out there, but you could take exactly the same technology platform and install it locally at a pharmaceutical company for them to do drug-lead discovery,” said Albert Mons, a computational linguist and co-founder of the startup, Knewco.

“Philosophically we wanted to hand off the responsibility for the quality of the content of the system to the community and then provide generic technology that can be used to grow the knowledgebase on a daily basis,”

Given high-throughput data and the increasing number of papers describing them, “comprehensive and timely annotation of the literature for facts by any central team of experts [is] an unachievable goal. Computer assistance in the annotation process is, therefore urgently needed,” the scientists wrote in a paper published in the current issue of Genome Biology.

WikiProteins was created over the last two years and announced on May 28, by a team of scientists from the Swiss Institute of Bioinformatics, the GO consortium and the IntAct database at the European Molecular Biology Laboratory-European Bioinformatics Institute, Erasmus Medical Centre, Leiden University Medical Centre, both in the Netherlands, the Brazilian Stela Institute, the WikiMedia Foundation, and Knewco.

The project is an interactive and semantically supported workspace based on Wiki pages and contains a knowledgebase, a navigation tool, and a section on the people in its annotating community. Beneath that layer is a relational Wiki based on WikiData software, an indexer, and software that creates components called Knowlets, which stores the relationships between all the mined concepts.

Knowlets are at the core of this platform, a proprietary concept mining software component and ontology format that Knewco has developed over the last two years.

Writing in the Genome Biology paper, the team said information is mined from scientific publications and the Knowlet links two given concepts, but records that information only once.

“This approach results in a minimal growth of the ‘concept space’ as compared to the text space,” the authors wrote. New and unique facts in the scientific literature expand the corpus to a much lesser degree than the totality of the text generated by new academic journal articles.

Concept pairs are placed in what the WikiProteins creators call a “related concept cloud.” By applying a meta-analysis algorithm, the software calculates a semantic association to reflect the strength and type of relationship the concepts have.

The relationship is dynamic and recalculated based on newly mined information. Its calculated value is based on three factors: factual statements found in the scientific literature or databases, increasing co-occurrence of two concepts in a sentence or a paragraph, and predictive associations based on the overlap of the two concepts.

The WikiProteins terminology has been mapped to concept identifiers in the Wiki-based terminology system called OmegaWiki. WikiProteins and OmegaWiki are driven by a relational database that is linked to the Knowlets by on-the-fly indexing of all Wiki pages. An indexer called Peregrine is designed to recognize concepts by the Wiki, and the indexer is coupled to a terminology system derived from OmegaWiki.

Each biomedical concept has its own page that includes up-to-date annotations. Registered users can become annotators and edit records. WikiProteins shows the new record alongside the original one that was mined from the authoritative databases. Professional annotators at their respective databases can choose to incorporate some of new community-entered information into their database.

Do you want to know more?

Search This Blog

Bio Saga Blog - A Chronicle of Life Sciences & Informatics

Interactive WikiProteins Project Invites Researchers to Annotate Biological Concepts

Comments

Popular posts from this blog

Top 100 Cutting-Edge Science Blogs

India, UK based Anuva ties up with US genomics major

Top 25 Indian Bioinformatics Companies