The BioText Project

University of California, Berkeley

BioText Data

This web page contains links to training and testing sets for various research results produced by the BioText project.

Recognizing Abbreviation Definitions

Please acknowledge your access to this data by citing this paper if you use the data in research or for other purposes:

A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text, Ariel Schwartz and Marti Hearst, in the proceedings of the Pacific Symposium on Biocomputing (PSB 2003) pdf

To develop this collection, 1000 MEDLINE abstracts were randomly selected from the results of a query on the term "yeast". These were then hand tagged, producing a list of 954 correct pairs.

The dataset was first annotated by a researcher in computational and biosciences. The data was further verified by comparing any questionable pairs against other occurrences of the same abbreviation in other abstracts, using the web site provided by Chang, Schuetze, and Altman 2002. A pair extracted by the Schwartz and Hearst algorithm is considered correct only if it exactly matches a pair labeled in the dataset.

Protein-Protein Interaction Data

Please acknowledge your access to this data by citing this paper if you use the data in research or for other purposes:

Multi-way Relation Classification: Application to Protein-Protein Interaction, Barbara Rosario and Marti Hearst, in HLT-NAACL'05, Vancouver, 2005.   pdf

The dataset was annotated by a researcher in computational and biosciences. In the paper above we describe how we extracted the data. The format is the following:

interaction_type====PaperPubMedID_Prot1_ID_Prot2_ID==>sentence with proteins labeled|| .....

Relations between DISEASE/TREATMENT Entities

Please acknowledge your access to this data by citing this paper if you use the data in research or for other purposes: Classifying Semantic Relations in Bioscience Text, Barbara Rosario and Marti A. Hearst, in the proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), Barcelona, July 2004. pdf

Information about, and links to, the files

Noun Compound Semantics

Please acknowledge your access to this data by citing this paper if you use the data in research or for other purposes: Classifying the Semantic Relations in Noun Compounds via a Domain-Specific Lexical Hierarchy. Barbara Rosario and Marti Hearst, Proceedings of 2001 Conference on Empirical Methods in Natural Language Processing, Pittsburgh, PA (EMNLP 2001). pdf

In the following files are all the labeled NC used in the experiments described in the paper Classifying the Semantic Relations in Noun Compounds via a Domain-Specific Lexical Hierarchy.