=========================================================== Documentation for MPQA Opinion Corpus version 2.0 =========================================================== Contents: 1. Introduction 2. Overview of Changes 2.1 Additional data 2.2 Addition of attitude and target annotations 2.3 Removal of exploratory attributes 2.4 Addition of OpQA answer annotations 2.5 Refinement of some annotations 3. Data 3.1 MPQA original subset 3.2 OpQA subset 3.3 XBank subset 3.4 ULA subset 3.5 ULA-LU subset 4. MPQA Annotation Scheme 4.1 agent 4.2 expressive-subjectivity 4.3 direct-subjective 4.4 objective-speech-event 4.5 attitude 4.6 target 4.7 inside 5. Subjective Sentences 6. OpQA Answer Annotations 7. Database Structure 7.1 database/docs 7.2 database/meta_anns 7.3 database/man_anns 8. MPQA Annotation Format 9. Acknowledgements 10. Contact Information 11. References ----------------------------------------------------------- 1. Introduction This corpus contains news articles and other text documents manually annotated for opinions and other private states (i.e., beliefs, emotions, sentiments, speculations, etc.). The main changes in this version of the MPQA Corpus are the addition of new attitude and target annotations, the inclusion of answer annotations for the OpQA subset of the corpus, and the addition of new annotated documents, growing the size of the corpus to 692 documents. These changes are described in more detail in the following section. The previous version of the MPQA Corpus was released with two different versions of the terminology used to describe the MPQA annotation scheme. For this version of the corpus, only the newer terminology (LRE) from (Wiebe, Wilson, and Cardie, 2005) is used. ----------------------------------------------------------- 2. Overview of Changes 2.1 Additional data This release contains an additional 157 annotated documents. The new documents come from Xbank (85 Wall Street Journal texts), the ULA (48 texts from the American National Corpus), and the ULA-LU (24 texts from the ULA language understanding subcorpus. These additional documents are annotated with the full MPQA annotation scheme, including attitudes and targets. 2.2 Addition of attitude and target annotations The MPQA annotation scheme has been extended to include two new types of annotations: attitude annotations and target annotations - The new annotations are described in Theresa Wilson's Dissertation (Wilson, 2008). The related chapter can be found in the release: TAWilsonDissertationCh7Attitudes.pdf. As an overview, the attitude annotations aim to capture the attitudes, e.g., positive sentiments, negative sentiments, agreements, negative arguings, etc., being expressed overall by the private states represented by the direct subjective annotations. To capture the relationship between attitudes and direct subjective annotations, a new attribute has been added to the direct subjective annotations: the attitude-link attribute. The attitude-link indicates which attitude is associated with which direct subjective annotation. The target annotations aim to capture what each attitude is about, e.g., what the sentiment is about, what is being argued about, etc. Just as the attitude annotations are linked to the direct subjective annotations using the attitude-link attribute, the target annotations are linked to the attitude annotations using the target-link attribute. The 157 new documents and 349 documents from the original MPQA Corpus have the new attitude and target annotations. 2.3 Removal of exploratory attributes In the original MPQA annotation scheme, some exploratory attributes were included to capture some attitude and target information. These attributes were: - agent annotation -> nested-target attribute - direct subjective annotation -> attitude-type attribute - direct subjective annotation -> attitude-toward attribute The new attitude and target annotations replace these earlier, exploratory annotations, so for this release, these attributes have been removed. 2.4 OpQA Corpus subset The OpQA Corpus is a 98 document subset of the original MPQA Corpus. In work by Stoyanov, Cardie, and Wiebe (2005), these documents were annotated for answers to a small set of fact and opinion questions. These answer annotations are included in this release. 2.5 Refinement of some annotations As part of annotating attitudes and targets, refinements were made to some of the existing annotations. These refinements include changing the annotation span boundaries, changing intensity and polarity attributes, and in a few cases, removing annotations, adding new annotations, or changing a direct subjective annotation to an objective speech annotation (or vice versa). ----------------------------------------------------------- 3. Data This release of the corpus contains 692 documents, a total of 15802 sentences. There are 5 different sets of documents: 1. MPQA original subset 2. OpQA (Opinion Question Answering) subset 3. XBank 4. ULA (Unified Linguistic Annotation) 5. ULA-LU (Language Understanding subcorpus) The Xbank, ULA, and ULA-LU data as well as some documents of the original 535-document release carry attitude and target annotations following the scheme described below in section 4. This set of documents is listed in the file doclist.attitudeSubset. For the documents of the original MPQA corpus (see section 3.1), an assignment to topic categories is available but not for any of the other data sets. 3.1 MPQA original subset The documents in the MPQA original subset are listed in the file: doclist.mpqaOriginalSubset. These articles are from 187 different foreign and U.S. news sources. They date from June 2001 to May 2002. They were identified by human searches and by an information retrieval system. The majority of the articles are on 10 different topics, but a number of additional articles were randomly selected (more or less) from a larger corpus of 270,000 documents. This last set of articles has topic: misc. The 10 topics are: argentina: economic collapse in Argentina axisofevil: reaction to President Bush's 2002 State of the Union Address guantanamo: U.S. holding prisoners in Guantanamo Bay humanrights: reaction to U.S. State Department report on human rights kyoto: ratification of Kyoto Protocol mugabe: 2002 presidential election in Zimbabwe settlements: Israeli settlements in Gaza and West Bank spacestation: space missions of various countries taiwan: relations between Taiwan and China venezuela: presidential coup in Venezuela The file, doclist.mpqaOriginalByTopic, gives the topic for each document. 3.2 OpQA subset The documents in the OpQA subset are listed in the file: doclist.opqaSubset. This section of the corpus consists of a set of 98 documents from the original 535-document MPQA Corpus. These documents were annotated for the research on Opinion Question Answering presented in Stoyanov, Cardie, and Wiebe (2005). Text segments that contributed answers to each of 30 question, 1/2 fact-based and 1/2 opinion-based, are annotated. 3.3 XBank The documents in the Xbank subset are listed in the file: doclist.xbankSubset. The Xbank subset contains 85 Wall Street Journal texts from the Penn TreeBank. They were annotated for inclusion in the Xbank, which is a simple tree-based merger of PropBank, NomBank, the Discourse TreeBank, and TimeBank annotations. For more information on XBank and the additional annotations available on this data, please visit: http://www.cs.brandeis.edu/~marc/ula/xbank-browser/. 3.4 ULA The documents in the ULA subset are listed in the file: doclist.ulaSubset. The ULA-OANC-1 corpus is a 40K-word subcorpus of the American National Corpus (ANC). The documents in the subcorpus were chosen to be representative of the open portion of the ANC. There are 48 documents in this set, falling into 6 categories: travel guides transcriptions of spoken conversation fundraising letters a chapter of the 9/11 report a chapter from a linguistics textbook articles from Slate magazine All documents in ULA-OANC-1 corpus have been annotated by participants of the Unified Linguistic Annotation project. In additional to the MPQA annotations, the following annotations are available on this data: PropBank, NomBank, PennTreebank, TimeBank, Penn Discourse Treebank, WordNet senses and FrameNet frames. For more information on the ULA-OANC corpus and the additional annotations available on this data, please visit: http://nlp.cs.nyu.edu/wiki/corpuswg/ULA-OANC-1. 3.5 The ULA language understanding subcorpus (ULA-LU) The documents in the ULA-LU subset are listed in the file: doclist.ula-luSubset. There are 24 documents in this set, falling into these main categories: emails related to the enron case spoken language transcripts newswire text wall street journal texts (from Penn TreeBank) translations of Arabic source texts ----------------------------------------------------------- 4. MPQA Annotation Scheme This section contains an overview of the types of annotations that you will see marked in the documents of this corpus. For more details on the MPQA annotations see the instructions available here: http://www.cs.pitt.edu/~wiebe/pubs/pub1.html (original MPQA scheme) and http://homepages.inf.ed.ac.uk/twilson/attitude-instructions.pdf 4.1 agent annotation Marks phrases that refer to sources of private states and speech events, or phrases that refer to agents who are targets of an attitude. Possible attributes: id - Unique identifier assigned by the annotator to the first meaningful and descriptive reference to an agent. There are two agent annotations with a 0,0 byte span in every document. These two annotations are to give an id for the writer of the document ('w') and for an implicit agent ('implicit'). Private states and speech events are sometimes attributed to implicit agents. nested-source - Used when the agent reference is the source of a private state/speech event. The nested-source is a list of agent ids beginning with the writer and ending with the id for the immediate agent being referenced. Example: w, Foreign Ministry, US State Dept agent-uncertain - Used when the annotator is uncertain whether the agent is the correct source of a private state/speech event Possible values: somewhat-uncertain, very-uncertain 4.2 expressive-subjectivity annotation Marks expressive-subjective elements, words and phrases that indirectly express a private state. For example, 'fraud' and 'daylight robbery' in the following sentence are expressive-subjective elements. "We foresaw electoral fraud but not daylight robbery," Tsvangirai said. Possible attributes: nested-source - List of agent ids beginning with the writer and ending with the id for the immediate agent that is the source of the private state being expressed by the expressive-subjective element. nested-source-uncertain - Used when an annotator is uncertain as to whether the agent is the correct nested source. Possible values: somewhat-uncertain, very-uncertain intensity - Indicates the intensity of private state being expressed by the expressive-subjective element. Possible values: low, medium, high, extreme polarity - Indicates the contextual polarity of the private state. Possible values: positive, negative, both, neutral, uncertain-positive, uncertain-negative, uncertain-both, uncertain-neutral 4.3 direct-subjective annotation Marks direct mentions of private states and speech events (spoken or written) expressing private states. Possible attributes: nested-source - List of agent ids, beginning with the writer and ending with the id for the immediate agent that is the source of the private state or speech event. annotation-uncertain - Used when an annotator is uncertain as to whether the expression marked is indeed a direct private state or a speech event. Possible values: somewhat-uncertain, very-uncertain implicit - The presence of this attribute indicates that the speech event is implicit. This attribute is used when there is not a private state or speech event phrase on which to actually make an annotation. For example, there is no phrase "I write" for the writer of the sentence. subjective-uncertain - Used when an annotator is uncertain as to whether a private state is being expressed. Possible values: somewhat-uncertain, very-uncertain intensity - Indicates the overall intensity of the private state being expressed, considering the 'direct-subjective' phrase and everything inside its scope. Possible values: low, medium, high, extreme expression-intensity - Indicates the intensity of the speech event or private state expression itself. Possible values: neutral, low, medium, high, extreme polarity - Indicates the contextual polarity of the private state. Only included when expression-intensity is not neutral. Possible values: positive, negative, both, neutral, uncertain-positive, uncertain-negative, uncertain-both, uncertain-neutral attitude-link - Id of attitude annotation(s) that are linked to this direct-subjective annotation. If there is more than one linked attitude, this is represented as a comma- separated list of attitude ids. insubstantial - Used when the private state or speech event is not substantial in the discourse Possible values are combination of: c1, c2, c3 These possible values correspond to criteria necessary for a private state or speech event to be substantial. Please see the annotation instructions for a complete description of these criteria. The criteria listed for this attribute are the criteria that the private state or speech speech event fails to meet. 4.4 objective-speech-event annotation Marks speech events that do not express private states. Possible attributes: nested-source - List of agent ids, beginning with the writer and ending with the id for the immediate agent that is the source of the private state or speech event. annotation-uncertain - Used when an annotator is uncertain as to whether the expression marked is indeed a speech event. Possible values: somewhat-uncertain, very-uncertain implicit - The presence of this attribute indicates that the speech event is implicit. This attribute is used when there is not a speech event phrase on which to actually make an annotation. For example, there is no phrase "I write" for the writer of the sentence. objective-uncertain - Used when an annotator is uncertain as to whether the speech event is objective. Possible values: somewhat-uncertain, very-uncertain insubstantial - Used when the speech event is not substantial in the discourse Possible values are combination of: c1, c2, c3 These possible values correspond to criteria necessary for a private state or speech event to be substantial. Please see the annotation instructions for a complete description of these criteria. The criteria listed for this attribute are the criteria that the private state or speech event fails to meet. 4.5 attitude annotation Marks the attitudes that compose the expressed private states. Possible attributes: id - Identifier assigned to the attitude annotation, typically beginning with an 'a' followed by a number. attitude-type - Type of attitude Possible values: positive sentiment negative sentiment positive arguing negative arguing positive agreement negative agreement (disagreement) positive intention negative intention speculation other-attitude attitude-uncertain - Used when an annotator is uncertain about the type of attitude, or whether the attitude should be marked. Possible values: somewhat-uncertain, very-uncertain target-link - Id of target annotation(s) that are linked to this attitude annotation. If there is more than than one linked target, this is represented as a comma- separated list of target ids. If the attitude does not have a target (or the target is unmarkable), this attribute has the value 'none'. inferred - Used when a fairly prominent attitude can be inferred. For example, in the sentence below, the most prominent attitude is a positive sentiment being expressed by the people toward the fall of Chavez. However, there is also clearly a negative attitude negative attitude toward Chavez that can be inferred. Example: People are happy because Chavez has fallen. 4.6 target annotation Marks the targets of the attitudes, i.e., what the attitudes are about or what the attitudes are directed toward. Possible attributes: id - Identifier assigned to the target annotation, typically beginning with an 't' followed by a number. target-uncertain - Used when an annotator is uncertain about whether this is the correct target for the attitude. Possible values: somewhat-uncertain, very-uncertain 4.7 inside annotation The term 'inside' refers to the words inside the scope of a direct private state or speech event phrase ('on'). The annotators did not mark 'inside' annotations. However, 'inside' annotations were created automatically for each writer 'on' annotation. Each writer 'inside' corresponds to a GATE sentence. ----------------------------------------------------------- 5. Subjective Sentences The annotations described in section 4 are expression- level annotations, performed below the level of the sentence. We ask annotators to identify all subjective expressions in a sentence, which gives us very fine-grained, detailed annotations. Although the annotators sometimes differ over which particular expressions are subjective, and how many subjective expressions are in a sentence, they have very good agreement as to whether there is subjectivity in a sentence (see (Wiebe, Wilson, Cardie (2005)). For the work using this data that appeared in CoNLL03 (Riloff et al., 2003) and EMNLP03 (Riloff & Wiebe, 2003) the following definition of a subjective sentence was used. The definition is in terms of the annotations. A sentence was considered subjective if 1 OR 2: 1. the sentence contains a "GATE_direct-subjective" annotation WITH attribute intensity NOT IN ['low', 'neutral'] AND NOT WITH attribute insubstantial. 2. the sentence contains a "GATE_expressive-subjectivity" annotation WITH attribute intensity NOT IN ['low'] Otherwise, a sentence was considered objective. The file, test_setCoNLL03, contains the list of files used for evaluation in (Riloff et al, 2003). NOTE: Since the experiments performed in (Riloff et al., 2003) and (Riloff & Wiebe, 2003), some annotation errors and errors in the sentence splits have been corrected, and some annotations have been refined. ----------------------------------------------------------- 6. OpQA Answer Annotations This section contains an overview of the answer annotations marked in the OpQA subset. For more details on the OpQA answer annotations see the instructions available here: http://www.cs.cornell.edu/~ves/Publications/MPQA_annot_instr.pdf Each answer annotation marks a text segment that contributes (or partially contributes) an answer to one of 30 different fact or opinion questions. These questions can be found in the file: opqa-questions. The answer annotations have the following attributes: a. annotator - the annotator who provided the given annotation b. topic - topic of the question being answered (see opqa-questions) c. questionnumber - within the topic, the number of the question being answered (see opqa-questions) d. confidence - value ranging from 1 to 5 expressing the annotator's confidence that the segment answers the question e. confidencecomment - optional comment given by annotator f. partial - true or false, used to indicate whether the segment is only a partial answer to the question ----------------------------------------------------------- 7. Database Structure The database/ contains three subdirectories: docs, meta_anns, man_anns. Each subdirectory has the following structure: subdir / \ parent .. parent / \ docleaf ... docleaf Within each subdirectory, each document is uniquely identified by its parent/docleaf. For example, 20010927/23.18.15-25073, identifies one document. 20010927 is the parent; 23.18.15-25073 is the docleaf. 7.1 database/docs The docs subdirectory contains the document collection. In this subdirectory, each docleaf (e.g., 23.18.15-25073) is a text file containing one document. 7.2 database/meta_anns Each docleaf (e.g., 23.18.15-25073) in the meta_anns subdirectory contains information about the document (e.g., source, date). The meta_anns files are in MPQA format, which is described in section 8. All the documents in the MPQA original subset have corresponding meta_anns files except the following five documents: 20020516/22.23.24-9583, 20020517/22.08.22-24562, 20020521/22.21.24-5526, 20020522/22.34.49-13286, and 20020523/22.37.46-10374. 7.3 database/man_anns This subdirectory contains the manual annotations for the documents. In this subdirectory, each docleaf (23.18.15-25073) is a directory that contains two or three files: gateman.mpqa.lre.2.0, gatesentences.mpqa.2.0, and for those documents that are part of the OpQA corpus subset, answer.mpqa.2.0. The file gateman.mpqa.lre.2.0 contains the human opinion annotations, including the new attitude and target annotations (Wilson, 2008) for those documents that have been annotated for attitudes and targets. The file gatesentences.mpqa.2.0 contains spans for sentence, minus junk sentences that contain meta data or other spurious information that was not part of the article. These junk sentences were removed by hand. All of these files, gateman.mpqa.lre.2.0, gatesentences.mpqa.2.0, and answer.mpqa.2.0, are in MPQA format, described in section 8. ----------------------------------------------------------- 8. MPQA Annotation Format The MPQA format is a type of general stand-off annotation. Every line in an annotation file is either a comment line (beginning with a '#") or an annotation line (one annotation per line). An MPQA annotation line consists of text fields separated by a single TAB. The fields used are listed below, with an example annotation underneath. id span data_type ann_type attributes 58 730,740 string GATE_agent nested-source="w,chinarep" Every annotation has a identifier, id. This id is unique ONLY within a given MPQA annotation file. The span is the starting and ending byte of the annotation in the document. For example, the annotation listed above is from the document, temp_fbis/20.20.10-3414. The span of this annotation is 730,740. This means that the start of this annotation is byte 730 in the file docs/temp_fbis/20.20.10-3414, and byte 740 is the character after the last character of the annotation. blah, blah, blah, example annotation, blah, blah, blah | | start byte end byte The data_type of all annotations should be 'string'. The types of annotations in the gateman.mpqa files are GATE_agent, GATE_expressive-subjectivity, GATE_direct-subjective, GATE_objective-speech-event, GATE_attitude, GATE_target, GATE_inside, and GATE_split. With the exception of GATE_split, these annotation types correspond to the annotation types described in section 4. Sentence annotations in the gatesentence.mpqa.2.0 files have type GATE_sentence. The annotations in the answer.mpqa.2.0 files all are of type 'answer'. Each attribute is an attribute_name="attribute_value" pair. An annotation may have any number of attributes, including 0 attributes. Multiple attributes for an annotation are separated by single spaces, and they may be listed in any order. The attributes that an annotation may have depends on the type of annotation. The set of possible attributes for each MPQA annotation type is listed in section 4. The set of possible attributes for the OpQA answer annotations are listed in section 6. ----------------------------------------------------------- 9. Acknowledgements The development of the MPQA Opinion Corpus version 1.0 was performed in support of the Northeast Regional Research Center (NRRC) which is sponsored by the Advanced Research and Development Activity (ARDA), a U.S. Government entity which sponsors and promotes research of import to the Intelligence Community which includes but is not limited to the CIA, DIA, NSA, NIMA, and NRO. The development of version 1.2 was supported in part by the NSF under grant IIS-0208798 and by the Advanced Research and Development Activity (ARDA). The development of version 2.0 was supported in part by an Andrew Mellow Pre-doctoral Fellowship, Department of Homeland Security Grant N0014-07-1-0152, and National Science Foundation grant CNS-0551615. ----------------------------------------------------------- 10. Contact Information Please direct any questions that you have about this corpus or the annotation scheme to Janyce Wiebe at the University of Pittsburgh. Theresa Wilson email: twilson@inf.ed.ac.uk Janyce Wiebe email: wiebe@cs.pitt.edu ----------------------------------------------------------- 11. References Janyce Wiebe, Eric Breck, Chris Buckley, Claire Cardie, Paul Davis, Bruce Fraser, Diane Litman, David Pierce, Ellen Riloff, Theresa Wilson, David Day, Mark Maybury (2003). REcognizing and Organizing Opinions Expressed in the World Press. 2003 AAAI Spring Symposium on New Directions in Question Answering. Theresa Wilson and Janyce Wiebe (2003). Annotating Opinions in the World Press. 4th SIGdial Workshop on Discourse and Dialogue (SIG0dial-03). ACL SIGdial. Ellen Riloff, Janyce Wiebe, and Theresa Wilson (2003). Learning Subjective Nouns Using Extraction Pattern Bootstrapping. Seventh Conference on Natural Language Learning (CoNLL-03). ACL SIGNLL. Ellen Riloff and Janyce Wiebe (2003). Learning Extraction Patterns for Subjective Expressions. Conference on Empirical Methods in Natural Language Processing (EMNLP-03). ACL SIGDAT. Veselin Stoyanov, Claire Cardie, and Janyce Wiebe (2005). Multi-Perspective Question Answering Using the OpQA Corpus. Human Language Technologies Conference/Conference on Empirical Methods in Natural Language Processing. Janyce Wiebe, Theresa Wilson, and Claire Cardie (2005). Annotating expressions of opinions and emotions in language. Language Resources and Evaluation (formerly Computers and the Humanities) 1(2). Theresa Wilson, Janyce Wiebe, and Paul Hoffman (2005). Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. Proceedings of HLT/EMNLP 2005, Vancouver, Canada. Theresa Wilson (2008). Fine-grained Subjectivity and Sentiment Analysis: Recognizing the intensity, polarity, and attitudes of private states, Chapter 7, "Representing Attitudes and Targets". Ph.D. Dissertation, University of Pittsburgh. ----------------------------------------------------------- Theresa Wilson Josef Ruppenhofer Janyce Wiebe version 2.0 last modified 12/10/08