=========================================================== Documentation for OpinionFinder 1.5 =========================================================== Contents: 1. Introduction 1.1 Background 1.2 READMEs 1.3 Steps OpinionFinder Goes Through for Processing 2. Notes on the System 3. System Requirements and Programs Used by OpinionFinder 4. Installation 5. Running OpinionFinder 6. Acknowledgements 7. Contact Information 8. Citing Use of OpinionFinder 9. References 10. List of Contributors ----------------------------------------------------------- 1. Introduction OpinionFinder is a system that processes documents and automatically identifies subjective sentences as well as various aspects of subjectivity within sentences, including agents who are sources of opinion, direct subjective expressions and speech events, and sentiment expressions. It outputs files using inline SGML markup. The "Background" section gives a brief description of subjectivity, sources, direct subjective expressions and speech events, and sentiment expressions. The "READMEs" section lists other READMEs that are included with this system. 1.1 Background Subjective sentences express private states. Private states are internal mental or emotional states, including speculations, beliefs, emotions, evaluations, goals, and judgments. Below are a few examples of subjective sentences: (1) Jill said, "I hate Bill." (2) John thought he won the race. (3) Mary hoped her presentation would go well. Direct subjective expressions are direct references to private states. Speech events include both speaking and writing events. In the subjective sentences above, "said" is a speech event and "hate", "thought", and "hoped" are all direct subjective expressions. Sentiment expressions are a type of subjective expression. Specifically, they are expressions of positive and negative emotions, judgments, evaluations, and stances. In the examples above, "hate" is a negative sentiment expression and "hope" is a positive sentiment expression. The sources of a private states for (1), (2) and (3) are Jill, John, and Mary, respectively. For more information on subjectivity, subjective expressions, and the ways in which private states may be expressed in language, see Wiebe (2002) and Wiebe, Wilson, Cardie (2005). 1.2 READMEs This README describes how to install and run OpinionFinder. For more information about the different components of OpinionFinder, please see the following READMEs: README.database - Describes the file directory structure used by OpinionFinder. The database directory is where the documents are put to be processed by the system and where the system puts the SGML tagged output of the subjective sentences, sentiment expressions, direct subjective expressions and speech events, and sources. It contains the docs, auto_anns, and output_anns directories. - Describes the general MPQA file format. The MPQA format is one form of output that the system provides. README.polarity - Describes the polarity classifier, including the format of the MPQA and SGML files it outputs. README.subjectivity - Describes the subjective sentence classifier, including the format of the MPQA and SGML files it outputs. README.speech_dirsubj - Describes the direct subjective expression and speech event classifier, including the format of the MPQA and SGML files it outputs. README.source - Describes the source classifier, including the format of the MPQA and SGML files it outputs. README.knownbugs - Describes known errors in OpinionFinder. README.featuresclues - What to cite if you use only the feature and subjectivity clue information collected as part of our ongoing work. 1.3 Steps OpinionFinder Goes Through for Processing OpinionFinder goes through the following steps: 1) Preprocessing A set of documents in the docs directory are prepared for processing. XML and HTML meta information is removed. 2) Sentence Splitting and POS Tagging OpenNLP 1.3.0 is used to sentence split and part-of-speech tag the documents. 3) Stemming SCOL, version 1k, Steven Abney's stemmer program is used to stem the documents. 4) Feature Finder Clues useful for identifying subjective sentences and sentiment expressions are found in the text document. 5) Shallow Parsing SUNDANCE (Sentence UNDerstanding ANd Concept Extraction), a partial parser from the NLP laboratory at the University of Utah, is used by Autoslog-TS to identify extraction patterns needed by the sentence classifiers and the SourceFinder. 6) SourceFinder The SourceFinder the extraction patterns from Choi et al. (2005) to mark the sources of private states. 7) Direct Subjective Expression and Speech Event Classifier The direct subjective expression and speech event classifier, built by Eric Breck, tags the direct subjective expressions and speech events found within the document. This classifier uses WordNet 1.6 and PyWordNet 1.6. For information about PyWordNet or to download a copy, go to http://osteele.com/projects/pywordnet/. 8) Subjectivity Classifier The subjectivity classifier tags sentences in the document as subjective or objective. 9) Polarity Classifier The polarity classifier tags the words in the document with their contextual polarity. 10) SGML markup The MPQA files from the subjectivity, direct subjective expression and speech event, source, and polarity classifiers are written to the output_anns folder with inline SGML markup. The document that is tagged is the original document but with the original XML and HTML tags replaced with white space. ----------------------------------------------------------- 2. Notes on the System The program will only process documents that have sentences with less than 1000 words. When documents are put in the docs directory, they are consumed by the program. You should have a copy of the documents elsewhere. If you want to rerun the OpinionFinder on the original documents, you should recopy the documents to the docs directory. ----------------------------------------------------------- 3. System Requirements and Programs Used by OpinionFinder OpinionFinder runs on LINUX and uses the "/tmp" and "intermediate" folders for intermediate files. The "intermediate" folder is in the directory of this README. 3.1 Software installation The following software should be installed for the OpinionFinder to work: Python(version 2.3 or higher), Perl, and Java (version 1.5). Both Python and Perl must be accessible by invoking their names (ex: python example.py and perl example.pl). 3.2 Installation of required programs. Unzip the downloaded zip file to access and install the programs. 1) Unzip the file, opinionfinderv1.5.tar.gz: bzip2 -dvf opinionfinderv1.5.tar.bz2 2) Extract the files from the tar: tar -xvf opinionfinderv1.5.tar Install the following programs on your system before installing OpinionFinder: 1) SUNDANCE (Sentence UNDerstanding ANd Concept Extraction) and autoannotate are software developed by the NLP laboratory at the University of Utah. OpinionFinder uses version 4.37 of SUNDANCE. SUNDANCE has been packaged with OpinionFinder and is available in the opinonfinder/software directory. The readme file opinonfinder/software/sundance.README gives step by step installation instructions for installing this program. The file sundance-techreport.pdf in the same directory gives technical details about SUNDANCE. NOTE: if you get errors executing the makefile, please refer to the FAQ that addesses some known issues. 2) SCOL, version 1k is a partial parser, which has several tools, including a stemmer. SCOL has been packaged with OpinionFinder and is available in the opinonfinder/software directory. Unzip the file scol1k.tgz. gunzip scol1k.tgz tar -xvf scol1k.tar This will create a directory scol1k under opinonfinder/software. Go to this directory and follow the instructions in the README file to install SCOL 3) BoosTexter 2.1 is an implementation of a boosting machine learning algorithm developed by Erin Allwein, Robert Schapire, and Yoram Singer. Information on how to download and install Boostester is available at http://www.research.att.com/~gsf/download/ref/boostexter/boost exter.html 4) WordNet 1.6 This is a lexical reference system developed at Princeton University. Wordnet is available for download from http://wordnet.princeton.edu/obtain Follow the "Download old versions of WordNet" link to obtain WordNet version 1.6 NOTE: OpinionFinder has been developed using Wordnet 1.6 and may not work for other versions. 3.3 Miscellaneous software The programs below don't have to be installed by the user, but are used by OpinionFinder: 1) OpenNLP 1.3.0 is a set of Natural Language Processing tools. The files from OpenNLP that are necessary for our program to run have been put in the lib folder so OpinionFinder can access them. The software directory has OpenNLP 1.3.0. Unzip the file opennlp-tools-1.3.0.tgz with the command gunzip -dvf opennlp-tools-1.3.0.tgz Extract the files using the following command tar -xvf opennlp-tools-1.3.0.tar 2) The SourceFinder is software developed by the NLP groups at Cornell University and the University of Utah. See Choi et al. (2005) for more information. 3) PyWordNet 1.6 is used by the direct subjective expression and speech event classifier. For information about PyWordNet or to download a copy, go to http://osteele.com/projects/pywordnet/. See the LICENSES folder for licensing information and for more information about software used by OpinionFinder. ----------------------------------------------------------- 4. Installation The following describes how to install OpinionFinder on your system. Make sure you performed all the steps in section 3, System Requirements, before proceeding. 1) Go into the opinionfinder directory and edit config.txt using your favorite editor, specifying the directories where programs and data are located. The file, config.txt, gives details on what must be specified. Note that the javapath variable must point to the bin directory of Java version 1.5. 2) Run install.py. You must run install.py from the directory it is in. python install.py config.txt If you get the following errors from Java, ignore them: Note: Some input files use unchecked or unsafe operations. Note: Recompile with -Xlint:unchecked for details. We do not support the third party software that is needed by OpinionFinder to run. If you have problems installing that software, please contact the distributor. ----------------------------------------------------------- 5. Running OpinionFinder In order to run OpinionFinder, you can either specify variables from the command line or write a config file. Information that you provide the system includes the doclist, whether or not to zip intermediate files (.bz2) used by the system, whether to create the SGML output, and whether or not manual sentence splits will be provided. A doclist file is a list of documents to be processed. Each line is an absolute path to a document in the docs directory of the database. Each line is separated by a carriage return ('\n'). The simplest way to run OpinionFinder is python opinionfinder.py -f doclist where doclist is the path to the doclist. For more information on how to use opinionfinder.py, type python opinionfinder.py For an example of a config file, see opin.config. For an example of a doclist, see twain.doclist, which is in the "examples" directory. To make sure that OpinionFinder works on your system, you can do the following. Note that the files mentioned can be found in the "examples" directory: 1) Copy the marktwain folder into the docs folder of your database directory. 2) Edit the doclist, twain.doclist, so it gives the absolute paths to each of the documents in marktwain folder. 3) From the opinionfinder directory, run the command: python opinionfinder.py -f ./examples/twain.doclist If you see exceptions of the form: Exception exceptions.AttributeError: "DbfilenameShelf instance has no attribute 'writeback'" in ignored from the direct subjective expression and speech event classifier, ignore it. You will see SGML output for these files in the output_anns directory of the database. You can compare this output with the files in the "marktwainprocessed" folder of the "examples" directory. ----------------------------------------------------------- 6. Acknowledgements This work was supported by the Advanced Research and Development Activity (ARDA), by the NSF under grants IIS-0208028, IIS-0208798 and IIS-0208985, and by the Xerox Foundation. ----------------------------------------------------------- 7. Contact Information Please direct any questions or problems that you have to the following email address: opin@cs.pitt.edu ----------------------------------------------------------- 8. Citing Use of OpinionFinder Please cite the use of the various components of OpinionFinder individually. a. Subjective Sentence Classifiers: Ellen Riloff and Janyce Wiebe (2003). Learning Extraction Patterns for Subjective Expressions. Conference on Empirical Methods in Natural Language Processing (EMNLP-03). ACL SIGDAT. Pages 105-112. Janyce Wiebe and Ellen Riloff (2005). Creating subjective and objective sentence classifiers from unannotated texts. Sixth International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2005). b. Source Identifier: Yejin Choi, Claire Cardie, Ellen Riloff, and Siddharth Patwardhan (2005). Identifying Sources of Opinions with Conditional Random Fields and Extraction Patterns. Proceedings of Human Language Technology Conference/Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP 2005), Vancouver, Canada. c. Direct Subjective and Speech Event Identifier Yejin Choi, Eric Breck, and Claire Cardie (2006). Joint Extraction of Entities and Relations for Opinion Recognition. Conference on Empirical Methods in Natural Language Processiong (EMNLP-2006). d. Polarity Classifier Theresa Wilson, Janyce Wiebe and Paul Hoffmann (2005). Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. Proceedings of Human Language Technologies Conference/Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP 2005), Vancouver, Canada. e. Features and Clues of Subjectivity See README.featuresclues. ----------------------------------------------------------- 9. References Yejin Choi, Eric Breck, and Claire Cardie (2006). Joint Extraction of Entities and Relations for Opinion Recognition. Conference on Empirical Methods in Natural Language Processiong (EMNLP-2006). Yejin Choi, Claire Cardie, Ellen Riloff, and Siddharth Patwardhan (2005). Identifying Sources of Opinions with Conditional Random Fields and Extraction Patterns. Proceedings of Human Language Technology Conference/Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP 2005), Vancouver, Canada. Ellen Riloff (1996). Automatically Generating Extraction Patterns from Untagged Text. Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96). Pages 1044-1049. Ellen Riloff and Janyce Wiebe (2003). Learning Extraction Patterns for Subjective Expressions. Conference on Empirical Methods in Natural Language Processing (EMNLP-03). ACL SIGDAT. Pages 105-112. Ellen Riloff, Janyce Wiebe, and Theresa Wilson (2003). Learning Subjective Nouns Using Extraction Pattern Bootstrapping. Seventh Conference on Natural Language Learning (CoNLL-03). ACL SIGNLL. Robert E. Schapire and Yoram Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135-168, 2000. Janyce Wiebe (2002). Instructions for Annotating Opinions in Newspaper Articles. Department of Computer Science Technical Report TR-02-101, University of Pittsburgh, Pittsburgh, PA. Janyce Wiebe and Ellen Riloff (2005). Creating subjective and objective sentence classifiers from unannotated texts. Sixth International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2005). Janyce Wiebe, Theresa Wilson, and Claire Cardie (2005). Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, volume 39, issue 2-3, pp. 165-210. Theresa Wilson, Janyce Wiebe and Paul Hoffmann (2005). Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. Proceedings of Human Language Technologies Conference/Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP 2005), Vancouver, Canada. ----------------------------------------------------------- 10. List of Contributors University of Pittsburgh: Janyce Wiebe Paul Hoffmann Colin Ihrig Jason Kessler Swapna Somasundaran Theresa Wilson University of Utah: Ellen Riloff Siddharth Patwardhan Cornell University: Claire Cardie Eric Breck Yejin Choi