Archive for sentiment analysis

The grilling is scheduled a fortnight hence

Posted in Uncategorized with tags , , , , , , , , on 2011 July 20 by Asad Sayeed

Yep, my Special Day of Reckoning is approaching:

 THE DISSERTATION DEFENSE FOR THE DEGREE OF Ph.D. IN COMPUTER SCIENCE FOR

Asad Basheer Sayeed

Will be held:

DATE:                    Wednesday August 3, 2011 at 12:00 p.m.

LOCATION:           Room 3258 A.V. Williams Bldg.

TITLE:        A Distributional and Syntactic Approach to Fine-Grained Opinion Mining

ABSTRACT:  This thesis contributes to a larger social science research program of analyzing the diffusion of IT innovations. We show how to automatically discriminate portions of text dealing with opinions about innovations by finding {source, target, opinion} triples in text. In this context, we can discern a list of innovations as targets from the domain itself. We can then use this list as an anchor for finding the other two members of the triple at a “fine-grained” level—paragraph contexts or less.

We first demonstrate a vector space model for finding opinionated contexts in which the innovation targets are mentioned. We can find paragraph-level contexts by searching for an “expresses-an-opinion-about” relation between sources and targets using a supervised model with an SVM that uses features derived from a general-purpose subjectivity lexicon and a corpus indexing tool. We show that our algorithm correctly filters the domain relevant subset of subjectivity terms so that they are more highly valued.

We then turn to identifying the opinion. Typically, opinions in opinion mining are taken to be positive or negative. We discuss a crowdsourcing technique developed to create the seed data describing human perception of opinion bearing language needed for our supervised learning algorithm. Our user interface successfully limited the meta-subjectivity inherent in the task (“What is an opinion?”) while reliably retrieving relevant opinionated words using labour not expert in the domain.

Finally, we developed a new data structure and modeling technique for connecting targets with the correct within-sentence opinionated language. Syntactic relatedness tries (SRTs) contain all paths from a dependency graph of a sentence that connect a target expression to a candidate opinionated word. We use factor graphs to model how far a path through the SRT must be followed in order to connect the right targets to the right words. It turns out that we can correctly label significant portions of these tries with very rudimentary features such as part-of-speech tags and dependency labels with minimal processing. This technique uses the data from the crowdsourcing technique we developed as training data.

We conclude by placing our work in the context of a larger sentiment classification pipeline and by describing a model for learning from the data structures produced by our work. This work contributes to computational linguistics by proposing and verifying new data gathering techniques and applying recent developments in machine learning to inference over grammatical structures for highly subjective purposes. It applies a suffix tree-based data structure to model opinion in a specific domain by imposing a restriction on the order in which the data is stored in the structure.

Examining Committee:

COMMITTEE CHAIR:                          Dr. Amy Weinberg

Dean’s Representative:                      Dr. William Idsardi

Committee Members:

Dr. Jordan Boyd-Graber

Dr. Hal Daume III

Dr. Donald Perlis

EVERYONE IS INVITED TO ATTEND THE PRESENTATION PORTION OF THIS DEFENSE

Sentiment analysis seminar

Posted in Uncategorized with tags , , on 2009 September 17 by Asad Sayeed

Hi ho, people.  I attended the sentiment analysis seminar again this week, but I was helping lead the discussion this time, on current efforts in sentiment annotation, particularly the Multi-Purpose Question-Answering (MPQA) corpus.  We covered these papers:

The first paper mainly covers the basic effort in MPQA sentiment annotation—how to break down the problem so as to achieve consistent annotation and how to measure inter-annotator agreement.  The basic paradigm that Wiebe and her team use is to view sentiment annotation as being about a mapping between stretches of text and the “private states” of an opinion-holder.  A private state is simply a property of an opinion-holding entity that is not independently verifiable. In other words, a private state is a description of subjectivity.

We can thus develop an ontology of private states and the holders of said states and how they are reflected in text.  There are two major categories of private state expressions: expressive (implicit) subjectivity and direct (explicit) subjectivity.  “Bob squashed the hated insect” is a statement in which the hatedness of the insect is clearly a subjective statement, but it’s not directly attributed to Bob—even though in the context it could be Bob’s opinion.  “Bob claims he hates insects,” on the other hand, is a statement directly made by Bob.

It was interesting to watch the reactions of some of my fellow classmates.  Most of them come from either straight-up CS or hardcore statistical NLP, so fine-grained philosophical distinctions of subjectivity are not day-to-day staples in their work, and some of them expressed on the course mailing list and in class that reading the papers required some amount of mentality shift.  Philip went through some of the techniques of linguists in making these distinctions, including how linguists change the contexts of statements in order to establish tests for the linguistic properties of statements.

Sentiment analysis seems to be one of those places where a stronger bridge between linguistics and applied NLP can be made.

The second paper followed much in the same vein as the first, except that it emphasized the extent to which current NLP techniques cannot yet handle some of the distinctions in the MPQA annotation.  In particular, semantic role labeling—a family of existing techniques for establishing grammatical dependencies—cannot be used to directly infer some of the participants in a private state expression.  For example, when the holder of an opinion is only implied, the semantic role labeling as we currently conceive it will never find it.

The last reading dealt with some additions to the MPQA made by Theresa Wilson, particularly in the addition of target/topic annotations to the MPQA as well as subdividing opinion types into “attitudes” like “sentiment” and “arguing”.

Having fooled around a bit with the MPQA myself, I had the opportunity to show to the class a little bit of what it looked like, and what the challenges of using a somewhat inconsistent standoff annotation format could be.  Philip also took the opportunity to try out some collaborative annotation of a small passage of text with hilarious results—in the amount of arguing it took to decide the subjectivity of even small stretches of text.  Assigning subjectivity is too subjective! In that sense, the high inter-annotator agreement in the original MPQA effort seems somewhat surprising, which had been pointed out on the mailing list before the class.

Sentiment analysis seminar

Posted in Uncategorized with tags , , , , , , on 2009 September 9 by Asad Sayeed

I will attempt to blog some of things I attend during the semester.  One of them is a weekly seminar on sentiment analysis taught by Philip Resnik.  I am in it right now. This is therefore a liveblog and hence not guaranteed to make sense or be complete. Especially the latter, far from it.

Tim Hawes’ thesis and conversational analysis

The first thing we’re talking about is Tim Hawes work for his Master’s degree, which he defended just yesterday.  I attended it yesterday before I had even started this blog.  It was about predicting the outcome of Supreme Court (US) cases from transcripts of oral arguments.  This is particularly interesting today, as Philip just mentioned, as Sonia Sotomayor showed up for work today at SCOTUS for the first time.  I proposed by mailing list that one further means of predicting how individual would vote, even if they rarely say anything on the bench (true of some justices) would be their body of writing and argument prior to confirmation.  Philip proposed the use of a mixture model using prior argument, updated as the justice moves through his/her career.

In the case of legal arguments, we have to make some assumptions.  Hawes’ thesis mentioned a couple of textual assumptions: cohesion and coherence.  That is, we assume that there are topical and other elements that evolve through the text in a consistent way.  There are techniques we can use to measure and segment a document based on these kinds of assumptions, such as TextTiling and lexical chaining.

The distinction between cohesion and coherence: the latter is a semantic value that really must be judged by a person—it’s about interpretation.  It is possible to have cohesion without coherence. (We mentioned the word “zeugma”, look it up.)

The right place to look for discourse analysis from an NLP point of view is to start with Rhetorical Structure Theory (RST).  Daniel Marcu has written on this topic.

Hawes’ thesis had two kinds of results.  One of these was encoded in “rose diagrams.”  These are modified pie charts for which each slice varies in radius as though it were a petal of a rose, and each petal is coloured differently in a gradient of shades.  In this representation, we can visualize a large number of things at once.  In the case of SCOTUS, each justice can be represented as a petal, whose colour represents political leaning, radius agreement, and width the number of follow-up turns at questioning.  It’s a bit of a complicated representation and hard to describe without a diagram, which I’m not about to do on the fly.

While this form of visualization is quite complicated by itself, it can be used to make contrasts between types of cases and judicial situations, in which case it produces often very strong and visible contrasts.  Contrasts we can examine are between “liberal” vs. “conservative”, affirm vs. overturn, plaintiff win vs. lose, and so on.   It turns out that by this method, you can predict Clarence Thomas’ vote (who rarely speaks) to a high degree of accuracy.

We didn’t seem to get to the other technique Hawes’ used, but we had to change rooms.

Sentiment analysis

We didn’t end up changing rooms.

This part of the seminar we briefly touch on the basics of sentiment analysis, particular with reference to Bing Liu’s recent review article.  So we begin with a discussion of the general dimension and challenges of sentiment analysis, such as

  • What do people think of _____?
  • What features/facets/topics matter?
  • Mixed and neutral sentiment.
  • The effect of comparatives.
  • How opinions change, and what influences this.
  • Covert opinion/spin vs. overt expressions.
  • The holders of opinion
  • Multilingual and cross-language issues.

We then had a rather wide-ranging discussion of the different kinds of issues, including a detailed discussion on syntactic relationships within sentences that might relate opinion-holders to sentiments to opinion targets/objects. This discussion was so wide-ranging and yet very compressed that it is hard to represent it in a liveblog, but it covers issues that we will revisit in later sessions.