Linguists on the lam

Posted in Uncategorized with tags , , , on 2011 March 10 by Asad Sayeed

So, the reason why I decided to start writing again as per the previous post is that I was inspired by this post by Melody Dye which was intended, I guess, to stir up an old debate, and kind of also succeeded.  I didn’t participate on the thread due to time constraints, but I vehemently disagree with the argument she presents, and I eventually got into a “tweetflooding” argument with old friend Jeremy Kahn and new virtual friend Zoltan Varju on the matter. I will eventually get to responding to it, I hope, even though I really shouldn’t as I have something called a “dissertation” to write.  (Ugh.)  I’m going to write something related but just slightly tangential here.

In a nutshell: Melody’s thread is yet another rehash of the old methodological arguments against linguistic (particularly syntactic) theory that are destined to be visited on every generation. Multiple times. Forever and ever—it is simply a fact one must accept that people are going to believe that Google is a sort of linguistic counterexample engine.  I am in the peculiar position of someone who works with Big Corpora as his bread-and-butter and dissertation topic and so on—but remains quite skeptical of the ability of this work to provide us with particularly interesting insights as to the human capacity for language in itself.

But the main point I want to make, briefly, is on the linguistics blogosphere itself.  Is it just me, or is it wildly unrepresentative of the linguistics field as a whole?  Maybe it’s because I live very near to/participate in the Maryland hothouse of unreconstructed generative grammarians (Philip Resnik excepted, heh), but, um, it doesn’t seem to reflect the other “hothouses” (Carleton U and U of Ottawa) to which I’ve belonged as well, nor does it reflect my brushes past other real-life linguists and other departments.  On the occasions that I have read Language Log, it and its commentariat have tended to take positions a lot closer to Melody’s than the part of mainstream syntactic theory.

Aside from an obvious accusation of “anecdote” and “sample bias”, let me throw out another possible explanation that might actually tie together a number of issues: the fact that a lot of syntactic theorists, both faculty and students, tend to come from humanities (lit. and philosophy) backgrounds, and that it is not really surprising that the linguistic blogosphere is pretty saturated by Big Corpus and neo-empiricists and so on—and why a Google (heh) search for “minimalist linguistics blog” and various terms like that don’t tend to turn up much.

Again, perhaps I missed the Big Syntax Blog out there, but I’m pretty connected and well-read *cough* online, and I’d be surprised if I had truly missed it.

Now as to why syntacticians tend to have this background, why the technically-oriented ones might drift to the Big Corpus side of things, and what this all means for the field, well, those are interesting questions indeed.  It seems to be the case, for example, that a lot of syntactic theorists are getting jobs in English departments rather than, say, applied math or logic positions.  (More anecdotal experience.)

And as to what it all means, well, it means that syntactic theory is susceptible to criticism from the camp on the opposite side, the “European-style” logicians and formal grammarians—a criticism to which I am much more sympathetic than the claims of post-Chomskyans/neo-empiricists. (And about which I intend to write a post in the not-to-distant future!) But it also means that, regardless of who is right about these matters, syntax is not growing its base in places that it needs to grow its base, insofar as academic blogs are potential incubators of future collaborators and grad students.  And I believe that they are these days to a goodly extent.

And unfortunately it kind of also means that a lot of syntacticians will only be dimly aware that these issues are being revisited, even if the arguments aren’t really all that different from the ones that have been made in the past.  I am definitely sympathetic to people who might think we’ve been here and done that, like, 50 years ago.  So it goes.

Again! Again!

Posted in Uncategorized with tags on 2011 March 10 by Asad Sayeed

Hi. *waves a little sheepishly*

Recent events have caused me to reconsider firing up this blog again.  Although I’ve been having a ball on Twitter. In the meantime:

Not long after I named this blog, I realized that the title of this blog has a certain, um, unfortunate acronym.  But I quickly decided that I wouldn’t consider changing it.  It just amuses me to think that someone might one day say to themselves, “Asad is writing BS!” and my spirit self would think “Yes, yes he is.” So, I think it’s perfect.

Derivational cycles: syntax seminar

Posted in Uncategorized with tags , , , , , , on 2009 September 17 by Asad Sayeed

By the way, I am also attending a seminar in syntax (specifically, theories of derivational cycles) held by Norbert Hornstein and Juan Uriagereka.  Unfortunately, it overlaps with Philip Resnik’s sentiment analysis seminar, and technically being CS and all, I attend Philip’s class fully and then barge in an hour late to the syntax seminar.  That means I have a devil of a time picking up the thread of the conversation, but so far—as it has focused on a historical review of cyclicity in the syntax literature, much of which I am already familiar with—I don’t yet feel like I am suffering.

For those not as familiar with theoretical syntax and wondering what a “derivational cycle” might be…hoo, boy.  One of the criticisms of theoretical linguistics of the so-called “Chomskyan” variety (that UMD linguistics practices with gusto) is that it has its head in the formalistic clouds, far away from language, but never far enough away that it can be described with a great deal of mathematical precision.  Cyclicity in syntax is both a prime example of this, and one of the most important and IMO convincing and interesting aspects of the approach.

But one simple way of thinking about it is that there are definite limitations to the scope of question words in a sentence, and that these limitations happen in “cycles” roughly—but not strictly—defined by nested clauses.  Making the case requires a lot of examples and reams of PhD theses, but here’s an illustrative pair:

  • Why was the man sleeping in the boat?
  • What boat was the man sleeping in?

We can extend both questions by adding another clause-embedding, in a sense recursively (to appeal to CS sensibilities).

  • Why did you tell the reporter that the man was sleeping in the boat?
  • What boat did you tell the reporter that the man was sleeping in?

In the “why” case, the shorter question asks the reason for sleeping in the boat, but the longer question no longer allows that interpretation.  We are instead forced to interpret that it was asking why “you [told] the reporter” about it.  In other words, introduction of the “that” seems to have had an effect of “blocking” the question from applying to the later clause.

But not so for “What boat”!  There is therefore something special about the introduction of a clause boundary in English that blocks some interpretations of questions but not others.  We can extend these examples further, and into other languages.  As clauses can be nested further, we can suggest that these phenomena are therefore in some sense cyclic or recursive, yet apply to very abstract human faculties of interpretation.

This particular class was a review of Chomsky’s classic Barriers monograph (in a nutshell, how clause boundaries act as barriers to certain interpretations) followed by a review of Lasnik and Saito’s work (that’s UMD’s Howard Lasnik) on “proper government”, which elaborates on some of the conditions that permit barriers to form through characteristics of abstract variables called “traces”.  Each of these, however, would take me hours to summarize, so I won’t, at this point.

Sentiment analysis seminar

Posted in Uncategorized with tags , , on 2009 September 17 by Asad Sayeed

Hi ho, people.  I attended the sentiment analysis seminar again this week, but I was helping lead the discussion this time, on current efforts in sentiment annotation, particularly the Multi-Purpose Question-Answering (MPQA) corpus.  We covered these papers:

The first paper mainly covers the basic effort in MPQA sentiment annotation—how to break down the problem so as to achieve consistent annotation and how to measure inter-annotator agreement.  The basic paradigm that Wiebe and her team use is to view sentiment annotation as being about a mapping between stretches of text and the “private states” of an opinion-holder.  A private state is simply a property of an opinion-holding entity that is not independently verifiable. In other words, a private state is a description of subjectivity.

We can thus develop an ontology of private states and the holders of said states and how they are reflected in text.  There are two major categories of private state expressions: expressive (implicit) subjectivity and direct (explicit) subjectivity.  “Bob squashed the hated insect” is a statement in which the hatedness of the insect is clearly a subjective statement, but it’s not directly attributed to Bob—even though in the context it could be Bob’s opinion.  “Bob claims he hates insects,” on the other hand, is a statement directly made by Bob.

It was interesting to watch the reactions of some of my fellow classmates.  Most of them come from either straight-up CS or hardcore statistical NLP, so fine-grained philosophical distinctions of subjectivity are not day-to-day staples in their work, and some of them expressed on the course mailing list and in class that reading the papers required some amount of mentality shift.  Philip went through some of the techniques of linguists in making these distinctions, including how linguists change the contexts of statements in order to establish tests for the linguistic properties of statements.

Sentiment analysis seems to be one of those places where a stronger bridge between linguistics and applied NLP can be made.

The second paper followed much in the same vein as the first, except that it emphasized the extent to which current NLP techniques cannot yet handle some of the distinctions in the MPQA annotation.  In particular, semantic role labeling—a family of existing techniques for establishing grammatical dependencies—cannot be used to directly infer some of the participants in a private state expression.  For example, when the holder of an opinion is only implied, the semantic role labeling as we currently conceive it will never find it.

The last reading dealt with some additions to the MPQA made by Theresa Wilson, particularly in the addition of target/topic annotations to the MPQA as well as subdividing opinion types into “attitudes” like “sentiment” and “arguing”.

Having fooled around a bit with the MPQA myself, I had the opportunity to show to the class a little bit of what it looked like, and what the challenges of using a somewhat inconsistent standoff annotation format could be.  Philip also took the opportunity to try out some collaborative annotation of a small passage of text with hilarious results—in the amount of arguing it took to decide the subjectivity of even small stretches of text.  Assigning subjectivity is too subjective! In that sense, the high inter-annotator agreement in the original MPQA effort seems somewhat surprising, which had been pointed out on the mailing list before the class.

Multiple annotations

Posted in Uncategorized on 2009 September 11 by Asad Sayeed

I am now (yet again) running into what must be a familiar problem to any researcher performing experiments on annotated text: managing and coordinating multiple streams of annotation from humans and from software onto the same texts.  Right now, I have documents marked up in XML via a named entity tagger, but now I need to mark off certain kinds of phrases without involving anything that was marked in XML as a named entity.  This will require an annoying song-and-dance to coordinate an XML parser with a, well, English-language parser, getting rid of the tags and then putting them back.

Couple this with other sources of annotation down the pipeline, and it’s at the very least a long series of the tedious and repetitive kind of programming, not the fun kind.

Now, of course, some portion of my nonexistent readership will pipe up: what about UIMA?  UIMA is very nice in theory, but it is a big Javafied over-engineered mess (to be blunt).  I and teammates used it last year for an information extraction project that is now on ice, and it was one of those cases where the cure was worse than the disease. Modularization and reusability and generality and safety and all that software engineering is all very nice and elegant in theory, but it imposes a steep and unacknowledged price in programmer usability, especially since we were trying to integrate it into an annotation pipeline heavily dependent on tools I had already created in Python.

(Someone really needs to rein in software engineers.  UIMA reminded me of the UML, whose problems were apparent in the 90s for similar reasons.  Note in the Wikipedia article the problem of OO orthodoxy in UML.  As scripting languages have evolved and become popular, UML becomes even more difficult to conceive as a practical way of translating design to implementation.)

So until someone comes up with a nice, friendly way of mangling multiple NLP-related annotation streams together in Python and Perl and whatever, without expecting enormous Eclipse workbenches and UML-ish software-engineering doodads, I guess I’m just going to stick to combining annotations via irritating ad hoc scripts.

Sentiment analysis seminar

Posted in Uncategorized with tags , , , , , , on 2009 September 9 by Asad Sayeed

I will attempt to blog some of things I attend during the semester.  One of them is a weekly seminar on sentiment analysis taught by Philip Resnik.  I am in it right now. This is therefore a liveblog and hence not guaranteed to make sense or be complete. Especially the latter, far from it.

Tim Hawes’ thesis and conversational analysis

The first thing we’re talking about is Tim Hawes work for his Master’s degree, which he defended just yesterday.  I attended it yesterday before I had even started this blog.  It was about predicting the outcome of Supreme Court (US) cases from transcripts of oral arguments.  This is particularly interesting today, as Philip just mentioned, as Sonia Sotomayor showed up for work today at SCOTUS for the first time.  I proposed by mailing list that one further means of predicting how individual would vote, even if they rarely say anything on the bench (true of some justices) would be their body of writing and argument prior to confirmation.  Philip proposed the use of a mixture model using prior argument, updated as the justice moves through his/her career.

In the case of legal arguments, we have to make some assumptions.  Hawes’ thesis mentioned a couple of textual assumptions: cohesion and coherence.  That is, we assume that there are topical and other elements that evolve through the text in a consistent way.  There are techniques we can use to measure and segment a document based on these kinds of assumptions, such as TextTiling and lexical chaining.

The distinction between cohesion and coherence: the latter is a semantic value that really must be judged by a person—it’s about interpretation.  It is possible to have cohesion without coherence. (We mentioned the word “zeugma”, look it up.)

The right place to look for discourse analysis from an NLP point of view is to start with Rhetorical Structure Theory (RST).  Daniel Marcu has written on this topic.

Hawes’ thesis had two kinds of results.  One of these was encoded in “rose diagrams.”  These are modified pie charts for which each slice varies in radius as though it were a petal of a rose, and each petal is coloured differently in a gradient of shades.  In this representation, we can visualize a large number of things at once.  In the case of SCOTUS, each justice can be represented as a petal, whose colour represents political leaning, radius agreement, and width the number of follow-up turns at questioning.  It’s a bit of a complicated representation and hard to describe without a diagram, which I’m not about to do on the fly.

While this form of visualization is quite complicated by itself, it can be used to make contrasts between types of cases and judicial situations, in which case it produces often very strong and visible contrasts.  Contrasts we can examine are between “liberal” vs. “conservative”, affirm vs. overturn, plaintiff win vs. lose, and so on.   It turns out that by this method, you can predict Clarence Thomas’ vote (who rarely speaks) to a high degree of accuracy.

We didn’t seem to get to the other technique Hawes’ used, but we had to change rooms.

Sentiment analysis

We didn’t end up changing rooms.

This part of the seminar we briefly touch on the basics of sentiment analysis, particular with reference to Bing Liu’s recent review article.  So we begin with a discussion of the general dimension and challenges of sentiment analysis, such as

  • What do people think of _____?
  • What features/facets/topics matter?
  • Mixed and neutral sentiment.
  • The effect of comparatives.
  • How opinions change, and what influences this.
  • Covert opinion/spin vs. overt expressions.
  • The holders of opinion
  • Multilingual and cross-language issues.

We then had a rather wide-ranging discussion of the different kinds of issues, including a detailed discussion on syntactic relationships within sentences that might relate opinion-holders to sentiments to opinion targets/objects. This discussion was so wide-ranging and yet very compressed that it is hard to represent it in a liveblog, but it covers issues that we will revisit in later sessions.

Why I am still a No Facebook zone

Posted in Uncategorized with tags , , , , on 2009 September 8 by Asad Sayeed

I have given in a little bit to the social networking trend that I have generally resisted by adopting Twitter for certain purposes.  It’s useful for running an occasional headline feed of my life and activities or for publishing a running commentary.  While it has the followspam problem, it’s relatively easy to hide in the crowd.

From fairly early on, I have had a Facebook page.  But almost from its inception, I have set its status to say in no uncertain terms that I will not respond to friend requests.  I got the Facebook page because certain aspects of my life required interaction with undergrads who live their lives publicly and mostly use the Facebook Internet to manage their social lives, as I pay attention to one or two undergraduate-run campus cultural organizations.  Suffice it to say that since that time, these organizations have failed to pique my interest, but I have kept my identity “parked” on Facebook anyway, rather than delete it.

Nevertheless, I admit that I have never found a good way to articulate my dislike of Facebook and its reasons. However, Valerie Aurora has done so for me (incl. link to great Wired article).

Facebook’s internal messaging system is 100% pure evil and completely representative of what I hate about Facebook. I can get an email notification that someone has sent me a message, but replying to them via email requires me to look up their email address on their profile by hand – so you don’t, you just reply using internal Facebook messaging, which requires me to go back to their damn web site every time I want to communicate with this person. I can see this being very attractive if you have a sucky email provider with bad spam filtering, but you could also write this in such a way that it integrates smoothly with your existing email account. Facebook didn’t, because they want Facebook Internet to partition from Real Internet, leaving them with far more control over your online data than Google could ever dream of acquiring.

That said, despite the temptation of Gmail at times, I have not allowed myself to become too well-integrated into the BorG for now, as I am a little bit paranoid of them too, although it’s true, they are still not as problematic as Facebook.

Woohoo, first substantive post!