Multiple annotations

I am now (yet again) running into what must be a familiar problem to any researcher performing experiments on annotated text: managing and coordinating multiple streams of annotation from humans and from software onto the same texts.  Right now, I have documents marked up in XML via a named entity tagger, but now I need to mark off certain kinds of phrases without involving anything that was marked in XML as a named entity.  This will require an annoying song-and-dance to coordinate an XML parser with a, well, English-language parser, getting rid of the tags and then putting them back.

Couple this with other sources of annotation down the pipeline, and it’s at the very least a long series of the tedious and repetitive kind of programming, not the fun kind.

Now, of course, some portion of my nonexistent readership will pipe up: what about UIMA?  UIMA is very nice in theory, but it is a big Javafied over-engineered mess (to be blunt).  I and teammates used it last year for an information extraction project that is now on ice, and it was one of those cases where the cure was worse than the disease. Modularization and reusability and generality and safety and all that software engineering is all very nice and elegant in theory, but it imposes a steep and unacknowledged price in programmer usability, especially since we were trying to integrate it into an annotation pipeline heavily dependent on tools I had already created in Python.

(Someone really needs to rein in software engineers.  UIMA reminded me of the UML, whose problems were apparent in the 90s for similar reasons.  Note in the Wikipedia article the problem of OO orthodoxy in UML.  As scripting languages have evolved and become popular, UML becomes even more difficult to conceive as a practical way of translating design to implementation.)

So until someone comes up with a nice, friendly way of mangling multiple NLP-related annotation streams together in Python and Perl and whatever, without expecting enormous Eclipse workbenches and UML-ish software-engineering doodads, I guess I’m just going to stick to combining annotations via irritating ad hoc scripts.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: