Unsupervised Morphological Analysis and Crosslingual Word Sense Disambiguation using Large Quantities of Unannotated Data

Edward Kenschaft, master's thesis based on pre-candidacy doctoral research, University of Maryland.

This page displays best under Firefox, but it should also be readable under Internet Explorer. Likewise, it makes use of color, but should be perfectly understandable in black & white.

This paper is also available in a PDF version.

1. Introduction

1.1. Abstract

My research at the University of Maryland involved three distinct but related areas within the field of computational linguistics:

  1. Unsupervised morphological analysis (UMA) – Breaking words into component morphemes for nearly any written language, using only raw text for analysis.
  2. Word sense disambiguation (WSD) – Distinguishing between a word's possible meanings.
  3. Machine translation (MT) – Translating text automatically from one language to another, particularly making use of UMA and CWSD.

We evaluated the following experimental hypotheses:

  1. Unsupervised morphological analysis (UMA)
    1. We can design a UMA system that applies to any morphologically complex language, and has accuracy comparable (not necessarily superior) to language-specific systems. (Task I)
  2. Word sense disambiguation (WSD)
    1. WSD accuracy can be improved by providing unannotated data in addition to annotated training data. (Task II.A)
    2. WSD accuracy can be improved by using document-level features in addition to sentence-level features. (Task II.B)
    3. WSD accuracy can be improved by performing UMA on the text. (Task II.C)
  3. Machine translation (MT)
    1. MT accuracy can be improved by performing UMA on the source text. (Task III.A)
    2. MT accuracy can be improved by performing CWSD on the source text. (Task III.B)

This paper summarizes all experimental results, with conclusions supporting hypotheses A, B, C and D.

Please note the partial glossary at the end of the paper.

Disclaimers: This paper is not intended for publication, but is rather a summary of my research efforts at UMD, from fall 2005 through 2006. It incorporates portions of several project reports, minimally edited, as well as descriptions of incomplete experiments. Summaries of related research efforts have generally not been updated since 2006.

1.2. Goals

1.2.1. Task I: Unsupervised morphological analysis

Morphological analysis (or induction) is the process of breaking a word, or all the words of a given language, into component morphemes. The field has a substantial history, but efforts have typically been constrained in major ways:

Building upon this tradition, we sought to:

The goal was an unsupervised morphological analysis (UMA) system that can be used to induce morphological structure in any language where written text is available, regardless of what else is known about the language.

1.2.2. Task II: Crosslingual word sense disambiguation

Word sense disambiguation (WSD) is the process of determining which possible sense of a word is used in a given context. For instance, the English word fence has at least three senses:

  1. He climbed over the fence.
  2. He learned to fence with a sabre.
  3. He went into the city to fence the stolen TV's.

Monolingual WSD is often a poorly defined problem, in that:

Various researchers have begun to explore crosslingual word sense disambiguation (CWSD), based on a corpus of sentences that have been translated between two languages. The possible translations of a word in the target language are taken to be its sense inventory. That is, if a word can be translated in three different ways, then these are its three "senses" – regardless of whether some other theoretical formalism would agree.

So far, research into CWSD has been constrained in several ways:

Building upon this tradition, we sought to:

Three of these approaches reduced data sparseness by introducing noisy data. Naturally, this involves trade-offs. However, results show that these approaches nonetheless helped in some cases. More research presumably could further improve noise filtering.

1.2.3. Task III: Machine translation

We attempted to evaluate UMA and CWSD in terms of their impact on machine translation (MT). However, we ran out of time before reaching the desired results.

1.3. Related work

Disclaimer: These highlights do not scratch the surface of a full literature survey on these topics.

1.3.1. Related work: Unsupervised morphological analysis

Morphological analysis is the process of parsing a word into its component morphemes. This can be done manually or automatically, using either a rule-based or statistical system. Many of the most commonly studied languages have commercially available rule-based systems which achieve greater than 95% accuracy (citation???). However, developing such a system for a new language is time-intensive, as is annotating data for use by a supervised statistical system. We therefore focused our attention on unsupervised induction of morphological structure. Such a system could be created for a new language much faster than a rule-based or supervised system, as long as unannotated text is available.

This field has a long and rich history (e.g. Koskenniemi 1983; Kay 1987). We mention just a couple of efforts that most closely influenced our work.

Dayne Freitag (2005) offers a methodology for inferring morphosyntactic transformation rules, as opposed to morphemes per se. In a nutshell, his method involves the following steps:

  1. Group all tokens into co-occurrence classes (i.e. tokens that appear in similar contexts).
  2. Identify transformations between members of one co-occurrence class to members of another.
  3. Cull transforms to a minimal set.

Some advantages of Freitag's system:

Morphological analysis is related to a task known in the literature as (deep) lexical acquisition (DLA). Timothy Baldwin (2005) assumes a seed database, and describes a variety of methods for augmenting that database. His method for identifying morphemes is summarized as follows:

  1. Generate all 1- to 6-grams occurring in the corpus.
  2. Filter out n-grams with fewer than (m=3) occurrences.
  3. Filter out any n-gram with the same frequency as any supersequence (i.e. a larger string in which it appears).
  4. Select the (k=3900) n-grams with the highest saturation (i.e. most complete coverage of the data?).

This method, I suspect, is theoretically inferior to Freitag's, since it does not share any of the advantages listed above, and arbitrarily limits the number of recognized morphemes. However, it should be significantly simpler and faster to implement.

Baldwin's non-morphological features depend on deeper semantic knowledge, which I do not presume to have.

1.3.2. Related work: Morphology & word sense disambiguation

Florian and Wicentowski (2002) used morphological analysis to improve WSD, referenced in (Wicentowski 2002).

I only discovered this paper recently (2008), and have not examined their work.

1.3.3. Related work: Morphology & alignment / machine translation

Philipp Koehn & Kevin Knight (2003) experimented with heuristics for dividing German compounds into component morphemes, by looking for known roots that occur as substrings of larger words. They evaluated the effects of their approach on word alignment (German↔English), word-based statistical machine translation (WBSMT), and phrase-based statistical machine translation (PBSMT).

Because German is agglutinative, each German word (particularly noun compound) may align with an arbitrarily long sequence of English words. This provides a significant challenge for word alignment, and for other applications that depend upon alignment, such as machine translation.

Koehn & Knight attempted to address this problem by learning rules for splitting German noun compounds into component parts. A part is either a subword or filler between subwords. They used only unannotated text for training.

Their algorithm involved two steps: generation and selection.

For generation, they looked at a new word, and listed every possible covering of that word with known words and fillers. For example, if the new word is aktionsplan, the four possible coverings are:

  1. aktionsplan (no covering)
  2. aktions-plan
  3. aktion-s-plan
  4. akt-ion-s-plan
For selection, they employed various heuristics to determine which of these coverings is in some sense optimal. They experimented with a variety of different heuristics, including:
  1. eager – split each word into the most (hence, smallest) possible chunks
  2. frequency-based – employ a metric which takes into account frequency counts of various chunks in the training text
  3. parallel text – use German↔English parallel text to test each proposed split against aligned text, keeping it if maps to an actual English phrase, discarding it if it does not
  4. part-of-speech (POS) – restrict covering words to content words (e.g. not prepositions or determiners)

Their results are shown in (01).

(01) Table: Evaluation of compound-splitting methods (Koehn & Knight 2003)
Splitting method Alignment BLEU
precision   recall    accuracy WBSMT PBSMT
none 94.2% 29.1 30.5
eager 24.8% 73.3% 87.1% 22.2 34.4
frequency-based 57.4% 86.6% 95.7% 31.7 34.2
parallel 83.3% 89.1% 98.6% 29.4 33.0
parallel & POS 93.8% 90.1% 99.1% 30.6 32.6

Not surprisingly, the best word alignment, 99.1% accuracy, occurred when using all possible information, parallel text plus POS, up from 94.2% with unmodified text. The frequency-based metric produced only produced moderate improvements, to 95.7%. Eager splitting hurt alignment.

It is perhaps surprising that the best downstream results with MT, as measured by BLEU score, did not involve optimization with parallel text. For word-based MT, the frequency-based metric was by far the best performer, with a 2.6 and 1.6 improvement in BLEU score, respectively, over no splitting and splitting based on all information.

Even more interesting are the results with phrase-based MT. Here the best approach was dumb, eager splitting, with a whopping 3.9 BLEU improvement over no splitting (34.4 vs. 30.5). The frequency-based metric was a close second (34.2).

The success of eager splitting is attributed to the peculiar strength of PBSMT, which has no problem taking an arbitrary string of "words" and regrouping them into "phrases" as needed.

Nießen & Ney (2001a, 2004) introduced the idea of a hierarchical lexicon, where a word is represented at various levels of inflectional specificity, starting with the bare root. German↔English alignment is performed at each of these levels. The technique was adopted by various later researchers.

Nießen & Ney also employed various heuristics for syntactic restructuring, such as reordering questions and detaching verb prefixes, to make the German more like the English. Finally, they added other heuristics, such as detecting multiword phrases based on POS sequence.

Used all together, the techniques led to improvements comparable to an order of magnitude of data (02). This claim would be stronger if an additional order of magnitude were represented, e.g. from 500 to 5K.

Each column represents a different accuracy measure; lower is better.

(02) Table: Evaluation of syntactic restructuring methods (lower is better) (Nießen & Ney 2004)
#sent Method 1-BLEU mWER SSER ISER
0 baseline 76.7 53.6 60.4 29.8
restructure 70.9 50.2 57.8 30.0
  + all 67.4 48.0 52.8 24.1
5K baseline 52.6 38.0 37.3 17.4
restructure 47.9 34.7 33.6 15.2
  + all 47.1 33.9 31.8 13.7
58K

baseline 46.3 34.1 30.2 14.1
restructure 43.7 32.5 26.6 12.8
  + all 42.9 31.8 26.3 11.8

Todo: Results with just hierarchical lexicon, no restructuring?

Popović & Ney (2004a) built on the concept of the hierarchical lexicon (Nießen & Ney 2001a). They first parsed German words into roots and artificial inflectional morphemes, e.g. PRES for present tense. They then used a modified EM alignment algorithm to treat each of these complex "words" as a hierarchy, with alignment possible at any level. They evaluated effects on a variety of baseline systems, producing at least modest improvements in every case.

Popović et al. (2004b, 2005) used knowledge of the specific language pair to remove inflectional morphemes or function words that are not translatable, leading to a significantly reduced lexicon, and reduced alignment error rate.

I presume the alignment of function words to NULL has been handled in the mainstream alignment literature, although I haven't read up on it. Removing inflectional morphemes by hand seems (a) ad hoc, and (b) potentially detrimental to translation due to loss of information. Splitting apart but retaining all morphemes should provide the same benefits, without these objections.

Bettina Schrader (2004) also addressed German↔English alignment & translation (among other things), first looking at inflectional affixes, then noun compounding.

In experiments on three different data sets, Schrader removed inflectional affixes from the German, and aligned/translated only the lemmas. She evaluated on translation candidates generated using a bag-of-words dictionary generator (Hiemstra 1996). With all three data sets, recall went up with lemmatization. However, with the exception of one data set (Patente), precision went down. Schrader attributes this to a naive precision metric, which unduly penalizes the system for generating multiple translation candidates for the same term.

Using only the Patente data set, Schrader tried another experiment where she broke long compounds into component morphemes before alignment, e.g. Dämpfungsscheibenanordnung ('dampening disk assembly') → Dämpfung scheibe anordnung. Recall improved somewhat, while precision did not change significantly. Schrader noted that the system was severely hindered by its inability to handle alignments involving multi-word units. This suggests that results should be significantly better when evaluated according to a downstream PBSMT system.

Schrader (2006) addressed the problem of aligning German noun compounds with English word sequences. For data, she used the Europarl corpus (Koehn 2005) with both sides automatically POS tagged, and manually corrected alignments of ~100,000 German tokens to English counterparts.

Motivation

Schrader was particularly interested in the problems posed by an inordinately large proportion of hapax legomena (i.e. word forms occurring only once) in the German data, as hapax legomena cannot be handled neatly by a statistical system. As shown in table (03), only 38% of English word forms are hapax, compared to 49% of German.

(03) Table: Corpus characteristics (Schrader 2006)
Language Tokens Forms Hapax Other Rare Frequent
English 29,077,024 101,967 39,200 (38%) 35,608 (35%) 27,159 (27%)
German 27,643,792 286,330 140,826 (49%) 98,126 (34%) 47,378 (17%)

Of 512 German hapax legomena examined:

(04) Table: German compound length vs. English phrase length (# of occurrences) (Schrader 2006)
English German compound length
1 2 3 4 >4
1 word 59 2 1 0 3
2 words 30 119 56 15 10
3 words 2 20 15 8 10
4 words 0 1 0 0 2

The English phrase length is often one more than the compound length, due to the presence of a prepositional phrase. For example, Kongresh-vorlage 'submission to Congress'.

Observation: Table (04) only seems to support this observation for German noun compounds of length 1 or 2. For longer compounds especially, a more precise analysis might lead to improved heuristics.

Approach

Instead of word-to-word mappings, Schrader mapped German compound nouns to English word sequences.

Results
Side argument: Compositional analysis of compound nouns

Schrader argues that German compounds should not be broken into components, since the meaning of the compound may not be compositional. For instance, the German word Personen-stand is made up of roots literally translated as 'personal status', but the meaning of the compound is 'marital status'. However, a modern PBSMT system will take this into account, translating the sequence as a non-compositional unit when possible, and as compositional components otherwise.

Goldwater & McClosky (2005) addressed the problems of translating from a morphologically rich language (Czech) to a morphologically poor language (English).

For data, they used:

Evaluation was based on word-based (not phrase-based) MT BLEU scores.
Motivation

Czech's morphological richness leads to high data sparseness. While the numbers of rare lemmas in Czech and English are comparable, Czech has roughly twice as many rare inflected forms as English (07).

(07) Figure: Rare word forms (Goldwater & McClosky 2005)
[Token counts]

Furthermore, features represented morphologically in Czech are often represented by other means in English (08).

(08) Table: Morphological correspondences
Czech English
a. plural
past tense
plural
past tense
b. person
negation
genitive case
instrumental case
pronouns
not
of
by or with
c. gender none
d. other case markings syntax
Approach

A variety of approaches were compared. These are illustrated in (09), and explained following.

(09) Figure: Transformations (Goldwater & McClosky 2005)
a. Original Pro nĕkoho by její provedení mĕlo smysl .
b. Lemmas pro nĕkdo být jeho provedení mít smysl .
c. Morpheme tokens pro nĕkdo být PER_3 jeho provedení mít PER_X smysl .
d. Modified lemmas pro nĕkdo být+PER_3 jeho provedení mít+PER_X smysl .
e. Vector (pro) (nĕkdo) (být PER_3) (jeho) (provedení) (mít PER_X) (smysl) (.)
Lemmatization

Replacing each word with its associated lemma reduces data sparseness, but also loses information (09.b). This approach is expected to help when morphological information is not carried over into English (08.c,d), but may hurt in other cases.

Several different approaches to lemmatization were tried.
Morpheme tokens

In this approach, words are replaced by token sequences, each token representing one morpheme (09.c). This reduces data sparseness, although not quite as much as lemmatization. This approach is expected to help most when the Czech morpheme corresponds to an English function word (08.b).

Modified lemmas

First each word is lemmatized, then tags representing morphemes of interest are concatenated to the lemma (09.d). This eliminates some variation among word forms, thus slightly reducing data sparseness. This approach is expected to help when morphological information is carried over into English (08.a).

Feature vectors

Each word is replaced by a feature vector including the lemma and tokens representing morphemes (09.e). This reduces data sparseness without loss of information, which should lead to optimum results. However, alignment is estimated using a variant of the EM algorithm, which may result in loss of accuracy.

Combined model

Based on the results of the above experiments (10) (11), they created a combined model using tokens for person and negation morphemes (08.b), and the modified lemma method for number and tense (08.a). The results are in (12).

Results

Experiments were with a word-to-word translation system. Explanation follows.

BLEU score difference of 00.9 is significant (p < .05).

(10) Table: BLEU scores for baseline & lemmatization
Dev Test
a. word-to-word 31.1 27.0
b. truncate (k=6) 35.3 28.3
c. lemmatize all 35.5 29.9
d.   except Pro 35.0
e.   except Pro, V, N 34.6
f.   n < 50 37.0 30.6
(11) Table: BLEU scores (dev set) for method & class of tag
Tokens Mod-Lem Vectors
a. PER 36.5 35.6 35.6
b. TEN 36.5 36.1 36.4
c. PER,TEN 35.5 36.2 35.5
d. NUM 35.4 36.7 36.1
e. CASE 35.3 34.0 33.7
f. NEG 35.7 35.6 35.3

(12) Table: BLEU scores for combined model
Dev Test
combined model 39.0 33.3

The best lemmatization results occurred with lemmatizing low-frequency words (10.f). Truncation (10.b) performed better than baseline (10.a), but worse than the best lemmatization.

Scores on other experiments (11) were reported on the dev set only. Surprisingly, none scored better than simple lemmatization.

Adding person-agreement tokens (PER) helped. Examination confirmed that they did, indeed, align with English pronouns. Inexplicably, tense-marking tokens (TEN) helped just as much, but using both PER and TEN canceled the benefits of either.

For modified lemmas, number (NUM) and tense tags helped, which makes sense since these are also marked in English. Case tags hurt, which also makes sense, since they increase data sparseness without providing any information marked in English.

Disappointingly, feature vectors never significantly outperformed morpheme tokens.

The combined model performed best of all, successfully combining the advantages of each method.

Conclusions

Young-Suk Lee (2004) addressed problems of translating Arabic, a templatic language, with English. He observes that in Arabic→English translation, one Arabic word frequently aligns with multiple English words, owing to functional affixes in Arabic, e.g.: llmEArDp"of the opposition". This one-to-many alignment poses technical challenges to PBSMT systems.

Lee's approach takes several steps:

  1. POS tag Arabic and English parallel text.
  2. Segment Arabic words into prefix(es)-stem-suffix(es) as distinct tokens.
  3. Align segmented Arabic text with English text.
  4. Determine translation probability of each Arabic stem/affix and POS into English POS.
  5. Evaluate each Arabic affix.
    1. If the affix robustly translates to an English POS, keep it as a token.
    2. If the most common translation for the affix is NULL, delete it.
    3. Else, merge the affix back into the stem.
  6. [PBSMT experiments only] Manually delete multiple definite determiners in a single Arabic phrase.

Lee evaluates the effects of his approach on BLEU score for corpora ranging from 3.5K to 3.3M words, using either IBM Model 1 (a WBSMT system) or phrase-based alignment (13).

(13) Table: BLEU-score evaluation of delete/merge (Lee 2004)
# sentences Model 1 PBSMT
base morph base morph
3.5K .10 .25 .17 .24
35K .14 .29 .24 .29
350K .18 .31 .32 .36
3.3M .18 .32 .36 .39

For Model 1, BLEU scores are vastly improved using Lee's method, roughly doubled. For PBSMT, the improvements are roughly equivalent to an order of magnitude of data.

1.3.3.9. Habash & Rambow (2005): Arabic morphological complexity

Habash & Rambow (2005) also looked at Arabic→English translation. They noted that English has 50 morphological classes, while Arabic has roughly 333,000 in theory, 2200 attested.

1.3.3.10. Isbihani et al. (2006): Segmenting Arabic

Isbihani et al. (2006) compared the impact on MT of various methods for segmenting Arabic source text. They found that the best results came from an unsupervised method using a finite-state automaton augmented with a memory of word properties.

1.3.4. Related work: Crosslingual word sense disambiguation

Crosslingual word sense disambiguation (CWSD) uses words of the target language as the sense inventory for words of the source language. For instance, if the source language word bank can translate to either banque or encaisser in the target language, then [banque, encaisser] are considered to be possible senses for bank. This nicely solves theoretical questions in WSD such as what level of granularity to use for word senses. If it leads to a different translation, it's a different sense; if it doesn't, it's not.

1.3.4.1. Vickrey et al. (2005)

Vickrey et al. (2005) describe CWSD results for data taken from the Europarl fr-en parallel corpus (Koehn 2005). Their baseline always guesses the most common sense for a given word, yielding an F-score of 51.1%. Their logistic regression model yields an F-score of 62.0%.

They also describe an experiment with filling in a missing ambiguous word in a target sentence. This was intended to prove the potential usefulness of WSD in machine translation.

1.3.5. Related work: WSD & machine translation

Attempts to use WSD as a preprocessor for MT have generally not produced encouraging results (Och 1999; Carpuat & Wu 2005; Cabezas & Resnik 2005). One explanation is that WSD artificially increases data sparseness, without focusing on senses that are relevant to translation (Yarowsky & Florian 2002). This is one motive for exploring crosslingual WSD, using target language words as the sense inventory for source language words (Resnik & Yarowsky 2000). In fact, one of the exercises in Senseval-3 was crosslingual (Mihalcea & Edmonds, ed. 2004).

Until recently, most research in WSD involved only annotated text. Some have begun looking into semi-supervised learning, augmenting a small amount of annotated text with a large amount of unannotated text, with optimistic results (Yarowsky 1995; Pham et al. 2005). Such an approach would be particularly useful when translating from a high-resource language into a low-resource language, since the amount of parallel text is likely to be small, while the amount of unannotated text in the source language is virtually without bound.

1.3.5.1. Carpuat & Wu (2005): Constrain MT using WSD

Carpuat & Wu (2005) used WSD to constrain possible translations, in contrast to our approach of providing probabilistic lexical selection recommendations. They concluded, "Even state-of-the-art WSD does not help BLEU score."

1.3.5.2. Cabezas & Resnik (2005): Improve MT using WSD

Cabezas & Resnik (2005) attempted to use crosslingual WSD to improve Spanish→English MT. They found no significant effect. They performed no lemmatization or morphological analysis; nor did they identify content words.

1.3.5.3. Pham et al. (2005)

Pham et al. (2005) used sgt_light (Joachims 2003) on tasks similar to ours, with positive results.

Pham et al. published their results just as I was beginning my research, after I had completed a literature review. I did not encounter their paper until after I had written up my first round of results. Hence the relative lack of prominence of their work in my write-up.

1.3.6. Related work: Phrase-based statistical machine translation (PBSMT)

1.3.6.1. Chiang (2005): Hiero

The University of Maryland currently boasts the state-of-the-art PBSMT system, Hiero (Chiang 2005).

Previous PBSMT systems (e.g. Koehn 2004; Och & Ney 2004) generated translation candidates at two levels, word-level and phrase-level, where a word is any contiguous sequence of characters delimited by white space (i.e. a token), and a phrase is any contiguous sequence of words. The primary innovation of Hiero is to generate translation candidates from a hierarchy of phrases of arbitrary depth (i.e. phrases made up of phrases).

One effect of PBSMT in general, and hierarchical PBSMT in particular, is to mitigate the impact of overzealous tokenization. Two "words" which should not have been separated are simply grouped together by the system into functional "phrases". Thus, we expect to benefit by using the most aggressive morphological analysis available.

Caveat: Both Pharaoh and Hiero use a configurable maximum number of "words" per "phrase". Breaking words into their component morphemes will increase the length of any given "phrase". The phrase-length limit will therefore need to be increased, which is likely to increase processing time.

Hiero, like Pharaoh before it (Koehn 2004), provides a probabilistic lexical selection mechanism. The preprocessor can specify one or more proposed translations for a word or phrase, each with a feature label (e.g. "WSD") and a specified probability or cost. Feature weights are then optimized based on a dev set.

1.3.7. Related work: Machine learning

1.3.7.1. Joachims (2003): Spectral Graph Transducer (SGT)

Joachims (2003) introduced the Spectral Graph Transducer (SGT) technology to perform semi-supervised machine learning using a small amount of annotated training data and a large amount of observed test data, so-called transductive learning. A trivial adaptation uses small amounts of test and training data and a large amount of unannotated data. Joachims points out that SGT subsumes various cotraining tasks (Yarowsky 1995; Blum & Mitchell 1998) as a special case. More specifically, SGT handles cotraining applications where the feature set for each view remains static across iterations.

In brief, SGT constructs a graph with examples as vertices and edge weights indicating similarity, and finds an approximate minimal cut between positive examples and negative examples using a spectral graph method.

Some advantages of SGT are:

1.3.8. Related work: Corpus linguistics

1.3.8.1. Babych & Hartley (2003): S-score

In a paper on named entity recognition in MT, Babych & Hartley (2003) introduced the s-score heuristic for identifying content words in a large corpus.

Todo: describe

2. Task I: Unsupervised morphological analysis

2.1.1. Task formulation

2.1.1.1. Hypothesis

These experiments were designed to test the hypothesis:

2.1.1.2. Objective

Naturally, the first step in designing a morphological analysis system is to answer the question: What is a morpheme?

We don't.

Or, rather, we adopt a utilitarian approach: A morpheme is whatever can be analyzed constructively as a morpheme for our intended purpose. If our purpose is word sense disambiguation, a morpheme is any substring of a word such that treating it as a morpheme improves the accuracy of word sense disambiguation.

As an example, take the English word implementation. An expert might say this is two morphemes: implement + -ation. Our system may infer, instead, that it consists of five morphemes: im- + ple + -ment + -a + -tion. (Note that im-, -ment, and -tion are all frequently attested.) If it turns out this analysis works well for our evaluation metric, we don't dwell unduly over the implausibility of ple as a stem.

Our approach may therefore benefit considerably from optimization based on our choice of evaluation metric(s).

2.1.1.3. Languages of interest

In selecting languages for experimentation, we wanted to cover as wide a range of morphological types as possible, including:

Listed in order of expected difficulty. Fully analytic languages are a moot point, since they have no morphology to speak of.

Disclaimers: It's been disputed whether German is truly agglutinative, as opposed to fusional (citation???). It appears to function so for our purposes. English is claimed by some to be virtually analytic (citation???).

For our initial experiments, we decided to begin with simpler languages (i.e. agglutinative and fusional), and move to more difficult languages (i.e. suppletive and templatic) later on. We were also constrained by pragmatic factors, e.g.:

We decided to begin with a selection of Europarl languages, using data available from the parallel corpus (Koehn 2005), Specifically, we examined:

By fortunate coincidence, WMT06 later used the same data source.

2.1.1.4. Preprocessing methods

Several simple preprocessing heuristics and algorithms were evaluated. These included:

Our primary algorithm was based on (Koehn & Knight 2003) as a starting point. We chose their approach on the premise that it could be equally applicable to most other languages (agglutinative, fusional, or suppletive).

A key difference between German and fusional (particularly Romance) languages quickly became apparent. A German noun compound is built from:

  1. an open class of unbound roots (nouns), each independently attested
  2. a small, closed class of bound fillers (e.g. -a-, -s)

Disclaimer: This description is based upon anecdotal observation, and may not be accurate.

In contrast, a word in a fusional language is built from:

  1. an open class of (mostly) unbound roots, most of which are independently attested
  2. a large (although probably closed) class of bound affixes
  3. a virtually endless array of morphosyntactic variations

One apparent exception to (a) in English is: ineptin- + ept (?), where in- is the frequently attested negative prefix, but ept is not attested in any other context that I am aware of. Presumably it was a word in some earlier incarnation of the language, perhaps a variation of apt.

As another example, note the apparent morphological pattern in: ascend, descend, transcend.

The end result is that morphological analysis in German is a fairly straightforward statistical analysis of observable events (i.e. known roots), while English analysis involves statistical analysis of events which are themselves hidden (i.e. proposed morphemes). This description suggests EM, which, indeed, is similar to the iterative approach we ended up using, and even more similar to other approaches we did not get to try.

2.1.1.4.1. Morphological analysis algorithm

Our algorithm is (roughly) as follows:

  1. Enumerate all words (i.e. word forms) from a corpus, with the number of occurrences of each word.
  2. Posit initial selection of morphemes.
    1. Store the counts of all attested words.
    2. Add counts for all likely affixes, taken as the first or last n characters, 1 ≤ n ≤ 3.
  3. Morphologically analyze each word in the corpus.
    1. Give the unsplit word a score based on log(count+1), modified for various configurable heuristics.
    2. Enumerate all possible binary splits.
    3. For each possible split, determine its score as the weighted mean of the scores of each half, again modified by heuristics. The weight reflects the number of morphemes posited for the substring.
    4. Choose the 1-best analysis as the hypothesized split for that word.
  4. Do it again.
    1. Use the hypothesized splits to generate new counts for each (newly) attested morpheme.
    2. Run (3) again using these new counts.
    3. Repeat until convergence, or maximum n iterations. (We used n=20.)

We experimented with various options, including:

  1. Posit initial selection of morphemes.
    1. Exclude words without any vowels.
    2. Exclude words below n characters long, 1 ≤ n ≤ 4.
    3. Use content words only, by excluding words below a configurable s-score threshold (Babych & Hartley 2003).
  2. Morphologically analyze each word in the corpus.
    1. Bias against an affix without a vowel.
    2. Bias against an affix with only one letter.
    3. Bias against novel affix (count=0).

Ideally, we should set up a procedure to optimize these configuration options (and others) based on development data and evaluation metric.

2.1.2. Evaluation

2.1.2.1. Complexity

Morphological analysis can be viewed as an exercise in the reduction of complexity without loss of information. Stand-alone methods of evaluation are designed to reflect this goal.

2.1.2.1.1. Rare events

Rare events are a bane of statistical applications, as indicated by (among others) Schrader (2006) for German, and Goldwater & McClosky (2005) for Czech. Prima facia, a statistical system has a 0% chance of processing hapax legomena correctly, since they never occur in both test and training data. While most languages are not as bad as German or Czech, Zipf's law applies. Morphological analysis can have a potentially profound impact by reducing these intractable or nearly intractable words into recognizable component morphemes.

The problem may not be as bad as it seems at first. Even if half the forms are hapax (as for German or Czech), the percentage of singleton tokens will be considerably lower, typically less than one percent of all tokens (14). On the other hand, these hapax tokens are likely to be content words, critical to the adequacy of translation (or whatever your end purpose), whereas the highest frequency forms are most likely to be function words.

In other words, rare events are far more critical to correct translation than is captured by their relative impact on BLEU score or similar automated measures.

We performed analysis similar to that in Schrader (2006) for our languages of interest, before and after breaking words into component morphemes. For the purpose of analysis, we considered "rare" any form that occurs 10 or fewer times in the corpus (14).

(14) Table: Rare events, by language, with and without morphological analysis


Total
Frequency 1 (hapax) Frequency 1-10
# Forms # Tokens # Tokens % Forms % Tokens # Tokens % Forms % Tokens
German Words 170,402 13,233,854 81,052 47.57% 0.613% 140,464 82.43% 2.377%
Morphemes 30,898 13,348,182 9,836 31.83% 0.074% 19,507 63.13% 0.385%
English Words 34,827 10,648,289 9,361 26.88%
0.088% 21,350 61.82% 0.587%
Morphemes 18,955
10,659,570 3,991 21.06%
0.038% 9,817 51.79% 0.285%
Spanish Words 77,526 12,495,786 25,772 33.24% 0.201% 54,908 70.83% 1.182%
Morphemes 32,407 12,530,435 8,102 25.00% 0.065% 18,334 56.57% 0.425%
French Words 55,985 12,036,764 16,404 29.30% 0.136% 37,045 66.17% 0.875%
Morphemes 25,880 12,058,591 5,756 22.24% 0.048% 13,524 52.26% 0.334%
Finnish Words 301,597 7,835,581 155,994 51.72% 1.990% 259,890 86.17% 7.102%
Morphemes 38,019 8,085,040 12,238 32.19% 0.151% 24,095 63.38% 0.788%

Todo: Add rows for naive methods.

Todo: Bar graphs.

As expected, the results are dramatic for German, a compounding language, where the number of distinct forms drops 82% from 170,402 to 30,898, and the number of rare tokens drops 86% from 140,464 to 19,507. The results are even more dramatic for Finnish, a morphologically rich language, where the number of distinct forms drops 87% from 301,597 to 38,019, and the number of rare tokens drops 91% from 259,890 to 24,095. The reduction is also considerable for other languages. Even English, a morphologically poor language, sees reductions of around 50%.

Note that the number of tokens necessarily goes up as words are split into component morphemes. However, the amount of increase is not as significant as one might expect. Presumably, this reflects a characteristic of language that most common words are morphologically simple (citation???). It also confirms that we are making informed decisions about splitting morphemes, doing so only where it reduces complexity.

Frankly, the increase in number of tokens seems ridiculously low. One would expect it to be at least the same magnitude as the decrease in forms, but this is not always the case.

Contrast the naive method of treating every character as a morpheme (not shown). We would predict the number of forms to drop down to the size of the alphabet, but the number of tokens to jump to the number of characters in the corpus.

Or contrast another naive method, truncating every word to n characters (also not shown). We would expect this to dramatically reduce the number of distinct forms, and virtually eliminate rare events, while keeping the number of tokens constant. This demonstrates the theoretical limitation of this evaluation metric, in that it does not account for the loss of information involved in truncation. It also provides a possible indication why truncation performs so well in many applications (citation???).

Schrader (2006) used a totally different approach, both to simplification and evaluation, making it difficult to compare results.

2.1.2.1.2. Complexity vs. information

Various metrics have been used to measure complexity vs. information in morphological analysis, e.g. saturation and informativeness (Baldwin 2005).

We did not have the opportunity to investigate these metrics.

2.1.2.2. Comparison against a standard

One can also evaluate against an established standard. Possible candidates include:

We did not have the opportunity to pursue any of these evaluation standards. We do not expect our method to outperform language-specific rule-based systems, although it would be interesting to see how closely they compare.

2.1.2.3. Downstream applications

Finally, one can evaluate the effects of morphological analysis on downstream applications. We looked at word sense disambiguation and machine translation (both described below), although results are incomplete.

3. Task II: Crosslingual word sense disambiguation

These sets of experiments endeavored to improve the quality of crosslingual word sense disambiguation (CWSD) using a variety of approaches, including:

3.1. Task II.A: WSD using unlabeled data

This set of experiments evaluated the effects on WSD results of supplementing annotated training data with unlabeled data, a semi-supervised approach known as transductive learning. The theory is that patterns in the unlabeled data can help elucidate patterns in the labeled data.

Historically, this was my first set of original experiments, in fall 2005. The results provided part of the impetus to pursue morphological analysis.

3.1.1. Task formulation

3.1.1.1. Hypothesis

These experiments were designed to test the hypothesis:

3.1.1.2. Software

For the core engine, we employed two machine learning packages, svm_light and sgt_light, both made publicly available by Thorsten Joachims.

svm_light is an implementation of support vector machines (Joachims 1999). Although svm_light includes a transductive learning option, it runs too slowly on large data sets to be of practical use.

Update: Joachims has since released a new version of svm_light which is supposed to perform much better on larger data sets, but I have not had the chance to try it out.

sgt_light is an implementation of spectral graph transducers (Joachims 2003), implementing binary (yes-no) classification only. We therefore trained a set of binary models for each word form, one model for each possible sense. We chose the best sense based on the highest-confidence model.

With sgt_light, the test data must be present when the model is built, so it is not possible to run experiments using training data only.

To align text, we used Pharaoh (Koehn 2004) wrapping GIZA++ (Och 2003).

3.1.1.3. Data

For experimental data, we used Senseval WSD evaluation sets, and also data constructed from parallel corpora.

interest & line

The initial set of experiments used the Senseval-2 data (available on Ted Pedersen's website) for performing (monolingual) WSD on the words interest and line. Each set was randomly divided into 75 test examples, 150 labeled training examples, and the remainder unlabeled examples, 2143 for interest and 3921 for line.

This was the reality-check stage. Since all the data came preprocessed from the same source, there was (presumably) no noise or comparability issue.

Spanish Lexical Sample

The next set of experiments used the (monolingual) SpanishLS data from Senseval-3. The data came preprocessed into 4195 test examples, 8430 labeled training examples, and 61252 unlabeled examples, covering a total of 46 lexical roots, ranging from 2 to 8 senses per root.

Europarl fr-en

To anticipate an MT scenario, we constructed a CWSD exercise from parallel text. We chose the Europarl fr-en corpus (Koehn 2002), since it is large enough to produce many words and examples, and we have at least a fighting chance of interpreting both the source and target language texts.

Starting with the intersected fr-en and en-fr alignments, we generated a list of all the French content words, as indicated by s-score, and their aligned English words. Any alignment that occurred fewer than 6 times was lumped into UNK. Potential words of interest were identified with at least 500 examples, between 3 and 7 senses, no more than 5% UNK alignments, and no more than 65% of examples for any one sense. A visual scan filtered out remaining candidates whose different alignments represented only affixation variants. (This predated our work on morphological analysis.) That left a total of 42 French words and their English alignments. For each of these 42 words, the examples were randomly divided into 50 test examples, 50 dev examples, 150 labeled training examples, and the remainder (at least 250) unlabeled examples.

Europarl en-fr

We again processed the parallel text from the Europarl fr-en experiment, only this time with English as the source language and French as the target language. The resulting data set contained 33 English words and their French alignments.

3.1.1.4. Feature Analysis

With each data set, we explored various methods of analyzing the data into features, optimizing for each set. Not all configuration options are described here. Comparable experiments are configured identically except as indicated.

Preprocessing

Preprocessing options included the following.

Feature Selection

Every experiment used at least local context and broad context options for feature selection. Later experiments also included other feature sets.

Following (Pham et al. 2005), we experimented with two alternatives for using multiple feature sets.

Counts

We explored different options for counting features, following (Cabezas & Resnik 2005).

Inverse Document Frequency (IDF) is essentially a hack that gives the system an extra hint as to which features are likely to be most significant. Various heuristics are possible, but they all follow the basic pattern of penalizing features that occur in examples from many different classes, and rewarding features that occur in only one or a few classes. We tried several heuristics, but do not discuss the results here.

It turned out that using a simple boolean for counts always produced the best results, but the optimal setting for IDF varied by data set.

3.1.1.5. Evaluation

F-score

We calculated the standard F-score for each experiment.

Baseline

We compared against two baselines:

The first (naive) baseline was just a sanity check, since the second was always higher, as expected. The second baseline provided the target to beat.

However, quirks of svm_light and sgt_light led to a practical problem:

We therefore could not use a single tool to get the direct head-to-head comparison that would most clearly demonstrate our hypothesis.

Filtering

Later experiments included filtering options, trading recall for precision.

The results with filtering are not presented in this paper. It generally had little effect on F-score.

3.1.2. Results: WSD using unlabeled data

F-score results are in (16), with explanations following.

(16) Table: F-score WSD results
Data available
training  + test  + unlabeled
interest baseline 50.7    
svm_light 69.3 72.0 81.3
sgt_light   76.0 85.3
sgt-co   80.0 85.3
line baseline 60.0    
svm_light 72.0 69.3 81.3
sgt_light   68.0 80.0
sgt-co   69.3 77.3
Spanish LS baseline 67.7    
svm_light 83.2 85.1
sgt_light   84.6 86.6
sgt-co   85.1 85.7
Europarl fr-en baseline 56.7