Edward Kenschaft, master's thesis based on pre-candidacy doctoral research, University of Maryland.
This page displays best under Firefox, but it should also be readable under Internet Explorer. Likewise, it makes use of color, but should be perfectly understandable in black & white.
This paper is also available in a PDF version.
My research at the University of Maryland involved three distinct but related areas within the field of computational linguistics:
We evaluated the following experimental hypotheses:
This paper summarizes all experimental results, with conclusions supporting hypotheses A, B, C and D.
Please note the partial glossary at the end of the paper.
Disclaimers: This paper is not intended for publication, but is rather a summary of my research efforts at UMD, from fall 2005 through 2006. It incorporates portions of several project reports, minimally edited, as well as descriptions of incomplete experiments. Summaries of related research efforts have generally not been updated since 2006.
Morphological analysis (or induction) is the process of breaking a word, or all the words of a given language, into component morphemes. The field has a substantial history, but efforts have typically been constrained in major ways:
The goal was an unsupervised morphological analysis (UMA) system that can be used to induce morphological structure in any language where written text is available, regardless of what else is known about the language.
Word sense disambiguation (WSD) is the process of determining which possible sense of a word is used in a given context. For instance, the English word fence has at least three senses:
Monolingual WSD is often
a poorly
defined problem, in that:
Various researchers have begun to explore crosslingual word sense disambiguation (CWSD), based on a corpus of sentences that have been translated between two languages. The possible translations of a word in the target language are taken to be its sense inventory. That is, if a word can be translated in three different ways, then these are its three "senses" – regardless of whether some other theoretical formalism would agree.
So far, research into CWSD has been constrained in several ways:
Building upon this tradition, we sought to:
Three of these approaches reduced data sparseness by introducing noisy data. Naturally, this involves trade-offs. However, results show that these approaches nonetheless helped in some cases. More research presumably could further improve noise filtering.
We attempted to evaluate UMA and CWSD in terms of their impact on machine translation (MT). However, we ran out of time before reaching the desired results.
Disclaimer: These highlights do not scratch the surface of a full literature survey on these topics.
Morphological analysis is the process of parsing a word into its component morphemes. This can be done manually or automatically, using either a rule-based or statistical system. Many of the most commonly studied languages have commercially available rule-based systems which achieve greater than 95% accuracy (citation???). However, developing such a system for a new language is time-intensive, as is annotating data for use by a supervised statistical system. We therefore focused our attention on unsupervised induction of morphological structure. Such a system could be created for a new language much faster than a rule-based or supervised system, as long as unannotated text is available.
This field has a long and rich history (e.g. Koskenniemi 1983; Kay 1987). We mention just a couple of efforts that most closely influenced our work.
Dayne Freitag (2005) offers a methodology for inferring morphosyntactic transformation rules, as opposed to morphemes per se. In a nutshell, his method involves the following steps:
Some advantages of Freitag's system:
Morphological analysis is related to a task known in the literature as (deep) lexical acquisition (DLA). Timothy Baldwin (2005) assumes a seed database, and describes a variety of methods for augmenting that database. His method for identifying morphemes is summarized as follows:
This method, I suspect, is theoretically inferior to Freitag's, since it does not share any of the advantages listed above, and arbitrarily limits the number of recognized morphemes. However, it should be significantly simpler and faster to implement.
Baldwin's non-morphological features depend on deeper semantic knowledge, which I do not presume to have.
Florian and Wicentowski (2002) used morphological analysis to improve WSD, referenced in (Wicentowski 2002).
I only discovered this paper recently (2008), and have not examined their work.
Philipp Koehn & Kevin Knight (2003) experimented with heuristics for dividing German compounds into component morphemes, by looking for known roots that occur as substrings of larger words. They evaluated the effects of their approach on word alignment (German↔English), word-based statistical machine translation (WBSMT), and phrase-based statistical machine translation (PBSMT).
Because German is agglutinative, each German word (particularly noun compound) may align with an arbitrarily long sequence of English words. This provides a significant challenge for word alignment, and for other applications that depend upon alignment, such as machine translation.
Koehn & Knight attempted to address this problem by learning rules for splitting German noun compounds into component parts. A part is either a subword or filler between subwords. They used only unannotated text for training.
Their algorithm involved two steps: generation and selection.
For generation, they looked at a new word, and listed every possible covering of that word with known words and fillers. For example, if the new word is aktionsplan, the four possible coverings are:
Their results are shown in (01).
| Splitting method | Alignment | BLEU | |||
|---|---|---|---|---|---|
| precision | recall | accuracy | WBSMT | PBSMT | |
| none | – | – | 94.2% | 29.1 | 30.5 |
| eager | 24.8% | 73.3% | 87.1% | 22.2 | 34.4 |
| frequency-based | 57.4% | 86.6% | 95.7% | 31.7 | 34.2 |
| parallel | 83.3% | 89.1% | 98.6% | 29.4 | 33.0 |
| parallel & POS | 93.8% | 90.1% | 99.1% | 30.6 | 32.6 |
Not surprisingly, the best word alignment, 99.1% accuracy, occurred when using all possible information, parallel text plus POS, up from 94.2% with unmodified text. The frequency-based metric produced only produced moderate improvements, to 95.7%. Eager splitting hurt alignment.
It is perhaps surprising that the best downstream results with MT, as measured by BLEU score, did not involve optimization with parallel text. For word-based MT, the frequency-based metric was by far the best performer, with a 2.6 and 1.6 improvement in BLEU score, respectively, over no splitting and splitting based on all information.
Even more interesting are the results with phrase-based MT. Here the best approach was dumb, eager splitting, with a whopping 3.9 BLEU improvement over no splitting (34.4 vs. 30.5). The frequency-based metric was a close second (34.2).
The success of eager splitting is attributed to the peculiar strength of PBSMT, which has no problem taking an arbitrary string of "words" and regrouping them into "phrases" as needed.
Nießen & Ney (2001a, 2004) introduced the idea of a hierarchical lexicon, where a word is represented at various levels of inflectional specificity, starting with the bare root. German↔English alignment is performed at each of these levels. The technique was adopted by various later researchers.
Nießen & Ney also employed various heuristics for syntactic restructuring, such as reordering questions and detaching verb prefixes, to make the German more like the English. Finally, they added other heuristics, such as detecting multiword phrases based on POS sequence.
Used all together, the techniques led to improvements comparable to an order of magnitude of data (02). This claim would be stronger if an additional order of magnitude were represented, e.g. from 500 to 5K.
Each column represents a different accuracy measure; lower is better.
| #sent | Method | 1-BLEU | mWER | SSER | ISER |
|---|---|---|---|---|---|
| 0 | baseline | 76.7 | 53.6 | 60.4 | 29.8 |
| restructure | 70.9 | 50.2 | 57.8 | 30.0 | |
| + all | 67.4 | 48.0 | 52.8 | 24.1 | |
| 5K | baseline | 52.6 | 38.0 | 37.3 | 17.4 |
| restructure | 47.9 | 34.7 | 33.6 | 15.2 | |
| + all | 47.1 | 33.9 | 31.8 | 13.7 | |
| 58K |
baseline | 46.3 | 34.1 | 30.2 | 14.1 |
| restructure | 43.7 | 32.5 | 26.6 | 12.8 | |
| + all | 42.9 | 31.8 | 26.3 | 11.8 |
Todo: Results with just hierarchical
lexicon, no restructuring?
Popović & Ney (2004a) built on the concept of the hierarchical lexicon (Nießen & Ney 2001a). They first parsed German words into roots and artificial inflectional morphemes, e.g. PRES for present tense. They then used a modified EM alignment algorithm to treat each of these complex "words" as a hierarchy, with alignment possible at any level. They evaluated effects on a variety of baseline systems, producing at least modest improvements in every case.
Popović et al. (2004b, 2005) used knowledge of the specific language pair to remove inflectional morphemes or function words that are not translatable, leading to a significantly reduced lexicon, and reduced alignment error rate.
I presume the alignment of function words to NULL has been handled in the mainstream alignment literature, although I haven't read up on it. Removing inflectional morphemes by hand seems (a) ad hoc, and (b) potentially detrimental to translation due to loss of information. Splitting apart but retaining all morphemes should provide the same benefits, without these objections.
Bettina Schrader (2004) also addressed German↔English alignment & translation (among other things), first looking at inflectional affixes, then noun compounding.
In experiments on three different data sets, Schrader removed inflectional affixes from the German, and aligned/translated only the lemmas. She evaluated on translation candidates generated using a bag-of-words dictionary generator (Hiemstra 1996). With all three data sets, recall went up with lemmatization. However, with the exception of one data set (Patente), precision went down. Schrader attributes this to a naive precision metric, which unduly penalizes the system for generating multiple translation candidates for the same term.
Using only the Patente data set, Schrader tried another experiment where she broke long compounds into component morphemes before alignment, e.g. Dämpfungsscheibenanordnung ('dampening disk assembly') → Dämpfung scheibe anordnung. Recall improved somewhat, while precision did not change significantly. Schrader noted that the system was severely hindered by its inability to handle alignments involving multi-word units. This suggests that results should be significantly better when evaluated according to a downstream PBSMT system.
Schrader (2006) addressed the problem of aligning German noun compounds with English word sequences. For data, she used the Europarl corpus (Koehn 2005) with both sides automatically POS tagged, and manually corrected alignments of ~100,000 German tokens to English counterparts.
Schrader was particularly interested in the problems posed by an inordinately large proportion of hapax legomena (i.e. word forms occurring only once) in the German data, as hapax legomena cannot be handled neatly by a statistical system. As shown in table (03), only 38% of English word forms are hapax, compared to 49% of German.
| Language | Tokens | Forms | Hapax | Other Rare | Frequent |
|---|---|---|---|---|---|
| English | 29,077,024 | 101,967 | 39,200 (38%) | 35,608 (35%) | 27,159 (27%) |
| German | 27,643,792 | 286,330 | 140,826 (49%) | 98,126 (34%) | 47,378 (17%) |
Of 512 German hapax legomena examined:
| English | German compound length | ||||
|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | >4 | |
| 1 word | 59 | 2 | 1 | 0 | 3 |
| 2 words | 30 | 119 | 56 | 15 | 10 |
| 3 words | 2 | 20 | 15 | 8 | 10 |
| 4 words | 0 | 1 | 0 | 0 | 2 |
The English phrase length is often one more than the compound
length, due to the presence of a prepositional phrase. For example, Kongresh-vorlage 'submission to Congress'.
Observation: Table (04) only seems to
support this observation for German noun compounds of length 1 or 2.
For longer compounds especially, a more precise analysis might lead to
improved heuristics.
Instead of word-to-word mappings, Schrader mapped German compound nouns to English word sequences.
Schrader argues that German compounds should not be broken into components, since the meaning of the compound may not be compositional. For instance, the German word Personen-stand is made up of roots literally translated as 'personal status', but the meaning of the compound is 'marital status'. However, a modern PBSMT system will take this into account, translating the sequence as a non-compositional unit when possible, and as compositional components otherwise.
Goldwater & McClosky (2005) addressed the problems of translating from a morphologically rich language (Czech) to a morphologically poor language (English).
For data, they used:
Czech's morphological richness leads to high data sparseness. While the numbers of rare lemmas in Czech and English are comparable, Czech has roughly twice as many rare inflected forms as English (07).
![]() |
Furthermore, features represented morphologically in Czech are often represented by other means in English (08).
| Czech | English | |
|---|---|---|
| a. | plural past tense |
plural past tense |
| b. | person negation genitive case instrumental case |
pronouns not of by or with |
| c. | gender | none |
| d. | other case markings | syntax |
A variety of approaches were compared. These are illustrated in (09), and explained following.
| a. | Original | Pro nĕkoho by její provedení mĕlo smysl . |
| b. | Lemmas | pro nĕkdo být jeho provedení mít smysl . |
| c. | Morpheme tokens | pro nĕkdo být PER_3 jeho provedení mít PER_X smysl . |
| d. | Modified lemmas | pro nĕkdo být+PER_3 jeho provedení mít+PER_X smysl . |
| e. | Vector | (pro) (nĕkdo) (být PER_3) (jeho) (provedení) (mít PER_X) (smysl) (.) |
Replacing each word with its associated lemma reduces data sparseness, but also loses information (09.b). This approach is expected to help when morphological information is not carried over into English (08.c,d), but may hurt in other cases.
Several different approaches to lemmatization were tried.In this approach, words are replaced by token sequences, each token representing one morpheme (09.c). This reduces data sparseness, although not quite as much as lemmatization. This approach is expected to help most when the Czech morpheme corresponds to an English function word (08.b).
First each word is lemmatized, then tags representing morphemes of interest are concatenated to the lemma (09.d). This eliminates some variation among word forms, thus slightly reducing data sparseness. This approach is expected to help when morphological information is carried over into English (08.a).
Each word is replaced by a feature vector including the lemma and tokens representing morphemes (09.e). This reduces data sparseness without loss of information, which should lead to optimum results. However, alignment is estimated using a variant of the EM algorithm, which may result in loss of accuracy.
Based on the results of the above experiments (10) (11), they created a combined model using tokens for person and negation morphemes (08.b), and the modified lemma method for number and tense (08.a). The results are in (12).
Experiments were with a word-to-word translation system. Explanation follows.
BLEU score difference of 00.9 is significant (p < .05).
| Dev | Test | ||
|---|---|---|---|
| a. | word-to-word | 31.1 | 27.0 |
| b. | truncate (k=6) | 35.3 | 28.3 |
| c. | lemmatize all | 35.5 | 29.9 |
| d. | except Pro | 35.0 | – |
| e. | except Pro, V, N | 34.6 | – |
| f. | n < 50 | 37.0 | 30.6 |
| Tokens | Mod-Lem | Vectors | ||
|---|---|---|---|---|
| a. | PER | 36.5 | 35.6 | 35.6 |
| b. | TEN | 36.5 | 36.1 | 36.4 |
| c. | PER,TEN | 35.5 | 36.2 | 35.5 |
| d. | NUM | 35.4 | 36.7 | 36.1 |
| e. | CASE | 35.3 | 34.0 | 33.7 |
| f. | NEG | 35.7 | 35.6 | 35.3 |
| Dev | Test | |
|---|---|---|
| combined model | 39.0 | 33.3 |
The best lemmatization results occurred with lemmatizing low-frequency words (10.f). Truncation (10.b) performed better than baseline (10.a), but worse than the best lemmatization.
Scores on other experiments (11) were reported on the dev set only. Surprisingly, none scored better than simple lemmatization.
Adding person-agreement tokens (PER) helped. Examination confirmed that they did, indeed, align with English pronouns. Inexplicably, tense-marking tokens (TEN) helped just as much, but using both PER and TEN canceled the benefits of either.
For modified lemmas, number (NUM) and tense tags helped, which makes sense since these are also marked in English. Case tags hurt, which also makes sense, since they increase data sparseness without providing any information marked in English.
Disappointingly, feature vectors never significantly outperformed morpheme tokens.
The combined model performed best of all, successfully combining the advantages of each method.
Young-Suk Lee (2004) addressed problems of translating Arabic, a templatic language, with English. He observes that in Arabic→English translation, one Arabic word frequently aligns with multiple English words, owing to functional affixes in Arabic, e.g.: llmEArDp ↔ "of the opposition". This one-to-many alignment poses technical challenges to PBSMT systems.
Lee's approach takes several steps:
Lee evaluates the effects of his approach on BLEU score for corpora ranging from 3.5K to 3.3M words, using either IBM Model 1 (a WBSMT system) or phrase-based alignment (13).
| # sentences | Model 1 | PBSMT | ||
|---|---|---|---|---|
| base | morph | base | morph | |
| 3.5K | .10 | .25 | .17 | .24 |
| 35K | .14 | .29 | .24 | .29 |
| 350K | .18 | .31 | .32 | .36 |
| 3.3M | .18 | .32 | .36 | .39 |
For Model 1, BLEU scores are vastly improved using Lee's method, roughly doubled. For PBSMT, the improvements are roughly equivalent to an order of magnitude of data.
Habash & Rambow (2005) also looked at Arabic→English translation. They noted that English has 50 morphological classes, while Arabic has roughly 333,000 in theory, 2200 attested.
Isbihani et al. (2006) compared the impact on MT of various methods for segmenting Arabic source text. They found that the best results came from an unsupervised method using a finite-state automaton augmented with a memory of word properties.
Crosslingual word sense disambiguation (CWSD) uses words of the target language as the sense inventory for words of the source language. For instance, if the source language word bank can translate to either banque or encaisser in the target language, then [banque, encaisser] are considered to be possible senses for bank. This nicely solves theoretical questions in WSD such as what level of granularity to use for word senses. If it leads to a different translation, it's a different sense; if it doesn't, it's not.
Vickrey et al. (2005) describe CWSD results for data taken from the Europarl fr-en parallel corpus (Koehn 2005). Their baseline always guesses the most common sense for a given word, yielding an F-score of 51.1%. Their logistic regression model yields an F-score of 62.0%.
They also describe an experiment with filling in a missing ambiguous word in a target sentence. This was intended to prove the potential usefulness of WSD in machine translation.
Attempts to use WSD as a preprocessor for MT have generally not produced encouraging results (Och 1999; Carpuat & Wu 2005; Cabezas & Resnik 2005). One explanation is that WSD artificially increases data sparseness, without focusing on senses that are relevant to translation (Yarowsky & Florian 2002). This is one motive for exploring crosslingual WSD, using target language words as the sense inventory for source language words (Resnik & Yarowsky 2000). In fact, one of the exercises in Senseval-3 was crosslingual (Mihalcea & Edmonds, ed. 2004).
Until recently, most research in WSD involved only annotated text. Some have begun looking into semi-supervised learning, augmenting a small amount of annotated text with a large amount of unannotated text, with optimistic results (Yarowsky 1995; Pham et al. 2005). Such an approach would be particularly useful when translating from a high-resource language into a low-resource language, since the amount of parallel text is likely to be small, while the amount of unannotated text in the source language is virtually without bound.
Carpuat & Wu (2005) used WSD to constrain possible translations, in contrast to our approach of providing probabilistic lexical selection recommendations. They concluded, "Even state-of-the-art WSD does not help BLEU score."
Cabezas & Resnik (2005) attempted to use crosslingual WSD to improve Spanish→English MT. They found no significant effect. They performed no lemmatization or morphological analysis; nor did they identify content words.
Pham et al. (2005) used sgt_light (Joachims 2003) on tasks similar to ours, with positive results.
Pham et al. published their results just as I was beginning my research, after I had completed a literature review. I did not encounter their paper until after I had written up my first round of results. Hence the relative lack of prominence of their work in my write-up.
The University of Maryland currently boasts the state-of-the-art PBSMT system, Hiero (Chiang 2005).
Previous PBSMT systems (e.g. Koehn 2004; Och & Ney 2004) generated translation candidates at two levels, word-level and phrase-level, where a word is any contiguous sequence of characters delimited by white space (i.e. a token), and a phrase is any contiguous sequence of words. The primary innovation of Hiero is to generate translation candidates from a hierarchy of phrases of arbitrary depth (i.e. phrases made up of phrases).
One effect of PBSMT in general, and hierarchical PBSMT in particular, is to mitigate the impact of overzealous tokenization. Two "words" which should not have been separated are simply grouped together by the system into functional "phrases". Thus, we expect to benefit by using the most aggressive morphological analysis available.
Caveat: Both Pharaoh and Hiero use a configurable maximum number of "words" per "phrase". Breaking words into their component morphemes will increase the length of any given "phrase". The phrase-length limit will therefore need to be increased, which is likely to increase processing time.
Hiero, like Pharaoh before it (Koehn 2004), provides a probabilistic lexical selection mechanism. The preprocessor can specify one or more proposed translations for a word or phrase, each with a feature label (e.g. "WSD") and a specified probability or cost. Feature weights are then optimized based on a dev set.
Joachims (2003) introduced the Spectral Graph Transducer (SGT) technology to perform semi-supervised machine learning using a small amount of annotated training data and a large amount of observed test data, so-called transductive learning. A trivial adaptation uses small amounts of test and training data and a large amount of unannotated data. Joachims points out that SGT subsumes various cotraining tasks (Yarowsky 1995; Blum & Mitchell 1998) as a special case. More specifically, SGT handles cotraining applications where the feature set for each view remains static across iterations.
In brief, SGT constructs a graph with examples as vertices and edge weights indicating similarity, and finds an approximate minimal cut between positive examples and negative examples using a spectral graph method.
Some advantages of SGT are:
In a paper on named entity recognition in MT, Babych & Hartley (2003) introduced the s-score heuristic for identifying content words in a large corpus.
Todo: describe
These experiments were designed to test the hypothesis:
Naturally, the first step in designing a morphological analysis system is to answer the question: What is a morpheme?
We don't.
Or, rather, we adopt a utilitarian approach: A morpheme is whatever can be analyzed constructively as a morpheme for our intended purpose. If our purpose is word sense disambiguation, a morpheme is any substring of a word such that treating it as a morpheme improves the accuracy of word sense disambiguation.
As an example, take the English word implementation. An expert might say this is two morphemes: implement + -ation. Our system may infer, instead, that it consists of five morphemes: im- + ple + -ment + -a + -tion. (Note that im-, -ment, and -tion are all frequently attested.) If it turns out this analysis works well for our evaluation metric, we don't dwell unduly over the implausibility of ple as a stem.
Our approach may therefore benefit considerably from optimization based on our choice of evaluation metric(s).
In selecting languages for experimentation, we wanted to cover
as wide a range of morphological
types as possible, including:
Listed in order of expected difficulty. Fully analytic languages are a moot point, since they have no morphology to speak of.
Disclaimers: It's been disputed whether German is truly agglutinative, as opposed to fusional (citation???). It appears to function so for our purposes. English is claimed by some to be virtually analytic (citation???).
For our initial experiments, we decided to begin with simpler languages (i.e. agglutinative and fusional), and move to more difficult languages (i.e. suppletive and templatic) later on. We were also constrained by pragmatic factors, e.g.:
We decided to begin with a selection of Europarl languages, using data available from the parallel corpus (Koehn 2005), Specifically, we examined:
By fortunate coincidence, WMT06 later used the same data source.
Several simple preprocessing heuristics and algorithms were evaluated. These included:
Our primary algorithm was based on (Koehn & Knight 2003) as a starting point. We chose their approach on the premise that it could be equally applicable to most other languages (agglutinative, fusional, or suppletive).
A key difference between German and fusional (particularly Romance) languages quickly became apparent. A German noun compound is built from:
Disclaimer: This description is based upon anecdotal observation, and may not be accurate.
In contrast, a word in a fusional language is built from:
One apparent exception to (a) in English is: inept ↔ in- + ept (?), where in- is the frequently attested negative prefix, but ept is not attested in any other context that I am aware of. Presumably it was a word in some earlier incarnation of the language, perhaps a variation of apt.
As another example, note the apparent morphological pattern in: ascend, descend, transcend.
The end result is that morphological analysis in German is a fairly straightforward statistical analysis of observable events (i.e. known roots), while English analysis involves statistical analysis of events which are themselves hidden (i.e. proposed morphemes). This description suggests EM, which, indeed, is similar to the iterative approach we ended up using, and even more similar to other approaches we did not get to try.
Our algorithm is (roughly) as follows:
We experimented with various options, including:
Ideally, we should set up a procedure to optimize these configuration options (and others) based on development data and evaluation metric.
Morphological analysis can be viewed as an exercise in the reduction of complexity without loss of information. Stand-alone methods of evaluation are designed to reflect this goal.
Rare events are a bane of statistical applications, as indicated by (among others) Schrader (2006) for German, and Goldwater & McClosky (2005) for Czech. Prima facia, a statistical system has a 0% chance of processing hapax legomena correctly, since they never occur in both test and training data. While most languages are not as bad as German or Czech, Zipf's law applies. Morphological analysis can have a potentially profound impact by reducing these intractable or nearly intractable words into recognizable component morphemes.
The problem may not be as bad as it seems at first. Even if half the forms are hapax (as for German or Czech), the percentage of singleton tokens will be considerably lower, typically less than one percent of all tokens (14). On the other hand, these hapax tokens are likely to be content words, critical to the adequacy of translation (or whatever your end purpose), whereas the highest frequency forms are most likely to be function words.
In other words, rare events are far more critical to correct translation than is captured by their relative impact on BLEU score or similar automated measures.
We performed analysis similar to that in Schrader (2006) for our languages of interest, before and after breaking words into component morphemes. For the purpose of analysis, we considered "rare" any form that occurs 10 or fewer times in the corpus (14).
| Total |
Frequency 1 (hapax) | Frequency 1-10 | |||||||
|---|---|---|---|---|---|---|---|---|---|
| # Forms | # Tokens | # Tokens | % Forms | % Tokens | # Tokens | % Forms | % Tokens | ||
| German | Words | 170,402 | 13,233,854 | 81,052 | 47.57% | 0.613% | 140,464 | 82.43% | 2.377% |
| Morphemes | 30,898 | 13,348,182 | 9,836 | 31.83% | 0.074% | 19,507 | 63.13% | 0.385% | |
| English | Words | 34,827 | 10,648,289 | 9,361 | 26.88% |
0.088% | 21,350 | 61.82% | 0.587% |
| Morphemes | 18,955 |
10,659,570 | 3,991 | 21.06% |
0.038% | 9,817 | 51.79% | 0.285% | |
| Spanish | Words | 77,526 | 12,495,786 | 25,772 | 33.24% | 0.201% | 54,908 | 70.83% | 1.182% |
| Morphemes | 32,407 | 12,530,435 | 8,102 | 25.00% | 0.065% | 18,334 | 56.57% | 0.425% | |
| French | Words | 55,985 | 12,036,764 | 16,404 | 29.30% | 0.136% | 37,045 | 66.17% | 0.875% |
| Morphemes | 25,880 | 12,058,591 | 5,756 | 22.24% | 0.048% | 13,524 | 52.26% | 0.334% | |
| Finnish | Words | 301,597 | 7,835,581 | 155,994 | 51.72% | 1.990% | 259,890 | 86.17% | 7.102% |
| Morphemes | 38,019 | 8,085,040 | 12,238 | 32.19% | 0.151% | 24,095 | 63.38% | 0.788% | |
Todo: Add rows for naive methods.
Todo: Bar graphs.
As expected, the results are dramatic for German, a compounding language, where the number of distinct forms drops 82% from 170,402 to 30,898, and the number of rare tokens drops 86% from 140,464 to 19,507. The results are even more dramatic for Finnish, a morphologically rich language, where the number of distinct forms drops 87% from 301,597 to 38,019, and the number of rare tokens drops 91% from 259,890 to 24,095. The reduction is also considerable for other languages. Even English, a morphologically poor language, sees reductions of around 50%.
Note that the number of tokens necessarily goes up as words are split into component morphemes. However, the amount of increase is not as significant as one might expect. Presumably, this reflects a characteristic of language that most common words are morphologically simple (citation???). It also confirms that we are making informed decisions about splitting morphemes, doing so only where it reduces complexity.
Frankly, the increase in number of tokens seems ridiculously low. One would expect it to be at least the same magnitude as the decrease in forms, but this is not always the case.
Contrast the naive method of treating every character as a morpheme (not shown). We would predict the number of forms to drop down to the size of the alphabet, but the number of tokens to jump to the number of characters in the corpus.
Or contrast another naive method, truncating every word to n characters (also not shown). We would expect this to dramatically reduce the number of distinct forms, and virtually eliminate rare events, while keeping the number of tokens constant. This demonstrates the theoretical limitation of this evaluation metric, in that it does not account for the loss of information involved in truncation. It also provides a possible indication why truncation performs so well in many applications (citation???).
Schrader (2006) used a totally different approach, both to simplification and evaluation, making it difficult to compare results.
Various metrics have been used to measure complexity vs. information in morphological analysis, e.g. saturation and informativeness (Baldwin 2005).
We did not have the opportunity to investigate these metrics.
One can also evaluate against an established standard. Possible candidates include:
We did not have the opportunity to pursue any of these evaluation standards. We do not expect our method to outperform language-specific rule-based systems, although it would be interesting to see how closely they compare.
Finally, one can evaluate the effects of morphological analysis on downstream applications. We looked at word sense disambiguation and machine translation (both described below), although results are incomplete.
These sets of experiments endeavored to improve the quality of crosslingual word sense disambiguation (CWSD) using a variety of approaches, including:
This set of experiments evaluated the effects on WSD results of supplementing annotated training data with unlabeled data, a semi-supervised approach known as transductive learning. The theory is that patterns in the unlabeled data can help elucidate patterns in the labeled data.
Historically, this was my first set of original experiments, in fall 2005. The results provided part of the impetus to pursue morphological analysis.
These experiments were designed to test the hypothesis:
For the core engine, we employed two machine learning packages, svm_light and sgt_light, both made publicly available by Thorsten Joachims.
svm_light is an implementation of support vector machines (Joachims 1999). Although svm_light includes a transductive learning option, it runs too slowly on large data sets to be of practical use.
Update: Joachims has since released a new version of svm_light which is supposed to perform much better on larger data sets, but I have not had the chance to try it out.
sgt_light is an implementation of spectral graph transducers (Joachims 2003), implementing binary (yes-no) classification only. We therefore trained a set of binary models for each word form, one model for each possible sense. We chose the best sense based on the highest-confidence model.
With sgt_light, the test data must be present when the model is built, so it is not possible to run experiments using training data only.
To align text, we used Pharaoh (Koehn 2004) wrapping GIZA++ (Och 2003).
For experimental data, we used Senseval WSD evaluation sets, and also data constructed from parallel corpora.
The initial set of experiments used the Senseval-2 data (available on Ted Pedersen's website) for performing (monolingual) WSD on the words interest and line. Each set was randomly divided into 75 test examples, 150 labeled training examples, and the remainder unlabeled examples, 2143 for interest and 3921 for line.
This was the reality-check stage. Since all the data came preprocessed from the same source, there was (presumably) no noise or comparability issue.
The next set of experiments used the (monolingual) SpanishLS data from Senseval-3. The data came preprocessed into 4195 test examples, 8430 labeled training examples, and 61252 unlabeled examples, covering a total of 46 lexical roots, ranging from 2 to 8 senses per root.
To anticipate an MT scenario, we constructed a CWSD exercise from parallel text. We chose the Europarl fr-en corpus (Koehn 2002), since it is large enough to produce many words and examples, and we have at least a fighting chance of interpreting both the source and target language texts.
Starting with the intersected fr-en and en-fr alignments, we generated a list of all the French content words, as indicated by s-score, and their aligned English words. Any alignment that occurred fewer than 6 times was lumped into UNK. Potential words of interest were identified with at least 500 examples, between 3 and 7 senses, no more than 5% UNK alignments, and no more than 65% of examples for any one sense. A visual scan filtered out remaining candidates whose different alignments represented only affixation variants. (This predated our work on morphological analysis.) That left a total of 42 French words and their English alignments. For each of these 42 words, the examples were randomly divided into 50 test examples, 50 dev examples, 150 labeled training examples, and the remainder (at least 250) unlabeled examples.
We again processed the parallel text from the Europarl fr-en experiment, only this time with English as the source language and French as the target language. The resulting data set contained 33 English words and their French alignments.
With each data set, we explored various methods of analyzing the data into features, optimizing for each set. Not all configuration options are described here. Comparable experiments are configured identically except as indicated.
Preprocessing options included the following.
Every experiment used at least local context and broad context options for feature selection. Later experiments also included other feature sets.
Following (Pham et al. 2005), we experimented with two alternatives for using multiple feature sets.
We explored different options for counting features, following (Cabezas & Resnik 2005).
Inverse Document Frequency (IDF) is essentially a hack that gives the system an extra hint as to which features are likely to be most significant. Various heuristics are possible, but they all follow the basic pattern of penalizing features that occur in examples from many different classes, and rewarding features that occur in only one or a few classes. We tried several heuristics, but do not discuss the results here.
It turned out that using a simple boolean for counts always produced the best results, but the optimal setting for IDF varied by data set.
We calculated the standard F-score for each experiment.
We compared against two baselines:
The first (naive) baseline was just a sanity check, since the second was always higher, as expected. The second baseline provided the target to beat.
However, quirks of svm_light and sgt_light led to a practical problem:
We therefore could not use a single tool to get the direct head-to-head comparison that would most clearly demonstrate our hypothesis.
Later experiments included filtering options, trading recall for precision.
The results with filtering are not presented in this paper. It generally had little effect on F-score.
F-score results are in (16), with explanations following.
| Data available | ||||
|---|---|---|---|---|
| training | + test | + unlabeled | ||
| interest | baseline | 50.7 | ||
| svm_light | 69.3 | 72.0 | 81.3 | |
| sgt_light | 76.0 | 85.3 | ||
| sgt-co | 80.0 | 85.3 | ||
| line | baseline | 60.0 | ||
| svm_light | 72.0 | 69.3 | 81.3 | |
| sgt_light | 68.0 | 80.0 | ||
| sgt-co | 69.3 | 77.3 | ||
| Spanish LS | baseline | 67.7 | ||
| svm_light | 83.2 | 85.1 | – | |
| sgt_light | 84.6 | 86.6 | ||
| sgt-co | 85.1 | 85.7 | ||
| Europarl fr-en | baseline | 56.7 | ||
| svm_light | 70.7 | 66.0 | – | |
| sgt_light | 65.2 | 68.9 | ||
| sgt-co | 65.5 | 69.4 | ||
| Europarl en-fr | baseline | 51.3 | ||
| svm_light | 64.7 | 61.1 | – | |
| sgt_light | 60.4 | 62.8 | ||
| sgt-co | 61.2 | 63.5 | ||
The columns are:
The rows are:
Empty cells denote impossible experiments. Cells with a dash '–' denote theoretically possible experiments that were beyond the capability of the software.
The best-scoring experiment for each data set is highlighted. Experiments that indicate locally negative results (i.e. lower score than the cell to its immediate left) are in red bold. Experiments that indicate locally positive results (i.e. higher score than the cell to its immediate left) are in blue bold italic.
I have not determined statistical significance.
Providing additional unlabeled data produced considerably better results than train & test data alone, thus supporting our experimental hypothesis. One surprising result was that svm_light's accuracy on line data (and, later, the Europarl data sets) actually degraded when the test examples were made available.
My best guess is that the explanation lies in the implementation of the svm_light software. Including unlabeled data (e.g. test data) invokes the transductive learning option, which turns off some optimization routines. I suspect that these routines not only speed up performance, but actually enhance the accuracy. However, this hypothesis is not supported by the results reported in (Joachims 2003). Note that adding the rest of the unlabeled data did, in fact, improve the performance of svm_light over training data alone.
It made little difference whether local and broad context were combined into one model, or used to generate two separate models which were then cotrained. This contrasts with the findings in (Pham et al. 2005), which showed better results with cotrained models.
Our results for train & test data are comparable with those reported in (Pham et al. 2005), while our results with unlabeled data are much higher. Although unclear from their paper, it appears that (Pham et al. 2005) did not make use of unlabeled data other than the test data.
To the best of my knowledge, our results were higher than any reported in any research literature.
The svm_light baseline was much higher than with interest & line. Adding unlabeled data marginally improved the results. The results are consistent with our experimental hypothesis, but do not strongly support it. I have not determined statistical significance.
Note that our system outperformed the best system entered in Senseval-3 (Marquez et al. 2004). However, this is not a fair comparison, since we optimized directly on the test data. (This being my first original research effort, I made a beginner's mistake, optimizing on test data instead of held-out development data.) It would be interesting to run the experiments again with proper controls.
With both Europarl fr-en and en-fr, unlabeled data noticeably improved accuracy over just train & test data. However, even with all unlabeled data, the accuracy of sgt_light was less than the accuracy of the svm_light baseline. This highlights the practical problem of not being able to run a whole test suite using a single tool.
It should now be possible to run the entire set of experiments using just the latest version of svm_light, allowing direct comparison of results.
Error analysis suggested that one reason we didn't see stronger results was data sparseness introduced by morphological complexity. For instance, the English word student is considered to have four senses [étudiant, étudiante, étudiants, étudiantes], even though a human observer can easily see there is only one. A morphologically rich target language creates data sparseness by exploding the sense inventory, as in this example. A morphologically rich source language creates data sparseness by exploding the lexicon, e.g. storing [étudiant, étudiante, étudiants, étudiantes] all as distinct lexical entries.
A traditional statistical approach would address the problem by throwing more data at it. This might be feasible with resource-rich languages such as French and English, but will not be viable when working with less studied languages. Rather, this suggests the need for morphological analysis, at least on the side of the morphologically rich language. See: WSD using morphological analysis.
Yarowsky (1995) demonstrated that document-level information, even comparatively primitive, can significantly enhance the quality of unsupervised WSD, to levels comparable with supervised methods. However, data provided for Senseval (now Semeval) conferences have always involved individual sentences only. New data sets need to be constructed, including document-level context, to allow experimentation with document-level features.
These experiments were designed to test the hypothesis:
Incomplete. See: Future work related to WSD with document-level features.
These experiments were designed to test the hypothesis:
Early CWSD experiments involving English and French were hampered by data sparseness, caused, in large part, by morphological complexity.
For these experiments, data sets were constructed from the Europarl parallel corpus, as described above under Task II.A. Languages of interest are as described under Task I.
As a proof of concept, we first ran experiments using naive morphological preprocessing, such as is typical in other applications. Since our focus was source language morphology, we used our morphologically simplest language (English) as the target language. We specifically used:
Results are shown in (17), sorted by language in decreasing order of F-score. Blue is the baseline.
I have not determined statistical significance.
| Languages | Process | F-score | Forms | Senses | Examples | S/F | E/F | E/S |
|---|---|---|---|---|---|---|---|---|
| de-en | trunc6 | 57.5 | 99 | 1505 | 127071 | 15.20 | 1283.55 | 84.43 |
| trunc5 | 57.4 | 96 | 1961 | 155221 | 20.43 | 1616.89 | 79.15 | |
| diac | 55.9 | 101 | 598 | 45048 | 5.92 | 446.02 | 75.33 | |
| none | 55.8 | 101 | 587 | 44330 | 5.81 | 438.91 | 75.52 | |
| lc | 55.8 | 101 | 587 | 44330 | 5.81 | 438.91 | 75.52 | |
| es-en | trunc6 | 61.4 | 99 | 1006 | 101280 | 10.16 | 1023.03 | 100.68 |
| trunc5 | 60.8 | 98 | 1490 | 136166 | 15.20 | 1389.45 | 91.39 | |
| diac | 60.4 | 104 | 618 | 50637 | 5.94 | 486.89 | 81.94 | |
| none | 59.8 | 104 | 610 | 46811 | 5.87 | 450.11 | 76.74 | |
| lc | 59.8 | 104 | 610 | 46811 | 5.87 | 450.11 | 76.74 | |
| strip | 57.5 | 97 | 825 | 93046 | 8.51 | 959.24 | 112.78 | |
| fr-en | trunc6 | 59.8 | 75 | 930 | 79461 | 12.40 | 1059.48 | 85.44 |
| trunc5 | 59.3 | 74 | 1338 | 111140 | 18.08 | 1501.89 | 83.06 | |
| diac | 58.6 | 80 | 485 | 36139 | 6.06 | 451.74 | 74.51 | |
| none | 58.4 | 80 | 478 | 35857 | 5.98 | 448.21 | 75.01 | |
| lc | 58.3 | 80 | 478 | 35857 | 5.98 | 448.21 | 75.01 | |
| strip | 56.8 | 77 | 650 | 66794 | 8.44 | 867.45 | 102.76 | |
| fi-en | trunc6 | 56.0 | 60 | 522 | 37618 | 8.70 | 626.97 | 72.07 |
| trunc5 | 55.7 | 58 | 755 | 55885 | 13.02 | 963.53 | 74.02 | |
| none | 53.9 | 63 | 347 | 25411 | 5.51 | 403.35 | 73.23 | |
| lc | 53.9 | 63 | 347 | 25411 | 5.51 | 403.35 | 73.23 | |
| diac | 53.8 | 63 | 347 | 25411 | 5.51 | 403.35 | 73.23 |
Todo: bar graphs
The test set is consistent across all experiments in a given language, but the training set intentionally is not.
For example, aktionsplan might occur 500 times in the text. After removing 50 dev examples, 50 test examples, and 150 training examples, that leaves 250 unlabeled examples. Now let's say aktions occurs 2500 times in the text. When we run the trunc6 experiment, all training and unlabeled examples are grouped together for both words, along with all other words beginning aktion-. The example data set for the truncated form of aktionsplan is therefore considerably higher than for the original form. This also introduces considerable noise, as aktionsplan might now be mapped to any word which is a translation for any word beginning aktion-.
Certain observations stand out.
A next step would be to run CWSD on the Europarl data sets in (17).
As noted above, Vickrey et al. (2005) set up a Europarl fr-en CWSD experiment in which they achieved 62.0% F-score, compared to a 51.1% baseline (18). They graciously sent us their data so we could compare results. However, a direct comparison is uninformative, since they used a language-specific POS tagger, and we did not.
It would have been quite informative if we had beat them anyway, but we didn't. We could rerun the experiment using a POS tagger to get a fair comparison.
Instead, we used this set of experiments to illustrate the usefulness of another technique: classifier combination.
We ran experiments on a dev set using the following algorithms:
Then, for each word, we chose whichever classifier had the highest precision on the dev set, and ran that on the test set. As expected, this 1-best system outperformed any of the individual classifiers (18).
| Precision | # Best | |
|---|---|---|
| Vickrey baseline | 52.6 | |
| Vickrey best | 62.0 | |
| baseline | 52.9 | 595 |
| unmod | 55.6 | 540 |
| trunc5 | 50.0 | 214 |
| trunc6 | 51.5 | 228 |
| morph | 54.7 | 294 |
| 1-best | 57.7 |
The # Best column shows the number of base words for which each individual system proved to be the most precise. Some observations are striking.
Eyeballing a word-by-word breakdown of results (not shown) reveals that ultra-high-frequency words tend to get the best results with either the baseline or unmodified WSD. This is not surprising, given that morphological analysis is expected to help most with sparse data. It also explains the high overall scores of the simple systems.
Beyond this, it is not immediately clear how results are correlated. It's not even clear that the words where the morphological system performs best are those with the most morphological variation. For instance, love is one of these words, even though no other morphologically similar words occur in the corpus.
Other methods of classifier combination are also possible, of course, and should be examined. See: Other future work related to WSD.
These experiments were designed to test the hypothesis:
Previous related work suggests strong potential for using morphological analysis to improve the quality of machine translation. Some observations:
Most previous research involved techniques highly
customized for a specific language pair. We planned to
demonstrate utility across several language pairs,
representing various language
types.
Incomplete. See: Future work related to MT and morphological analysis.
These experiments were designed to test the hypothesis:
Previous related work has been less than encouraging when it comes to using CWSD to improve downstream machine translation. However, we suspect this reflects limitations in the approaches tried, rather than in the concept, itself. This suspicion was confirmed by (Pham et. al. 2005).
For CWSD to have a significant effect, it would presumably need to consider factors not already taken into account by the MT system. In our case, this includes:
The University of Maryland owns the state-of-the-art MT system, Hiero. This would be the obvious choice of platform for our experiments. In particular, Hiero offers an option for specifying translation candidates as features, which are then factored into the translation decisions. We planned to prime lexical selection by providing the results of CWSD through these features.
We planned to compare results using:
Training and test data are those used for the NAACL 2006 Workshop Shared Task: Exploiting Parallel Texts for Statistical Machine Translation (WMT06) (Koehn & Monz 2006). (19) lists the WMT06 results (test data only) for comparison, including:
Columns indicate results on the following data sets:
| Fr-En | Es-En | De-En | En-Fr | En-Es | En-De | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dev | In | Out | Dev | In | Out | Dev | In | Out | Dev | In | Out | Dev | In | Out | Dev | In | Out | ||
| WMT06 participants |
Highest | 31.94 | 22.50 | 32.37 | 28.35 | 27.30 | 18.87 | 33.66 | 25.26 | 31.85 | 27.76 | 18.85 | 11.82 | ||||||
| Lowest | 21.44 | 19.42 | 23.91 | 19.17 | 15.86 | 11.78 | 25.07 | 21.44 | 23.17 | 16.83 | 9.84 | 6.55 | |||||||
| Mean, all | 29.33 | 20.70 | 29.99 | 25.77 | 23.75 | 16.47 | 31.04 | 23.45 | 29.36 | 24.47 | 16.62 | 10.43 | |||||||
| Mean, pack | 30.42 | 21.23 | 30.65 | 26.51 | 25.10 | 17.07 | 31.74 | 23.74 | 30.10 | 25.39 | 18.87 | 10.90 | |||||||
| Baseline | Hiero/base | 25.77 | 25.67 | 24.25 | 21.01 | 20.94 | 15.69 | ||||||||||||
| Hiero/base (sub) | 27.14 | 27.08 | 24.14 | ||||||||||||||||
| Truncation | Hiero/trunc5 | ||||||||||||||||||
| Hiero/trunc6 | |||||||||||||||||||
| Morphology | Hiero/morph | ||||||||||||||||||
| WSD | Hiero/wsd | ||||||||||||||||||
| Hiero/morph/wsd |
|||||||||||||||||||
I only had a few months to work on these experiments, and It turned out our installation of Hiero first needed to be scaled up. By the time I finished that and ran a few baseline experiments, my time had run out. It was perhaps interesting to note that the Hiero baseline performed somewhat below the average of WMT06 participants.
Our experimental results support several conclusions.
There are a plethora of related research directions to pursue.
Definitions are intended for usefulness to an educated audience, not scientific precision.
1-best: When returning results, return only the top-scoring candidate. This is in contrast to n-best, where several high-scoring candidates might be returned, each with a probability (or other score).
accuracy: Usually used in its generic sense, equivalent to correctness; when indicated, a specific measure of a system's correctness. Contrast: precision, recall, F-score.
affix: A bound morpheme, i.e. one that cannot stand alone as a word, but must be attached to a root. Affixes are often differentiated between inflectional and derivational types.
agglutinative: A language type (see below).
alignment: The process of examining texts which are purported translations of each other, and determining which subparts are translations of each other. For instance, if the system is given aligned sentences, its goal is to identify aligned words. Word alignment is a subprocess of many machine translation algorithms.
analytic: A language type (see below).
annotated (or labeled): Data annotated with correct "answers" for whatever task is at hand. For instance, if the task is word sense disambiguation, each word in the annotated data comes labeled with its correct sense.
attested: Of data, occurring in an observed corpus. For instance, the word "reference" is attested in this document, as it appears in the following paragraph. The word "flabbergasted" is not attested in this document (or wasn't, until I wrote this sentence).
bleu: A popular metric for evaluating the accuracy of machine translation systems, by comparing the system's output with a set of "correct" reference translations. A higher BLEU score is presumably better. Scores are not comparable between test sets, so BLEU is only useful when comparing different systems using the same test set.
closed class: Refers to a class of objects (typically a part-of-speech) where the entire contents of the class are known and unlikely to change. Examples: English prepositions, articles, auxiliary verbs. Contrast: open class.
compounding: The process of combing multiple roots of the same part-of-speech into a single word. The meaning of the resulting word may have a meaning that is not determined directly from the meanings of its components. For instance, English manhole.
confidence: In a statistical system, confidence is the system's own estimate of the reliability of its results.
content word: A word that carries significant semantic meaning, such that any translation would have to include that meaning in order to be considered accurate. These are generally open class, such as English nouns and verbs. Contrast: function word.
corpus (plural corpora): A (usually large) body of data; in our case, written text.
cotraining: Loosely speaking, running two independent analyses of the same data, where the output of each analysis informs the other (Yarowsky 1995; Blum & Mitchell 1998).
crosslingual word sense disambiguation (cwsd): WSD in which a word's senses are taken to be its possible translations in another language.
cwsd: See: crosslingual word sense disambiguation.
derivational affix: An affix that is not entirely productive, may produce idiosyncratic changes in meaning, and may change the part of speech. For instance, the -er affix in English is derivational. The derived meanings of farm→farmer and ministry→minister are not predictable. Discarding a derivational affix almost always results in significant loss or change of meaning, unlike an inflectional affix.
development (dev) data: Annotated data set aside for optimizing a system's parameters. This is still considered a "fair" system, which would not be the case if the system were optimized directly on the test data.
diacritic: A symbol that appears over the top of a letter (or occasionally beneath it), such as an accent or umlaut.
em: See: expectation-maximization.
europarl corpus: A large parallel corpus covering 11 European languages, drawn from European Union parliamentary proceedings (Koehn 2005).
expectation-maximization (em) algorithm: A popular machine learning algorithm that iteratively uses current guesses about unknown values to make new guesses for the next iteration (Demster et al. 1977).
f-score: A common measure of a system's correctness, calculated as the harmonic mean of precision and recall.
form: A word form.
function word: A word that does not carry significant meaning, instead reflecting grammatical or syntactic characteristics. These are generally closed class, such as English prepositions. Contrast: content word.
fusional: A language type (see below).
hapax legomenon: A word form that occurs precisely once in a corpus.
inflectional affix: An affix that is highly productive and regular in its use. They generally add only grammatical information, and rarely change the part of speech. For instance, the -s affix in English is inflectional. An inflectional affix can often be discarded without significant loss of meaning, unlike a derivational affix.
irregular: See: regular.
labeled: See: annotated.
lemma: Another name for a root.
lemmatization: The process of replacing each word with its lemma(s), typically by identifying and discarding affixes. For example, lemmatizing the sentence, "the biggest dogs ate the smallest meals," yields, "the big dog eat the small meal".
machine translation (mt): The fully automated process of translating text from one language to another.
morpheme: Informally, the smallest discernible unit of meaning. A word is typically composed of one or more morphemes, differentiated between roots and affixes. For instance, the English word farmers can be decomposed into three morphemes: farm -er -s, where farm is the root, and -er and -s are affixes.
There is evidence that morphemes are composed of yet smaller units of meaning. However, just as many scientists are happy to act as though atoms are the smallest units of matter, so we will maintain our happy delusion that morphemes are atomic.
morphological analysis: The process of breaking a word, or all the words of a given language, into component morphemes.
morphologically poor: Having comparatively few inflectional affixes and/or irregular forms. English is morphologically poor, having comparatively few word endings.
morphologically rich: Having comparatively many inflectional affixes and/or irregular forms. Czech is morphologically rich, having different word endings for gender, person, number, case, and phase of the moon, in all different combinations. (Well, I'm not sure about the last one.)
mt: See: machine translation.
n-gram: A string of n characters.
noise: In the context of a statistical application, noise refers to annotated data where the annotations are less than 100% accurate – a virtually universal condition of real-world data. This can lead a naive system to draw incorrect conclusions. A robust system either filters out the noise, or otherwise avoids weighting it too heavily.
open class: Refers to a class of objects (typically a part-of-speech) where the contents of the class cannot be readily enumerated, and change frequently. Examples: English nouns, verbs, adjectives.
parallel corpus (or parallel text): A corpus of text that has been translated directly between two (or more) languages. Typically, such a corpus is initially aligned only at the document level. i.e. It is not marked which sentences or words are translations of each other.
pbsmt: See: phrase-based statistical machine translation.
phrase-based statistical machine translation (pbsmt): A form of machine translation which translates a sequence of words ("phrase") in the source language to a sequence of words in the target language as a single operation. Contrast: WBSMT.
pos: Part-of speech: noun, verb, adjective, etc.
precision: A common measure of a system's correctness, calculated as: (number of correct identifications) / (number of identifications produced by the system). So, if there were 100 possible correct answers, but the system only offered 10 guesses, of which 6 were correct, the precision is: (6 / 10) = 60%. Contrast: recall, F-score.
productive: Refers to a process that can be applied in most situations, even novel cases. For instance, in English, applying -s to form a plural noun or -ing to form a progressive verb are highly productive processes. Applying -tion to form a noun from a verb is not productive, applying only in a relatively small number of cases.
recall: A common measure of a system's correctness, calculated as: (number of correct identifications) / (number of potential correct identifications). So, if there were 100 possible correct answers, but the system only offered 10 guesses, of which 6 were correct, the recall is: (6 / 100) = 6%. Contrast: precision, F-score.
regular: Refers to a process that follows a typical productive pattern. For instance, in English, forming cat + -s -> cats is regular pluralization. Forming man + -s -> men is irregular pluralization. The most common words are the most likely to follow irregular patterns, while novel words almost exclusively follow regular patterns (Pinker 1999).
robust : See: noise.
root (or lemma): An unbound morpheme, i.e. one that can stand alone as a word. Contrast affix.
rule-based: A system built around rules written by experts, rather than statistical machine learning.
semi-supervised: A statistical system that relies for training on a small amount of annotated data and a large amount of unannotated data. Contrast: supervised, unsupervised.
sense: A possible meaning for a given word. For example, the English word fence has at least three senses:
senseval: Senseval (now Semeval) is a series of workshops designed to test and showcase WSD systems. The first workshop, Senseval-1, was held in 1998. The most recent, Semeval-1/Senseval-4, was held in 2006.
source text: In a translation exercise, the input text that needs to be translated. The source language is the language of this text. Contrast: target text.
sparse: Data that does not well represent all possibilities. Ideally, you want your data to contain an example of every possible event you might see. However, this is only possible in a finite domain. If you're looking, for instance, at all possible uses of the word fence, you want at least several hundred sentences, if not several thousand, to capture the variety of possible contexts. The fewer such sentences you have, the greater your data sparseness. A major challenge of designing a statistical system is dealing with data sparseness.
stem: In this discussion, a stem is any word without inflectional affixes, containing one or more roots and zero or more derivational affixes. Specifically, a word's stem refers to the largest possible stem in that word. So, for the English word farmers, its root is farm, and its stem is farmer.
stemming: The process of replacing each word with its stem (or root), typically by identifying and discarding inflectional (or all) affixes.
supervised: A statistical system that relies for training on large amounts of annotated data, often requiring considerable time and effort by experts to annotate the data. Contrast: unsupervised, semi-supervised.
suppletive: A language type (see below).
target text: The output of translation. The target language is the language of this text. Contrast: source text.
templatic: A language type (see below).
test data: Annotated data set aside for testing a system's performance, For fair results, the test set must not be observed while building the system. Contrast: dev data.
token: In this discussion, a token is any occurrence of a string of non-punctuation, non-space characters. The precise definition varies between applications.
training data: Annotated data used to train a statistical system.
uma: See: unsupervised morphological analysis.
unannotated: Raw text that has not been annotated for any particular purpose, such as might come directly from a news broadcast or internet post.
unsupervised: A statistical system that relies for training only on unannotated data, often readily available in large amounts. Contrast: supervised, semi-supervised.
unsupervised morphological analysis (uma): A system that performs morphological analysis without requiring experts to write rules or break words into morphemes by hand.
wbsmt: See: word-based statistical machine translation.
word: In this discussion, the term word typically refers to a word form, although it may sometimes be used to indicate a token. The intended usage should be obvious from context.
word form: A class of tokens with identical spelling. For instance, the sentence "the big cat chased the big dog" contains seven tokens, but only five word forms.
word sense disambiguation (wsd): The process of determining which possible sense of a word is used in a given context.
word-based statistical machine translation (wbsmt): A form of machine translation which translates each word in the source language to a word in the target language as a single operation. Contrast: PBSMT.
zipf's law: George Kingsley Zipf observed a property of language that the frequency of any word in a corpus is inversely proportional to its rank in the frequency table. Thus, the most frequent word occurs approximately twice as often as the second most frequent word, which occurs approximately twice as often as the fourth most frequent word, and so on. At the other end of the curve, the "long tail", rare events are by far the highest class of word forms, with hapax legomena being the highest class of all. (I haven't been able to confirm whether the same ratio applies.) Zipf's law has since been observed in a large number of natural (and unnatural) distributions.
Languages are often grouped into three main classes.
Todo: examples of each
We use these additional subclassifications in our discussion.
In reality, a language rarely fits precisely into one class. For instance, English is predominately analytic ("have been going"), but has some fusional ("he dance-s"), suppletive ("he went"), and even agglutinative ("garbage-truck-windshield-wiper-cleaner"), characteristics.
Off-the-shelf morphological tools are available to the research community for a variety of common languages. These systems are mostly rule-based and language-specific. We would expect them to outperform anything we could induce automatically.
Galen Andrew, Trond Grenager, and Christopher Manning. 2004. Verb Sense and Subcategorization: Using Joint Inference to Improve Performance on Complementary Tasks. EMNLP 2004: 150-157. [ps, pdf]
R.H. Baayen, R. Piepenbrock, and L. Gulikers. 1995. The CELEX Lexical Database (CD-ROM). LDC, University of Pennsylvania, Philadelphia..
Bogdan Babych and Anthony Hartley. 2003. Selecting Translation Strategies in MT using Automatic Named Entity Recognition. European Association for Machine Translation.
Timothy Baldwin. 2005. Bootstrapping Deep Lexical Resources: Resources for Courses. Proceedings of the ACL-SIGLEX 2005 Workshop on Deep Lexical Acquisition: 67–76. Ann Arbor, USA. [pdf]
J. Bilmes and K. Kirchhoff. 2003. Factored Language Models and Generalized Parallel Backoff. Human Language Technology Conference. [pdf]
A. Blum and T. Mitchell. 1998. Combining Labeled and Unlabeled Data with Co-Training. Proceedings of the 1998 Conference on Computational Learning Theory. July 1998. [ps]
Clara Cabezas and Philip Resnik. 2005. Using WSD Techniques for Lexical Selection in Statistical Machine Translation. Technical Report: LAMP-TR-124/CS-TR-4736/UMIACS-TR-2005-42. University of Maryland, College Park, July 2005. [pdf]
Marine Carpuat and Dekai Wu. 2007a. How Phrase Sense Disambiguation outperforms Word Sense Disambiguation for Statistical Machine Translation. 11th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI 2007). Skovde: Sep 2007. [pdf]
Marine Carpuat and Dekai Wu. 2007b. Improving Statistical Machine Translation using Word Sense Disambiguation. 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2007). Prague: Jun 2007. [pdf]
Marine Carpuat and Dekai Wu. 2005. Word Sense Disambiguation vs. Statistical Machine Translation. 43rd Annual Meeting of the Association for Computational Linguistics (ACL-2005). Ann Arbor, MI, Jun 2005. [pdf]
Yee Seng Chan, Hwee Tou Ng, and David Chiang. 2007. Word sense disambiguation improves statistical machine translation. Proc. ACL. [pdf]
David Chiang. 2005. A Hierarchical Phrase-Based Model for Statistical Machine Translation. Proceedings of ACL 2005: 263–270. Best paper award. [pdf]
I. Chugur, J. Gonzalo, F. Verdejo. 2000. Sense distinctions in NLP applications. Proceedings of Ontolex. [ps]
Martin Čmejrek, Jan Cuřín, Jiří Havelka, Jan Hajič, Vladislav Kuboň. 2004. Prague Czech-English Dependency Treebank: Syntactically Annotated Resources for Machine Translation. 4th International Conference on Language Resources and Evaluation. Lisbon, Portugal. [pdf, ps, download]
Jan Daciuk. 2004. Finite-State Lexical Tools. BIS 2004, 7th International Conference on Business Information Systems: 373-380. Witold Abramowicz (ed.), Wydawnictwo Akademii Ekonomicznej w Poznaniu, Poznań, Poland, 21-23 April, 2004. [download, mirror, website]
Jan Daciuk and Gertjan van Noord. 2001. A Finite-State Library for NLP. CLIN 2001. University of Twente, Enschede, the Netherlands, November 2001.
A. de Gispert. 2005. Phrase Linguistic Classification and Generalization for Improving Statistical Machine Translation. Proc. of the ACL Student Research Workshop (ACL'05/SRW): 67-72. Ann Arbor (Michigan), June 2005. [pdf]
Arthur Dempster, Nan Laird, and Donald Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1): 1–38.
R. Florian and R. Wicentowski. 2002. Unsupervised italian word sense disambiguation using wordnets and unlabeled corpora. Proceedings of SigLEX’02: 67–73.
Dayne Freitag. Morphology Induction from Term Clusters. Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005): 128-135. Ann Arbor, MI, June 2005. [pdf]
Sharon Goldwater, David McClosky. 2005. Improving Statistical MT through Morphological Analysis. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Vancouver. [pdf, ps]
Nizar Habash, Owen Rambow. 2005. Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop. ACL-05. [pdf]
D. Hiemstra. 1996. Using Statistical Methods to Create a Bilingual Dictionary. Master's thesis. Universiteit Twente.
Anas E. Isbihani, Shahram Khadivi, Oliver Bender, and Hermann Ney. 2006. Morpho-syntactic Arabic Preprocessing for Arabic to English Statistical Machine Translation. Proceedings on the Workshop on Statistical Machine Translation: 15-22. New York City, June 2006.
T. Joachims. 2003. Transductive Learning via Spectral Graph Partitioning. International Conference on Machine Learning (ICML). [pdf, ps (gz)]
T. Joachims. 1999. Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning. B. Schölkopf and C. Burges and A. Smola (ed.). MIT-Press. [pdf, ps (gz)]
Martin Kay. 1987. Nonconcatenative Finite-State Morphology. EACL 1987: 2-10. [pdf]
Edward Kenschaft. 2005. Improving Crosslingual Word Sense Disambiguation using Unlabeled Monolingual Corpora. Unpublished. [html]
K. Kirchhoff and M. Yang. 2005. Improved Language Modeling for Statistical Machine Translation, Proceedings of the ACL Workshop on Building and Using Parallel Texts. [pdf]
Philipp Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. MT Summit 2005. [pdf]
Philipp Koehn. 2004. Pharaoh: A Beam Search Decoder for Phrase-Based Statistical Machine Translation Models. AMTA 2004. [pdf, ps, slides]
Philipp Koehn. 2002. Europarl: A Multilingual Corpus for Evaluation of Machine Translation. Draft, unpublished. [ps, data]
Philipp Koehn and Kevin Knight. 2003. Empirical Methods for Compound Splitting. EACL 2003. [pdf, ps, abstract]
Philipp Koehn and Christof Monz. 2006. Manual and Automatic Evaluation of Machine Translation between European Languages. Proceedings on the Workshop on Statistical Machine Translation. June 2006.
Philipp Koehn, Franz Josef Och, Daniel Marcu. 2003. Statistical Phrase-Based Translation. Proceedings of the Human Language Technology Conference 2003 (HLT-NAACL 2003). Edmonton, Canada, May 2003. [pdf]
Kimmo Koskenniemi. 1983. Two-level Morphology: A General Computational Model for Word-Form Recognition and Production. Publications, p. 160. University of Helsinki, Department of General Linguistics.
Young-Suk Lee. Morphological Analysis for Statistical Machine Translation. HLT-NAACL 2004: Short Papers: 57–60. Susan Dumais, Daniel Marcu, Salim Roukos, eds. Boston, Massachusetts, May 2004. [pdf, ps]
Young-Suk Lee, Kishore Papineni, Salim Roukos, Ossama Emam, Hany Hassan. 2003. Language Model Based Arabic Word Segmentation. Proceedings of the 41st Annual Meeting of the ACL: 399-406. Sapporo, Japan. [pdf]
L. Màrquez, M. Taulé, M.A. Martí, M. García, N. Artigas, F. Real, D. Ferrés. 2004. Senseval-3: The Spanish Lexical Sample Task. Proceedings of the Senseval-3 ACL-SIGLEX Workshop. Barcelona, Spain. [pdf]
Martin, Mihalcea and Pedersen. 2005. Word Alignment for Languages with Scarce Resources. Proceedings of the ACL Workshop on Building and Using Parallel Texts. Ann Arbor, MI, June 29-30. [pdf]
Rada Mihalcea and Phil Edmonds, ed. 2004. Proceedings of Senseval-3: The Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text. Association for Computational Linguistics: Barcelona, Spain, July, 2004. [website, data]
Mohammad and Pedersen. 2004. Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation. Proceedings of the Conference on Computational Natural Language Learning (CoNLL). Boston, MA, May 6-7, 2004. [pdf]
Sonja Nießen, Hermann Ney. 2004. Statistical Machine Translation with Scarce Resources Using Morpho-Syntactic Information. Computational Linguistics, Volume 30, Number 2: 181-204, June 2004. [CogNet]
Sonja Nießen, Hermann Ney. 2001a. Toward hierarchical models for statistical machine translation of inflected languages. ACL-EACL-2001: 39th Annual Meeting of the Association for Computational Linguistics - joint with EACL 2001: Proceedings of the Workshop on Data-Driven Machine Translation: 47-54. Toulouse, France, July 2001. [ps]
Sonja Nießen, Hermann Ney. 2001b. Morpho-syntactic analysis for Reordering in Statistical Machine Translation. Proceedings of the MT Summit VIII: 247-252. Santiago de Compostela, Galicia, Spain, September 2001. [pdf, ps]
Sonja Nießen, Hermann Ney. 2000. Improving SMT Quality with Morpho-Syntactic Analysis. Proceedings of the 20th International Conference on Computational Linguistics: 1081-1085. Saarbrucken, Germany.
Franz Josef Och. 1999. An Efficient Method for Determining Bilingual Word Classes. Ninth Conf. of the Europ. Chapter of the Association for Computational Linguistics (EACL'99): 71-76. Bergen, Norway, June 1999. [ps]
Franz Josef Och and Hermann Ney. 2004. The Alignment Template Approach to Statistical Machine Translation. Computational Linguistics 30: 417–449.
Franz Josef Och and Hermann Ney. 2003. A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics 29(1): 19-51. March 2003.
Thanh Phong Pham, Hwee Tou Ng, and Wee Sun Lee. 2005. Word Sense Disambiguation with Semi-Supervised Learning. Proceedings of the 20th National Conference on Artificial Intelligence (AAAwe 2005): 1093-1098. Pittsburgh, Pennsylvania, USA. [pdf]
Steven Pinker. 1999. Words and Rules: The Ingredients of Language. HarperCollins. [website]
Maja Popović and Hermann Ney. 2004a. Improving Word Alignment Quality using Morpho-syntactic Information. COLING04. [pdf]
Maja Popović and Hermann Ney. 2004b. Towards the Use of Word Stems and Suffixes for Statistical Machine Translation. Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC): 1585-1588. Lisbon, Portugal, May 2004. [ps]
Maja Popović, David Vilar, Hermann Ney, Slobodan Jovičić, Zoran Šarić. 2005. Augmenting a Small Parallel Text with Morpho-syntactic Language Resources for Serbian-English Statistical Machine Translation. Proceedings of the ACL Workshop on Building and Using Parallel Texts: Data-Driven Machine Translation and Beyond: 41-48. Ann Arbor, Michigan, June 2005. [pdf]
Philip Resnik and David Yarowsky. 2000. Distinguishing Systems and Distinguishing Senses: New Evaluation Methods for Word Sense Disambiguation. Natural Language Engineering, 5(2): 113-133. [ps]
Bettina Schrader. 2006. Non-Probabilistic Alignment of Rare German and English Nominal Expressions. Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC). Genoa, Italy, May 2006.
Bettina Schrader. 2004. Improving Word Alignment Quality Using Linguistic Knowledge. Proceedings of the Workshop on The Amazing Utility of Parallel and Comparable Corpora (LREC 2004 satellite event): 46-49. Lisbon, Portugal, May 2004. [pdf]
David
Vickrey, Luke Biewald,
Marc Teyssier, and Daphne
Koller. 2005. Word-Sense
Disambiguation
for Machine Translation. Proceedings
of Human Language Technology Conference and Conference on Empirical
Methods in Natural Language Processing:
771-778. Vancouver, British Columbia, Canada,
October 2005.
Richard Wicentowski. 2002. Modeling and Learning Multilingual Inflectional Morphology in a Minimally Supervised Framework. Doctoral dissertation. Johns Hopkins University, Baltimore, Maryland, October 2002.
David Yarowsky. 2000. Hierarchical Decision Lists for Word Sense Disambiguation. Computers and the Humanities 34(2): 179-186. [pdf]
David Yarowsky. 1995. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics: 189-196. Cambridge, MA. [ps]
David Yarowsky. 1994. Decision Lists for Lexical Ambiguity Resolution: Application to Accent Restoration in Spanish and French. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics: 88-95. Las Cruces, NM. [ps]
David Yarowsky and R. Florian. 2002. Evaluating Sense Disambiguation Performance Across Diverse Parameter Spaces. Journal of Natural Language Engineering. [ps]
David Yarowsky and Richard Wicentowski. 2001. Minimally supervised morphological analysis by multimodal alignment. Proceedings of ACL-01.
David Yarowsky and Richard Wicentowski. 2000. Minimally supervised morphological analysis by multimodal alignment. Proceedings of ACL-2000: 207-216. Hong Kong. [ps]