Edward Kenschaft, master's thesis based on pre-candidacy doctoral research, University of Maryland.
This page displays best under Firefox, but it should also be readable under Internet Explorer. Likewise, it makes use of color, but should be perfectly understandable in black & white.
This paper is also available in a PDF version.
My research at the University of Maryland involved three distinct but related areas within the field of computational linguistics:
We evaluated the following experimental hypotheses:
This paper summarizes all experimental results, with conclusions supporting hypotheses A, B, C and D.
Please note the partial glossary at the end of the paper.
Disclaimers: This paper is not intended for publication, but is rather a summary of my research efforts at UMD, from fall 2005 through 2006. It incorporates portions of several project reports, minimally edited, as well as descriptions of incomplete experiments. Summaries of related research efforts have generally not been updated since 2006.
Morphological analysis (or induction) is the process of breaking a word, or all the words of a given language, into component morphemes. The field has a substantial history, but efforts have typically been constrained in major ways:
The goal was an unsupervised morphological analysis (UMA) system that can be used to induce morphological structure in any language where written text is available, regardless of what else is known about the language.
Word sense disambiguation (WSD) is the process of determining which possible sense of a word is used in a given context. For instance, the English word fence has at least three senses:
Monolingual WSD is often
a poorly
defined problem, in that:
Various researchers have begun to explore crosslingual word sense disambiguation (CWSD), based on a corpus of sentences that have been translated between two languages. The possible translations of a word in the target language are taken to be its sense inventory. That is, if a word can be translated in three different ways, then these are its three "senses" – regardless of whether some other theoretical formalism would agree.
So far, research into CWSD has been constrained in several ways:
Building upon this tradition, we sought to:
Three of these approaches reduced data sparseness by introducing noisy data. Naturally, this involves trade-offs. However, results show that these approaches nonetheless helped in some cases. More research presumably could further improve noise filtering.
We attempted to evaluate UMA and CWSD in terms of their impact on machine translation (MT). However, we ran out of time before reaching the desired results.
Disclaimer: These highlights do not scratch the surface of a full literature survey on these topics.
Morphological analysis is the process of parsing a word into its component morphemes. This can be done manually or automatically, using either a rule-based or statistical system. Many of the most commonly studied languages have commercially available rule-based systems which achieve greater than 95% accuracy (citation???). However, developing such a system for a new language is time-intensive, as is annotating data for use by a supervised statistical system. We therefore focused our attention on unsupervised induction of morphological structure. Such a system could be created for a new language much faster than a rule-based or supervised system, as long as unannotated text is available.
This field has a long and rich history (e.g. Koskenniemi 1983; Kay 1987). We mention just a couple of efforts that most closely influenced our work.
Dayne Freitag (2005) offers a methodology for inferring morphosyntactic transformation rules, as opposed to morphemes per se. In a nutshell, his method involves the following steps:
Some advantages of Freitag's system:
Morphological analysis is related to a task known in the literature as (deep) lexical acquisition (DLA). Timothy Baldwin (2005) assumes a seed database, and describes a variety of methods for augmenting that database. His method for identifying morphemes is summarized as follows:
This method, I suspect, is theoretically inferior to Freitag's, since it does not share any of the advantages listed above, and arbitrarily limits the number of recognized morphemes. However, it should be significantly simpler and faster to implement.
Baldwin's non-morphological features depend on deeper semantic knowledge, which I do not presume to have.
Florian and Wicentowski (2002) used morphological analysis to improve WSD, referenced in (Wicentowski 2002).
I only discovered this paper recently (2008), and have not examined their work.
Philipp Koehn & Kevin Knight (2003) experimented with heuristics for dividing German compounds into component morphemes, by looking for known roots that occur as substrings of larger words. They evaluated the effects of their approach on word alignment (German↔English), word-based statistical machine translation (WBSMT), and phrase-based statistical machine translation (PBSMT).
Because German is agglutinative, each German word (particularly noun compound) may align with an arbitrarily long sequence of English words. This provides a significant challenge for word alignment, and for other applications that depend upon alignment, such as machine translation.
Koehn & Knight attempted to address this problem by learning rules for splitting German noun compounds into component parts. A part is either a subword or filler between subwords. They used only unannotated text for training.
Their algorithm involved two steps: generation and selection.
For generation, they looked at a new word, and listed every possible covering of that word with known words and fillers. For example, if the new word is aktionsplan, the four possible coverings are:
Their results are shown in (01).
| Splitting method | Alignment | BLEU | |||
|---|---|---|---|---|---|
| precision | recall | accuracy | WBSMT | PBSMT | |
| none | – | – | 94.2% | 29.1 | 30.5 |
| eager | 24.8% | 73.3% | 87.1% | 22.2 | 34.4 |
| frequency-based | 57.4% | 86.6% | 95.7% | 31.7 | 34.2 |
| parallel | 83.3% | 89.1% | 98.6% | 29.4 | 33.0 |
| parallel & POS | 93.8% | 90.1% | 99.1% | 30.6 | 32.6 |
Not surprisingly, the best word alignment, 99.1% accuracy, occurred when using all possible information, parallel text plus POS, up from 94.2% with unmodified text. The frequency-based metric produced only produced moderate improvements, to 95.7%. Eager splitting hurt alignment.
It is perhaps surprising that the best downstream results with MT, as measured by BLEU score, did not involve optimization with parallel text. For word-based MT, the frequency-based metric was by far the best performer, with a 2.6 and 1.6 improvement in BLEU score, respectively, over no splitting and splitting based on all information.
Even more interesting are the results with phrase-based MT. Here the best approach was dumb, eager splitting, with a whopping 3.9 BLEU improvement over no splitting (34.4 vs. 30.5). The frequency-based metric was a close second (34.2).
The success of eager splitting is attributed to the peculiar strength of PBSMT, which has no problem taking an arbitrary string of "words" and regrouping them into "phrases" as needed.
Nießen & Ney (2001a, 2004) introduced the idea of a hierarchical lexicon, where a word is represented at various levels of inflectional specificity, starting with the bare root. German↔English alignment is performed at each of these levels. The technique was adopted by various later researchers.
Nießen & Ney also employed various heuristics for syntactic restructuring, such as reordering questions and detaching verb prefixes, to make the German more like the English. Finally, they added other heuristics, such as detecting multiword phrases based on POS sequence.
Used all together, the techniques led to improvements comparable to an order of magnitude of data (02). This claim would be stronger if an additional order of magnitude were represented, e.g. from 500 to 5K.
Each column represents a different accuracy measure; lower is better.
| #sent | Method | 1-BLEU | mWER | SSER | ISER |
|---|---|---|---|---|---|
| 0 | baseline | 76.7 | 53.6 | 60.4 | 29.8 |
| restructure | 70.9 | 50.2 | 57.8 | 30.0 | |
| + all | 67.4 | 48.0 | 52.8 | 24.1 | |
| 5K | baseline | 52.6 | 38.0 | 37.3 | 17.4 |
| restructure | 47.9 | 34.7 | 33.6 | 15.2 | |
| + all | 47.1 | 33.9 | 31.8 | 13.7 | |
| 58K |
baseline | 46.3 | 34.1 | 30.2 | 14.1 |
| restructure | 43.7 | 32.5 | 26.6 | 12.8 | |
| + all | 42.9 | 31.8 | 26.3 | 11.8 |
Todo: Results with just hierarchical
lexicon, no restructuring?
Popović & Ney (2004a) built on the concept of the hierarchical lexicon (Nießen & Ney 2001a). They first parsed German words into roots and artificial inflectional morphemes, e.g. PRES for present tense. They then used a modified EM alignment algorithm to treat each of these complex "words" as a hierarchy, with alignment possible at any level. They evaluated effects on a variety of baseline systems, producing at least modest improvements in every case.
Popović et al. (2004b, 2005) used knowledge of the specific language pair to remove inflectional morphemes or function words that are not translatable, leading to a significantly reduced lexicon, and reduced alignment error rate.
I presume the alignment of function words to NULL has been handled in the mainstream alignment literature, although I haven't read up on it. Removing inflectional morphemes by hand seems (a) ad hoc, and (b) potentially detrimental to translation due to loss of information. Splitting apart but retaining all morphemes should provide the same benefits, without these objections.
Bettina Schrader (2004) also addressed German↔English alignment & translation (among other things), first looking at inflectional affixes, then noun compounding.
In experiments on three different data sets, Schrader removed inflectional affixes from the German, and aligned/translated only the lemmas. She evaluated on translation candidates generated using a bag-of-words dictionary generator (Hiemstra 1996). With all three data sets, recall went up with lemmatization. However, with the exception of one data set (Patente), precision went down. Schrader attributes this to a naive precision metric, which unduly penalizes the system for generating multiple translation candidates for the same term.
Using only the Patente data set, Schrader tried another experiment where she broke long compounds into component morphemes before alignment, e.g. Dämpfungsscheibenanordnung ('dampening disk assembly') → Dämpfung scheibe anordnung. Recall improved somewhat, while precision did not change significantly. Schrader noted that the system was severely hindered by its inability to handle alignments involving multi-word units. This suggests that results should be significantly better when evaluated according to a downstream PBSMT system.
Schrader (2006) addressed the problem of aligning German noun compounds with English word sequences. For data, she used the Europarl corpus (Koehn 2005) with both sides automatically POS tagged, and manually corrected alignments of ~100,000 German tokens to English counterparts.
Schrader was particularly interested in the problems posed by an inordinately large proportion of hapax legomena (i.e. word forms occurring only once) in the German data, as hapax legomena cannot be handled neatly by a statistical system. As shown in table (03), only 38% of English word forms are hapax, compared to 49% of German.
| Language | Tokens | Forms | Hapax | Other Rare | Frequent |
|---|---|---|---|---|---|
| English | 29,077,024 | 101,967 | 39,200 (38%) | 35,608 (35%) | 27,159 (27%) |
| German | 27,643,792 | 286,330 | 140,826 (49%) | 98,126 (34%) | 47,378 (17%) |
Of 512 German hapax legomena examined:
| English | German compound length | ||||
|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | >4 | |
| 1 word | 59 | 2 | 1 | 0 | 3 |
| 2 words | 30 | 119 | 56 | 15 | 10 |
| 3 words | 2 | 20 | 15 | 8 | 10 |
| 4 words | 0 | 1 | 0 | 0 | 2 |
The English phrase length is often one more than the compound
length, due to the presence of a prepositional phrase. For example, Kongresh-vorlage 'submission to Congress'.
Observation: Table (04) only seems to
support this observation for German noun compounds of length 1 or 2.
For longer compounds especially, a more precise analysis might lead to
improved heuristics.
Instead of word-to-word mappings, Schrader mapped German compound nouns to English word sequences.
Schrader argues that German compounds should not be broken into components, since the meaning of the compound may not be compositional. For instance, the German word Personen-stand is made up of roots literally translated as 'personal status', but the meaning of the compound is 'marital status'. However, a modern PBSMT system will take this into account, translating the sequence as a non-compositional unit when possible, and as compositional components otherwise.
Goldwater & McClosky (2005) addressed the problems of translating from a morphologically rich language (Czech) to a morphologically poor language (English).
For data, they used:
Czech's morphological richness leads to high data sparseness. While the numbers of rare lemmas in Czech and English are comparable, Czech has roughly twice as many rare inflected forms as English (07).
![]() |
Furthermore, features represented morphologically in Czech are often represented by other means in English (08).
| Czech | English | |
|---|---|---|
| a. | plural past tense |
plural past tense |
| b. | person negation genitive case instrumental case |
pronouns not of by or with |
| c. | gender | none |
| d. | other case markings | syntax |
A variety of approaches were compared. These are illustrated in (09), and explained following.
| a. | Original | Pro nĕkoho by její provedení mĕlo smysl . |
| b. | Lemmas | pro nĕkdo být jeho provedení mít smysl . |
| c. | Morpheme tokens | pro nĕkdo být PER_3 jeho provedení mít PER_X smysl . |
| d. | Modified lemmas | pro nĕkdo být+PER_3 jeho provedení mít+PER_X smysl . |
| e. | Vector | (pro) (nĕkdo) (být PER_3) (jeho) (provedení) (mít PER_X) (smysl) (.) |
Replacing each word with its associated lemma reduces data sparseness, but also loses information (09.b). This approach is expected to help when morphological information is not carried over into English (08.c,d), but may hurt in other cases.
Several different approaches to lemmatization were tried.In this approach, words are replaced by token sequences, each token representing one morpheme (09.c). This reduces data sparseness, although not quite as much as lemmatization. This approach is expected to help most when the Czech morpheme corresponds to an English function word (08.b).
First each word is lemmatized, then tags representing morphemes of interest are concatenated to the lemma (09.d). This eliminates some variation among word forms, thus slightly reducing data sparseness. This approach is expected to help when morphological information is carried over into English (08.a).
Each word is replaced by a feature vector including the lemma and tokens representing morphemes (09.e). This reduces data sparseness without loss of information, which should lead to optimum results. However, alignment is estimated using a variant of the EM algorithm, which may result in loss of accuracy.
Based on the results of the above experiments (10) (11), they created a combined model using tokens for person and negation morphemes (08.b), and the modified lemma method for number and tense (08.a). The results are in (12).
Experiments were with a word-to-word translation system. Explanation follows.
BLEU score difference of 00.9 is significant (p < .05).
| Dev | Test | ||
|---|---|---|---|
| a. | word-to-word | 31.1 | 27.0 |
| b. | truncate (k=6) | 35.3 | 28.3 |
| c. | lemmatize all | 35.5 | 29.9 |
| d. | except Pro | 35.0 | – |
| e. | except Pro, V, N | 34.6 | – |
| f. | n < 50 | 37.0 | 30.6 |
| Tokens | Mod-Lem | Vectors | ||
|---|---|---|---|---|
| a. | PER | 36.5 | 35.6 | 35.6 |
| b. | TEN | 36.5 | 36.1 | 36.4 |
| c. | PER,TEN | 35.5 | 36.2 | 35.5 |
| d. | NUM | 35.4 | 36.7 | 36.1 |
| e. | CASE | 35.3 | 34.0 | 33.7 |
| f. | NEG | 35.7 | 35.6 | 35.3 |
| Dev | Test | |
|---|---|---|
| combined model | 39.0 | 33.3 |
The best lemmatization results occurred with lemmatizing low-frequency words (10.f). Truncation (10.b) performed better than baseline (10.a), but worse than the best lemmatization.
Scores on other experiments (11) were reported on the dev set only. Surprisingly, none scored better than simple lemmatization.
Adding person-agreement tokens (PER) helped. Examination confirmed that they did, indeed, align with English pronouns. Inexplicably, tense-marking tokens (TEN) helped just as much, but using both PER and TEN canceled the benefits of either.
For modified lemmas, number (NUM) and tense tags helped, which makes sense since these are also marked in English. Case tags hurt, which also makes sense, since they increase data sparseness without providing any information marked in English.
Disappointingly, feature vectors never significantly outperformed morpheme tokens.
The combined model performed best of all, successfully combining the advantages of each method.
Young-Suk Lee (2004) addressed problems of translating Arabic, a templatic language, with English. He observes that in Arabic→English translation, one Arabic word frequently aligns with multiple English words, owing to functional affixes in Arabic, e.g.: llmEArDp ↔ "of the opposition". This one-to-many alignment poses technical challenges to PBSMT systems.
Lee's approach takes several steps:
Lee evaluates the effects of his approach on BLEU score for corpora ranging from 3.5K to 3.3M words, using either IBM Model 1 (a WBSMT system) or phrase-based alignment (13).
| # sentences | Model 1 | PBSMT | ||
|---|---|---|---|---|
| base | morph | base | morph | |
| 3.5K | .10 | .25 | .17 | .24 |
| 35K | .14 | .29 | .24 | .29 |
| 350K | .18 | .31 | .32 | .36 |
| 3.3M | .18 | .32 | .36 | .39 |
For Model 1, BLEU scores are vastly improved using Lee's method, roughly doubled. For PBSMT, the improvements are roughly equivalent to an order of magnitude of data.
Habash & Rambow (2005) also looked at Arabic→English translation. They noted that English has 50 morphological classes, while Arabic has roughly 333,000 in theory, 2200 attested.
Isbihani et al. (2006) compared the impact on MT of various methods for segmenting Arabic source text. They found that the best results came from an unsupervised method using a finite-state automaton augmented with a memory of word properties.
Crosslingual word sense disambiguation (CWSD) uses words of the target language as the sense inventory for words of the source language. For instance, if the source language word bank can translate to either banque or encaisser in the target language, then [banque, encaisser] are considered to be possible senses for bank. This nicely solves theoretical questions in WSD such as what level of granularity to use for word senses. If it leads to a different translation, it's a different sense; if it doesn't, it's not.
Vickrey et al. (2005) describe CWSD results for data taken from the Europarl fr-en parallel corpus (Koehn 2005). Their baseline always guesses the most common sense for a given word, yielding an F-score of 51.1%. Their logistic regression model yields an F-score of 62.0%.
They also describe an experiment with filling in a missing ambiguous word in a target sentence. This was intended to prove the potential usefulness of WSD in machine translation.
Attempts to use WSD as a preprocessor for MT have generally not produced encouraging results (Och 1999; Carpuat & Wu 2005; Cabezas & Resnik 2005). One explanation is that WSD artificially increases data sparseness, without focusing on senses that are relevant to translation (Yarowsky & Florian 2002). This is one motive for exploring crosslingual WSD, using target language words as the sense inventory for source language words (Resnik & Yarowsky 2000). In fact, one of the exercises in Senseval-3 was crosslingual (Mihalcea & Edmonds, ed. 2004).
Until recently, most research in WSD involved only annotated text. Some have begun looking into semi-supervised learning, augmenting a small amount of annotated text with a large amount of unannotated text, with optimistic results (Yarowsky 1995; Pham et al. 2005). Such an approach would be particularly useful when translating from a high-resource language into a low-resource language, since the amount of parallel text is likely to be small, while the amount of unannotated text in the source language is virtually without bound.
Carpuat & Wu (2005) used WSD to constrain possible translations, in contrast to our approach of providing probabilistic lexical selection recommendations. They concluded, "Even state-of-the-art WSD does not help BLEU score."
Cabezas & Resnik (2005) attempted to use crosslingual WSD to improve Spanish→English MT. They found no significant effect. They performed no lemmatization or morphological analysis; nor did they identify content words.
Pham et al. (2005) used sgt_light (Joachims 2003) on tasks similar to ours, with positive results.
Pham et al. published their results just as I was beginning my research, after I had completed a literature review. I did not encounter their paper until after I had written up my first round of results. Hence the relative lack of prominence of their work in my write-up.
The University of Maryland currently boasts the state-of-the-art PBSMT system, Hiero (Chiang 2005).
Previous PBSMT systems (e.g. Koehn 2004; Och & Ney 2004) generated translation candidates at two levels, word-level and phrase-level, where a word is any contiguous sequence of characters delimited by white space (i.e. a token), and a phrase is any contiguous sequence of words. The primary innovation of Hiero is to generate translation candidates from a hierarchy of phrases of arbitrary depth (i.e. phrases made up of phrases).
One effect of PBSMT in general, and hierarchical PBSMT in particular, is to mitigate the impact of overzealous tokenization. Two "words" which should not have been separated are simply grouped together by the system into functional "phrases". Thus, we expect to benefit by using the most aggressive morphological analysis available.
Caveat: Both Pharaoh and Hiero use a configurable maximum number of "words" per "phrase". Breaking words into their component morphemes will increase the length of any given "phrase". The phrase-length limit will therefore need to be increased, which is likely to increase processing time.
Hiero, like Pharaoh before it (Koehn 2004), provides a probabilistic lexical selection mechanism. The preprocessor can specify one or more proposed translations for a word or phrase, each with a feature label (e.g. "WSD") and a specified probability or cost. Feature weights are then optimized based on a dev set.
Joachims (2003) introduced the Spectral Graph Transducer (SGT) technology to perform semi-supervised machine learning using a small amount of annotated training data and a large amount of observed test data, so-called transductive learning. A trivial adaptation uses small amounts of test and training data and a large amount of unannotated data. Joachims points out that SGT subsumes various cotraining tasks (Yarowsky 1995; Blum & Mitchell 1998) as a special case. More specifically, SGT handles cotraining applications where the feature set for each view remains static across iterations.
In brief, SGT constructs a graph with examples as vertices and edge weights indicating similarity, and finds an approximate minimal cut between positive examples and negative examples using a spectral graph method.
Some advantages of SGT are:
In a paper on named entity recognition in MT, Babych & Hartley (2003) introduced the s-score heuristic for identifying content words in a large corpus.
Todo: describe
These experiments were designed to test the hypothesis:
Naturally, the first step in designing a morphological analysis system is to answer the question: What is a morpheme?
We don't.
Or, rather, we adopt a utilitarian approach: A morpheme is whatever can be analyzed constructively as a morpheme for our intended purpose. If our purpose is word sense disambiguation, a morpheme is any substring of a word such that treating it as a morpheme improves the accuracy of word sense disambiguation.
As an example, take the English word implementation. An expert might say this is two morphemes: implement + -ation. Our system may infer, instead, that it consists of five morphemes: im- + ple + -ment + -a + -tion. (Note that im-, -ment, and -tion are all frequently attested.) If it turns out this analysis works well for our evaluation metric, we don't dwell unduly over the implausibility of ple as a stem.
Our approach may therefore benefit considerably from optimization based on our choice of evaluation metric(s).
In selecting languages for experimentation, we wanted to cover
as wide a range of morphological
types as possible, including:
Listed in order of expected difficulty. Fully analytic languages are a moot point, since they have no morphology to speak of.
Disclaimers: It's been disputed whether German is truly agglutinative, as opposed to fusional (citation???). It appears to function so for our purposes. English is claimed by some to be virtually analytic (citation???).
For our initial experiments, we decided to begin with simpler languages (i.e. agglutinative and fusional), and move to more difficult languages (i.e. suppletive and templatic) later on. We were also constrained by pragmatic factors, e.g.:
We decided to begin with a selection of Europarl languages, using data available from the parallel corpus (Koehn 2005), Specifically, we examined:
By fortunate coincidence, WMT06 later used the same data source.
Several simple preprocessing heuristics and algorithms were evaluated. These included:
Our primary algorithm was based on (Koehn & Knight 2003) as a starting point. We chose their approach on the premise that it could be equally applicable to most other languages (agglutinative, fusional, or suppletive).
A key difference between German and fusional (particularly Romance) languages quickly became apparent. A German noun compound is built from:
Disclaimer: This description is based upon anecdotal observation, and may not be accurate.
In contrast, a word in a fusional language is built from:
One apparent exception to (a) in English is: inept ↔ in- + ept (?), where in- is the frequently attested negative prefix, but ept is not attested in any other context that I am aware of. Presumably it was a word in some earlier incarnation of the language, perhaps a variation of apt.
As another example, note the apparent morphological pattern in: ascend, descend, transcend.
The end result is that morphological analysis in German is a fairly straightforward statistical analysis of observable events (i.e. known roots), while English analysis involves statistical analysis of events which are themselves hidden (i.e. proposed morphemes). This description suggests EM, which, indeed, is similar to the iterative approach we ended up using, and even more similar to other approaches we did not get to try.
Our algorithm is (roughly) as follows:
We experimented with various options, including:
Ideally, we should set up a procedure to optimize these configuration options (and others) based on development data and evaluation metric.
Morphological analysis can be viewed as an exercise in the reduction of complexity without loss of information. Stand-alone methods of evaluation are designed to reflect this goal.
Rare events are a bane of statistical applications, as indicated by (among others) Schrader (2006) for German, and Goldwater & McClosky (2005) for Czech. Prima facia, a statistical system has a 0% chance of processing hapax legomena correctly, since they never occur in both test and training data. While most languages are not as bad as German or Czech, Zipf's law applies. Morphological analysis can have a potentially profound impact by reducing these intractable or nearly intractable words into recognizable component morphemes.
The problem may not be as bad as it seems at first. Even if half the forms are hapax (as for German or Czech), the percentage of singleton tokens will be considerably lower, typically less than one percent of all tokens (14). On the other hand, these hapax tokens are likely to be content words, critical to the adequacy of translation (or whatever your end purpose), whereas the highest frequency forms are most likely to be function words.
In other words, rare events are far more critical to correct translation than is captured by their relative impact on BLEU score or similar automated measures.
We performed analysis similar to that in Schrader (2006) for our languages of interest, before and after breaking words into component morphemes. For the purpose of analysis, we considered "rare" any form that occurs 10 or fewer times in the corpus (14).
| Total |
Frequency 1 (hapax) | Frequency 1-10 | |||||||
|---|---|---|---|---|---|---|---|---|---|
| # Forms | # Tokens | # Tokens | % Forms | % Tokens | # Tokens | % Forms | % Tokens | ||
| German | Words | 170,402 | 13,233,854 | 81,052 | 47.57% | 0.613% | 140,464 | 82.43% | 2.377% |
| Morphemes | 30,898 | 13,348,182 | 9,836 | 31.83% | 0.074% | 19,507 | 63.13% | 0.385% | |
| English | Words | 34,827 | 10,648,289 | 9,361 | 26.88% |
0.088% | 21,350 | 61.82% | 0.587% |
| Morphemes | 18,955 |
10,659,570 | 3,991 | 21.06% |
0.038% | 9,817 | 51.79% | 0.285% | |
| Spanish | Words | 77,526 | 12,495,786 | 25,772 | 33.24% | 0.201% | 54,908 | 70.83% | 1.182% |
| Morphemes | 32,407 | 12,530,435 | 8,102 | 25.00% | 0.065% | 18,334 | 56.57% | 0.425% | |
| French | Words | 55,985 | 12,036,764 | 16,404 | 29.30% | 0.136% | 37,045 | 66.17% | 0.875% |
| Morphemes | 25,880 | 12,058,591 | 5,756 | 22.24% | 0.048% | 13,524 | 52.26% | 0.334% | |
| Finnish | Words | 301,597 | 7,835,581 | 155,994 | 51.72% | 1.990% | 259,890 | 86.17% | 7.102% |
| Morphemes | 38,019 | 8,085,040 | 12,238 | 32.19% | 0.151% | 24,095 | 63.38% | 0.788% | |
Todo: Add rows for naive methods.
Todo: Bar graphs.
As expected, the results are dramatic for German, a compounding language, where the number of distinct forms drops 82% from 170,402 to 30,898, and the number of rare tokens drops 86% from 140,464 to 19,507. The results are even more dramatic for Finnish, a morphologically rich language, where the number of distinct forms drops 87% from 301,597 to 38,019, and the number of rare tokens drops 91% from 259,890 to 24,095. The reduction is also considerable for other languages. Even English, a morphologically poor language, sees reductions of around 50%.
Note that the number of tokens necessarily goes up as words are split into component morphemes. However, the amount of increase is not as significant as one might expect. Presumably, this reflects a characteristic of language that most common words are morphologically simple (citation???). It also confirms that we are making informed decisions about splitting morphemes, doing so only where it reduces complexity.
Frankly, the increase in number of tokens seems ridiculously low. One would expect it to be at least the same magnitude as the decrease in forms, but this is not always the case.
Contrast the naive method of treating every character as a morpheme (not shown). We would predict the number of forms to drop down to the size of the alphabet, but the number of tokens to jump to the number of characters in the corpus.
Or contrast another naive method, truncating every word to n characters (also not shown). We would expect this to dramatically reduce the number of distinct forms, and virtually eliminate rare events, while keeping the number of tokens constant. This demonstrates the theoretical limitation of this evaluation metric, in that it does not account for the loss of information involved in truncation. It also provides a possible indication why truncation performs so well in many applications (citation???).
Schrader (2006) used a totally different approach, both to simplification and evaluation, making it difficult to compare results.
Various metrics have been used to measure complexity vs. information in morphological analysis, e.g. saturation and informativeness (Baldwin 2005).
We did not have the opportunity to investigate these metrics.
One can also evaluate against an established standard. Possible candidates include:
We did not have the opportunity to pursue any of these evaluation standards. We do not expect our method to outperform language-specific rule-based systems, although it would be interesting to see how closely they compare.
Finally, one can evaluate the effects of morphological analysis on downstream applications. We looked at word sense disambiguation and machine translation (both described below), although results are incomplete.
These sets of experiments endeavored to improve the quality of crosslingual word sense disambiguation (CWSD) using a variety of approaches, including:
This set of experiments evaluated the effects on WSD results of supplementing annotated training data with unlabeled data, a semi-supervised approach known as transductive learning. The theory is that patterns in the unlabeled data can help elucidate patterns in the labeled data.
Historically, this was my first set of original experiments, in fall 2005. The results provided part of the impetus to pursue morphological analysis.
These experiments were designed to test the hypothesis:
For the core engine, we employed two machine learning packages, svm_light and sgt_light, both made publicly available by Thorsten Joachims.
svm_light is an implementation of support vector machines (Joachims 1999). Although svm_light includes a transductive learning option, it runs too slowly on large data sets to be of practical use.
Update: Joachims has since released a new version of svm_light which is supposed to perform much better on larger data sets, but I have not had the chance to try it out.
sgt_light is an implementation of spectral graph transducers (Joachims 2003), implementing binary (yes-no) classification only. We therefore trained a set of binary models for each word form, one model for each possible sense. We chose the best sense based on the highest-confidence model.
With sgt_light, the test data must be present when the model is built, so it is not possible to run experiments using training data only.
To align text, we used Pharaoh (Koehn 2004) wrapping GIZA++ (Och 2003).
For experimental data, we used Senseval WSD evaluation sets, and also data constructed from parallel corpora.
The initial set of experiments used the Senseval-2 data (available on Ted Pedersen's website) for performing (monolingual) WSD on the words interest and line. Each set was randomly divided into 75 test examples, 150 labeled training examples, and the remainder unlabeled examples, 2143 for interest and 3921 for line.
This was the reality-check stage. Since all the data came preprocessed from the same source, there was (presumably) no noise or comparability issue.
The next set of experiments used the (monolingual) SpanishLS data from Senseval-3. The data came preprocessed into 4195 test examples, 8430 labeled training examples, and 61252 unlabeled examples, covering a total of 46 lexical roots, ranging from 2 to 8 senses per root.
To anticipate an MT scenario, we constructed a CWSD exercise from parallel text. We chose the Europarl fr-en corpus (Koehn 2002), since it is large enough to produce many words and examples, and we have at least a fighting chance of interpreting both the source and target language texts.
Starting with the intersected fr-en and en-fr alignments, we generated a list of all the French content words, as indicated by s-score, and their aligned English words. Any alignment that occurred fewer than 6 times was lumped into UNK. Potential words of interest were identified with at least 500 examples, between 3 and 7 senses, no more than 5% UNK alignments, and no more than 65% of examples for any one sense. A visual scan filtered out remaining candidates whose different alignments represented only affixation variants. (This predated our work on morphological analysis.) That left a total of 42 French words and their English alignments. For each of these 42 words, the examples were randomly divided into 50 test examples, 50 dev examples, 150 labeled training examples, and the remainder (at least 250) unlabeled examples.
We again processed the parallel text from the Europarl fr-en experiment, only this time with English as the source language and French as the target language. The resulting data set contained 33 English words and their French alignments.
With each data set, we explored various methods of analyzing the data into features, optimizing for each set. Not all configuration options are described here. Comparable experiments are configured identically except as indicated.
Preprocessing options included the following.
Every experiment used at least local context and broad context options for feature selection. Later experiments also included other feature sets.
Following (Pham et al. 2005), we experimented with two alternatives for using multiple feature sets.
We explored different options for counting features, following (Cabezas & Resnik 2005).
Inverse Document Frequency (IDF) is essentially a hack that gives the system an extra hint as to which features are likely to be most significant. Various heuristics are possible, but they all follow the basic pattern of penalizing features that occur in examples from many different classes, and rewarding features that occur in only one or a few classes. We tried several heuristics, but do not discuss the results here.
It turned out that using a simple boolean for counts always produced the best results, but the optimal setting for IDF varied by data set.
We calculated the standard F-score for each experiment.
We compared against two baselines:
The first (naive) baseline was just a sanity check, since the second was always higher, as expected. The second baseline provided the target to beat.
However, quirks of svm_light and sgt_light led to a practical problem:
We therefore could not use a single tool to get the direct head-to-head comparison that would most clearly demonstrate our hypothesis.
Later experiments included filtering options, trading recall for precision.
The results with filtering are not presented in this paper. It generally had little effect on F-score.
F-score results are in (16), with explanations following.
| Data available | ||||
|---|---|---|---|---|
| training | + test | + unlabeled | ||
| interest | baseline | 50.7 | ||
| svm_light | 69.3 | 72.0 | 81.3 | |
| sgt_light | 76.0 | 85.3 | ||
| sgt-co | 80.0 | 85.3 | ||
| line | baseline | 60.0 | ||
| svm_light | 72.0 | 69.3 | 81.3 | |
| sgt_light | 68.0 | 80.0 | ||
| sgt-co | 69.3 | 77.3 | ||
| Spanish LS | baseline | 67.7 | ||
| svm_light | 83.2 | 85.1 | – | |
| sgt_light | 84.6 | 86.6 | ||
| sgt-co | 85.1 | 85.7 | ||
| Europarl fr-en | baseline | 56.7 | ||