Improving Lexical Selection using Morphological

Edward Kenschaft
LING895 Doctoral Candidacy Research

See also reading group notes.

Abstract

Word sense disambiguation (WSD) is the process of determining which sense of an ambiguous word is intended in context. The task has received considerable study, typically in a monolingual environment.

Crosslingual WSD treats each possible translation of a word into a target language as a distinct sense. WSD techniques can then be used to aid in the process of translation – specifically, in lexical selection.

Morphological analysis is the process of breaking a word into its component parts. This has also been studied in depth, but rarely in the context of WSD or translation. Unsupervised morphological induction can be used to induce morphological structure in any language where written text is available, regardless of what else is known about the language.

We demonstrate that unsupervised morphological induction can improve crosslingual WSD, which in turn improves lexical selection for machine translation.

Introduction

???

Related Work

Chiang 2005

The University of Maryland currently boasts the state-of-the-art PBSMT system, Hiero (Chiang 2005).

Previous PBSMT systems (e.g. Koehn 2004; Och & Ney 2004) generated translation candidates at two levels, word-level and phrase-level, where a word is any contiguous sequence of characters delimited by white space (i.e. a token), and a phrase is any contiguous sequence of words. The primary innovation of Hiero is to generate translation candidates from a hierarchy of phrases of arbitrary depth.

One effect of PBSMT in general, and hierarchical PBSMT in particular, is to mitigate the impact of overzealous tokenization. Two "words" which should not have been separated are simply grouped together by the system into functional phrases. Thus, we expect to benefit by using the most aggressive morphological analysis available.

Caveat: Both Pharaoh and Hiero use a configurable maximum number of "words". Breaking words into their component morphemes will increase the length of any given "phrase". The phrase-length limit will therefore need to be increased, which is likely to increase processing time.

Hiero, like Pharaoh before it (Koehn 2004), provides a probabilistic lexical selection mechanism. The preprocessor can specify one or more proposed translations for a word or phrase, each with a feature label (e.g. "WSD") and a specified probability or cost. For any group of identical feature labels, the MERT process (explain???) then determines empirically how much weight to assign to that feature. The result is a better-informed system, where the worst that can happen is that the weight of this new feature is reduced to 0.

Crosslingual WSD

Do we need a separate section for WSD???

Crosslingual word sense disambiguation (WSD) uses words of the target language as the sense inventory for words of the source language. For instance, if the source language word bank can translate to either banque or encaisser in the target language, then [banque, encaisser] are considered to be possible senses for bank. This nicely solves theoretical questions in WSD such as what level of granularity to use for word senses. If it leads to a different translation, it's a different sense; if it doesn't, it's not.

Vickrey et al. 2005

Vickrey et al. (2005) describe crosslingual WSD results for data taken from the English→French Europarl parallel corpus (Koehn 2002). Their baseline always guesses the most common sense for a given word, yielding an accuracy of 51.1%. Their logistic regression model yields an accuracy of 62.0%, a considerable improvement.

Kenschaft 2005

Experiments suggest that one of the major obstacles to widespread implementation of crosslingual WSD is the morphological complexity of either or both languages (Kenschaft 2005). A morphologically complex source language leads to data sparseness by exploding the lexicon. A morphologically complex target language leads to data sparseness by exploding the sense inventory. For instance, looking at English-French translation, a naïve system considers the English student to have four senses [étudiant, étudiante, étudiants, étudiantes], even though a human observer can easily see there is only one.

We anticipate the greatest benefit to WSD will result from breaking down both source language and target language into morphemes and throwing away inflectional affixes, looking only at the stems. Thus student has only one sense [étudiant] and the French affixes -e and -s are ignored.

WSD & MT

Carpuat & Wu 2005

Carpuat & Wu (2005) concluded, "Even state-of-the-art WSD does not help BLEU score." Their experiments used WSD to constrain the possible translations, in contrast to our approach of providing probabilistic lexical selection recommendations.

Cabezas & Resnik 2005

Cabezas & Resnik (2005) attempted to use crosslingual WSD to improve Spanish→English MT. They found no significant effect. They performed no lemmatization or morphological analysis of any sort; nor did they limit their attempts to content words.

Morphological analysis is the process of parsing a word into its component morphemes. This can be done manually, which is obviously quite time-consuming, or automatically, using either a rule-based or statistical system. Many of the most commonly studied languages have commercially available rule-based systems which achieve greater than 95% accuracy (???). However, developing such a system for a new language is understandably time-intensive, as is annotating data for use by a supervised statistical system. We will therefore be focusing our attention on unsupervised (or semi-supervised???) induction of morphological structure, an area which has been studied in depth for many years (e.g. Koskenniemi 1983; Kay 1987; Koehn & Knight 2003).

Side note: The name Kenschaft would fit nicely with this list of contributors.

Freitag (2005) offers a methodology for inferring morphosyntactic transformation rules, as opposed to morphemes per se. In a nutshell, his method involves the following steps, explained in more detail below???.

  1. Group all tokens into cooccurrence classes.
  2. Identify Perl-like transformations between members of one cooccurrence class to members of another.
  3. Cull transforms to minimal set.

Some advantages of Freitag's system are as follows.

Morphological analysis is related to a task known in the literature as (deep) lexical acquisition (DLA) (Baldwin 2005). Baldwin assumes a seed database, and describes a variety of methods for augmenting that database. His method for identifying morphemes is summarized as follows.

  1. Generate all 1- to 6-grams.
  2. Filter out n-grams with fewer than (n=3) occurrences.
  3. Filter out any n-gram with same frequency as supersequence.
  4. Select (k=3900) n-grams with highest saturation (= most complete coverage?).

This method, I suspect, is inferior to Freitag's, since it does not share any of the advantages listed above. However, it should be significantly faster to implement.

Baldwin's non-morphological features depend on deeper semantic knowledge, which I do not presume to have.

To the best of my knowledge, no one has researched the effects of morphological analysis on word sense disambiguation (WSD), either monolingual or crosslingual. As noted above, previous experiments provide anecdotal evidence that it should help dramatically.

Nießen & Ney (2000, 2001b, 2004) explore morphosyntactic restructuring to improve English-German alignment. Their experiments show significant improvement in translation quality, although the most effective techniques were syntactic rather than morphological. Their techniques are specific to this language pair.

Nießen & Ney (2001a) introduce the idea of a hierarchical lexicon, where a word is represented at various levels of inflectional specificity, starting with the bare stem. The technique yielded a slight reduction in subjective semantic error rate. The technique was further developed by various later researchers

Koehn & Knight (2003) experiment with heuristics for dividing German compounds into their component parts, evaluating the effect on alignment (German-English), word-based MT, and phrase-based MT. The technique produced satisfying improvements (0.026 BLEU) with word-based MT, and impressive improvements (0.039 BLEU) with phrase-based MT. This is explained by the observation that overzealousness in dividing a word into alleged morphemes is no problem for phrase-based MT, which simply regroups morphemes into phrases as necessary.

I expect Hiero to show even greater improvements, particularly with overzealous tokenization.

Improvements in alignment were most dramatic when information from the parallel text was taken into account. Basically, every possible split is checked against the aligned text to see if it corresponds with English words, and discarded if it does not.

Lee (2004) observes that in Arabic-English translation, one Arabic word freqently aligns with multiple English words, owing to functional affixes in Arabic, e.g. llmEArDpof the opposition. This 1-many alignment poses technical challenges to PBSMT systems.

Lee addresses this problem with a several-step process:

  1. POS tag Arabic and English parallel text.
  2. Segment Arabic words into prefix(es)-stem-suffix(es).
  3. Align segmented Arabic text with English text.
  4. Determine translation probability of each Arabic stem/affix and POS into English POS.
  5. Evaluate each Arabic affix.
    1. If the affix robustly translates to an English POS, keep it as a token.
    2. If the most common translation for the affix is NULL, delete it.
    3. Else, merge the affix back into the stem.

He evaluates the effects on BLEU score for corpora ranging from 3.5K to 3.3M words, using either Model 1 or phrase-based alignment. For Model 1, he consistently approximately doubles the BLEU score. For PBSMT, the improvements are more modest, but significant, e.g. 0.36→0.39 for the 3.3M corpus.

The PBSMT scores include a manual deletion of multiple definite determiners in a single Arabic phrase.

Popović & Ney (2004a) build on the concept of the hierarchical lexicon (Nießen & Ney 2001a). They first parse German words into stems and artificial inflectional morphemes, e.g. PRES for present tense. They then use a modified EM alignment algorithm to treat each of these complex "words" as a hierarchy, with alignment possible at any level. They evaluate effects on a variety of baseline systems, producing at least modest improvements in every case.

Popović et al. (2004b, 2005) use knowledge of the specific language pair to remove inflectional morphemes or function words that are not translatable, leading to a significantly reduced lexicon, and reduced AER.

I presume the alignment of function words to NULL has been handled in the mainstream alignment literature, although I haven't read up on it. Removing inflectional morphemes by hand seems (a) ad hoc, and (b) potentially detrimental to translation due to loss of information. Splitting apart but retaining all morphemes should provide the same benefits, without these objections.

Schrader (2004) performed experiments to determine the impact of morphological analysis (among other things) on German-English alignment & translation. In experiments on three different data sets, she removed inflectional affixes from the German, and aligned/translated only the roots (lemmas). She evaluated on translation candidates generated from tokens of interest in new sentences, using a bag-of-words dictionary generator (Hiemstra 1996).

With all three data sets, recall went up with lemmatization. However, with the exception of one data set (Patente), precision went down. Schrader attributes this to a naïve precision metric, which unduly penalizes the system for generating multiple translation candidates for the same term.

Using only the Patente data set, Schrader tried another experiment where she broke long compounds into component morphemes before alignment, e.g. Dämpfungsscheibenanordnung ('dampening disk assembly') → Dämpfung scheibe anordnung. Recall improved somewhat, but precision did not change significantly.  Schrader noted that the system was severely hindered by its inability to handle alignments involving multi-word units. This suggests that results should be significantly better with a modern PBSMT system.

Schrader (2006) addresses the problem of aligning German noun compounds with English multi-word units. She identifies a compound as any word of 12 or more characters, marked by a POS tagger as a noun. When a compound noun is identified, she invokes a special procedure for identifying potential English multi-word units with which it might align. Overall, the process resulted in 70% recall.

No baseline for comparison exists, confirmed in email from Bettina.

Schrader argues that compounds should not be broken into components, since the meaning of the compound may not be compositional. However, a modern PBSMT system will take this into account, translating the sequence as a unit when possible, and as individual components when not.

Goldwater & McClosky (2005) provide a good summary of the issues motivating morphology in MT, and the history of related research. They also set up a variety of experiments in Czech-English MT, resulting in a BLEU score improvement from 0.270 to 0.333.

Their last experimental setup is supposed to use full morphemes. However, they are not what I conceive of as morphemes. First they lemmatize each word and attach pseudowords (carscar+PL), then they fiddle with the alignment algorithm to treat the word as a complex structure. As discussed above, this has potential advantages over treating morphemes as words (carscar +s) and using GIZA++ for alignment, but is also more challenging to implement.

A quick glance at their sample Czech data suggests that it is highly fusional and suppletive. This suggests that I would be better off postponing work with Czech until I have made progress with a cleaner agglutinative language, such as German.

Isbihani et al. 2006

Isbihani et al. (2006) compared the impact on MT of various methods for segmenting Arabic source text. They found that the best results came from an unsupervised method using an FSA augmented with a memory of word properties.

???

Corpus Linguistics

Babych & Hartley 2003

As part of a paper on ???, Babych & Hartley (2003) introduced the s-score heuristic for identifying content words in a large corpus.

Task Formulation

Hypothesis

This set of experiments is intended to support the following (complex) hypothesis:

  1. Unsupervised morphological induction (UMI) can improve the quality of crosslingual WSD.
  2. Crosslingual WSD can improve the quality of phrase-based statistical MT.

Resources

Software

For WSD we will use the system described in (Kenschaft 2005).

For MT we will use Hiero for original experiments. The WSD system will provide lexical selection recommendations.

Languages of Interest

Choices of language are driven by several factors:

  1. morphological complexity & variation
  2. ease of experimentation
    1. availability of parallel text
    2. readability to native English speaker (me)
  3. demand within the broader research community

We mostly limit our experiments to language pairs with English as the target language, because English is relatively simple morphologically. We can therefore hope to get away with doing morphological analysis only on the source side, postponing various challenges introduced by doing similar analysis on the target side.

For source languages, we will include high-resource languages covering all the major language types.

The types are presented in order of the expected degree of challenge to an unsupervised morphological induction system.

Preprocessing Methods

Several different preprocessing heuristics and algorithms will be evaluated. The simplest will be to truncate each word to n characters, with n ranging from 4 to 7, a la (Och ???). Other simple heuristics will include stripping off inflectional endings known to be common, such as -e and -s in French.

A more sophisticated UMI algorithm will be based on (Koehn & Knight 2003). In a nutshell, we will use known root words to posit structure for longer words containing those roots. Koehn & Knight successfully used this approach with German complex nouns. We predict that it will also be of benefit for more inflectional, less agglutinative languages, such as the Romance languages.

If we have time, we will also implement other sophisticated heuristics, such as those described in (Frietag 2005) and (Baldwin 2005).

Evaluation

We will evaluate our approaches against the following state-of-the-art systems.

Vickrey et al. 2005

Vickrey et al. (2005) describe crosslingual WSD results for data taken from the English→French Europarl parallel corpus (Koehn 2002). They graciously sent us a copy of their data so we could replicate their baseline experiment and compare our results to theirs. We will also show results for all other Europarl languages translating into English.

Vickrey et al. also describe an experiment with filling in a missing ambiguous word in a target sentence. This was intended to prove the potential usefulness of WSD in machine translation. We will repeat their experiment using our own WSD system.

Cabezas & Resnik 2005

Cabezas & Resnik (2005) attempted to use crosslingual WSD to improve Spanish→English MT, without significant success. We will use the same data, but using morphological analysis, and limiting our efforts to content words.

Koehn & Knight 2003

Koehn & Knight (2003) used morphological analysis of German compound nouns to improve German→English MT, with significant success. We will attempt to get a copy of their data, or else construct a comparable framework from the Europarl corpus. We may use Pharaoh instead of Hiero for this set of experiments, in order to most closely mimic their environment. We will show results using:

  1. just the original text (baseline)
  2. morphological analysis alone
  3. morphological analysis and lexical selection using crosslingual WSD

Since our focus is on lexical selection rather than morphology alone, we will be less concerned with whether our result (2) improves upon (Koehn & Knight 2003), and more concerned with the improvement from (2) to (3).

If time permits, we will also implement Koehn & Knight's algorithm for identifying crosslingually relevant morphology.

Hiero

The UMD MT team continues to develop and enhance Hiero. We will evaluate using a recent stable version of the system, Europarl language data, and GALE Arabic→English data. We will compare results using:

  1. Hiero alone
  2. morphological analysis on source text
  3. WSD for lexical selection.

Again, we will be more interested in the improvement from (2) to (3).

Isbihani et al. 2006

I am not inclined to pursue any comparisons with this paper, except perhaps as a follow-up exercise.

Agenda

Our experiments will follow roughly the following plan.

The complete long-term agenda can be found under Future Work.

Observations and Results

WSD

Preliminary

The number of possible experiments based on (I) above is obviously huge. We ran several representative experiments on small data sets with various languages, using English translations as a sense inventory. Some results are listed in Table 1, using the following key.

Experiments are sorted by language in decreasing order of accuracy. Blue is the baseline.

Table 1: Naïve WSD results
Lang Process Dev Types Senses Examples S/T E/T E/S
de-en trunc6 0.575 99 1505 127071 15.20 1283.55 84.43
trunc5 0.574 96 1961 155221 20.43 1616.89 79.15
diac 0.559 101 598 45048 5.92 446.02 75.33
none
0.558 101 587 44330 5.81 438.91 75.52
lc 0.558 101 587 44330 5.81 438.91 75.52
es-en trunc6 0.614 99 1006 101280 10.16 1023.03 100.68
trunc5 0.608 98 1490 136166 15.20 1389.45 91.39
diac 0.604 104 618 50637 5.94 486.89 81.94
none 0.598 104 610 46811 5.87 450.11 76.74
lc 0.598 104 610 46811 5.87 450.11 76.74
strip 0.575 97 825 93046 8.51 959.24 112.78
fi-en trunc6 0.560 60 522 37618 8.70 626.97 72.07
trunc5 0.557 58 755 55885 13.02 963.53 74.02
lc 0.539 63 347 25411 5.51 403.35 73.23
none 0.539 63 347 25411 5.51 403.35 73.23
diac 0.538 63 347 25411 5.51 403.35 73.23
fr-en trunc6 0.598 75 930 79461 12.40 1059.48 85.44
trunc5 0.593 74 1338 111140 18.08 1501.89 83.06
diac 0.586 80 485 36139 6.06 451.74 74.51
none 0.584 80 478 35857 5.98 448.21 75.01
lc 0.583 80 478 35857 5.98 448.21 75.01
strip 0.568 77 650 66794 8.44 867.45 102.76

graphs???

Certain observations jump out across all languages tested.

  1. Of these naïve heuristics, only truncation has a significant positive effect. This suggests that truncation alone might help MT, as suggested in (Och ???).
  2. Lowercasing and removing diacritics have negligible effect. We held these constant for later experiments.
  3. Stripping common inflectional endings has a significant negative effect. This suggests that those endings carry more semantic weight than we may have thought. The final -s, in particular, might carry information of benefit to lexical selection.
  4. There is no obvious correlation of accuracy with any of the statistics noted, unless we throw out "strip" experiments as an outlier. In that case, there is a correlation with E/S, the average number of examples per sense. In other words, if we can increase the amount of available data without losing important information or increasing the complexity of the problem, we win.

Morphology

It is difficult to evaluate a system for morphological analysis directly. However, it is possible to produce anecdotal evidence to highlight the potential benefits, and to demonstrate its impact on downstream applications where evaluation metrics are available.

It should also be possible to compare results against rule-based systems for well-studied languages, but I have not tried this.

Rare Events

Schrader (2006) observes that compounding languages such as German have an inordinately high proportion of rare events. Indeed, roughly half of the token types are hapax. Presumably, a statistical system has a zero-percent chance of translating these correctly, since the ones that occur in the test set never occur in training. Likewise, the frequency of rare events (however defined) will adversely affect the quality of translation. While most languages are not as bad as German, Zipf's law suggests that rare events will always have a comparatively high frequency. Morphological analysis can have a potentially profound impact by reducing these impossible or nearly impossible cases to component morphemes which the system recognizes.

The problem may not be as bad as it seems at first. Even if half the token types are hapax, the percentage of singleton tokens will be considerably fewer, often less than one percent of all tokens. On the other hand, these hapax tokens are likely to be critical to the adequacy of translation, where the highest frequency types are most likely to be function words.

In other words, rare events are far more critcal to correct translation than is captured by their relative impact on BLEU score or other automated measures. Note for future work: Develop an evaluation metric for MT which weights the contribution of a word on the overall score based on its s-score.

We performed analysis similar to that in Schrader (2006) for German, English, French, Spanish, and Finnish corpora, both before and after breaking words into component morphemes. For the purpose of analysis, we considered "rare" any word that occurs 10 or fewer times in the corpus. The results are in Table 2.

Table 2: Rare events

Total
Frequency 1 Frequency 1-10

# Types # Tokens # Tokens % Types % Tokens # Tokens % Types % Tokens
English Words 34,827 10,648,289 9,361 26.88%
0.088% 21,350 61.82% 0.587%
Morphemes 18,955
10,659,570 3,991 21.06%
0.038% 9,817 51.79% 0.285%
Finnish Words 301,597 7,835,581 155,994 51.72% 1.990% 259,890 86.17% 7.102%
Morphemes 38,019 8,085,040 12,238 32.19% 0.151% 24,095 63.38% 0.788%
French Words 55,985 12,036,764 16,404 29.30% 0.136% 37,045 66.17% 0.875%
Morphemes 25,880 12,058,591 5,756 22.24% 0.048% 13,524 52.26% 0.334%
German Words 170,402 13,233,854 81,052 47.57% 0.613% 140,464 82.43% 2.377%
Morphemes 30,898 13,348,182 9,836 31.83% 0.074% 19,507 63.13% 0.385%
Spanish Words 77,526 12,495,786 25,772 33.24% 0.201% 54,908 70.83% 1.182%
Morphemes 32,407 12,530,435 8,102 25.00% 0.065% 18,334 56.57% 0.425%

bar graphs ???

As expected, the results are dramatic for German, a compounding language, where the number of distinct types drops 82% from 170,402 to 30,898, and the number of rare tokens drops 86% from 140,464 to 19,507. The results are even more dramatic for Finnish, a highly inflected language, where the number of distinct types drops 87% from 301,597 to 38,019, and the number of rare tokens drops 91% from 259,890 to 24,095. We expect to see the most impact on translation quality in these languages.

However, the reduction in complexity is considerable for all languages, even English, a mostly isolating language. We can reasonably expect improvements in translation and other downstream applications in all these languages.

Note that the number of tokens necessarily goes up as words are broken into component morphemes, but the amount of increase is not as significant as one might expect. Presumably, this reflects the characteristic of language that the most common words are likely to be morphological simple (or irregular, in the case of common verbs).

To be honest, the amount of increase seems ridiculously low. You would expect the increase in tokens to be at least the same magnitude as the decrease in types, but this is not always the case. I will need to double check the code, in case I missed something in the counting. But I expect the conclusions to remain essentially the same.

Schrader (2006) used a totally different approach, both to simplification and evaluation. It is difficult to compare results.

Morphology & WSD

This set of experiments evaluates the use of morphological analysis to improve WSD results.

The analysis involves the following systems:

  1. baseline – always choose the most common sense
  2. sgt.unmod – run the WSD system on unmodified data
  3. sgt.trunc5 – run the WSD system with all words truncated to 5 characters
  4. sgt.trunc6 – run the WSD system with all words truncated to 6 characters
  5. sgt.morph – run the WSD system using full morphological analysis
  6. best – for each word, evaluate each of the above 5 systems on a dev set, and use the best-performing system to make predictions on the test set
  7. best (no morph) – same as (6), but omitting sgt.morph

Vickrey

Table 3: WSD results on Vickrey English data (as of 9/27/06)

Precision # Best
baseline 52.94% 595
sgt.unmod 55.59% 540
sgt.trunc5 49.96% 214
sgt.trunc6 51.53% 228
sgt.morph 54.66% 294
best 57.73%
Vickrey baseline 52.6%
Vickrey best 62.0%

The overall results of applying morphological analysis for WSD on the Vickrey data set were negative. Table 3 shows the overall precision of each system, and the number of base words for which each individual system proved to be the best.

There are several reasons why these results may not be indicative.

  1. The training set is unusually small, with sparse data. More than 25% of words have 60 or fewer examples, with the median at 164 examples.
  2. The source language is English, which we know to be morphologically poor.
  3. The baseline is remarkably high, greater than 50% simply from guessing the most common sense for each word.

Note that our system does not beat Vickrey's best. This could be explained in part because Vickrey uses a supervised component, a POS tagger, while we use exclusively unsupervised approaches. A direct comparison would require us to integrate Vickrey's POS tagger.

The # Best column provides useful insights. Although the unmodified WSD system is the best overall performer, the baseline is the best performer on more words. Each of the other three systems is the best performer on a non-neglible minority. This suggests there might be significant benefits from using a dev set to determine which system performs the best on a given word, and using that system on the actual test data.

Eyeballing a word-by-word breakdown of results (not included) does not shed much light. The ultra-high-frequency words tend to work best with either the baseline or unmodified WSD, which helps explain the high overall scores of those systems. Beyond this, it is not immediately clear how results are correllated. It's not even clear that the words where the morphological system performs best are those with the most morphological variation. For instance, love is one of these words, where no other morphologically similar words occur in the corpus.

Machine Translation

Training and test data are those used for the NAACL 2006 Workshop Shared Task: Exploiting Parallel Texts for Statistical Machine Translation (WMT06) (Koehn & Monz 2006).  Table 4: WMT06 results lists the WMT06 results (test data only) for comparison, including:

Columns indicate results on the following data sets:

Table 4: WMT06 results (as of 1/08/2007)
Fr-En Es-En De-En En-Fr En-Es En-De
Dev In Out Dev In Out Dev In Out Dev In Out Dev In Out Dev In Out
WMT06 Highest 31.94 22.50 32.37 28.35 27.30 18.87 33.66 25.26 31.85 27.76 18.85 11.82
Lowest 21.44 19.42 23.91 19.17 15.86 11.78 25.07 21.44 23.17 16.83 9.84 6.55
Mean, all 29.33 20.70 29.99 25.77 23.75 16.47 31.04 23.45 29.36 24.47 16.62 10.43
Mean, pack 30.42 21.23 30.65 26.51 25.10 17.07 31.74 23.74 30.10 25.39 18.87 10.90
Baseline Hiero/base
25.77 25.67 24.25 21.01 15.69
Hiero/base (sub) 27.14 27.08 24.14
Truncation Hiero/trunc5
Hiero/trunc6
Morphology Hiero/morph
WSD Hiero/wsd
Hiero/morph/wsd

Conclusions

.


Appendices

Long-Term Agenda

out of date ???

  1. Determine the impact of unsupervised morphological induction (UMI) on crosslingual WSD. 
    1. Parameters
      1. Language alternatives
        1. Target language
          1. Use consistently preprocessed English unless specified otherwise.
        2. Source languages
          1. All EU languages.
          2. Arabic.
          3. (optional) Other languages as available.
      2. Preprocessing alternatives
        1. Heuristics
          1. None (baseline).
          2. Truncate to n characters, 4 ≤ n ≤ 7.
          3. Lowercase all.
          4. Eliminate diacritics.
          5. Strip common inflectional endings (e.g. -e-s in French).
          6. Tokenize common inflectional endings into distinct tokens.
          7. Representative combinations of the above.
        2. Algorithms
          1. Universal (i.e. non-language-specific) morphological analyzer based on (Koehn & Knight 2003).
          2. (optional) Universal MA based on (Frietag 2005).
          3. (optional) Universal MA based on (Baldwin 2005).
          4. Language-specific MA tools, at least for Arabic, optionally other languages.
      3. Feature analysis
        1. No preprocessing (baseline).
        2. Apply preprocessing to all tokens in source text.
        3. Apply preprocessing to all content words.
        4. Apply preprocessing to head word only.
    2. Evaluation
      1. WSD framework (Kenschaft 2005).
        1. Evaluate using consistent dev/test sets derived automatically from content words in parallel corpora.
        2. Establish accuracy baselines using:
          1. Always select most common sense.
          2. WSD without morphological analysis.
        3. Establish results using representative UMI options.
        4. Correlate results with observable characteristics of data, e.g.
          1. Number of distinct types (i.e. word forms) being disambiguated.
          2. Number of distinct types in entire corpus.
          3. Number of possible senses for each type.
          4. Number of training examples for each type.
      2. English→French Word Translation task (Vickrey et al. 2005).
        1. Replicate the baseline experiment using their data, always choosing the most common sense.
        2. Train and test our system on precisely the same data using representative UMI options.
        3. Train and test our system using the same data augmented with further training samples from the Europarl corpus made possible by the specific preprocessing alternative.
        4. Compare results.
  2. Determine the impact of crosslingual WSD (lexical selection) on MT.
    1. Parameters
      1. Language alternatives
        1. Target language
          1. Consistently preprocessed English.
        2. Source languages
          1. Arabic.
          2. At least one romance language.
          3. At least one largely agglutinative language.
          4. (optional) Other languages as available.
      2. Preprocessing alternatives – same as in I.A.2.
      3. Feature analysis – same as in I.A.3.
      4. Lexical selection
        1. No lexical selection (baseline).
        2. Apply lexical selection to all words.
        3. Apply lexical selection to all content words.
        4. Apply lexical selection to all content words for which a dev set predicts positive results.
    2. Evaluation
      1. English→French Blank-Filling task (Vickrey et al. 2005).
        1. Establish accuracy baseline using language model alone.
        2. Train and test our system using their data.
        3. Compare results.
      2. Spanish→English WSD & MT (Cabezas & Resnik 2005).
        1. Establish BLEU baseline running Pharaoh on their data.
        2. Use best WSD system to provide lexical selection recommendations to Pharaoh.
        3. Compare results.
      3. German→English MT (Koehn & Knight 2003).
        1. Establish BLEU baseline running Pharaoh on their data.
        2. Run morphological analysis on source text.
        3. Use best WSD system to provide lexical selection recommendations to Pharaoh.
        4. Compare results.
      4. Hiero MT, various language pairs.
        1. Establish baseline (BLEU? TER?) running Hiero alone.
        2. Run morphological analysis on source text.
        3. Use best WSD system to provide lexical selection recommendations to Hiero.
        4. Compare results.
  3. Determine the impact of supervised morphological analysis (SMA) on WSD, alignment and translation.
    1. Determine the impact of SMA on monolingual and crosslingual WSD.
      1. Identify languages of interest with SMA available.
      2. Evaluate effect of SMA on monolingual WSD for each language that has Senseval-3 test data available.
        1. Run baselines.
          1. Run monolingual WSD with no morphological analysis, using unmodified tokens (Kenschaft 2005).
          2. Run monolingual WSD using head words truncated to first 4 characters. (Och???)
          3. Run monolingual WSD using truncated words (head and otherwise).
        2. Run tests.
          1. Run monolingual WSD using SMA constituent morphemes in place of head words.
          2. Run monolingual WSD using SMA constituent morphemes in place of all words.
        3. Compare all scores.
      3. Evaluate effect of SMA on crosslingual WSD for representative language pairs with parallel text available.
        1. Run baselines.
          1. Run crosslingual WSD with no morphological analysis, using unmodified tokens (Kenschaft 2005).
          2. Run crosslingual WSD using truncated source language head words.
          3. Run crosslingual WSD using truncated source language words (head and otherwise).
          4. Run crosslingual WSD using truncated source language/target language words.
          5. This should yield higher scores, since the number of target language senses will be reduced.

        2. Run tests.
          1. Run crosslingual WSD using SMA constituent morphemes in place of source language head words.
          2. Run crosslingual WSD using SMA constituent morphemes in place of all source language words.
          3. Run crosslingual WSD using SMA constituent morphemes in place of all source language/target language words.
          4. This should yield scores comparable with Baseline iv.

        3. Compare all scores.
        4. Perform cursory error analysis.
    2. Determine the impact of SMA on alignment and translation.
      1. Evaluate effect of SMA on alignment for representative language pairs.
        1. Run baselines.
          1. Run alignment with no morphological analysis, using unmodified tokens.
          2. Run alignment using truncated source language words.
          3. Run alignment using truncated source language/target language words.
        2. Run tests.
          1. Run alignment using just SMA root words in place of all source language/target language words.
          2. Run alignment using SMA constituent morphemes in place of all source language/target language words.
          3. Effect on scores???

        3. Compare all AER scores.
        4. Perform cursory error analysis.
      2. Evaluate effect of SMA and WSD on Pharaoh MT for representative language pairs.
      3. Hiero is currently too slow to run many successive experiments, so we will do most of our testing with Pharaoh. If Hiero's performance speed improves sufficiently, this step could be skipped.

        1. Run baselines.
          1. Run Pharaoh translation with no morphological analysis, using unmodified tokens.
          2. Run Pharaoh translation using truncated source language words.
          3. Run Pharaoh translation using truncated source language/target language words.
          4. Rebuild target language language model with truncated stems??? Effect on scores???

          5. Run Pharaoh translation using SMA constituent morphemes in place of all source language/target language words.
        2. Test SMA & MT.
          1. Run Pharaoh translation using SMA constituent morphemes in place of all source language words.
          2. Run Pharaoh translation using SMA constituent morphemes in place of all source language/target language words.
          3. Rebuild target language language model with morphemes??? Effect on scores???

        3. Test WSD & MT.
          1. Run Pharaoh translation with no morphological analysis, using WSD to prime lexical selection.
          2. Run Pharaoh translation using SMA constituent morphemes in place of all source language words, and using WSD to prime lexical selection.
          3. Run Pharaoh translation using SMA constituent morphemes in place of all source language/target language words, and using WSD to prime lexical selection.
        4. Compare all scores.
        5. Perform cursory error analysis.
      4. Evaluate effect of SMA & WSD on Hiero MT for a few indicative language pairs.
        1. Run baselines.
          1. Run Hiero translation with no morphological analysis, using unmodified tokens.
          2. Run Hiero translation using truncated source language words.
          3. Run Hiero translation using truncated source language/target language words.
        2. Test SMA & MT.
          1. Run Hiero translation using SMA constituent morphemes in place of all source language words.
          2. Run Hiero translation using SMA constituent morphemes in place of all source language/target language words.
        3. Test WSD & MT.
          1. Run Hiero translation with no morphological analysis, using WSD to prime lexical selection.
          2. Run Hiero translation using SMA constituent morphemes in place of all source language words, and using WSD to prime lexical selection.
          3. Run Hiero translation using SMA constituent morphemes in place of all source language/target language words, and using WSD to prime lexical selection.
        4. Compare all scores.
        5. Perform cursory error analysis.
  4. Determine the impact of unsupervised morphological induction (UMI) on WSD, alignment and translation.
    1. Determine the impact of using UMI on monolingual and crosslingual WSD.
      1. Identify representative languages of interest, including at least those languages from (I.A.1).
      2. Develop unsupervised morphological analyzer.
        1. Implement UMI, probably based on (Freitag 2005).
        2. Evaluate against output of SMA where available.
        3. If necessary, compare to other UMI proposals.
      3. Repeat (I.A.2) monolingual WSD experiments with UMI.
      4. Repeat (I.A.3) crosslingual WSD experiments with UMI.
    2. Determine the impact of UMI on alignment and translation.
      1. Repeat (I.B.1) alignment experiments with UMI.
      2. Repeat (I.B.2) Pharaoh MT experiments with UMI.
      3. Repeat (I.B.3) Hiero MT experiments with UMI.
  5. Determine the impact of alternative resources.
    1. Introduce other source language resources, and evaluate their effects, in the spirit of (Yarowsky & Florian 2002). e.g.:
      1. POS tagging
      2. semantic role labeling
      3. syntactic relationships
    2. Perform representative experiments using bilingual sense clustering (Och 1999) in place of crosslingual WSD.

Terminology

Morphology

Morphemes

Informally, a morpheme can be understood as "the smallest unit of meaning". A word is typically composed of one or more morphemes.

Morphemes can be loosely classified into roots and affixes. Generally speaking, a root can stand alone, while an affix must be attached to a root.

So, for instance, the English word farmers can be decomposed into three morphemes: farm +er +s, where farm is the root, and +er and +s are affixes.

There is evidence that morphemes are composed of yet smaller units of meaning. However, just as many scientists are happy to act as though atoms are the smallest units of matter, so we will maintain our happy delusion that morphemes are atomic.

Affixes are often differentiated between inflectional and derivational types.

Inflectional affixes are highly productive and regular, generally add only grammatical information, and rarely change the part of speech. For instance, the +s affix in English is inflectional.

Derivational affixes are less productive, may produce idosyncratic changes in meaning, and often change the part of speech. For instance, the +er affix in English is derivational. The derived meanings of farm→farmer and ministry→minister are not predictable.

Inflectional affixes can often be discarded without relevant loss of meaning. Discarding derivational affixes almost always results in significant loss or change of meaning.

While the use of the word stem varies, in this discussion we will use it to refer to (root + derivational affixes), with inflectional affixes discarded. So, for the English word farmers, its root is farm, and its stem is farmer.

Language Types

Languages are often grouped into three main classes.

examples of each???

I will use these additional subclassifications in my discussion.

In reality, a language rarely fits precisely into one class. For instance, English is predominately analytic ("have been going"), but has some fusional ("he dance+s"), suppletive ("he went"), and even agglutinative ("garbage truck windshield wiper cleaner"), characteristics.

Lexicon

A lexicon for a language typically consists of all the words attested in a training corpus. Failure to account for morphological complexity results in an artificially large lexicon, hence unnecessary problems with data sparseness.

For instance, a romance language such as French shows productive and regular inflection for both gender and number. A naïve lexicon, ignoring morphology, will contain the four words [étudiant, étudiante, étudiants, étudiantes] 'student'. Breaking the words into morphemes yields instead the single word [étudiant] and the two inflectional affixes [+e, +s]. For this single example, the savings is slight – 3 lexemes instead of 4. The reduction will be much more dramatic when once we add all the words that use these same affixes, and even greater when dealing with highly inflectional languages.

Czech example??? Quantify savings???

Agglutinative languages typically allow productive compounding of nouns. For instance, German allows the compound noun [Dämpfungsscheibenanordnung] 'dampening disk assembly', constructed from the three morphemes [Dämpfung Scheibe Anordnung] (Schrader 2004). A long compound is likely to be exceedingly rare, even unique (a.k.a. hapax), even though its component morphemes are quite common.

Schrader (2006) analyzed the Europarl English-German parallel corpus, and discovered that German has almost three times as many lexemes as English (286,330 vs. 101,967), and well over three times as many hapax legomena (140,826 vs. 39,200). Roughly two-thirds of these are compound nouns aligning to English multi-word units.

Quantify improvements, particularly in number of hapax legomena ???

Word Alignment

In word alignment, source language text is compared to target language text to find statistically frequent cooccurrences of words (or multi-word units), which are then considered to be aligned. Since accuracy in alignment depends on frequency of cooccurrence, data sparseness is a major problem.

This is a gross oversimplification, of course. For more detailed exposition, see, e.g. (Och & Ney 2003).

With agglutinative languages, drastic improvements should be realized by breaking down compounds, thus eliminating many hapax legomena. With inflectional or fusional languages, we might anticipate at first that the greatest benefit would result from throwing away inflectional affixes and aligning only root words and/or derivational affixes. However, Goldwater & McClosky (2005) point out that inflectional affixes often align with function words in a more analytic language, such as of or to in English. Retaining the affixes may therefore provide value.

quantify???


confirm???

Statistical Machine Translation

Statistical MT systems look at aligned text to predict translations of source language phrases, and use these predictions to translate previously unobserved sentences.

Again, this is a gross simplification. See, e.g. (Koehn et al. 2003) for a lengthier discussion.

MT relies heavily on word alignment, and likewise suffers from data sparseness introduced by complex morphology. We predict significant gains in translation accuracy, through:

However, using morphological analysis on the target language will also produce new challenges. Inflectional morphemes cannot simply be discarded, since they carry information which must be realized in the translation. This meansour translation system will be generating morphemes, which must then be combined in a postprocessor into complete words. This introduces all the problems of natural language generation, a whole separate subdiscipline.

Another challenge will be reconstructing complex morphology in a target language that does not exist in the source language. For instance, in translating student from English to French, morphological analysis should make the system more reliable at identifying the French root étudiant, but it will still need to determine the appropriate inflectional affixes. We might anticipate a larger proportion of errors where the inflectional affix is simply omitted, but this remains to be seen empirically.

Depending on the model we adopt, we might also introduce reordering errors which end up attaching an affix to the wrong root. For instance, let's say we're translating into English a sentence that means, "My dog has fleas." Presumably, the individual morphemes would translate into "My dog has flea +PL". If the +PL morpheme were somehow reordered into the wrong position, we might end up with the translation, "My dogs have (a) flea."

Yuval Marton (personal communication) and others have suggested aligning and translating feature vectors rather than atomic tokens. This would require nontrivial enhancements to existing PBSMT systems, but would avoid the reordering problems, and would allow for inclusion of any number of additional features, such as syntactic or semantic labels. This will, in fact, be the subject of the JHU 2006 Summer Workshop.

Additional Resources

Rule-Based Morphological Analysis

Off-the-shelf morphological tools are available to the research community for a variety of common languages. These systems are mostly rule-based and language-specific. We would expect them to outperform anything we could induce automatically.

Parallel Text

Parallel corpora are crucial for both MT and crosslingual WSD.

Senseval Data

Monolingual WSD experiments are not essential, but since the data is available, we might as well take advantage of it.

Alignment

UMD has a couple of people working on improved word alignment. However, until they have packaged their work for common use, the best I know of is still Pharaoh (Koehn 2004) wrapping GIZA++. It might be worth experimenting with a few different heuristics, such as INTERSECTION vs. GROW_DIAG_FINAL_AND.

Machine Translation

MT Metrics

Other Experiments

Senseval

Table 4 shows the results of some early experiments applying various preprocessing heuristics to Senseval WSD data.

Table 4: WSD results on Senseval data
Lang Process Precision Types Senses Examples S/T E/T E/S
de-en trunc6
0.575 99 1505 127071 15.20 1283.55 84.43
trunc5
0.574 96 1961 155221 20.43 1616.89 79.15
diac
0.559 101 598 45048 5.92 446.02 75.33
none
0.558 101 587 44330 5.81 438.91 75.52
lc 0.558 101 587 44330 5.81 438.91 75.52
es-en trunc6
0.614 99 1006 101280 10.16 1023.03 100.68
trunc5 0.608 98 1490 136166 15.20 1389.45 91.39
diac
0.604 104 618 50637 5.94 486.89 81.94
none
0.598 104 610 46811 5.87 450.11 76.74
lc 0.598 104 610 46811 5.87 450.11 76.74
strip
0.575 97 825 93046 8.51 959.24 112.78
fi-en trunc6
0.560 60 522 37618 8.70 626.97 72.07
trunc5
0.557 58 755 55885 13.02 963.53 74.02
lc 0.539 63 347 25411 5.51 403.35 73.23
none
0.539 63 347 25411 5.51 403.35 73.23
diac
0.538 63 347 25411 5.51 403.35 73.23
fr-en trunc6
0.598 75 930 79461 12.40 1059.48 85.44
trunc5
0.593 74 1338 111140 18.08 1501.89 83.06
diac
0.586 80 485 36139 6.06 451.74 74.51
none
0.584 80 478 35857 5.98 448.21 75.01
lc 0.583 80 478 35857 5.98 448.21 75.01
strip
0.568 77 650 66794 8.44 867.45 102.76



References

See additional resources.

Factored Translation Models

J Bilmes & K Kirchhoff (2003). "Factored Language Models and Generalized Parallel Backoff". Human Language Technology Conference.

K. Kirchhoff and M. Yang, "Improved Language Modeling for Statistical Machine Translation", Proceedings of the ACL Workshop on Building and Using Parallel Texts, 2005. [pdf]

Morphology – Language-Specific

Martin Čmejrek, Jan Cuřín, Jiří Havelka, Jan Hajič, Vladislav Kuboň. 2004. Prague Czech-English Dependency Treebank: Syntactically Annotated Resources for Machine Translation. In 4th International Conference on Language Resources and Evaluation. Lisbon, Portugal. [pdf, ps, download]

Nizar Habash, Owen Rambow. Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop. ACL-05. [pdf]

Young-Suk Lee, Kishore Papineni, Salim Roukos, Ossama Emam, Hany Hassan. 2003. Language Model Based Arabic Word Segmentation. In Proceedings of the 41st Annual Meeting of the ACL, 399-406. Sapporo, Japan. [pdf]

Morphology – Supervised

Jan Daciuk. Finite-State Lexical Tools. BIS 2004, 7th International Conference on Business Information Systems, pp. 373-380. Witold Abramowicz (ed.), Wydawnictwo Akademii Ekonomicznej w Poznaniu, Poznań, Poland, 21-23 April, 2004. [download, mirror]

Jan Daciuk, Gertjan van Noord. A Finite-State Library for NLP. CLIN 2001. University of Twente, Enschede, the Netherlands, November 2001.

Martin Kay: Nonconcatenative Finite-State Morphology. EACL 1987: 2-10. [pdf]

Koskenniemi, Kimmo. Two-level Morphology: A General Computational Model for Word-Form Recognition and Production, Publications, p. 160, University of Helsinki, Department of General Linguistics (1983).

Morphology – Unsupervised

Baldwin, Timothy (2005) Bootstrapping Deep Lexical Resources: Resources for Courses, In Proceedings of the ACL-SIGLEX 2005 Workshop on Deep Lexical Acquisition, Ann Arbor, USA, pp. 67–76. [pdf]

Dayne Freitag. Morphology Induction from Term Clusters. Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005), pp. 128-135. Ann Arbor, MI, June 2005. [pdf]

Morphology & MT

de Gispert, A. Phrase Linguistic Classification and Generalization for Improving Statistical Machine Translation. Proc. of the ACL Student Research Workshop (ACL'05/SRW), pp. 67-72. Ann Arbor (Michigan), June 2005. [pdf]

Sharon Goldwater, David McClosky. Improving Statistical MT through Morphological Analysis. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Vancouver, 2005. [pdf, ps]

Philipp Koehn, Kevin Knight. Empirical Methods for Compound Splitting. EACL 2003. [pdf, ps, abstract]

Young-Suk Lee. Morphological Analysis for Statistical Machine Translation. In Susan Dumais, Daniel Marcu, Salim Roukos, eds., HLT-NAACL 2004: Short Papers, 57–60, Boston, Massachusetts, May 2004. [pdf, ps]

Sonja Nießen, Hermann Ney. 2000. Improving SMT Quality with Morpho-Syntactic Analysis. In Proceedings of the 20th International Conference on Computational Linguistics, 1081-1085. Saarbrucken, Germany23.329.132.6

Sonja Nießen, Hermann Ney. Morpho-syntactic analysis for Reordering in Statistical Machine Translation. In Proceedings of the MT Summit VIII, pp. 247-252. Santiago de Compostela, Galicia, Spain, September 2001. [pdf (45 kb), ps]

Sonja Nießen, Hermann Ney. Statistical Machine Translation with Scarce Resources Using Morpho-Syntactic Information. In Computational Linguistics, Volume 30, Number 2, pp. 181-204, June 2004. [CogNet]

M. Popović, Hermann Ney. Improving Word Alignment Quality using Morpho-syntactic Information. COLING04. [pdf]

M. Popović and Hermann Ney. Towards the Use of Word Stems and Suffixes for Statistical Machine Translation. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC), 1585-1588. Lisbon, Portugal, May 2004. [ps]

M. Popović, D. Vilar, Hermann Ney, S. Jovičić, Z. Šarić. Augmenting a Small Parallel Text with Morpho-syntactic Language Resources for Serbian-English Statistical Machine Translation. In Proceedings of the ACL Workshop on Building and Using Parallel Texts: Data-Driven Machine Translation and Beyond, 41-48. Ann Arbor, Michigan, June 2005. [pdf]

Bettina Schrader. Non-Probabilistic Alignment of Rare German and English Nominal Expressions. To appear in: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC), May 2006, Genoa, Italy.

Bettina Schrader. Improving Word Alignment Quality Using Linguistic Knowledge. Proceedings of the Workshop on The Amazing Utility of Parallel and Comparable Corpora (LREC 2004 satellite event). pp 46-49. May 2004, Lisbon, Portugal. [pdf]

Translation – Statistical

Chiang, David. A Hierarchical Phrase-Based Model for Statistical Machine Translation. In Proceedings of ACL 2005, pages 263–270. Best paper award. [pdf]

Philipp Koehn. Pharaoh: A Beam Search Decoder for Phrase-Based Statistical Machine Translation Models. AMTA 2004. [pdf, ps, slides]

Philipp Koehn, Franz Josef Och, Daniel Marcu. Statistical Phrase-Based Translation. In Proceedings of the Human Language Technology Conference 2003 (HLT-NAACL 2003), Edmonton, Canada, May 2003. [pdf]

Franz Josef Och and Hermann Ney. 2004. The Alignment Template Approach to Statistical Machine Translation. Computational Linguistics 30:417–449.

Word Alignment

D. Hiemstra. 1996. Using Statistical Methods to Create a Bilingual Dictionary. Master's thesis, Universiteit Twente.

Philipp Koehn. Europarl: A Multilingual Corpus for Evaluation of Machine Translation. Draft, Unpublished 2002. [ps, data]

Franz Josef Och, Hermann Ney. A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, volume 29, number 1, pp. 19-51. March 2003.

WSD

Edward Kenschaft. Improving Crosslingual Word Sense Disambiguation using Unlabeled Monolingual Corpora. (unpublished ms) December 2005. [html]

Rada Mihalcea and Phil Edmonds, ed. Proceedings of Senseval-3: The Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text. Association for Computational Linguistics: Barcelona, Spain, July, 2004. [website, data]

Franz Josef Och. An Efficient Method for Determining Bilingual Word Classes. Ninth Conf. of the Europ. Chapter of the Association for Computational Linguistics (EACL'99), 71-76. Bergen, Norway, June 1999. [ps]

Misc

Babych & Hartley 2003.