Readings on Morphology & MT

MT Reading Group, April 19, 2006

Goldwater & McClosky (2005)

(01) Table: Morphological correspondences
Czech English
a. plural
past tense
plural
past tense
b. person
negation
genitive case
instrumental case
pronouns
not
of
by or with
c. gender none
d. other case markings syntax

Introduction

Methods

3.1 Lemmas

Morpheme Tokens (3.2 Pseudowords)

3.3 Modified Lemmas

Feature Vectors (3.4 Morphemes)

(03) Figure: Transformations
a. Original: Pro nĕkoho by její provedení mĕlo smysl .
b. Lemmas: pro nĕkdo být jeho provedení mít smysl .
c. Morpheme Tokens: pro nĕkdo být PER_3 jeho provedení mít PER_X smysl .
d. Modified Lemmas: pro nĕkdo být+PER_3 jeho provedení mít+PER_X smysl .
e. Vector: (pro) (nĕkdo) (být PER_3) (jeho) (provedení) (mít PER_X) (smysl) (.)

4.5 Combined Model

Based on the results of the above experiments, they created a combined model using tokens for person and negation morphemes, and the modified lemma method for number and tense.

Data

Roughly same number of rare lemmas in English and Czech, but roughly twice as many rare inflected wordforms in Czech.

(05) Table: Occurrences of each class of tag
Tag Class Count
a. PER 49700
b. TEN 47744
c.   past 22544
d.   present 20291
e.   future 1707
f.   other 3202
g. NUM 151646
h. CASE 151646
i. NEG 3326

(04) Figure: Number of items y with token count of x.

[Token counts]

Results

Experiments were with a word-to-word (not phrase-based) translation system.

BLEU score difference of .009 is significant (p < .05).

(06) Table: BLEU scores for baseline & lemmatization
Dev Test
a. word-to-word .311 .270
b. truncate (k=6) .353 .283
c. lemmatize all .355 .299
d.   except Pro .350
e.   except Pro, V, N .346
f.   n < 50 .370 .306
(07) Table: BLEU scores (Dev set) with different methods & classes of tags
Tokens Mod-Lem Vectors
a. PER .365 .356 .356
b. TEN .365 .361 .364
c. PER,TEN .355 .362 .355
d. NUM .354 .367 .361
e. CASE .353 .340 .337
f. NEG .357 .356 .353

(08)
Dev Test
combined model .390 .333

The best lemmatization results occurred with lemmatizing low-frequency words (06f).  Truncation (06b) performed significantly better than baseline (06a), but significantly worse than lemmatization.

Scores on other experiments (07) are reported on the Dev set only.  None scored better than lemmatization (!).

Adding person-agreement tokens (PER) helped, and examination confirmed that they did, indeed, align with English pronouns.  Inexplicably, tense-marking tokens (TEN) helped just as much, but using both PER and TEN canceled the benefits of either.

For modified lemmas, number (NUM) and tense tags helped, which makes sense since these are also marked in English.  Case tags hurt, which also makes sense, since they increase data sparseness without providing any information corresponding with English.

Feature vectors did not perform significantly better than morpheme tokens in any experiment (!).

The combined model performed best of all, successfully combining the advantages of each method.

Conclusions

  1. Morphological analysis does help, through the reduction of data sparseness.
  2. The greatest improvement comes from changing the source text to reflect the morphological characteristics of the target text.

Schrader (2006)

Introduction

German→English alignment is hindered by long German compounds aligning with English phrases, and a large proportion of hapax legomena.

This system aligns based on word length and POS patterns, without relying on frequency counts.

Nießen & Ney (2001) and Tschorn & Lüdeling (2003) split German compounds into components.  This only works if the meaning of the compound is compositional (09).

(09) Personen-stand 'marital status'
personal-status

This criticism suggests word-based translation.

Data

(10) Table: Corpus characteristics
Language Tokens Types Hapax Legomena Other Rare Frequent
English 29,077,024 101,967 39,200 (38%) 35,608 (35%) 27,159 (27%)
German 27,643,792 286,330 140,826 (49%) 98,126 (34%) 47,378 (17%)

Of 512 German hapax legomena examined:

(11) Table: German compound length vs. English phrase length
German compound length
English 1 2 3 4 >4
1 word 59 2 1 0 3
2 words 30 119 56 15 10
3 words 2 20 15 8 10
4 words 0 1 0 0 2

The English phrase length is often 1 more due to PP (12).

(12) Kongresh-vorlage
submission to Congress

Compound Alignment Strategy

What happens if there are multiple qualifying candidates?

Results

Future Work

Koehn & Knight (2003)

(13) Table: Evaluation of compound-splitting methods
Metrics BLEU
prec recall acc WBSMT PBSMT
none 0.0% 94.2% .291 .305
eager 24.8% 73.3% 87.1% .222 .344
frequency based 57.4% 86.6% 95.7% .317 .342
parallel 83.3% 89.1% 98.6% .294 .330
parallel & POS 93.8% 90.1% 99.1% .306 .326

Lee (2004)

(14) Table: BLEU-score evaluation of delete/merge
Model 1 PBSMT
# sentences base morph base morph
3.5K .10 .25 .17 .24
35K .14 .29 .24 .29
350K .18 .31 .32 .36
3.3M .18 .32 .36 .39

Habash & Rambow (2005)

Nießen & Ney (2001,2004)

(15) Table: Evaluation of methods (lower is better)
#sent Method 1-BLEU mWER SSER ISER
0 baseline 76.7 53.6 60.4 29.8
restructure 70.9 50.2 57.8 30.0
  + all 67.4 48.0 52.8 24.1
5K baseline 52.6 38.0 37.3 17.4
restructure 47.9 34.7 33.6 15.2
  + all 47.1 33.9 31.8 13.7
58K baseline 46.3 34.1 30.2 14.1

restructure 43.7 32.5 26.6 12.8

  + all 42.9 31.8 26.3 11.8

General Discussion

  1. data sparseness is a real problem, even with large corpora
  2. morphological analysis helps, up to an order of magnitude more data
  3. most experiments so far involve techniques highly customized for specific language pair
  4. most significant improvements are from changes which take the structure of the target language into consideration
  5. PBSMT overcomes overeager tokenization, may counteract (D)
  6. improvements are expected to correlate with inverse size of training corpus
  7. Future: factored translation models

References

Sharon Goldwater, David McClosky. Improving Statistical MT through Morphological Analysis. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Vancouver, 2005. [pdf, ps]

Nizar Habash, Owen Rambow. Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop. ACL-05. [pdf]

Philipp Koehn, Kevin Knight. Empirical Methods for Compound Splitting. EACL 2003. [pdf, ps, abstract]

Young-Suk Lee. Morphological Analysis for Statistical Machine Translation. In Susan Dumais, Daniel Marcu, Salim Roukos, eds., HLT-NAACL 2004: Short Papers, 57–60, Boston, Massachusetts, May 2004. [pdf, ps]

Sonja Nießen, Hermann Ney. Toward hierarchical models for statistical machine translation of inflected languages. In ACL-EACL-2001: 39th Annual Meeting of the Association for Computational Linguistics – joint with EACL 2001: Proceedings of the Workshop on Data-Driven Machine Translation, 47-54. Toulouse, France, July 2001. [ps]

Sonja Nießen, Hermann Ney. Statistical Machine Translation with Scarce Resources Using Morpho-Syntactic Information. In Computational Linguistics, Volume 30, Number 2, pp. 181-204, June 2004. [CogNet]

Maja Popović, David Vilar, Hermann Ney, Slobodan Jovičić, Zoran Šarić. Augmenting a Small Parallel Text with Morpho-syntactic Language Resources for Serbian-English Statistical Machine Translation. In Proceedings of the ACL Workshop on Building and Using Parallel Texts: Data-Driven Machine Translation and Beyond, 41-48. Ann Arbor, Michigan, June 2005. [pdf]

Bettina Schrader. Non-Probabilistic Alignment of Rare German and English Nominal Expressions. To appear in: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC), May 2006, Genoa, Italy.