Improving Crosslingual Word Sense Disambiguation using Unlabeled Monolingual Corpora

Edward Kenschaft
Seminar in Computational Linguistics, LING879, Fall 2004

This is a work in progress, containing sections which were not in my original class paper.

Abstract

Word sense disambiguation (WSD), determining the intended sense of a particular word, is a difficult problem in its own right.  Various people have begun to explore crosslingual WSD – using the words of a target language as the sense inventory for words in the source language – as a potential component of other cross-linguistic systems, such as machine translation (MT).  Comparatively little work has been done on the potential value of using other source language resources, particularly unlabeled data, to improve the accuracy of crosslingual WSD.  This paper describes the results of such an approach using the Spectral Graph Transducer (SGT) technology (Joachims, 2003), in which we demonstrate that a moderate amount of unlabeled data can improve WSD accuracy.

Background

WSD

The WSD problem is often poorly defined, in that it is rarely obvious which sense inventory is most felicitous (Resnik & Yarowsky, 2000).  The required granularity of senses is subject to dispute, and may vary depending on the end purpose (Chugur, et al., 2000).

Attempts to use WSD as a preprocessor for MT have generally not produced encouraging results (Och, 1999; Carpuat & Wu, 2005; Cabezas & Resnik, 2005).  One explanation is that WSD artificially increases data sparseness, without focusing on senses that are relevant to translation (Tim Sibley, 2005, personal communication; Yarowsky & Florian, 2002).  To address this problem, some have begun to explore crosslingual WSD, using target language words as the sense inventory for source language words (Resnik & Yarowsky, 2000).  In fact, one of the exercises in Senseval-3 was crosslingual.

Until recently, most research in WSD involved only annotated text.  Some have begun looking into semi-supervised learning, augmenting a small amount of labeled text with a large amount of unlabeled text, so-called transductive learning, with optimistic results (Yarowsky, 1995; Pham, et al., 2005).  Such an approach would be particularly useful when translating from a rich-resource language into a scarce-resource language, since the amount of parallel text is likely to be small, while the amount of unannotated text in the source language is virtually without bound.

Joachims developed SGT (Joachims, 2003) to solve precisely this sort of transductive learning task, which made it an appealing tool for semi-supervised WSD.  Apparently, Pham, et al. (2005) thought so, too, since they recently published results using the same tool on similar tasks.

Spectral Graph Transducer

Joachims introduced the SGT technology to perform semi-supervised machine learning using a small amount of labeled training data and a large amount of observed test data (Joachims, 2003).  A trivial adaptation uses small amounts of test and training data and a large amount of unlabeled data.  Joachims points out that SGT subsumes various cotraining tasks (Blum & Mitchell, 1998) as a special case.  More specifically, SGT handles cotraining applications where the feature set for each view remains static across iterations.

In brief, SGT constructs a graph with examples as vertices and edge weights indicating similarity, and finds an approximate minimal cut between positive examples and negative examples using a spectral graph method.  See (Joachims, 2003) for a more detailed exposition.

Some advantages of SGT are:

Plan

Hypothesis

For this paper, the primary goal was to confirm or refute the following hypothesis:

Using a moderate-to-large quantity of unlabeled source language data can significantly improve the accuracy of a WSD system.

A secondary goal was to identify feature sets that optimize WSD results for a given experiment, although those results are not discussed at length here.

Data Sources

For experimental data, we used various publicly available WSD evaluation sets, and also some fresh data constructed from parallel corpora.  Each data source is explained further under experiments.  The data sets included the following.

Several experiments used augmented data from the English Gigaword corpus.

Feature Analysis

With each set of source data, we explored various methods of analyzing the data into features, searching for the configuration that gave the optimum performance on each data set.  We were particularly interested in discovering whether the optimum configuration carried over to the experiments involving unlabeled data, although a detailed analysis is beyond the scope of this paper.

Preprocessing

Preprocessing options included the following.

Feature Selection

Every experiment used at least the local context and broad context options for feature selection.  Later experiments also included other feature sets.

Following (Pham, et al., 2005), we experimented with two alternatives for using multiple feature sets.

Counts

We also explored different options for counting features, following (Cabezas & Resnik, 2005).

Inverse Document Frequency (IDF) is essentially a hack that gives the system an extra hint as to which features are likely to be most significant.  Various heuristics are possible, but they all follow the basic pattern of penalizing features that occur in examples from many different classes, and rewarding features that occur in only one or a few classes.  We actually tried several heuristics, but do not report the differences here.  For the results reported below, all the experiments on a given data set used the same IDF setting.

Putting these two options together, ignoring log counts, yields the following four possibilities.

Counts
No Yes
IDF No 1|0 TF
Yes IDF|0 TF.IDF

It turned out that using a simple boolean for counts always produced the best results, but the optimal setting for IDF varied.

Implementation

Software

For the core engine, we used sgt_light, an implementation of SGT made publicly available by Thorsten Joachims.  Since sgt_light implements binary classification only, we trained a set of binary models for each word, one model per sense, and chose a sense based on the highest-confidence binary model.  We wrapped the executables in Perl modules and a driver script.

With sgt_light, the test data must be present when the model is built, so it is not possible to run experiments using training data only.

Scores with sgt_light can vary with the setting of parameter k – the valency of the nearest-neighbor graph – and the optimum setting is not predictable in advance.  All of the results reported for any given data source use the same value for k, generally the optimum setting for that data set.

With small data sets, we also compared against svm_light, an SVM implementation by Joachims (1999).  Although svm_light includes a transductive learning option, it runs too slowly on large data sets to be of practical use.

To align text, we used Pharaoh (Koehn, 2004) wrapping GIZA++ (Och, 2003).

Evaluation

We calculated the standard F-score (P+R)/2 for each experiment.  In addition, based on the hypothesis that precision will be more important than recall for alignment/translation, we calculated the weighted F-score (2P+R)/3.  The difference was only relevant when using filtering options, since P=R in other cases.

Baseline

As a baseline for comparison, we scored the accuracy of always predicting the most frequently occurring sense.

Filtering

Later experiments included filtering options, trading recall for precision.

The results with filtering are not presented in this paper.  It generally had little effect.

Results

F-score results are found in the following table, with explanations to follow.

Cells labeled 'N/A' denote logically impossible experiments, while cells with a dash '–' denote possible experiments that have not been performed.

The meanings of the column headers are explained for each experiment below.  In general, the data in each column include the data in all previous columns.

Cells that indicate negative results are highlighted in red italic.  The best-scoring experiment for each data set is highlighted in purple bold.

Table 1: F-score Results
Training data
training only train & test unlabeled augmented
interest baseline 50.7 N/A N/A N/A
svm_light 69.3 72.0 81.3
sgt_light N/A 76.0 85.3
sgt-co N/A 80.0 85.3
line baseline 60.0 N/A N/A N/A
svm_light 72.0 69.3 81.3
sgt_light N/A 68.0 80.0
sgt-co N/A 69.3 77.3
Spanish LS baseline 67.7 N/A N/A N/A
svm_light 83.2 85.1
sgt_light N/A 84.6 86.6
sgt-co N/A 85.1 85.7
English LS baseline 55.2 N/A N/A N/A
svm_light 69.4 66.9 N/A
sgt_light N/A 68.6 N/A 67.6
sgt-co N/A 69.4 N/A 67.9
Multilingual LS baseline 51.8 N/A N/A N/A
svm_light 62.0 57.1 N/A
sgt_light N/A 60.6 N/A 61.7
sgt-co N/A 61.8 N/A 59.9
Europarl fr-en baseline 56.7 N/A N/A N/A
svm_light 70.7 66.0
sgt_light N/A 65.2 68.9
sgt-co N/A 65.5 69.4
Europarl en-fr baseline 51.3 N/A N/A N/A
svm_light 64.7 61.1
sgt_light N/A 60.4 62.8 62.9
sgt-co N/A 61.2 63.5 62.8

Experiments

interest & line

The initial set of experiments used the Senseval-2 interest and line data available on Ted Pedersen's website.  These data sets are strictly monolingual.  Each was randomly divided into 75 test examples, 150 labeled training examples, and the remainder unlabeled examples, 2143 for interest and 3921 for line.

This was the reality-check stage.  Since all the data came preprocessed from the same source, there was (presumably) no noise or comparability issue.

All experiments used the following options.

Training data varied between the following options.

Four different algorithms were compared.

The train & test and unlabeled experiments demonstrate that svm_light and sgt_light produce comparable results.  Although execution times are not listed here, svm_light took orders of magnitude longer than sgt_light on the unlabeled experiments.

In general, having unlabeled data available produced considerably better results than train & test data alone, thus supporting our experimental hypothesis.  The one surprising result was that svm_light's accuracy on the line data actually degraded when the test examples were made available.  This observation was repeated in several later experiments.

My best guess is that the explanation lies in the implementation of the svm_light software.  Including unlabeled data (e.g. test data) invokes the transductive learning option, which turns off some optimization routines.  I suspect that these routines not only speed up performance, but actually enhance the accuracy.  However, this hypothesis is not supported by the results reported in (Joachims, 2003).  Note that adding the rest of the unlabeled data did, in fact, improve the performance of svm_light over training data alone, even in the line exercise.

It made little difference whether local and broad context were combined into one model, or used to generate two separate models which were then cotrained.  This observation was also repeated in later experiments.  This contrasts with the findings in (Pham, et al., 2005).

The results reported in (Pham, et al., 2005) are comparable with our results for train & test data, while our results with unlabeled data are much higher.  I can only guess that the results in (Pham, et al., 2005) do not make use of unlabeled data other than the test data.

Spanish Lexical Sample

The next set of experiments used the SpanishLS data from Senseval-3.  Again, this data set is strictly monolingual.  The data came preprocessed into 4195 test examples, 8430 labeled training examples, and 61252 unlabeled examples, covering a total of 46 lexical roots, ranging from 2 to 8 senses per root.  The options were the same as the interest & line experiments, except that diacritics were replaced with standard characters.

The starting accuracy with just training data was much higher than interest & line.  Adding unlabeled data improved the results, but not by much.  The results are consistent with our experimental hypothesis, but do not strongly support it.

Note that our system outperformed the best system entered in Senseval-3 (Marquez et al., 2004).  However, this is not a fair comparison, since we optimized directly on the test data.

English Lexical Sample

The EnglishLS experiment was similar to SpanishLS, except that diacritics were not an issue, the baseline was much lower, and unlabeled examples were not provided.  Instead, we augmented the data with examples harvested from the English Gigaword corpus.

This turned out to be a bigger challenge than expected, since the Senseval-3 data classified a variety of forms as the same word, e.g. eat, eating, eaten, eats, ate.  A naive heuristic for harvesting new examples led to extremely noisy data, which caused a severe degradation in accuracy.  An improved harvesting heuristic led to less disastrous results, but still negative.

Multilingual Lexical Sample

The MultilingualLS exercise was crosslingual, with the source examples in English, and the sense inventory taken from Hindi.  Since the Senseval-3 data did not include unlabeled examples, we augmented the data with examples harvested from the English Gigaword corpus.  This led to the same difficulties as the EnglishLS exercise.

Europarl fr-en

In an attempt to establish a more realistic MT scenario, we constructed a crosslingual WSD exercise from parallel text.  We chose the Europarl fr-en corpus (Koehn, 2002), since it is large enough to produce many words and examples, and we have at least a fighting chance of understanding both the source and target language examples.

Starting with the intersected fr-en and en-fr alignments, we generated a list of all the French words and their aligned English words.  Any alignment that occurred fewer than 6 times was lumped into UNK.  We then found all the words that had at least 500 examples, between 3 and 7 senses, no more than 5% UNK alignments, and no more than 65% of examples for any one sense.  Finally, we manually scanned through the remaining candidates to include only those where the different alignments indicated truly different senses, not just different affixation.  That left a total of 42 French words and their English alignments.  For each of these 42 words, the examples were randomly divided into 50 test examples, 50 dev examples, 150 labeled training examples, and the remainder (at least 250) unlabeled examples.

Results are reported with diacritics converted to standard characters and no IDF multipliers (sgt_light only).  Interestingly, this was one of the few cases where IDF hurt accuracy.

Unlabeled data noticeably improved accuracy over just train & test data.  However, even with all unlabeled data, the accuracy of sgt_light was less than the accuracy of svm_light with training data alone.

As discussed under interest & line, I can only guess why svm_light's performance deteriorates with more data.  With sgt_light, I suspect that the reason we do not see more dramatic improvement may have to do with morphology, as I will discuss in the next section.

Europarl en-fr

This used the same parallel text as the Europarl fr-en experiment, only this time with English as the source language and French as the target language.  The experimental data contained 33 English words and their French alignments.  Additional unlabeled examples were harvested from the English Gigaword corpus.

The unlabeled examples from the parallel text improved accuracy, although not dramatically.  The additional Gigaword examples either had no effect or had a negative impact.

The en-fr direction is more difficult than the reverse, because French's rich morphology yields a deceptively large number of "senses" for each word.  For instance, the English word relationship ends up with the following six French senses: lien, liens, rapport, rapports, relation, relations.  Obviously, these are not really six senses, but only three, each with singular and plural forms.  Many words have masculine and feminine forms, as well.  Whereas in the fr-en experiment we could manually screen out words with many alleged senses that differed only by an affix, if we did the same in the en-fr direction, we would hardly have any words left.

A traditional statistical approach would address the problem by throwing more data at it.  This might be feasible with rich-resource languages such as French and English, but will not be a viable approach when working with less studied languages.

At least two possible explanations for the lack of improvement with the augmented examples present themselves:

Teasing apart which explanation is correct will require more work.

Observations

Interestingly enough, using boolean "bag of words" generally outperformed full counts.  Much to my disappointment, using IDF usually helped, as well.

Scalability was a consideration for software engineering, but sgt_light performed admirably, even with huge data sets.

While I have not documented this, it appears that unlabeled examples are more likely to help when the base accuracy is already reasonably good.  The cases where unlabeled data significantly hurt appear to be words which started with an accuracy below 50%.  This suggests that simple filtering heuristics might have a positive effect.

Conclusions

These experiments tentatively support our experimental hypothesis that unlabeled data can improve accuracy, but with reservations.  The smallest, cleanest experiments (interest & line) showed the greatest improvement.  Augmenting with examples from the Gigaword corpus rarely helped, and even when it did, never brought the accuracy up to the level of svm_light trained on only the training examples.

In practice, only a small fraction of words proved suitable for crosslingual WSD.  Morphology proved to be a major obstacle, even when working with a morphologically simple language such as English.

Our results suggest that the technique shows promise, but more work will need to be done to overcome the technical challenges before it can be used widely with other applications.

Future Work

Two major areas for further exploration are:

  1. incorporate document-level features;
  2. overcome challenges with the use of a large unannotated corpus;
  3. incorporate morphological analysis.

Document-level Features

This is mostly an issue of finding an appropriate data source. The Europarl corpus is a promising possibility.

Large Unannotated Corpus

Some detailed error analysis is in order, to determine whether:

Morphological Analysis

Possible approaches include:

A starting point for research on morphology might be the work that has been done on English-Inuktitut alignment (e.g. Martin et al., 2005).

Some smaller, incremental steps we could take from this point include:

References

Monolingual WSD

Galen Andrew, Trond Grenager, and Christopher Manning. Verb Sense and Subcategorization: Using Joint Inference to Improve Performance on Complementary Tasks. EMNLP 2004, pp. 150-157. 2004. ps pdf

I. Chugur, J. Gonzalo, F. Verdejo. Sense distinctions in NLP applications. Proceedings of Ontolex. 2000.

L. Màrquez, M. Taulé, M.A. Martí, M. García, N. Artigas, F. Real and D. Ferrés. Senseval-3: The Spanish Lexical Sample Task. To appear in Proceedings of the Senseval-3 ACL-SIGLEX Workshop. Barcelona, Spain, 2004.

Mohammad and Pedersen. Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation. Proceedings of the Conference on Computational Natural Language Learning (CoNLL). Boston, MA, May 6-7, 2004.

Thanh Phong Pham, Hwee Tou Ng, & Wee Sun Lee. Word Sense Disambiguation with Semi-Supervised Learning. Proceedings of the 20th National Conference on Artificial Intelligence (AAAwe 2005), pp. 1093-1098. Pittsburgh, Pennsylvania, USA, 2005.

D. Yarowsky. Decision Lists for Lexical Ambiguity Resolution: Application to Accent Restoration in Spanish and French. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics. Las Cruces, NM, pp. 88-95. 1994.

D. Yarowsky. Hierarchical Decision Lists for Word Sense Disambiguation. Computers and the Humanities, 34(2):179-186. 2000.

David Yarowsky. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pp. 189-196. Cambridge, MA, 1995.

D. Yarowsky and R. Florian. Evaluating Sense Disambiguation Performance Across Diverse Parameter Spaces. To appear in Journal of Natural Language Engineering, 2002. ps

Crosslingual WSD

Clara Cabezas and Philip Resnik. Using WSD Techniques for Lexical Selection in Statistical Machine Translation. Technical Report: LAMP-TR-124/CS-TR-4736/UMIACS-TR-2005-42, University of Maryland, College Park, July 2005. (PDF)

Marine Carpuat and Dekai Wu. Word Sense Disambiguation vs. Statistical Machine Translation. 43rd Annual Meeting of the Association for Computational Linguistics (ACL-2005). Ann Arbor, MI, Jun 2005.

Franz Josef Och. An Efficient Method for Determining Bilingual Word Classes. Ninth Conf. of the Europ. Chapter of the Association for Computational Linguistics (EACL'99), pp. 71-76. Bergen, Norway, June 1999.

Philip Resnik and David Yarowsky. Distinguishing Systems and Distinguishing Senses: New Evaluation Methods for Word Sense Disambiguation. Natural Language Engineering, 5(2), pp. 113-133, 2000.

Alignment

Philipp Koehn. Europarl: A Multilingual Corpus for Evaluation of Machine Translation. Draft, Unpublished 2002. ps.

Philipp Koehn. Pharaoh: a Beam Search Decoder for Phrase-Based Statistical Machine Translation Models. AMTA 2004. pdf, ps, slides.

Martin, Mihalcea and Pedersen. Word Alignment for Languages with Scarce Resources. Proceedings of the ACL Workshop on Building and Using Parallel Texts. Ann Arbor, MI, June 29-30, 2005.

Franz Josef Och, Hermann Ney. A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, volume 29, number 1, pp. 19-51. March 2003.

Machine Learning

A. Blum and T. Mitchell. Combining Labeled and Unlabeled Data with Co-Training. Proceedings of the 1998 Conference on Computational Learning Theory. July 1998.

T. Joachims. Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.). MIT-Press, 1999. [PDF][Postscript (gz)]

T. Joachims. Transductive Learning via Spectral Graph Partitioning. International Conference on Machine Learning (ICML). 2003. [pdf] [Postscript (gz)]