Recruiting Online Volunteers for Linguistic Knowledge Acquisition

Job talk, May 13, 2008, 45 minutes [Powerpoint]

Abstract

The internet is essentially unregulated, immensely huge, and growing exponentially.

Terrorist groups make use of the internet to recruit and train operatives. Intelligence agencies work to track down these groups, but the sheer vastness and mutability of the internet makes this a daunting challenge.

Various computational linguistics subdisciplines can aid in this goal. One of these is opinion detection, which is used to scan web pages and identify terrorist propaganda.

Most state-of-the-art computational linguistics systems, including opinion detection, are statistically based. To do their task, they require large amounts of "gold" training data,such as texts labeled with opinions. Such data is rare, even in English, let alone in less studied languages. Typically, training data is generated over weeks or months by highly trained (and expensive) specialists.

The internet offers an alternative. For many purposes, the task of labeling data can be performed by an educated lay person with a minimum of training – the same sort of person who does crossword puzzles or sudoku. By offering the task as a game, potentially thousands of volunteers can be recruited to do it for free. This approach has been used with great success on a small but growing number of research projects, from classifying galaxies to transcribing notations on 19th century plant archives.

I propose to identify an opinion-labeling task that can be performed by lay people, and to offer this task as a game to the internet community. I anticipate acquiring tens of thousands of high-quality annotations with just a few person-weeks of work. If this succeeds, the approach could be adapted to a wide variety of other related tasks.

Outline

Challenge: Internet

Source (user info): www.internetworldstats.com/
Copyright © 2008, Miniwatts Marketing Group

Source (domain info): www.domaintools.com/internet-statistics/

Internet Users

[world internet users]
[world internet growth]

Primary Languages Spoken by Internet Users

[internet languages]
[internet growth by language]

Source: www.internetworldstats.com
Copyright © 2008, Miniwatts Marketing Group

Web Content Languages

[web content by language]

Source: global-reach.biz/globstats/refs.php3
Last revised: September 30, 2004

Summary

Challenge: Use of Internet for Global Terrorism

Terrorist Websites

Purposes

Summary

Source: "A world wide web of terror". The Economist. July 12, 2007.

How can we identify terrorist websites?

Digression: Humans and Computers are Different

Examples

Crossover

Opinion Detection

Problem

(01) "America is a mistake, admittedly a gigantic mistake, but a mistake nevertheless." (Sigmund Freud)
SPEAKER DISLIKES America
(02) "The United States of America is a threat to world peace." (Nelson Mandela)
SPEAKER DISLIKES United States of America
(03) "Every Muslim, from the moment they realize the distinction in their hearts, hates Americans, hates Jews and hates Christians." (Osama bin Laden)
Every Muslim DISLIKES Americans
Every Muslim DISLIKES Jews
Every Muslim DISLIKES Christians
(04) "We are determined to undertake jihad for Allah's sake and to take the battle inside damaged America, Allah willing." (intercepted email)
We INTENDS undertake jihad
SPEAKER DISLIKES America
(05) "Mr. McGee, don't make me angry. You wouldn't like me when I'm angry." (David Banner)
Mr. McGee SHOULDN'T make me angry
Mr. McGee DISLIKES me when I'm angry
(06) "All I want for Christmas is my two front teeth."
SPEAKER WANTS my two front teeth

Sources: (01-03) en.wikiquote.org/wiki/America. (04) "A world wide web of terror".

Resources

TREC 2006 Blog (Opinion Retrieval) Track

Problem
Examples
Opinionated
Skype 2.0 eats its young
The elaborate press release and WSJ review while impressive don’t help mask the fact that, Skype is short on new ground breaking ideas. Personalization via avatars and ring-tones... big new idea? Not really. Phil Wolff over on Skype Journal puts it nicely when he writes, “If you’ve been using Skype, the Beta version of Skype 2.0 for Windows won’t give you a new Wow! experience.” ...
Non-opinionated
Skype Launches Skype 2.0 Features Skype Video
Skype released the beta version of Skype 2.0, the newest version of its software that allows anyone with an Internet connection to make free Internet calls. The software is designed for greater ease of use, integrated video calling, and ...
Results
Best and median MAP (mean average precision) results of 57 submissions
Topic relevance Opinion finding
Best 42.19% 30.04%
Median 16.99% 10.59%

Source: Ounis et al 2006

Where can we get training data?

Volunteer projects

Examples

Non-computational

[galaxy image]

Galaxy Zoo

Problem
Volunteers
Results

www.galaxyzoo.org/

Stardust@home

Problem
Volunteers
Results

stardustathome.ssl.berkeley.edu/

Herbaria@home

[transcriptions by volunteer]Problem
Volunteers
Results

herbariaunited.org/atHome/

Open Mind Word Expert

Problem
(07)
He boarded the plane from gate 53.
The ball is not in play until it crosses the plane.
Volunteers
Results

Source: Mihalcea and Chkolovski, "Open Mind Word Expert"

Summary

What makes for a successful project?

Games

"In every job that must be done, there is an element of fun. You find the fun, and – snap! – the job's a game." (Mary Poppins)

Source: Luis von Ahn, "Human Computation"

Examples

ESP Game

[ESP Game]Problem
Game
Results

Source: www.espgame.org/
Copyright: © 2005 Carnegie Mellon University, all rights reserved. Patent Pending.

Peekaboom

Problem
Game
[Peekaboom]

Source: www.peekaboom.org/
Copyright © 2005 Carnegie Mellon University, all rights reserved. Patent Pending.

Results

Verbosity (proposed)

Problem
Game

Toolkits

Amazon Mechanical Turk

Examples

www.mturk.com/

Bossa

boinc.berkeley.edu/trac/wiki/BossaIntro/

Facebook

www.facebook.com/developers/

Linguist@home (a.k.a. That's Your Opinion)

Problem

(04) "We are determined to undertake jihad for Allah's sake and to take the battle inside damaged America, Allah willing."

1-player game

[screenshot]

How can we assure that answers are valid?

2-player game

Future Work

References

------. Amazon Mechanical Turk, website. [link]

------. Bossa, website. [link]

------. The ESP Game, website. [link]

------. Galaxy Zoo, website. [link]

------. Herbaria@home. Botanical Society of the British Isles, website. [link]

------. Internet World Stats. Miniwatts Marketing Group, website. [link]

------. Peekaboom, website. [link]

------. Playing or processing. The Economist. Dec 6, 2007. [link]

------. Spreading the load. The Economist. Dec 6, 2007. [link]

------. Stardust@home, website. [link]

------. A world wide web of terror. The Economist. July 12, 2007. [link]

Amir Alexander. Aerogel: The "Frozen Smoke" that Made Stardust Possible. The Planetary Society. November 8, 2006. [link]

Nathaniel Ayewah, Rada Mihalcea, and Vivi Nastase. Building Multilingual Semantic Networks with Non-Expert Contributions over the Web. Proceedings of the KCAP 2003 Workshop on Distributed and Collaborative Knowledge Capture. Sanibel Island, Florida, November 2003. [pdf] [ps] [demo]

Timothy Chklovski. 2005. Designing interfaces for guided collection of knowledge about everyday objects from volunteers. In Proceedings of the 10th international Conference on intelligent User interfaces (San Diego, California, USA, January 10 - 13, 2005). IUI '05. ACM, New York, NY, 311-313.

Timothy Chklovski, Using Analogy to Acquire Commonsense Knowledge from Human Contributors, MIT Artificial Intelligence Laboratory technical report AITR-2003-002, February 2003. [pdf]

Timothy Chklovski and Rada Mihalcea. Exploiting Agreement and Disagreement of Human Annotators for Word Sense Disambiguation. Proceedings of the Conference on Recent Advances in Natural Language Processing (RANLP 2003). Borovetz, Bulgaria, September 2003. [pdf] [ps] [demo] [data]

Kate Land, Anze Slosar, Chris Lintott, Dan Andreescu, Steven Bamford, Phil Murray, Robert Nichol, M.Jordan Raddick, Kevin Schawinski, Alex Szalay, Daniel Thomas, Jan Van den Berg. Galaxy Zoo: The large-scale spin statistics of spiral galaxies in the Sloan Digital Sky Survey. Submitted March 22, 2008. [link]

Chris J. Lintott, Kevin Schawinski, Anze Slosar, Kate Land, Steven Bamford, Daniel Thomas, M. Jordan Raddick, Robert C. Nichol, Alex Szalay, Dan Andreescu, Phil Murray, Jan van den Berg. Galaxy Zoo : Morphologies derived from visual inspection of galaxies from the Sloan Digital Sky Survey. Submitted to MNRAS, April 29, 2008. [link]

Rada Mihalcea and Timothy Chklovski. Open Mind Word Expert: Creating Large Annotated Data Collections with Web Users' Help. Proceedings of the EACL 2003 Workshop on Linguistically Annotated Corpora (LINC 2003). Budapest, April 2003. [pdf] [ps] [data]

Iadh Ounis, Maarten de Rijke, Craig Macdonald, Gilad Mishne, Ian Soboroff. Overview of the TREC-2006 Blog Track. TREC 2006.

Luis von Ahn. Games With a Purpose. IEEE Computer Magazine, vol. 39, no. 6, pp. 92-94, June 2006. [pdf]

Luis von Ahn. Human Computation. Google Tech Talks. July 26, 2006. [video]

Luis von Ahn, Ruoran Liu and Manuel Blum. Peekaboom: A Game for Locating Objects in Images. ACM CHI 2006. [pdf]

Luis von Ahn, S. Ginosar, M. Kedia, R. Liu and M. Blum. Improving Accessibility of the Web with a Computer GameACM CHI 2006. [pdf]

Luis von Ahn, Mihir Kedia and Manuel Blum. Verbosity: A Game for Collecting Common-Sense FactsACM CHI 2006. [pdf]

A. J. Westphal, C. C. Allen, R. Bastien, J. Borg, F. Brenker, J. C. Bridges, D. E. Brownlee, A. L. Butterworth, C. Floss, G. J. Flynn, D. Frank, Z. Gainsforth, E. Gruen, P. Hoppe, A. T. Kearsley, H. Leroux, L. R. Nittler, S. A. Sandford, A. Simionovici, F. J. Stadermann, R. M. Stroud, P. Tsou, T. Tyliszczak, J. Warren, M. E. Zolensky. Preliminary Examination of the Interstellar Collector of Stardust39th Lunar and Planetary Science Conference (2008), Abstract #1855. [pdf]

Nicholos Wethington. Galaxy Zoo Gets a Makeover. Universe Today. April 23, 2008. [link]

Nicholos Wethington. Galaxy Zoo Results Show that the Universe Isn't 'Lopsided'. Universe Today. March 28, 2008. [link]

Hui Yang, Luo Si, Jamie Callan. Knowledge Transfer and Opinion Detection in the TREC2006 Blog Track. TREC 2006.