Recruiting Online Volunteers for Linguistic Knowledge
Acquisition
Job talk,
May 13, 2008, 45 minutes [Powerpoint]
Abstract
The internet is essentially unregulated, immensely huge, and
growing exponentially.
Terrorist groups make use of the internet to recruit and train
operatives. Intelligence agencies work to track down these groups, but
the sheer vastness and mutability of the internet makes this a daunting
challenge.
Various computational linguistics subdisciplines can
aid in this goal. One of these is opinion detection, which is used to
scan web pages and identify terrorist propaganda.
Most
state-of-the-art computational linguistics
systems, including opinion detection, are statistically based. To do
their task, they require large amounts of "gold" training data,such as
texts labeled with opinions. Such data is rare, even in English, let
alone in less studied languages. Typically, training data is generated
over weeks or months by highly trained (and expensive) specialists.
The internet offers an alternative. For many purposes, the
task of labeling data can be performed
by an educated lay person with a minimum of training – the
same sort
of person who does crossword puzzles or sudoku. By offering the task as
a game, potentially thousands of volunteers can be recruited to do it
for free. This approach has been used with great success on a small but
growing number of research projects, from classifying galaxies to
transcribing notations on 19th century plant archives.
I
propose to identify an opinion-labeling task that can be performed by
lay people, and to offer this task as a game to the internet community.
I anticipate acquiring tens of thousands of high-quality annotations
with just a few person-weeks of work. If this succeeds, the approach
could be adapted to a wide variety of other related tasks.
Outline
- The internet is essentially unregulated, immensely huge,
and growing exponentially.
- Terrorist groups use the internet for recruiting and
training.
- Computational linguistics subdisciplines such as opinion
detection can be used to identify terrorist websites.
- Most such systems require training data which is not
readily available.
- Other research projects have had success recruiting
internet volunteers for comparably difficult tasks.
- I propose to do the same with opinion labeling, and then
extend to other related areas.
Challenge: Internet
- 1.36 billion users (Q1 2008)
- across entire populated world and diverse language groups
- 20.7%
annual growth (December 2006 to December 2007)
- 103,160,364
active domains (May 03, 2008)
- 332,840,730 deleted domains
- 648,853
new domains
in past 24 hours (May 03, 2008)
Source (user info):
www.internetworldstats.com/
Copyright © 2008, Miniwatts Marketing Group
Source (domain info): www.domaintools.com/internet-statistics/
Internet Users
- most users in Asia, Europe, and North America
- fastest growth in Middle East, Africa, and Latin America
Primary Languages Spoken by Internet Users
- largely European, East Asian, and Arabic
- fastest growth in Arabic
Source:
www.internetworldstats.com
Copyright © 2008, Miniwatts Marketing Group
Web Content Languages
- outdated
and suspect data
- vast majority English – for now
Source: global-reach.biz/globstats/refs.php3
Last revised: September 30, 2004
Summary
- internet is vast, with tremendously fast growth
- most content and users are from developed nations and
well-studied languages
- fastest growth is in developing nations and less-studied
languages
- analysts and technology need to keep up
Challenge: Use of Internet for Global Terrorism
Terrorist Websites
- several
thousand, increasing exponentially
Purposes
- propaganda – worldwide, anonymous
- "The Global Islamic Call to Resistance", 1600 pages, call
for self-starting terrorist cells
- "Questions and Uncertainties Concerning the Mujahideen
and their Operations", doctrinal justifications
- news bulletins
- videos of American soldiers being blown up
- video statements on recent events
- video game, "Night of Bush Capturing"
- training manuals, e.g. assassination, manufacturing
poisons/explosives
- "Encyclopedia of Preparation", huge & growing
online manual
- coordinate attacks between individuals or groups
- internet jihadist Irhabi007 helped plan attacks by two
men from Atlanta,
GA, on Washington, DC, targets
- "... networks within networks, connections within
connections and links
between individuals that cross local, national and international
boundaries." (Peter
Clarke, head of the counter-terrorism
branch of London's Metropolitan Police)
- raise funds through identity theft
Summary
- "The radicalisation process is occurring more quickly, more
widely and more anonymously in the internet age, raising the likelihood
of surprise attacks by unknown groups whose members and supporters may
be difficult to pinpoint." (National
Intelligence Estimate,
USA, 2006)
- "We have to find a way to stanch the flow. The internet
creates a
constant reservoir of radicalised people which terrorist groups and
networks can draw upon." (Professor
Bruce Hoffman, terrorism expert, Georgetown University)
Source: "A
world wide web of terror".
The Economist. July
12, 2007.
How can we identify terrorist websites?
Digression: Humans and Computers are Different
- computers can do many things that humans can't do (well)
- humans can do many things that computers can't do (well)
Examples
- computers only
- find new prime numbers
- scan the entire web for "Osama bin Laden"
- humans only
- recognize emotions from facial expressions
- captcha
![[captcha]](linguistathome_files/Modern-captcha.jpg)
- both
Crossover
- humans can impersonate computers
- computers can impersonate humans
- Eliza – requires clever rules, limited domain
- machine learning – requires lots of data
Opinion Detection
Problem
- identify opinions and attitudes in texts (more generally,
modalities)
- humans are very good at it, computers are not
| (01) |
"America is a mistake,
admittedly a gigantic mistake, but a mistake nevertheless." |
(Sigmund Freud) |
|
SPEAKER DISLIKES America |
|
| (02) |
"The United States
of America is a threat to world peace." |
(Nelson Mandela) |
|
SPEAKER DISLIKES United States of America |
|
| (03) |
"Every Muslim, from the
moment they
realize the distinction in their hearts, hates Americans, hates Jews
and hates Christians." |
(Osama bin Laden) |
|
Every Muslim DISLIKES Americans |
|
|
Every Muslim DISLIKES Jews |
|
|
Every Muslim DISLIKES Christians |
|
| (04) |
"We are determined to undertake jihad
for Allah's sake and to take the battle inside damaged America, Allah
willing." |
(intercepted email) |
|
We INTENDS undertake jihad |
|
|
SPEAKER DISLIKES America |
|
| (05) |
"Mr. McGee, don't make me angry. You wouldn't like me
when I'm angry." |
(David Banner) |
|
Mr. McGee SHOULDN'T make me angry |
|
|
Mr. McGee DISLIKES me when I'm angry |
|
| (06) |
"All I want for Christmas is my two front teeth." |
|
|
SPEAKER WANTS my two front teeth |
|
Sources: (01-03) en.wikiquote.org/wiki/America.
(04) "A
world wide web of terror".
Resources
- humans can do this task well, but not fast enough
- computers are moderately successful in limited domains
- accuracy of computers depends on availability of training
data
TREC 2006 Blog (Opinion Retrieval) Track
Problem
- given a blog entry and a topic, identify whether:
- the entry is relevant to that topic
- the entry expresses an opinion on the topic
- the opinion is positive, negative, or mixed
- no training data provided
- CMU used ~10,000 training examples from movie and product
reviews (Yang et al 2006)
Examples
Opinionated
| Skype 2.0 eats its young |
| The elaborate press release and WSJ review while
impressive
don’t
help mask the fact that, Skype is short on new ground breaking ideas.
Personalization via avatars and ring-tones... big new idea? Not really.
Phil Wolff over on Skype Journal puts it nicely when he writes,
“If you’ve been using Skype, the Beta version of
Skype 2.0
for Windows won’t give you a new Wow! experience.”
... |
Non-opinionated
| Skype Launches Skype 2.0
Features Skype Video |
| Skype released the beta version of Skype 2.0, the
newest version of its
software that allows anyone with an Internet connection to make free
Internet calls. The software is designed for greater ease of use,
integrated video calling, and ... |
Results
Best and median MAP (mean average precision) results
of 57 submissions
|
Topic relevance |
Opinion finding |
| Best |
42.19% |
30.04% |
| Median |
16.99% |
10.59% |
Source: Ounis et al 2006
Where can we get training data?
Volunteer projects
- Enlist online volunteers
- Provide minimal training
- Optionally, frame as a competitive game
- "The easiest part is getting the public involved. Most
volunteer-computing projects can draw on tens of thousands of people
with practically no advertising, relying on word of mouth. The problem
is usually keeping these eager amateurs busy." ("Spreading the load", The Economist)
Examples
Non-computational
- amateur bird-watchers track bird migrations
- amateur astronomers spot new comets
Galaxy Zoo
Problem
- roughly a million galaxies from Sloan Digital Sky Survey
- classify
- elliptical
- clockwise spiral
- anticlockwise
spiral
- unclear
- identify interactions between galaxies, real or illusory
Volunteers
- 100,000+ volunteers within a few months
- 30 volunteers classify each galaxy
- peak load 70,000 per hour
- final datasets
- 34,617,406 analyses
- 82,931 users
- filter unreliable volunteers using known test cases
Results
- unexpected source of error: users are biased toward
anticlockwise spirals
- 2 papers submitted for publication
- currently over 20 projects underway using resulting data
- future work
- phase two: more detailed questions
- phase three: more image sources
www.galaxyzoo.org/
Stardust@home
Problem
- aerogel sent seven years and 3 billion km through space
- identify tracks of microparticles in gel
Volunteers
- 24,000 participants
- 40 million searches in under a year
Results
- 50 candidate dust particles, each identified by hundreds of
participants
- featured in seven conference papers
stardustathome.ssl.berkeley.edu/
Herbaria@home
Problem
- thousands of 19th-century plant specimens with handwritten
notes
- read notes and enter information into database
Volunteers
- 162 volunteers, Zipfian distribution
- 68 volunteers transcribed 10 or more entries
- 24 volunteers transcribed 100 or more entries
- 7 volunteers transcribed 1000 or more entries
Results
- 22702 specimens documented (May 5, 2008)
- no redundancy
herbariaunited.org/atHome/
Open Mind Word Expert
Problem
- word sense disambiguation
(07)
| He boarded the plane
from gate 53. |
| The ball is not in play until it crosses the plane. |
- systems need training data
Volunteers
- 90,000 sense taggings over four months
- 240 words, 87 examples each on average
Results
- inter-annotator agreement: 66.56%
- 66.23% precision, vs. 63.32% baseline
- best precision for words with most training examples
Source: Mihalcea and
Chkolovski, "Open Mind Word Expert"
Summary
- projects get anywhere between 100+ and 100,000+
volunteers
- Zipfian distribution of contributions by volunteers
What makes for a
successful project?
Games
"In every job that must be done,
there is an element of fun. You find
the fun, and – snap! – the job's a game." (Mary Poppins)
- 9 billion human-hours of solitaire were played in 2003
- 7 million human-hours to build the Empire State Building,
or 6.8 hours out of 2003
- 20 million human-hours to build the Panama Canal, or one
day out of 2003
Source: Luis von Ahn, "Human
Computation"
Examples
ESP Game
Problem
- label images with words/captions
- purposes
- index images for search
- provide captions for visually impaired
Game
- two people, strangers
- type whatever the other player is
typing
- get points whenever you agree
- timed
- only store solutions when n pairs are recorded
- taboo words from previous solutions
- random test images to catch cheaters
- symmetric verification game
- both players get same input and give same output
- each player verifies the other
- suitable when number of valid answers is small
Results
- 75,000 players
- many people play over 20 hours per week
- 15 million agreements
- highly accurate
- highly complete
- large part of appeal is relation with anonymous partner
Source: www.espgame.org/
Copyright: © 2005 Carnegie Mellon University, all rights
reserved. Patent
Pending.
Peekaboom
Problem
- images with object labels, e.g. output of ESP Game
- need to locate objects in images
- used, e.g., for training computer vision
Game
- player A sees image
- player B has to guess object in image
- player A clicks on image, revealing small area to player B
- asymmetric verification game
- player A gets input, which player B has to guess
- player B verifies player A's analysis
- suitable when number of valid answers is large
Source: www.peekaboom.org/
Copyright © 2005 Carnegie
Mellon University, all rights reserved. Patent Pending.
Results
- 27,000 players in first four months
- 2,100,000 object locations
- many people averaged over 12 hours per day for first 10 days
Verbosity (proposed)
Problem
- input common sense facts, e.g. "cereal is eaten with milk"
Game
- player A sees word
- player B has to guess word
- player A gets to fill in various templates
- e.g. "object is
typically near ____"
- asymmetric verification game
Toolkits
Amazon Mechanical Turk
- paid service
- requester posts task online, along with
instructions and
pay rate
- worker views available tasks and selects those of interests
Examples
- examine an image and click on specified
objects, $0.05 per object
- evaluate relevance of search results, $0.02 per evaluation
www.mturk.com/
Bossa
- open source, Linux
- developer provides task-specific PHP scripts
- system rates volunteer skill, evaluates agreement
among volunteers
- pointer to Bolt, open source tutorial builder
boinc.berkeley.edu/trac/wiki/BossaIntro/
Facebook
- install customized apps
- take advantage of social networks
www.facebook.com/developers/
Linguist@home (a.k.a. That's Your Opinion)
Problem
- annotate sentences with opinions
- make it fun
(04) "We are determined to undertake jihad
for Allah's sake and to take the battle inside damaged America, Allah
willing."
- We INTENDS undertake jihad
- SPEAKER DISLIKES America
1-player game
- display sentence
- allow user to pick template and fill in participants
![[screenshot]](linguistathome_files/linguistathome-screenshot.png)
- display list of templates
- determined by expert consultant
- highlight eligible participants
- allow multiple answers
- 10 points for first, 20 for second, 30 for third, etc.
How can we assure that
answers are valid?
2-player game
- symmetric verification – number of valid answers
is small
Future Work
- extend to other linguistic subdisciplines, e.g. topic
classification
- extend to other widely used & studied languages,
e.g. German, Chinese
- extend to fastest growing languages, e.g. Arabic
References
------. Amazon
Mechanical Turk, website. [link]
------. Bossa,
website. [link]
------. The ESP
Game, website. [link]
------. Galaxy Zoo,
website.
[link]
------. Herbaria@home.
Botanical Society of the British Isles, website.
[link]
------. Internet World
Stats. Miniwatts Marketing Group, website.
[link]
------. Peekaboom,
website. [link]
------.
Playing or
processing.
The Economist. Dec
6, 2007. [link]
------.
Spreading the load.
The Economist. Dec
6, 2007. [link]
------.
Stardust@home,
website.
[link]
------.
A world wide web
of terror.
The Economist. July
12, 2007. [link]
Amir Alexander.
Aerogel: The "Frozen Smoke" that
Made Stardust Possible. The
Planetary Society. November 8, 2006. [link]
Nathaniel Ayewah,
Rada Mihalcea, and Vivi Nastase. Building Multilingual Semantic Networks with
Non-Expert Contributions over the Web. Proceedings of the KCAP 2003 Workshop on
Distributed and Collaborative Knowledge Capture. Sanibel
Island, Florida, November 2003. [pdf]
[ps]
[demo]
Timothy Chklovski.
2005. Designing interfaces for guided collection of knowledge about
everyday objects from volunteers. In Proceedings of the 10th
international Conference on intelligent User interfaces (San
Diego, California, USA, January 10 - 13, 2005). IUI '05. ACM, New York,
NY, 311-313.
Timothy Chklovski,
Using Analogy to Acquire Commonsense Knowledge from Human
Contributors, MIT Artificial Intelligence Laboratory
technical report AITR-2003-002, February 2003. [pdf]
Timothy Chklovski
and Rada Mihalcea. Exploiting Agreement and Disagreement of
Human Annotators for Word Sense Disambiguation. Proceedings of the Conference on Recent
Advances in Natural Language Processing (RANLP 2003).
Borovetz, Bulgaria, September 2003. [pdf]
[ps]
[demo] [data]
Kate Land, Anze Slosar,
Chris Lintott, Dan Andreescu, Steven Bamford, Phil Murray, Robert
Nichol, M.Jordan Raddick, Kevin Schawinski, Alex Szalay, Daniel Thomas,
Jan Van den Berg. Galaxy
Zoo: The large-scale spin statistics of spiral galaxies in the Sloan
Digital Sky Survey. Submitted March 22, 2008.
[link]
Chris J. Lintott, Kevin
Schawinski, Anze Slosar, Kate Land, Steven Bamford, Daniel Thomas, M.
Jordan Raddick, Robert C. Nichol, Alex Szalay, Dan Andreescu, Phil
Murray, Jan van den Berg. Galaxy
Zoo : Morphologies derived from visual inspection of galaxies from the
Sloan Digital Sky Survey. Submitted to MNRAS, April 29, 2008. [link]
Rada Mihalcea
and Timothy Chklovski. Open Mind Word Expert: Creating Large
Annotated Data Collections with Web Users' Help. Proceedings of the EACL 2003 Workshop on
Linguistically Annotated Corpora (LINC 2003). Budapest,
April 2003. [pdf]
[ps]
[data]
Iadh Ounis, Maarten
de Rijke, Craig Macdonald, Gilad Mishne, Ian Soboroff. Overview of the TREC-2006 Blog Track.
TREC 2006.
Luis von Ahn.
Games With a Purpose.
IEEE Computer Magazine,
vol. 39, no. 6, pp. 92-94, June 2006. [pdf]
Luis von Ahn.
Human Computation.
Google Tech Talks. July 26, 2006. [video]
Luis von Ahn, Ruoran
Liu and Manuel
Blum. Peekaboom:
A Game for Locating Objects in Images. ACM CHI 2006. [pdf]
Luis von Ahn, S.
Ginosar, M. Kedia, R. Liu and M. Blum.
Improving Accessibility of
the Web with a Computer Game. ACM CHI 2006. [pdf]
Luis von Ahn, Mihir
Kedia and Manuel
Blum. Verbosity:
A Game for Collecting Common-Sense Facts. ACM CHI 2006. [pdf]
A. J. Westphal, C. C. Allen,
R. Bastien, J.
Borg, F. Brenker, J. C. Bridges, D. E. Brownlee, A. L. Butterworth, C.
Floss, G. J. Flynn, D. Frank, Z. Gainsforth, E. Gruen, P. Hoppe, A. T.
Kearsley, H. Leroux, L. R. Nittler, S. A. Sandford, A. Simionovici, F.
J. Stadermann, R. M. Stroud, P. Tsou, T. Tyliszczak, J. Warren, M. E.
Zolensky. Preliminary
Examination of the Interstellar Collector of Stardust. 39th Lunar and Planetary Science
Conference (2008), Abstract #1855. [pdf]
Nicholos Wethington.
Galaxy Zoo Gets a Makeover.
Universe Today.
April 23,
2008.
[link]
Nicholos Wethington.
Galaxy Zoo Results Show that the
Universe
Isn't 'Lopsided'. Universe
Today. March 28, 2008. [link]
Hui Yang, Luo Si,
Jamie Callan. Knowledge
Transfer and Opinion Detection in the TREC2006 Blog Track.
TREC 2006.