Description
Knowledge bases provide a vital piece of infrastructure for many
language processing applications. They describe the inventory of word
meanings that a system might encounter, relationships between
concepts, and facts about how the world is and how it evolves. This
knowledge is put to a variety of uses in language processing tasks,
but how is the knowledge acquired and stored? Early hand-built
resources, such as Wordnet, Framenet, Cyc, and labelled text
collections such as the Penn Treebank, proved immensely useful to
language technology, but were costly and labor-intensive to construct.
In recent years, attention has been focused on automatically
populating knowledge resources to provide a similar benefit to these
hand-constructed collections, but at much reduced cost. The knowledge
in these resources is captured from sources such as call-center
recorded conversations, web pages, wikipedia, labelled/controlled text
corpora, and even human volunteers entering common sense facts on a
web page. Automatically-built knowledge bases have a variety of
advantages: they can extend the coverage of hand-constructed resources
or contain detailed coverage of a particular domain, they are easier
to keep up-to-date as the world evolves, and they can be built for
human languages that do not have hand-built resources. However,
potential users are often concerned about the quality of data
represented in automatically-built resources (even though
hand-constructed data also contains its share of noise), so a
particularly important theme among engineers working in this area is
assessing the quality and utility of automatically-acquired knowledge
resources as compared to manually-constructed ones.
This seminar will introduce students to current research on
automatically building and using knowledge resources for language
processing tasks. Sample topics include:
- Uses for knowledge resources in language technology (overview)
- Populating concept hierarchies and thesauruses
- Mining the web for individuals (such as people, companies, geographic regions) and their relations
- Collecting data from volunteers (OpenMind and its variants)
- Automatic construction of frames and predicate-argument structure
- Lexical and concept acquisition for specialized domains, such as bioinformatics
- Acquiring behaviors such as dialog moves and dialog strategies
- Collecting domain-specific resources for language modelling
The seminar is targetted at students working in Computational
Linguistics, Automated Reasoning, Language Acquisition or Data Mining
who want to gain familiarity with these techniques. The readings will
come primarily from recent Computational Linguistics and Artificial
Intelligence publications, but seminar participants will be encouraged
to suggest additional topics or readings from other fields.
Interested students and colleagues should enroll for the course
regardless of their level of background with this topic. It will be
our goal as a group to make the material understandable to all members
of the class.