CSE 788R04: Knowledge bootstrapping for Language Processing Applications

Professor: Donna Byron
Office hours: Monday 11:00 - 12:30, Thursday 11:00 - 12:00, Dreese 583

Class meeting time: TR 12:30pm-1:48 pm
Meeting place: Dreese 357
Newsgroup: cse.course.cse788R04

Thursday, November 30th

Here is a 2-column latex style, a bibstyle, and a template paper for your project writeups



Wednesday, September 27th

The reading for tomorrow is Automatic Acquisition of Hyponyms from Large Text Corpora(pdf)



Course Description


Description
Knowledge bases provide a vital piece of infrastructure for many language processing applications. They describe the inventory of word meanings that a system might encounter, relationships between concepts, and facts about how the world is and how it evolves. This knowledge is put to a variety of uses in language processing tasks, but how is the knowledge acquired and stored? Early hand-built resources, such as Wordnet, Framenet, Cyc, and labelled text collections such as the Penn Treebank, proved immensely useful to language technology, but were costly and labor-intensive to construct.

In recent years, attention has been focused on automatically populating knowledge resources to provide a similar benefit to these hand-constructed collections, but at much reduced cost. The knowledge in these resources is captured from sources such as call-center recorded conversations, web pages, wikipedia, labelled/controlled text corpora, and even human volunteers entering common sense facts on a web page. Automatically-built knowledge bases have a variety of advantages: they can extend the coverage of hand-constructed resources or contain detailed coverage of a particular domain, they are easier to keep up-to-date as the world evolves, and they can be built for human languages that do not have hand-built resources. However, potential users are often concerned about the quality of data represented in automatically-built resources (even though hand-constructed data also contains its share of noise), so a particularly important theme among engineers working in this area is assessing the quality and utility of automatically-acquired knowledge resources as compared to manually-constructed ones.

This seminar will introduce students to current research on automatically building and using knowledge resources for language processing tasks. Sample topics include:

  • Uses for knowledge resources in language technology (overview)
  • Populating concept hierarchies and thesauruses
  • Mining the web for individuals (such as people, companies, geographic regions) and their relations
  • Collecting data from volunteers (OpenMind and its variants)
  • Automatic construction of frames and predicate-argument structure
  • Lexical and concept acquisition for specialized domains, such as bioinformatics
  • Acquiring behaviors such as dialog moves and dialog strategies
  • Collecting domain-specific resources for language modelling
The seminar is targetted at students working in Computational Linguistics, Automated Reasoning, Language Acquisition or Data Mining who want to gain familiarity with these techniques. The readings will come primarily from recent Computational Linguistics and Artificial Intelligence publications, but seminar participants will be encouraged to suggest additional topics or readings from other fields. Interested students and colleagues should enroll for the course regardless of their level of background with this topic. It will be our goal as a group to make the material understandable to all members of the class.