2011 FaMAF Intro to NLG
This course covers three topics that are usually taught in separate courses: the automatic construction of text starting from structured data, the automatic construction of summaries (not the full summarization system, just the text construction bit), and the text creation in the target language during machine translation.
Graduate students, advanced undergraduates.
Natural Language Generation (NLG)
Textbook: Building Natural Language Generation Systems (2000) Reiter&Dale (ISBN 0521620368)
- (NLG1) Introduction to NLG. NLG systems architecture. NLG systems examples. Input data to NLG systems.
- (NLG1.1) E. Reiter, Has a Consensus NLG Architecture Appeared, and is it Psycholinguistically Plausible?, IWNLG '94. PDF (9 pages)
- (NLG2) Content planning. Content selection. Content selection via spreading activation. Content ordering. Domain Communicative Knowledge. Statistical methods for content selection and ordering.
- (NLG2.1) Mellish&al '98, An architecture for opportunistic text generation. PDF (10 pages)
- See also, Cox&al., '99. Dynamic versus static hypermedia in museum education: an evaluation of ILEX, the intelligent labelling explorer. PDF (8 pages)
- (NLG2.2) Kittredge&al, '91. On the need for domain communication knowledge. (10 pages)
- (NLG2.3) Duboue&McKeown '03, Statistical Acquisition of Content Selection Rules for Natural Language Generation. PDF (8 pages)
- (NLG2.4) Barzilay&Lapata, '08. Modeling Local Coherence: An Entity-based Approach. PDF (34 pages)
- (NLG2.1) Mellish&al '98, An architecture for opportunistic text generation. PDF (10 pages)
- (NLG3) Sentence planning. Lexicalization. Aggregation. Pronominalization. Generation of referring expressions. Statistical methods for referring expressions.
- (NLG3.1) Reape&Mellish, '98. Just What is Aggregation Anyway? PS (10 pages)
- (NLG3.2) Dale, '09. Referring expression generation through attribute-based heuristics. PDF (8 pages)
- (NLG3.3) Areces&al. '08. Referring Expressions as Formulas of Description Logic. PDF (8 pages)
- See also: Belz, '08. Intrinsic vs. extrinsic evaluation measures for referring expression generation. PDF (4 pages)
- (NLG4) Surface generation. Functional unification grammar methods. FUF. Statistical surface generation. Nitrogen. SPUD.
- (NLG4.1) Elhadad&Robin, '99. SURGE: a comprehensive plug-in syntactic realization component for text generation. PDF (44 pages)
- (NLG4.2) Langkilde&Knight, '98. The practical value of n-grams in generation. PDF (8 pages)
- (NLG4.3) Stone&al, '03. Microplanning with communicative intentions: The SPUD system. PDF (72 pages)
- (NLG5) Multi-lingual generation. Component re-use. NLG systems evaluation. Clinical trials. Gold standards.
- (NLG5.1) Callaway, '05. Automatic cinematography and multilingual NLG for generating video documentaries. (33 pages)
- (NLG5.2) Reiter&Robertson, '03. Lessons from a failure: Generating tailored smoking cessation letters. PDF (18 pages)
- (NLG5.3) Reiter, '02. Should corpora texts be gold standards for NLG? PDF (8 pages)
Textbook: Advances in Automatic Text Summarization (1999). Mani&Maybury ISBN 0262133598.
- (Summ1) Introduction to automatic text summarization. Single vs. multi-document summarization. Example systems.
- (Summ2) Single document summarization. Knowledge-based systems. Statistical systems. Ultra summarization. Headline generation.
- (Summ3) Multi-document summarization. Extractive systems. Sentence-fusion systems. Multi-lingual summarization. Summary updates.
- (Summ3.1) Radev&al. '04. Centroid-based summarization of multiple documents. Information Processing and Management, 40:919–938, December 2004 PDF (20 pages)
- (Summ3.2) Barzilay, '05. Sentence fusion for multidocument news summarization. PDF (32 pages)
- (Summ3.3) Saggion&al, '02. Developing infrastructure for the evaluation of single and multi-document summarization systems in a cross-lingual environment. PDF (8 pages)
- (Summ4) Summary evaluation. Intrinsic and extrinsic evaluation. Pyramid method. ROUGE.
- (Summ4.1) Dorr&al, '05. A methodology for extrinsic evaluation of text summarization: does ROUGE correlate? PDF (8 pages)
- (Summ4.2) Lin, '04. Rouge: A package for automatic evaluation of summaries. PDF (8 pages)
- (Summ4.3) Nenkova&al, '07. The Pyramid Method: Incorporating human content selection variation in summarization evaluation. PDF (23 pages)
Machine Translation (MT)
Textbook: Statistical Machine Translation (2010). Koehn (ISBN 0521874157)
- (MT1) Introduction to Machine Translation. Text production during machine translation.
- (MT2) Knowledge-based systems. Transfer rules. Bidirectional grammars.
- (MT3) Statistical methods. Statistical decoding. Beam search.
- (MT4) MT evaluation. BLEU.
- (MT4.1) Papinini&al '01. Bleu: a Method for Automatic Evaluation of Machine Translation. PDF (10 pages)
- Romina Altamirano: Using automatically induced subcategorization frames for Spanish verbs in NLG
- Matias Bordese: Verbalization of software patches
- Matias Bordone: Generation of cluster labels
- Julio Castillo: RTE and NLG
- Sergio Canchi: Sign language generation
- Marina Cárdenas: Medical records verbalization
- Raul Fervari: Large scale evaluation of description logic GRE algorithms.
- RC: report generation
- Hérnan Casalánguida: UML verbalization
- Fabian Pacheco: Large scale evaluation of classic GRE algorithms
- Pablo Perez De Angelis: authoring tool for game narratives
These are just sample project topics. Contact me for discussion of any specifics.
The projects are individual projects. The sample projects described below are actually general areas but not precise projects. The details for any projects need to be negotiated beforehand with the instructor. Most NLP projects involve partial implementations.
Projects in NLG
- Report generation
- A very strong area for NLG is the automatic construction of reports. The idea behind this project is work closely with research centers performing data acquisition (or use already published values). For example, the data published by the Observatorio Ambiental de Cordoba. This project can be done jointly with a research lab in the school or externally.
- Generation of referring expressions / instructions in virtual environments
- There is an international competition in the evaluation of NLG systems: GIVE Challenge. Prof. Benotti has participated in previous years and she plans to participate again in 2011. Contributing to such effort can make an interesting term project.
- Input Data Expressivity
- NLG systems use structured data as input, expressed in different formalisms. Many such formalisms are extremely simple (e.g., attribute value pairs), while others are more interesting (e.g., second order logic). Different formalisms are needed to build different type of texts. In this project, the idea is to explore the expressiveness of the input semantics with respect to the generated output text.
- Better compilation errors
- In this project, you will use the source code of an open source compiler, to improve the generation of longer and more relevant error messages. In ideal conditions, we can use a programming language already being used in the department for teaching introduction to programming courses.
- Software Patches Verbalization
- In this project you will generate succinct description of the changes produced in different files touched by a software patch. This is an ongoing project between me and Annie Ying, some student project can be made part of this effort.
Projects in Summarization
- Sentence fusion
- A classic problem in summarization is joining two sentences with similar meaning. In this project, you can explore generalizing existing techniques from English to Spanish.
- Clause contextualization
- When a salient sentence is extracted, the context is lost. In this project, we seek to modify the original sentence, augmenting it with contextual information. This could be particularly useful for the media tracking project at Grupo PLN FaMAF.
Projects in MT
- Improving translation quality using automatically induced grammars
- An MT system that produce grammatically correct sentences is to be preferred over one that does not. This intuition has driven significant amounts of research the last 5+ years in the field. In this project, you will relate that body of work to the research done unsupervised parsing at the Grupo PLN FaMAF. The goal is to enrich a regular statistical MT system with unsupervised grammars, so to obtain better quality translations.
- Improving treatment of verbs for Spanish / English language pair
- Spanish subtleties regarding its tense and modal verb system are difficult to capture by a statistical MT system. In this project, you can build a component that handles Spanish verb conjugations for MT systems with Spanish as their target language.
This is a tentative schedule. Codes refer to the Content section. Papers in italics are to be presented by the instructor, other papers are to be presented by the students. The two projects lectures will have the students presenting their project topics and the work done during the class.
Tue Mar 7
- Thu Mar 10: Course introduction, NLG1, NLG1.1
- Tue Mar 15: NLG2, NLG2.1 NLG2.3
- Thu Mar 17: NLG2.2, NLG2.4
- Tue Mar 22: NLG3, NLG3.1
Thu Mar 24
- Tue Mar 29: NLG3.2, NLG3.3
- Thu Mar 31: NLG4, NLG4.1
- Tue Apr 5: NLG4.2, NLG4.3
- Thu Apr 7: NLG5, NLG5.3
- Tue Apr 12: NLG5.1, NLG5.2
- Thu Apr 14: NLG wrap-up
- Tue Apr 19: Projects showcase
Thu Apr 21
- Tue Apr 26: Summ1
- Thu Apr 28: Summ2
- Tue May 3: Summ2.1
- Thu May 5: Summ2.2
- Tue May 10: Summ3, Sum3.1
- Thu May 12: Sum3.2, Sum3.3
- Tue May 17: Summ4, Sum4.2
- Thu May 19: Summ4.1, Summ4.3
- Tue May 24: Summarization wrap-up
- Thu May 26: MT1
- Tue May 31: MT2, MT2.1
- Thu Jun 2: MT2.2
- Tue Jun 7: MT3, MT3.1
- Thu Jun 9: MT3.2
- Tue Jun 14: MT4, MT4.1
- Thu Jun 17: Projects defense
- Thu Mar 17
- NLG2.2: Julio Castillo
- NLG2.3: (long) Matias Bordone
- Tue Mar 29
- NLG3.2: Fabian Pacheco
- NLG3.3: Raul Fervari
- Tue Apr 5
- NLG4.2: RC
- NLG4.3: Romina Altamirano
- Tue Apr 12
- NLG5.1: (long) Pablo Perez De Angelis
- NLG5.2: (long) Marina Cárdenas
- Thu May 12
- Sum3.2: (long) RC
- Sum3.3: Sergio Canchi
- Thu May 19
- Summ4.1: Hérnan Casalánguida
- Summ4.3: Matias Bordone
- Thu Jun 2
- MT2.2: (long) Romina Altamirano
- Thu Jun 9
- MT3.2: Matias Bordese
- Upper Model '89
- NLG systems, incl. timeline of NLG systems
- Downloadable NLG systems at the ACL wiki
- SenSem (semantically annotated Spanish corpus)
- Charles Callaway resource page, including a Spanish version of SURGE
- RealPro, includes the manual for RealPro
- scene generation
- STANDUP, joke generation
- MEAD public domain summarizer