Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Steven Bird

University of Pennsylvania

NLTK: The Natural Language Toolkit

May 17, 2002

Edward Loper, Steven Bird

Figure 1 for NLTK: The Natural Language Toolkit

Abstract:NLTK, the Natural Language Toolkit, is a suite of open source program modules, tutorials and problem sets, providing ready-to-use computational linguistics courseware. NLTK covers symbolic and statistical natural language processing, and is interfaced to annotated corpora. Students augment and replace existing components, learn structured programming by example, and manipulate sophisticated models from the outset.

* 8 pages, 1 figure, Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, Philadelphia, July 2002, Association for Computational Linguistics

Via

Access Paper or Ask Questions

Querying Databases of Annotated Speech

Apr 11, 2002

Steve Cassidy, Steven Bird

Figure 1 for Querying Databases of Annotated Speech

Figure 2 for Querying Databases of Annotated Speech

Figure 3 for Querying Databases of Annotated Speech

Abstract:Annotated speech corpora are databases consisting of signal data along with time-aligned symbolic `transcriptions'. Such databases are typically multidimensional, heterogeneous and dynamic. These properties present a number of tough challenges for representation and query. The temporal nature of the data adds an additional layer of complexity. This paper presents and harmonises two independent efforts to model annotated speech databases, one at Macquarie University and one at the University of Pennsylvania. Various query languages are described, along with illustrative applications to a variety of analytical problems. The research reported here forms a part of several ongoing projects to develop platform-independent open-source tools for creating, browsing, searching, querying and transforming linguistic databases, and to disseminate large linguistic databases over the internet.

* Database Technologies: Proceedings of the Eleventh Australasian Database Conference, pp. 12-20, IEEE Computer Society, 2000
* 9 pages, 4 figures

Via

Access Paper or Ask Questions

Phonology

Apr 11, 2002

Steven Bird

Abstract:Phonology is the systematic study of the sounds used in language, their internal structure, and their composition into syllables, words and phrases. Computational phonology is the application of formal and computational techniques to the representation and processing of phonological information. This chapter will present the fundamentals of descriptive phonology along with a brief overview of computational phonology.

* In Ruslan Mitkov (ed) (2002). Oxford Handbook of Computational Linguistics
* 27 pages

Via

Access Paper or Ask Questions

Computational Phonology

Apr 10, 2002

Steven Bird

Abstract:Phonology, as it is practiced, is deeply computational. Phonological analysis is data-intensive and the resulting models are nothing other than specialized data structures and algorithms. In the past, phonological computation - managing data and developing analyses - was done manually with pencil and paper. Increasingly, with the proliferation of affordable computers, IPA fonts and drawing software, phonologists are seeking to move their computation work online. Computational Phonology provides the theoretical and technological framework for this migration, building on methodologies and tools from computational linguistics. This piece consists of an apology for computational phonology, a history, and an overview of current research.

* Oxford International Encyclopedia of Linguistics, 2nd Edition, 2002
* 4 pages

Via

Access Paper or Ask Questions

Annotation Graphs and Servers and Multi-Modal Resources: Infrastructure for Interdisciplinary Education, Research and Development

Apr 10, 2002

Christopher Cieri, Steven Bird

Figure 1 for Annotation Graphs and Servers and Multi-Modal Resources: Infrastructure for Interdisciplinary Education, Research and Development

Figure 2 for Annotation Graphs and Servers and Multi-Modal Resources: Infrastructure for Interdisciplinary Education, Research and Development

Figure 3 for Annotation Graphs and Servers and Multi-Modal Resources: Infrastructure for Interdisciplinary Education, Research and Development

Figure 4 for Annotation Graphs and Servers and Multi-Modal Resources: Infrastructure for Interdisciplinary Education, Research and Development

Abstract:Annotation graphs and annotation servers offer infrastructure to support the analysis of human language resources in the form of time-series data such as text, audio and video. This paper outlines areas of common need among empirical linguists and computational linguists. After reviewing examples of data and tools used or under development for each of several areas, it proposes a common framework for future tool development, data annotation and resource sharing based upon annotation graphs and servers.

* Proceedings of ACL Workshop on Sharing Tools and Resources for Research and Education, Toulouse, July 2001, pp 23-30
* 8 pages, 6 figures

Via

Access Paper or Ask Questions

Seven Dimensions of Portability for Language Documentation and Description

Apr 10, 2002

Steven Bird, Gary Simons

Abstract:The process of documenting and describing the world's languages is undergoing radical transformation with the rapid uptake of new digital technologies for capture, storage, annotation and dissemination. However, uncritical adoption of new tools and technologies is leading to resources that are difficult to reuse and which are less portable than the conventional printed resources they replace. We begin by reviewing current uses of software tools and digital technologies for language documentation and description. This sheds light on how digital language documentation and description are created and managed, leading to an analysis of seven portability problems under the following headings: content, format, discovery, access, citation, preservation and rights. After characterizing each problem we provide a series of value statements, and this provides the framework for a broad range of best practice recommendations.

* Proceedings of the Workshop on Portability Issues in Human Language Technologies, Third International Conference on Language Resources and Evaluation, Paris: European Language Resources Association, 2002
* 8 pages

Via

Access Paper or Ask Questions

An Integrated Framework for Treebanks and Multilayer Annotations

Apr 03, 2002

Scott Cotton, Steven Bird

Abstract:Treebank formats and associated software tools are proliferating rapidly, with little consideration for interoperability. We survey a wide variety of treebank structures and operations, and show how they can be mapped onto the annotation graph model, and leading to an integrated framework encompassing tree and non-tree annotations alike. This development opens up new possibilities for managing and exploiting multilayer annotations.

* Proceedings of the Third International Conference on Language Resources and Evaluation, Paris: European Language Resources Association, 2002
* 8 pages

Via

Access Paper or Ask Questions

Creating Annotation Tools with the Annotation Graph Toolkit

Apr 03, 2002

Kazuaki Maeda, Steven Bird, Xiaoyi Ma, Haejoong Lee

Figure 1 for Creating Annotation Tools with the Annotation Graph Toolkit

Figure 2 for Creating Annotation Tools with the Annotation Graph Toolkit

Figure 3 for Creating Annotation Tools with the Annotation Graph Toolkit

Figure 4 for Creating Annotation Tools with the Annotation Graph Toolkit

Abstract:The Annotation Graph Toolkit is a collection of software supporting the development of annotation tools based on the annotation graph model. The toolkit includes application programming interfaces for manipulating annotation graph data and for importing data from other formats. There are interfaces for the scripting languages Tcl and Python, a database interface, specialized graphical user interfaces for a variety of annotation tasks, and several sample applications. This paper describes all the toolkit components for the benefit of would-be application developers.

* Proceedings of the Third International Conference on Language Resources and Evaluation, Paris: European Language Resources Association, 2002
* 8 pages, 12 figures

Via

Access Paper or Ask Questions

Models and Tools for Collaborative Annotation

Apr 03, 2002

Xiaoyi Ma, Haejoong Lee, Steven Bird, Kazuaki Maeda

Figure 1 for Models and Tools for Collaborative Annotation

Figure 2 for Models and Tools for Collaborative Annotation

Figure 3 for Models and Tools for Collaborative Annotation

Figure 4 for Models and Tools for Collaborative Annotation

Abstract:The Annotation Graph Toolkit (AGTK) is a collection of software which facilitates development of linguistic annotation tools. AGTK provides a database interface which allows applications to use a database server for persistent storage. This paper discusses various modes of collaborative annotation and how they can be supported with tools built using AGTK and its database interface. We describe the relational database schema and API, and describe a version of the TableTrans tool which supports collaborative annotation. The remainder of the paper discusses a high-level query language for annotation graphs, along with optimizations, in support of expressive and efficient access to the annotations held on a large central server. The paper demonstrates that it is straightforward to support a variety of different levels of collaborative annotation with existing AGTK-based tools, with a minimum of additional programming effort.

* Proceedings of the Third International Conference on Language Resources and Evaluation, Paris: European Language Resources Association, 2002
* 8 pages, 6 figures

Via

Access Paper or Ask Questions

TableTrans, MultiTrans, InterTrans and TreeTrans: Diverse Tools Built on the Annotation Graph Toolkit

Apr 03, 2002

Steven Bird, Kazuaki Maeda, Xiaoyi Ma, Haejoong Lee, Beth Randall, Salim Zayat

Figure 1 for TableTrans, MultiTrans, InterTrans and TreeTrans: Diverse Tools Built on the Annotation Graph Toolkit

Figure 2 for TableTrans, MultiTrans, InterTrans and TreeTrans: Diverse Tools Built on the Annotation Graph Toolkit

Figure 3 for TableTrans, MultiTrans, InterTrans and TreeTrans: Diverse Tools Built on the Annotation Graph Toolkit

Figure 4 for TableTrans, MultiTrans, InterTrans and TreeTrans: Diverse Tools Built on the Annotation Graph Toolkit

Abstract:Four diverse tools built on the Annotation Graph Toolkit are described. Each tool associates linguistic codes and structures with time-series data. All are based on the same software library and tool architecture. TableTrans is for observational coding, using a spreadsheet whose rows are aligned to a signal. MultiTrans is for transcribing multi-party communicative interactions recorded using multi-channel signals. InterTrans is for creating interlinear text aligned to audio. TreeTrans is for creating and manipulating syntactic trees. This work demonstrates that the development of diverse tools and re-use of software components is greatly facilitated by a common high-level application programming interface for representing the data and managing input/output, together with a common architecture for managing the interaction of multiple components.

* Proceedings of the Third International Conference on Language Resources and Evaluation, Paris: European Language Resources Association, 2002
* 7 pages, 7 figures

Via

Access Paper or Ask Questions