Next: Overview of Papers Up: Introduction to Special Issue Previous: Introduction to Special Issue

Introduction

In full parsing, a grammar and search strategy are used to assign a complete syntactic structure to sentences. The main problem here is to select the most plausible syntactic analysis given the often thousands of possible analyses a typical parser with a sophisticated grammar may return. Stochastic approaches can be used to order the analyses according to their probability or to generate the most probable parse(s) only. See [Jurafsky and Martin(2000)] for an introduction to traditional and stochastic approaches to parsing.

However, not all natural language processing (NLP) applications require a complete syntactic analysis. A full parse often provides more information than needed and sometimes less. E.g., in Information Retrieval, it may be enough to find simple NPs (Noun Phrases) and VPs (Verb Phrases). In Information Extraction, Summary Generation, and Question Answering, we are interested especially in information about specific syntactico-semantic relations such as agent, object, location, time, etc (basically, who did what to whom, when, where and why), rather than elaborate configurational syntactic analyses.

Partial or shallow parsing | the task of recovering only a limited amount of syntactic information from natural language sentences | has proved to be a useful technology for written and spoken language domains. For example, within the Verbmobil project, shallow parsers were used to add robustness to a large speech-to-speech translation system wahl00. Shallow parsers are also typically used to reduce the search space for full-blown, `deep' parsers Coll96. Yet another application of shallow parsing is question-answering on the World Wide Web, where there is a need to efficiently process large quantities of (potentially) ill-formed documents buch,Srihari+99. And more generally all text mining applications, e.g. in biology Sekimizu+98.

[Abney(1991)] is credited with being the first to argue for the relevance of shallow parsing, both from the point of view of psycholinguistic evidence and from the point of view of practical applications. His own approach used hand-crafted cascaded Finite State Transducers to get at a shallow parse.

Typical modules within a shallow parser architecture include the following:

Part-of-Speech Tagging. Given a word and its context, decide what the correct morphosyntactic class of that word is (noun, verb, etc.). POS tagging is a well-understood problem in NLP Halteren99, to which machine learning approaches are routinely applied.
Chunking. Given the words and their morphosyntactic class, decide which words can be grouped as chunks (noun phrases, verb phrases, complete clauses, etc.)
Relation Finding. Given the chunks in a sentence, decide which relations they have with the main verb (subject, object, location, etc.)

Because shallow parsers have to deal with natural languages in their entirety, they are large, and frequently contain thousands of rules (or rule analogues). For example, a rule might state that determiners (words such as the) are good predictors of noun phrases. These rule sets also tend to be largely `soft', in that exceptions abound. Continuing with our example, in the phrase:

...fatalities on non-interstate roads were about the same

the word the is instead within the adjectival phrase were about the same. This example was taken from the Parsed Wall Street Journal Marc93.

Building shallow parsers is therefore a labour-intensive task. Unsurprisingly, shallow parsers are usually automatically built, using techniques originating within the machine learning (or statistical) community.

The work by [Ramshaw and Marcus(1995)] proved to be an important inspiration source for this work. By formulating the task of NP-chunking as a tagging task, a large number of machine learning techniques suddenly became available to solve the problem. In this approach, each word is associated with one of three tags: I (for a word inside an NP), O (for outside of an NP), and B (for between the end of one and the start of another NP). The classification task can easily be extended to other types of chunks and with some effort even to finding relations Buchholz+99. For an extension of a HMM approach from tagging to chunking, see [Skut and Brants(1998)].

Readers are encouraged to visit the Computational Natural Language Learning (CoNLL) shared task websites:¹

http://lcg-www.uia.ac.be/conll2000/chunking/

and:

http://lcg-www.uia.ac.be/conll2001/clauses/

for background reading, datasets and results of more than 20 shallow parsing systems.

Applying learning techniques is however not necessarily straightforward:

The amount of data to be processed will push batch systems to the limit. This means that learners will need to scale.
Labelled training material is frequently noisy and only exists in relatively small quantities. Here, `small' is with respect to a language as a whole. Any learner must therefore deal with overfitting.
Real-world sentences tend to be long. Learners which do not operate in (near) linear time are simply unfit for the task.

Shallow parsing, like much of natural language processing, is therefore a challenging domain for machine learning research.

Note that shallow parsing does not refer to a single technique. Instead, it is better to consider it to refer to a family of related methods, all of which attempt to recover some syntactic information, at the possible expense of ignoring all other such information.

Next: Overview of Papers Up: Introduction to Special Issue Previous: Introduction to Special Issue

Hammerton J. 2002-03-12