Web N-Grams as a Resource for Corpus Linguistics
In recent years, the rapid growth of the World Wide Web has enabled research in computational linguistics to scale up to Web-derived corpora a thousand times the size of the British National Corpus (BNC) and more. These huge text collections open up entirely new possibilities for training statistical models and unsupervised learning algorithms. With the release of Google's Web 1T 5-gram database (Brants & Franz 2006), a corpus on the teraword scale came within reach of the general research community for the first time, in the form of n-gram frequency tables. Since then, the Web1T5 database has been applied to a wide range of natural language processing tasks. In addition to the obvious use as training data for broad-coverage n-gram models (e.g. as part of a machine translation or speech recognition system), the database has been used for spelling correction, as a convenient replacement for online Web queries e.g. in knowledge mining, and even for the prediction of fMRI neural activation associated with concrete nouns (Mitchell et al. 2008). Computer scientists have also developed specialized indexing engines that allow fast interactive queries to the database, impressively demonstrated e.g. by http://www.netspeak.org/ (Stein et al. 2010). In my talk, I explore the usefulness of Web1T5 and similar n-gram databases as a resource for corpus linguistic studies, despite its well-known shortcomings: the inevitable frequency thresholds, a genre composition dominated by computer science, porn and advertising, an abundance of text duplicates and boilerplate, as well as a complete lack of linguistic annotation (lemmatization and part-of-speech tagging). As an example, I show how three essential types of corpus analysis -- word and phrase frequencies, collocational profiles, and distributional semantics -- can be carried out on Web1T5. A prerequisite for more wide-spread adoption of n-gram databases in corpus linguistics is the availability of open-source indexing software that is flexible enough to support these types of corpus analysis, fast enough for interactive exploration of the database, and that runs on off-the-shelf desktop hardware. I present a simple and convenient solution building on SQLite (an embedded relational database engine), Perl and the statistical software package R (Evert 2010). The last part of my talk attempts an evaluation of Web1T5 as a linguistic resource. For this purpose, frequency counts for words and n-grams are compared with the BNC and other standard corpora, and Web1T5 is applied to several collocation extraction and semantic similarity tasks. A closer look at the evaluation results reveals some fundamental differences between a Web-based n-gram database and traditional corpora. In this way, I hope to shed new light on the question whether more data are really always better data (Church & Mercer 1993).
Prof. Dr. Stefan Evert is professor of English Computational Corpus Linguistics at Technische Universität Darmstadt, Germany. He earned a PhD in Computational Linguistics from the University of Stuttgart in 2004 and held a position as assistant professor for Computational Linguistics at the Institute of Cognitive Science, University of Osnabrück from 2005 to 2011. His main interests lie at the boundary between linguistic research, statistical corpus analysis and natural language processing. Current research topics include the methodological foundations of corpus linguistics, collocations and multiword expressions, distributional semantics and multi-dimensional analysis of language variation.