Language Generalization and Corpora

In the previous parts you may already have gotten some experience in dealing with linguistic databases or corpora as they are commonly called. Basically a corpus is just a (large) collection of texts (a newspaper could therefore already be called a corpus), but to make them useful for linguistic research the texts are usually structured in some way. They may also contain extra information that can be helpful to linguists doing research. In these exercises you are given two well-known corpora to use in the answering of the questions. One is the Brown corpus, which was the first of many others to follow. It consists of approximately one million words. Published in 1967, the texts that made it up were carefully selected current American English and came from such sources as newspapers and books on religion, science, hobbies and different kinds of fiction novels. The version you are given is a large part of it that is also Part-Of-Speech (POS) tagged. The tagset that was used for this can be viewed by clicking the 'explanation' button. It is probably a good thing to have this list standing by when working with the Brown corpus.

You may search the Brown corpus using the interface below. The 'start at' and 'end at' fields refer to sentence numbers, so you could use them to select a specific section from the corpus or you may want to keep note of the numbers of interesting sentences so you can quickly return to them later. Setting both fields to the same value will return only 1 sentence. If you fill in a word in the 'Word filter' field, only sentences containing that word will be returned. If you fill in more words separated by commas, it will return sentences that contain both of them. If you fill in more words separated by bars (|), it will retutn sentences that contain either of them. A star (*) functions as a wildcard: for instance, entering 'the * car' will return strings like 'the red car', 'the blue car', 'the old car', etc. You may also use the star attached to a word, for instance 'car*' could return such things as 'carrier', 'carts', 'caring'. In the 'categorial filter' you can enter tag names. You can use comma, bar and star here again. Note that you can combine information from all fields for a more specified search!

Mode/Analysis:
Start at:
End at:
Word filter:
This can be a comma seperated list
Note you can use *
Categorial filter:
This can be a comma seperated list
Note you can use *

A second corpus that is available is a special one: the Penn Treebank. Treebanks are corpora where each sentence has been annotated with structural information. The usual way to represent structural information in linguistics is by the use of a tree representation. The Penn Treebank can therefore supply you with a little bit of extra information when compared to a corpus with only a POS tag annotation layer (such as the Brown corpus). It is up to you however to decide if you actually need this extra information (and extra complexity!). The Penn Treebank has a tagset of it's own:

The Penn Treebank may be searched with the form at the bottom. Note that apart from searching words with the filter, you can also search for (sub)trees in a bracket notation. For instance, lets say I want to have all sentences with the verb join, but only as an infinite verb (it could also be a finite form, for instance any present plural form of join). To achieve this set the 'Filter Application' field to 'Tree representation' and enter the string '(VB join)' in the Filter field. You can find the 'VB' tag by taking a look at the tagset. Note that you can use comma, dot and bar like the same as with the Brown corpus.

Filter:
Filter Application:
View mode:

Try any of the exercises you like or turn to the Playground sections to play with the tools without having to bother about exercise restrictions. New users are recommended to turn there because you can find some short explanation on how the tools work, though we tried to make them such that most of you will have no trouble finding things out on your own. Both students and teachers can use the contact form (at the bottom of the menu) to give us their opinion on these tools. You may tell us anything: what you like and don't like about the tools, what you would like to see included, what could use some better explanation, etc. . Only with your feedback will it be possible to improve things for possible future upgrades.