Friday, October 16, 2020
Home Education An Introduction to Corpus Linguistics

An Introduction to Corpus Linguistics

An Introduction to Corpus Linguistics

The Word Corpus is Derived from a Latin word means ‘Body’ or ‘Mass’ while Oxford Dictionary defines the term as “a collection of written or spoken texts.” So we can say that “A collection of written text, especially the entire works of a particular writer or body of writing on a particular subject is called Corpus Linguistics”. Or Linguistics is the scientific study of language and its structure, Corpus Linguistics is the scientific study of the language based on Text Corpora. in the Introduction to Corpus Linguistics, we will analyze and study various aspects of corpora and corpus linguistics.

Language scholars and linguists have been using text collection for their analysis from ancient times. In the middle ages, people used to make lists of words in a particular text, together with their context. Some scholars also have used lists of the most frequently used words.’ Early corpus linguistics’ is a term that was used by McEnery and Wilson. They used this term for all Linguistics’ corpus-based works before the Chomsky. It was used by 1950 when structuralism was the basic linguistics science. Well known structuralist linguists like Bloomfield, Sapir, Harris, and Boas use these methods for the analysis of language that could be called corpus-based. But the term Corpus linguistics was not used at that time and adopted later on. New corpus-based methods are used as the basis of a variety of linguistics studies. Researchers use, analyze, and collect naturally occurring linguistics data to describe and manage a document, the change of language, the phenomenon of language acquisition, or to proof linguistics hypothesis. This point elaborates that language form is not subjective speculation but an objective fact.

According to the linguists of that time, “…. the sentences of a natural language are finite, and the sentences of a naturally occurring language can be collected and enumerated. It means that it is possible to get a whole collection of every occurring sentence via corpus. McEnery and Wilson also describe this point.

Corpus Linguistics

When we try to know about the status of corpora in its earliest phase, we find that  Käding in 1897 begins the comparison of the frequency distributions of letters, their sequences, and their retrieve spelling conventions from it. Kading uses a large some of words corpus approximately 11 million words of the German language. If we think for it in today’s era, it seems so difficult to manage a vast sum of words like this without any technical aid. At that very time, from 1876 to 1926, the language acquisition research was based mainly on the diaries of the parents because parents had recorded the language of children. Fries and Traver (1940) and Bonger 1947 used corpus-based data for research on foreign language pedagogy. The vocabulary lists that were used at that time were derived from the corpora based on the research of Thorndike (1921) and West and Palmer (1933). In the 1920s, there was a tradition of word counting in the text to find frequent words in pedagogical usage and grammatical structure for language teaching purposes. These were manually carried out studies, and follow the direction of comparative stylistics analysis, or quantitative comparison of the varieties of English.

Eaton (1940) describes the German, Dutch, French, and Italian words frequencies. And his studies are considered sophisticated still today. Lodge followed Etons examples, and he uses all semantic frequency lists. Ferries presented telephonic conversation-based descriptive grammar in 1952 at an early stage. It is a fact that grammar is not using extraordinary written data, and its first release is also inspiring.

Read also: What is language

Chomskyan Criticism on Corpus Linguistics

 Various linguistics scholars presented their criticism on Corpus linguistics. Noam Chomskyan findings regarding the nature of language drawn from the work on the corpus are useful. His good reasons and influence provide a change in the linguistic paradigm of that very time. After some time, there was a decline in the field of corpus linguistics. But later on, it takes a new birth and flourishes with new zeal and zest as a new branch of linguistics.

Corpus linguistics is a new branch of linguistics, and the primary function of this branch is to observe and study the samples of naturally occurring language for a better understanding of language use. With the help of a computer, it is easy to analyze and examine millions of language samples. Due to the computational technologies, we have the information regarding the frequency of use, vocabulary, and structure of sentences. How different words occur in various sentence structures, as well as the language of different genres. It shows exciting results for language scholars though they are in the field of teaching or research but can get information regarding their domains.

corpus linguistics
corpus linguistics is to analyze language sample

Corpus Linguistics a Methodology or a field of study

 Some linguists consider that Corpus linguistics is a methodology and focus on a group of methods while organizing linguistics data and studying a language. It is an important observation to become more mature and qualified in a specified area of study. Most of the people consider it purely as an inquiry and investigation for linguistics phenomenon utilizing spoken and written corpora. It involves in the compilation of dictionaries, collection of texts as well as using various software for language data management and collection purposes.

Corpus linguistics is not a monolithic but a heterogeneous field of study. It does not always take agreed methods and procedures for the exploration of the language. Even sometimes, generalizations qualify much for that is corpus linguistics. Corpus linguistics also have differences in its subfields and vary its approaches to the use of corpus linguistics.

But what is the generalization of the term corpus linguistics? Linguists have to define corpus linguistics, a subject that deals with some set of machine-readable texts that have held an appropriate basis on which to study a specific set of research questions. These are large sizes of the corpora that defy analysis by hand and eye alone under a particular time frame. This large scale data used to explain the machine-readable text. Here we use a computer to read, examine, study, and manipulate data. Here working with a vast database is not suitable because of the time limit. There would be a human analyst or a team of analysts to search through the text. It is challenging to search for a large corpus by hand, which guarantees no error.

Read also: What is Structuralism?

The collections of spoken and written words are called Corpora. These corpora comply with the linguistics study, investigation and follow specific criteria that make them relevant for the general or particular research on language. Corpora are stored in digital, electronic, and computerized forms; in this way, this data can be searched and analyzed quickly. The very first corpora of the written language were organized to manage one million words. But spoken words were generally much smaller. But now modern corpora is working so efficiently with a high number of words. For example, the BNC, British national Corpora have already 100 million words or more and often uses both written and spoken languages.

Corpora use mainly tools and allow users to search rapidly and reliably with them. These tools provide opportunities for users to look for words in context. These tools also allow the frequency of the data, description, and frequency of the list to be checked. Tools also enable the researchers how many times a word occurs in the text, paragraph, or list of that corpora. Qualitative and quantitative data is analyzed in concordances, and these are equally important to corpus linguistics.

corpus linguistics
Corpus linguistics example

Corpus linguistics is not the study of a particular aspect of language. So it is a distinctive type of research in linguistics. It is a field that is in the developing phase. It concentrates upon the set of methods and procedures for studying a language. We also observe as students of linguistics that one school of thought considers it a methodology for organizing linguistics data while another group opposes it. Procedures and mechanisms are still under development and are remain unclearly delineated. It does not mean that all fields in corpus linguistics are unclear and ambiguous as we observe the concordance, a well-established center of the approach. Because of this information, we reach the point that a corpus-based study can be conducted in various areas of linguistics. It also can reorganize our whole approach in the study of language. Corpus linguistics may redefine and refine the different theories of language. Moreover, the development of corpus linguistics has also provided new opportunities and facilities for the exploration of novel approaches in the field of communication.

Read also: Historical linguistics

Methods in Corpus Linguistics

 Various research methods emerged with the help of corpus linguistics. These methods try to discover a new path from data to theory. Walis and Nelson introduce the 3A perspective of annotation, abstraction, and analysis. Annotation is an application of a text scheme. Here in annotation, there are praising, tagging, parts of speech, structural Mark up as well as various other representations. Abstraction is the translation of the terms in the scheme to words in a theoretically incited model or data set. Here in abstraction, there is a linguist directed search involved as well as a rule learning for the program.  An analysis is consists of manipulating, statistical probing and generalizing from the data set. Here in the investigation, there is an optimization of the rule base, knowledge discovery methods as well as statistical evaluation. Most of the lexical corpora of the modern age are POS Tagged, Parts of speech tagged. Corpus linguistics working with ‘plain unannotated text’ necessarily apply some methods to isolate salient features. In such conditions, annotation and abstraction are combined in a lexical search.

Discourse analysis in Corpus Linguistics

 Hassan (2004) emphasis the importance of data-driven research which investigates, the relationship between linguistics and society. It is the corpus linguistics methodological approach based on collecting and documenting real-life language data. The term Discourse was originated from the classic work of Henry Kucera and Nelson Francis. These writers published their work as the name of “Computational Analysis of Present-Day American English” on the cornerstone of the brown corpus. Corpus linguistics through light on the importance of perusal patterns of real language use in Linguistics research. They prefer an observation and analysis of language grounded on large sums of authentic text-corpora. Corpora help in deriving empirical knowledge about writing that can supplement information from sources, references, and introspection.

Read also: Krashen’s Monitor Model, Vygotsky’s Sociocultural Theories of Learning

Quantitative and Qualitative analysis in Corpus Linguistics 

Corpus-Based studies follow the quantitative techniques for research. E.g., if someone wants to learn the use of words large and big, then the user would need to know how many times each word appears in the corpus. And how many other relevant words co-occur with each of these adjectives. Or collocations and how common each of the collocation is. So these are called quantitative measurements. “A crucial part of the corpus-based approach is going beyond the quantitative patterns to propose functional interpretations explaining why the patterns exist. As a result, a large amount of effort in corpus-based studies is devoted to explaining and exemplifying quantitative patterns.” (Douglas Biber, Susan Conrad, and Randi Reppen, Corpus Linguistics: Investigating Language Structure and Use, Cambridge University Press, 2004)

Corpus linguistics uses quantitative and qualitative approaches, mostly in combination. To start up with quantitative results and to move to qualitative findings is the characteristics of corpus linguistics. There are the cyclic element in the procedure, to subject quantitative findings to qualitative scrutiny. It also explains why a specific frequency structure occurs. But on the other hand, a qualitative analysis may be the means for classifying examples in a particular corpus by their meanings; and this qualitative analysis may then be the input to further quantitative analysis, one based on meaning ” (Geoffrey Leech, Marianne Hundt, Christian Mair, and Nicholas Smith, Change in Contemporary English: A Grammatical Study. Cambridge University Press, 2012)

corpus linguistics
Corpus linguistics

One of the fundamental tasks of the corpus linguistics is to search text corpora. It illustrates linguistics samples, its features, and fetch them. Here are multiple computational methods for retrieving, processing, and fetching data. Concordance programs are used for the result orientation in corpus linguistics. Concordance means a listing of a particular word or term with a line number, and the name of the file was retrieved from it. The usual format for the listing is a keyword in context KWIC. Here the search term is shown in the center of the output window. It also has a variable of fixed length context on both sides of it.

Read also: Applied linguistics


In conclusion, it can be said that though corpus linguistics is considered a new field of study, if we keenly observe it, then we find its traces in the ancient time. It is working with language development from the birth of language with different names. We find clues and vestiges of its early existence when we see a collection of different verbs, phrases, noun, adjectives, and many other things of language from ancient times. It is the Corpus linguistics that provides all details about the word usage, verbs, adjective, their frequencies, and so on. With the development of computational devices, it becomes a more efficient and dominant branch of linguistics. Now with the help of this branch, researchers can find their desired searches. All exhaustive work becomes an easy task due to the use of computers and digital devices. Everyone can find his desired searchers within seconds. Corpus linguistics is in continual development as new things are going to include in the field of languages.


Written  & Reviewed by: Muhammad Asad Kasra,


M.Phil Scholar, Riphah Institute of Language and Literature, Riphah International University Lahore, Pakistan.


Most Popular

Quaid e Azam at Quetta

On 13th August, Quaid e Azam’s blood pressure had slow to a small degree and the feet swelling had occurred again. This condition forced...

Quaid–e-Azam at Karachi

The flight of Quaid-e-Azam reached at Mauripur Aerodrome on 4.15. After covering approximately 4 miles from Aerodrom the ambulance of Quaid-e-Azam stop due to...

Quaid e Azam’s Last days at Ziarat

Quaid-i-Azam has worked forty years for 14 hours a day for Muslims to have an independent country. He never knows about His illness during...

Theories of Consumption

Relative Income Theory of Consumption Mr. James Stemble Duesenberry, an economist of America, posited a theory for consumer behavior whose main purpose is to focus...

Recent Comments