Where Linguistics meets Natural Language Processing – Part I

For someone working with Natural Language Processing/Understanding (NLP/NLU), it brings a lot of value to incorporate a formal understanding of how languages are structured, beyond just being able to speak/understand them. Part I of a two-part articles will provide a simple explanation of the basic concepts of linguistics, their evolution, how language is studied and the second part will describe the connection of linguistics with the NLP/NLU world.

Part I: about linguistics

At the beginning of the 20th century, between 1906 and 1911, Ferdinand de Saussure, a swiss linguistic and semiotician, gave lectures about linguistics at the University of Geneva. In 1916, after Saussure’s death, his students Charles Bally and Albert Sechehaye published Course in General Linguistics, based on their notes from Saussure’s lectures. This is the moment where modern linguistics, specifically structural linguistics, was born.

Before Saussure, languages and their structures were studied by many people in different ways. The first register of it is from the 6th century before Christ in India, when Panini formulated 4 thousand rules of Sanskrit morphology. 

In Ancient Greece, Plato in his Cratylus dialogue,mentions that words are eternal concepts and exist in the world of ideas. When the University of Alexandria was founded, in 280 BC, and Greek was taught to speakers of other languages, the word grammar (“téchnē grammatikḗ” (Τέχνη Γραμματική)) was first used and it meant “the art of writing”.

In the Middle Ages, languages were studied under the name of philology. Jacob Grimm wrote Deutsche Grammatik, in the 18th century, which is considered the first great scientific linguistic work of the world. Wilhelm von Humboldt defined human language as a rule-governed system, with which you can create an infinite number of sentences using finite grammatical rules.  

Going back to modern linguistics, Saussure grounded the essential concept of sing, which is the combination of signified and signifier. 

  • The signified  is the idea or concept and,
  • the signifier is a means of expressing the signified.

Making it easier, signified is what’s behind the word, the concept or idea, and signifier is how we express its meaning written with letters or pronounce with sounds, for example. 

For Saussure, signs should be studied synchronically because signs can be defined just when they are in contrast with other signs.

After Structuralism, many different linguistic movements developed. Let’s check out 2 of the most important ones:

  • Generativism

In the second half of the 20th century, Noam Chomsky created the Generativism. The generativism is based on syntax, but it also addressed other aspects of language structures such as phonology and morphology.

According to Chomsky the first thing linguists should do is describe the universal grammar, which is a set of syntactic rules universal for all humans and underlying the grammars of all human languages. Generativism affirms that universal grammar is innate to the human brain. 

  • Functionalism

On the opposite side of Chomsky, Michael Halliday published in 1985, An introduction to functional grammar. Halliday’s studies are based on social interaction and cover a broad number of subjects, in which linguistic is included.

A language is defined by Halliday as a social semiotic system. It evolves as a system of meaning potential or as a set of resources that influence what a speaker can do with language in a particular social context.  In other words, languages are more than a set of sentences, but the exchange of meaning in interpersonal (social) contexts, based on choices made by the speakers.

The five levels of language analysis

Languages are usually analyzed in five levels. We will describe them starting with the one with the smallest unit of study to the biggest.

  1. Phonetics and phonology are in charge of studying the sounds of human speech. On one side, phonetics studies human sounds, how they are produced, transmitted, and received. Its unit of study is called phoneme, defined as a perceptually unit of sound that distinguishes a word in a specific language. On the other side,  phonology classifies those sounds within the system of a particular language. 
  2. Following the next smallest unit of language, we find morphology which is the study of words, how they are formed, and how they relate to each other. According to morphology words are built using the smallest units of meaning, morphemes. For example, worker is formed by the root “work” plus and suffix “er” to make it a noun. 
  3. Syntax is the branch of linguistics that studies phrases, specifically the set of rules and principles that govern the structure of phrases or sentences. The study of syntax looks at the ways in which the words can be ordered and combined to transmit proper meaning. 
  4. Semantics cares about how meaning works in human languages in a basic way, so the literal meaning of words is considered principally as parts of the human language system.
  5. Finally, pragmatics concentrates on how the basic meaning is used in practice and how the context affects it. In this way, pragmatics explains how the talkers overcome ambiguities since the meaning relies also on the manner, place, time, previous knowledge, etc. of a communication. 

