!pip install datasets wiktionaryparser -Uq

Language Hacking: Using `spaCy` for Morphosyntactic Analysis#

This workshop is inspired by this article written by Tufts professor, Dr. Gregory Crane. Crane teaches in the Classical Studies department, meaning that he studies historical languages like Ancient Greek and Latin, so you might be wondering how natural language processing could be relevent to such a discipline. The answer is in language hacking, the processing of taking a language which you may or may not know and using pretrained language models to give you a deeper understanding of the text.

Professor Crane is a ‘digital philologist.’ Philology (from φιλολογία, the “love of words”) is the study of language in historical sources, which can include everything from ancient literature to contemporary song lyrics. As a result, philologists are interested in many different languages, especially sources in many languages, but no one, no matter how good a philologist, can learn every language they might be interested in.

This is where language hacking comes in. Deep learning models like the ones we’ll play around with today offer new opportunities for research, language-learning and cross-cultural exchange.

What is `spaCy`?#

Above I mentioned that we would be using a language model to tell us about the meaning of words in languages we don’t know. In this lesson, we’ll downlaod these models from a Python package called spaCy. spaCy is very powerful package with a lot of functionality. We’ll only be using a small part of what they offer: their pretrained language models.

Data and Model Preparation#

Before we can start language hacking, we need to set up our texts and models. For this example, I’ll be using the original French version of Alexandre Dumas’ The Three Muskeeters (Les trois mousquetaires). As a result, we’ll also be using the French spaCy model, which we’ll need to download.

from datasets import load_dataset

dataset = load_dataset("pnadel/les_trois_mousquetaires")
data = dataset["train"].to_pandas()
data

/home/runner/miniconda3/envs/guides/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Generating train split:   0%|          | 0/68 [00:00<?, ? examples/s]

Generating train split: 100%|██████████| 68/68 [00:00<00:00, 7697.22 examples/s]

	chapter	text
0	DANS LAQUELLE IL EST ÉTABLI QUE, MALGRÉ LEURS ...	Il y a un an à peu près qu’en faisant à la Bib...
1	LES TROIS PRÉSENTS DE M. D’ARTAGNAN PÈRE	Le premier lundi du mois d’avril 1625, le bour...
2	L’ANTICHAMBRE DE M. DE TRÉVILLE	M. de Troisville, comme s’appelait encore sa f...
3	L’AUDIENCE	M. de Tréville était pour le moment de fort mé...
4	L’ÉPAULE D’ATHOS, LE BAUDRIER DE PORTHOS,	D’Artagnan, furieux, avait traversé l’anticham...
...	...	...
63	L’HOMME AU MANTEAU ROUGE	Le désespoir d’Athos avait fait place à une do...
64	JUGEMENT	C’était une nuit orageuse et sombre, de gros n...
65	L’EXÉCUTION	Il était minuit à peu près; la lune, échancrée...
66	CONCLUSION	Le 6 du mois suivant, le roi, tenant la promes...
67	ÉPILOGUE	La Rochelle, privée du secours de la flotte an...

68 rows × 2 columns

data.iloc[0]['text'][:1000]

'Il y a un an à peu près qu’en faisant à la Bibliothèque\r\nroyale des recherches pour mon histoire de Louis XIV, je tombai\r\npar hasard sur les Mémoires de M. d’Artagnan, imprimés,—comme\r\nla plus grande partie des ouvrages de cette époque,\r\noù les auteurs tenaient à dire la vérité sans aller faire un tour\r\nplus ou moins long à la Bastille,—à Amsterdam, chez Pierre\r\nXVIII\r\nRouge. Le titre me séduisit: je les emportai chez moi, avec la\r\npermission de M. le conservateur, bien entendu, et je les dévorai.Mon intention n’est pas de faire ici une analyse de ce curieux\r\nouvrage, et je me contenterai d’y renvoyer ceux de mes lecteurs\r\nqui apprécient les tableaux d’époque. Ils y trouveront des portraits\r\ncrayonnés de main de maître; et, quoique ces esquisses\r\nsoient, pour la plupart du temps, tracées sur des portes de caserne\r\net sur des murs de cabaret, ils n’y reconnaîtront pas\r\nmoins, aussi ressemblantes que dans l’histoire de M. Anquetil,\r\nles images de Louis XIII, d’Anne d’Autriche, de Richeli'

# downloading french model from spacy
!python -m spacy download fr_core_news_md

/home/runner/miniconda3/envs/guides/bin/python: No module named spacy

# loading the model
import spacy
nlp = spacy.load("fr_core_news_md")
nlp # working!

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[5], line 2
      1 # loading the model
----> 2 import spacy
      3 nlp = spacy.load("fr_core_news_md")
      4 nlp # working!

ModuleNotFoundError: No module named 'spacy'

Apply the Model#

Now that we have our data and our model, we can apply the one to the other. The nlp object that we made above can be called like a function with some text. See below with a simple example.

Introduction to Morphosyntax#

This section serves as an introduction to some important NLP vocabulary and Python syntax for using spaCy.

# this is spacy `Doc` object
example = nlp("Je m'appelle Peter. J'aime les jeux vidéos.") # "My name is Peter. I like video games"
type(example)

spacy.tokens.doc.Doc

# .text just gives us the text as a string
example.text, type(example.text)

("Je m'appelle Peter. J'aime les jeux vidéos.", str)

# iterable to acces each sentence in the original text
example.sents, type(example.sents)

(<generator at 0x793cd347f880>, generator)

# iterating through each sentence
for sent in example.sents:
    print(sent.text)

Je m'appelle Peter.
J'aime les jeux vidéos.

# iterating through each token
for token in example:
    print(token.text)

Je
m'
appelle
Peter
.
J'
aime
les
jeux
vidéos
.

# iterating through each token
# AND getting some morphosyntactic information
for token in example:
    print(token.text, token.pos_, token.dep_, token.lemma_, sep='\t\t')

Je		PRON		nsubj		je
m'		PRON		expl:comp		me
appelle		VERB		ROOT		appeler
Peter		PROPN		xcomp		Peter
.		PUNCT		punct		.
J'		PRON		nsubj		je
aime		VERB		ROOT		aimer
les		DET		det		le
jeux		NOUN		obj		jeu
vidéos		PROPN		amod		vidéo
.		PUNCT		punct		.

See bleow for a glossary of relevant terms for the rest of this notebook.

Lemma: A lemma is the root form of a word. For example, the words “ran”, “running” and “runs” all come from the root “run”. In this case, “run” would be the lemma of “ran”, “running” and “runs” (Accessed through the lemma_ property).
Part of speech: You mihgt be familiar with part of speech as the function that a word takes in a sentence, but there couple different standards for representing this information.
- UPOS or Universal part of speech: This is the “normal” part of speech that you likely saw while learning English (Accessed through the pos_ property).
- XPOS or Language-specific part of speech: These are part of speech tags that might change depending on language. Oddly, they don’t always change from language to language. In fact, they can be shared between languages but are often much more specific about the part of speech of a word (Accessed through the tag_ property).
Morphology: To quote the spaCy docs, “Inflectional morphology is the process by which a root form of a word is modified by adding prefixes or suffixes that specify its grammatical function but do not change its part-of-speech.” So, where a lemma is the root form of a word, morphological features are what are added to the lemma to create grammatically correct variants of the same lemma (Accessed through the morph property).
Sentence relation or Sentence dependency: Each word, in addition to its part of speech and lemma, can be identified by what words it depends on or what words depend on it (or both). For example, thw word “apple” is a NOUN yet it can be used either as a subject (“The apple is red”) or an object (I ate the apple). The UPOS tag would be the same for each, but the sentence relation, that is what the word is doing in the sentence, would be different (Accessed through the dep_ property). You can find a list of dependencies here: English or French.
Treebanks: A sentence is made up of a seires of words which are dependent on one another. This understanding allows us to construct tree-like structures of sentences. This is similar to diagraming sentences, if you have ever done that. It can be quite useful to use this model when language hacking, as it will give you a better idea about how to progress through a sentence. Below you can see the treebank that spaCy created for the first sentence.

from spacy import displacy
displacy.render(list(example.sents)[0], style='dep', jupyter=True)

As in all treebanks, the verb (“appelle”) depends on only one thing: root, a placeholder which represents the semantic beginning of the sentence. From here all word depend on the main verb of the sentence. Take a look at the next example to see a variation.

# copulative example
displacy.render(nlp("Je suis fatigué"), style='dep', jupyter=True) # "I am tired" in English

In this treebank, there is not word with the VERB tag. Instead, the root word is “fatigué” or “tired” in English. This is because the grammatical verb is “suis” or “am” in English is tagged as a cop or copulative (this is sometimes called a ‘linking’ verb in English education, as copula > Lat. co-, together and apere, fasten). These verbs only join a subject to an adjective but do not indicate any action. It is for this reason that in many language they are left out (cf. A. Gk. “μακρός ὁ οἴκος”, meaning “the house is large”, but literally “the house large”). Even in English, certian dialects like African American Vernacular English (AAVE) sometimes do not expess copular verbs. For all of these reasons, they are marked differently than other verbs in treebanks.

Using the Data#

Now that we havea grasp on the core vocabulary, we can begin to delve into real French literature.

first_chapter = data.iloc[0]['text']
first_chapter_doc = nlp(first_chapter)

first_sentence = list(first_chapter_doc.sents)[0]
first_sentence.text

'Il y a un an à peu près qu’en faisant à la Bibliothèque\r\nroyale des recherches pour mon histoire de Louis XIV, je tombai\r\npar hasard sur les Mémoires de M. d’Artagnan, imprimés,—comme\r\nla plus grande partie des ouvrages de cette époque,\r\noù les auteurs tenaient à dire la vérité sans aller faire un tour\r\nplus ou moins long à la Bastille,—à Amsterdam, chez Pierre\r\nXVIII\r\nRouge.'

In English: A short time ago, while making researches in the Royal Library for my History of Louis XIV., I stumbled by chance upon the Memoirs of M. d’Artagnan, printed—as were most of the works of that period, in which authors could not tell the truth without the risk of a residence, more or less long, in the Bastille—at Amsterdam, by Pierre Rouge.

displacy.render(first_sentence, style='dep', jupyter=True)

This tree is very complex, so let’s break it down using spaCy’s head, rights, lefts and children functionalities.

# each token has a "head" word
# this is the word that it depends on
first_sentence[10], first_sentence[10].head
# the head word of faisant is tombai

(faisant, tombai)

# picking out the root verb
# looking for a token whose head is itself
root = [token for token in first_sentence if token.head == token][0]
root.text, root.dep_, root.pos_

('a', 'ROOT', 'VERB')

# `lefts` returns a generator of all words to the left of a given token
# which have that given token as their head
print('LEFTS')
for t in root.lefts:
    print(t.text, t.dep_, t.pos_, sep='\t\t')

print()
# `rights` returns a generator of all words to the right of a given token
# which have that given token as their head
print('RIGHTS')
for t in root.rights:
    print(t.text, t.dep_, t.pos_, sep='\t\t')

LEFTS
Il		expl:subj		PRON
y		expl:comp		PRON

RIGHTS
an		obj		NOUN
à		advmod		ADP
tombai		ccomp		VERB
.		punct		PUNCT

# `children` returns a generator of all of the descendants of a given word
for descendant in root.children:
    print(descendant.text, descendant.dep_, descendant.pos_, sep='\t\t')

Il		expl:subj		PRON
y		expl:comp		PRON
an		obj		NOUN
à		advmod		ADP
tombai		ccomp		VERB
.		punct		PUNCT

Trying to Read in a Language We Don’t Know#

We can now move on to using these morphosyntactic annotations to read text in a language we don’t know.

# first let's apply the model to the whole text
from tqdm import tqdm
tqdm.pandas() # for a progress bar

data['spacy_docs'] = data['text'].progress_apply(nlp)

100%|██████████| 68/68 [00:51<00:00,  1.32it/s]

# using wiktionary api as our dictionary
from wiktionaryparser import WiktionaryParser
parser = WiktionaryParser()
word = parser.fetch('hasard', 'french')
word[0]['definitions'][0]['text'][1:]

['(usually in the singular) (random) chance',
 'a coincidence',
 'hazard',
 '(golf) hazard']

def look_up_definition(word):
    parser = WiktionaryParser()
    word = parser.fetch(word, 'french')
    try:
        return word[0]['definitions'][0]['text'][1:]
    except:
        return 'No definition for this word. It is likely a proper noun.'

import pprint
pp = pprint.PrettyPrinter(indent=4)

example_sentence = list(data.iloc[-1].spacy_docs.sents)[4] # picking an easy example sentence, but feel free to alter the index to get a more complex sentence
pp.pprint(example_sentence.text)

'Il entra par le faubourg\r\nSaint-Jacques dans un magnifique apparat.'

# treebank
displacy.render(example_sentence, style='dep', jupyter=True)

Steps to get started language hacking:

Find the root. This will usually be the verb.
Look at what directly depends on the root. Begin translation.
For each dependant word, look at what depends directly on it. Continue translation.
Make observations on word usage and syntax

# find root and look it up
root = [token for token in example_sentence if token.head == token][0]
root.text, root.dep_, root.pos_, root.morph, look_up_definition(root.lemma_)

('entra',
 'ROOT',
 'VERB',
 Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin,
 ['(intransitive) to enter'])

From this information, we can see that the root of tis sentences is “entra”, that it means “enter” and that it is a 3rd person singular, indictative, past tense verb, whihc would translate to “entered” in English. Now we can look at the lefts and rights to find what depends on the root.

print(f'LEFTS: {[r for r in root.lefts]}')
first_dep = [r for r in root.lefts][0]
first_dep.text, first_dep.dep_, first_dep.pos_, first_dep.morph, look_up_definition(first_dep.lemma_)

LEFTS: [Il]

('Il',
 'nsubj',
 'PRON',
 Gender=Masc|Number=Sing|Person=3,
 ['he (third-person singular masculine subject pronoun for human subject)',
  'it (third-person singular subject pronoun for grammatically masculine objects)',
  '(impersonal pronoun) Impersonal subject; it'])

From the ‘nsubj’ tag, we can tell that this word is the subject of the verb “entra” and it means “he”. In the context of the story, this “he” refers to King Louis XIII.

print(f'RIGHTS: {[r for r in root.rights]}')
next_dep = [r for r in root.rights][0]
next_dep.text, next_dep.dep_, next_dep.pos_, next_dep.morph, look_up_definition(next_dep.lemma_)

RIGHTS: [faubourg, apparat, .]

('faubourg', 'obl:arg', 'NOUN', Gender=Masc|Number=Sing, ['suburb'])

From the “obl:arg” tag, we can see that the king entered some kind of suburb. Let’s explore this word’s dependencies to find out more.

for descendant in next_dep.children:
    if descendant.dep_ != 'dep':
        print(descendant.text, descendant.dep_, descendant.pos_, look_up_definition(descendant.lemma_), sep='\t\t')

par		case		ADP		['through', 'by (used to introduce a means; used to introduce an agent in a passive construction)', 'over (used to express direction)', 'from (used to describe the origin of something, especially a view or movement)', 'around, round (inside of)', 'on (situated on, used in certain phrases)', 'on, at, in (used to denote a time when something occurs)', 'in', 'per, a, an', 'out of (used to describe the reason for something)', 'for']
le		det		DET		['the (definite article)', 'Used before abstract nouns; not translated in English.', 'Used before the names of most countries, many subnational regions, and other geographical names including names of lakes and streets; not translated into English in most cases.', '(before parts of the body) the; my, your, etc.', '(before units) a, an, per', '(before dates) on']
Saint-Jacques		nmod		PROPN		['Santiago, Santiago de Compostela (the capital city of Galicia, Spain)']

From this information, we can see that “he [the king] entered by the the Saint-Jacques suburb.” We’re almost there!

last_dep = [r for r in root.rights][1]
last_dep.text, last_dep.dep_, last_dep.pos_, last_dep.morph, look_up_definition(last_dep.lemma_)

('apparat', 'obl:mod', 'NOUN', Gender=Masc|Number=Sing, ['pomp, ceremony'])

for descendant in last_dep.children:
    if descendant.dep_ != 'dep':
        print(descendant.text, descendant.dep_, descendant.pos_, look_up_definition(descendant.lemma_), sep='\t\t')

dans		case		ADP		['(literal, figurative) in, inside (enclosed in a physical space, a group, a state)', 'to (indicates direction towards certain large subdivisions, see usage notes)', 'in, within (a longer period of time)', '(with respect to time) during', 'out of, from', '(metonymically) in; in the works of', '(colloquial) Used in dans les (“about, around”)']
un		det		DET		['an, a']
magnifique		amod		ADJ		['magnificent, splendid, superb']

Perfect! Now we have enough to create a translation of the whole sentence:

“He [the king] entered by the Saint-Jacques suburb in a splendid ceremony.”

Feel free to go back up and follow the same procedure with a different sentence.

Limitations of Language Hacking#

Language hacking is a very useful paradigm for reading text in langauges you either don’t know or are learning. It allows scholars to explore traditions and cultures that they would have been excluded from in the past. That said, it comes with some key limitations.

Dictionaries: As we saw above, we relied heavily on the open source wiktionary as our dictionary. This will work fine for a language like French with millions of speakers. But for languages that no one or very few people speak, a specialized dictioary will be necessary.
Available language models: The point above about dictionaries also holds for pretrained language models. spaCy has pretrained language models for a number of languages, but there are many, many more langauges that they do not support. Training a new model for a new language is possible, but very time-consuming.
Inexact translations: Language hacking is by no means a substitute for a prepared translation. Translators can provide a much more compitent rendering of the original language, but language hacking gives scholars an additional method to interrogate linguistic questions in texts from other languages.

Please contact me at peter.nadel@tufts.edu for any questions.