{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# A Gentle Introduction to Natural Language Process\n",
"Today, we will take a text, *War and Peace* by Leo Tolstoy, and try to get verify or refute the claims of an outside source, *Tolstoy's Phoenix: From Method to Meaning in War and Peace* by George R. Clay (1998). This is a very common task both in and outside of the digital humanities and will introduce you to the popular NLP Python package, the Natural Language Toolkit (NLTK), and expose you to common methodologies for wrangling textual data. \n",
"\n",
"## Our overall task\n",
"\n",
"We will start by downloading the book and then we will learn how to clean the text, perform basic statistics, create visualizations, and discuss what we found and how to present those results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Goals\n",
"* Understand the rights we have to access books on Project Gutenberg\n",
"* Read in a text from Project Gutenberg\n",
"* Clean textual data using regular expressions\n",
"* Perform basic word frequency statistics\n",
"* Create visualziations of these statistics \n",
"* Discuss how to communicte thses results\n",
"* Return to our research question"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## General methods"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Read in our data\n",
"We will start with a url from Project Gutenberg. All of the texts from Gutenberg are in the public domain, so we won't have to worry about rights, but be aware of who own the intellectual property to a text before you scrape it. In this section, we will break the text up by chapter division. Later we'll do the same but by book division. \n",
"\n",
"#### The requests library\n",
"Here, we are using a library called `requests`. This library is great for HTTP requests, which are like asking for a specific action from the internet. \n",
"\n",
"More details below:\n",
"https://pypi.org/project/requests/"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"url = 'https://www.gutenberg.org/cache/epub/2600/pg2600.txt'\n",
"text = requests.get(url).text"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"8250"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## IMPORTANT: this pipeline will work for all text, but you will need to understand your text\n",
"## There is no one-size-fits-all for text cleaning, especially for extra information, like tables\n",
"## of content and disclaimers\n",
"\n",
"# In this case, we found the first line that states the start of the book. We found this by looking at the actual book.\n",
"# Link to book: https://www.gutenberg.org/cache/epub/2600/pg2600.txt\n",
"\n",
"start = text.find('BOOK ONE: 1805\\r\\n\\r\\n\\r\\n\\r\\n\\r\\n\\r\\n')\n",
"start"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3274769"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"end = text.find('END OF THE PROJECT GUTENBERG EBOOK WAR AND PEACE')\n",
"end"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## Using the start and end points that we found above, we can filter out all of the text before the start of the book\n",
"## and after its end.\n",
"\n",
"## In Python, we can use square brackets to delimit the new start and new end that we want. As we see below:\n",
"\n",
"text = text[start:end]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have a text that we are confident is nothing but the text of the book, we can begin to dissect it into its component parts. First, let's break it up into chapters."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To do so, I am using a very versatile package called `re` or regular expressions (regex) to do some advanced string parsing. This regex function, finditer, takes in a pattern and a text and returns all of the times that pattern occurs in the text. These patterns can look very complicated, but, in this case, it is '(Chapter [A-Z]+)', which means: *find all of the times when there is the word 'CHAPTER' followed by a space and then any amount of captial letters*.\n",
"This pattern fits the roman numeral counting of the chapters (ex. CHAPTER I or CHAPTER XII)\n",
"
\n",
"\n",
"Finditer also return the indices where the chapter title begins and ends. So, we can use the ending of one chapter title and the beginning of the next one to get all of the text in between the two chapters. This text is, by definition, the text in that chapter.\n",
"\n",
"You can read more about regex [here](https://librarycarpentry.org/lc-data-intro-archives/04-regular-expressions/index.html) and you can play around with your own regex at [regex101](https://regex101.com/). "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import re\n",
"ch_dict = {}\n",
"ch_list = list(re.finditer(r'(CHAPTER [A-Z]+)',text))\n",
"for i, m in enumerate(ch_list):\n",
" if i < len(ch_list)-1:\n",
" ch_dict[f'Chapter {i+1}'] = (m.end(0),ch_list[i+1].start(0)-1)\n",
" else:\n",
" ch_dict[f'Chapter {i+1}'] = (m.end(0), len(text)-1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'Chapter 1': (35, 11716),\n",
" 'Chapter 2': (11727, 19652),\n",
" 'Chapter 3': (19664, 28409),\n",
" 'Chapter 4': (28420, 36599),\n",
" 'Chapter 5': (36609, 47877),\n",
" 'Chapter 6': (47888, 55785),\n",
" 'Chapter 7': (55797, 61618),\n",
" 'Chapter 8': (61631, 68528),\n",
" 'Chapter 9': (68539, 80778),\n",
" 'Chapter 10': (80788, 90636),\n",
" 'Chapter 11': (90647, 95709),\n",
" 'Chapter 12': (95721, 103553),\n",
" 'Chapter 13': (103566, 107749),\n",
" 'Chapter 14': (107761, 116408),\n",
" 'Chapter 15': (116419, 125129),\n",
" 'Chapter 16': (125141, 135663),\n",
" 'Chapter 17': (135676, 139826),\n",
" 'Chapter 18': (139840, 153522),\n",
" 'Chapter 19': (153534, 160235),\n",
" 'Chapter 20': (160246, 172138),\n",
" 'Chapter 21': (172150, 187834),\n",
" 'Chapter 22': (187847, 198012),\n",
" 'Chapter 23': (198026, 208352),\n",
" 'Chapter 24': (208365, 217847),\n",
" 'Chapter 25': (217859, 238302),\n",
" 'Chapter 26': (238315, 249685),\n",
" 'Chapter 27': (249699, 258953),\n",
" 'Chapter 28': (258968, 277450),\n",
" 'Chapter 29': (277460, 287532),\n",
" 'Chapter 30': (287543, 303366),\n",
" 'Chapter 31': (303378, 317732),\n",
" 'Chapter 32': (317743, 335287),\n",
" 'Chapter 33': (335297, 342076),\n",
" 'Chapter 34': (342087, 347628),\n",
" 'Chapter 35': (347640, 357850),\n",
" 'Chapter 36': (357863, 376832),\n",
" 'Chapter 37': (376843, 387567),\n",
" 'Chapter 38': (387577, 399996),\n",
" 'Chapter 39': (400007, 405255),\n",
" 'Chapter 40': (405267, 416777),\n",
" 'Chapter 41': (416790, 429464),\n",
" 'Chapter 42': (429476, 436196),\n",
" 'Chapter 43': (436207, 448576),\n",
" 'Chapter 44': (448588, 453728),\n",
" 'Chapter 45': (453741, 464522),\n",
" 'Chapter 46': (464536, 474352),\n",
" 'Chapter 47': (474364, 486140),\n",
" 'Chapter 48': (486151, 499062),\n",
" 'Chapter 49': (499074, 515816),\n",
" 'Chapter 50': (515826, 535228),\n",
" 'Chapter 51': (535239, 554482),\n",
" 'Chapter 52': (554494, 572446),\n",
" 'Chapter 53': (572457, 589592),\n",
" 'Chapter 54': (589602, 601924),\n",
" 'Chapter 55': (601935, 614563),\n",
" 'Chapter 56': (614575, 633355),\n",
" 'Chapter 57': (633368, 643577),\n",
" 'Chapter 58': (643588, 657352),\n",
" 'Chapter 59': (657362, 668384),\n",
" 'Chapter 60': (668395, 677348),\n",
" 'Chapter 61': (677360, 691174),\n",
" 'Chapter 62': (691187, 703199),\n",
" 'Chapter 63': (703211, 714894),\n",
" 'Chapter 64': (714905, 727688),\n",
" 'Chapter 65': (727700, 736157),\n",
" 'Chapter 66': (736170, 747241),\n",
" 'Chapter 67': (747255, 761993),\n",
" 'Chapter 68': (762005, 771296),\n",
" 'Chapter 69': (771306, 788002),\n",
" 'Chapter 70': (788013, 801020),\n",
" 'Chapter 71': (801032, 812316),\n",
" 'Chapter 72': (812327, 823160),\n",
" 'Chapter 73': (823170, 828195),\n",
" 'Chapter 74': (828206, 838301),\n",
" 'Chapter 75': (838313, 845370),\n",
" 'Chapter 76': (845383, 854584),\n",
" 'Chapter 77': (854595, 859683),\n",
" 'Chapter 78': (859693, 867016),\n",
" 'Chapter 79': (867027, 872301),\n",
" 'Chapter 80': (872313, 879240),\n",
" 'Chapter 81': (879253, 885833),\n",
" 'Chapter 82': (885845, 891870),\n",
" 'Chapter 83': (891881, 899229),\n",
" 'Chapter 84': (899241, 906388),\n",
" 'Chapter 85': (906398, 913648),\n",
" 'Chapter 86': (913659, 926615),\n",
" 'Chapter 87': (926627, 941734),\n",
" 'Chapter 88': (941745, 949937),\n",
" 'Chapter 89': (949947, 953867),\n",
" 'Chapter 90': (953878, 963628),\n",
" 'Chapter 91': (963640, 966960),\n",
" 'Chapter 92': (966973, 976016),\n",
" 'Chapter 93': (976027, 987789),\n",
" 'Chapter 94': (987799, 998043),\n",
" 'Chapter 95': (998054, 1015627),\n",
" 'Chapter 96': (1015639, 1023407),\n",
" 'Chapter 97': (1023420, 1031896),\n",
" 'Chapter 98': (1031908, 1036698),\n",
" 'Chapter 99': (1036709, 1046151),\n",
" 'Chapter 100': (1046163, 1057473),\n",
" 'Chapter 101': (1057486, 1065212),\n",
" 'Chapter 102': (1065226, 1071207),\n",
" 'Chapter 103': (1071219, 1080110),\n",
" 'Chapter 104': (1080121, 1088277),\n",
" 'Chapter 105': (1088289, 1098614),\n",
" 'Chapter 106': (1098627, 1099622),\n",
" 'Chapter 107': (1099632, 1105115),\n",
" 'Chapter 108': (1105126, 1110849),\n",
" 'Chapter 109': (1110861, 1116058),\n",
" 'Chapter 110': (1116069, 1122794),\n",
" 'Chapter 111': (1122804, 1134448),\n",
" 'Chapter 112': (1134459, 1141243),\n",
" 'Chapter 113': (1141255, 1151007),\n",
" 'Chapter 114': (1151020, 1157898),\n",
" 'Chapter 115': (1157909, 1163462),\n",
" 'Chapter 116': (1163472, 1174306),\n",
" 'Chapter 117': (1174317, 1182434),\n",
" 'Chapter 118': (1182446, 1187852),\n",
" 'Chapter 119': (1187865, 1195693),\n",
" 'Chapter 120': (1195705, 1203595),\n",
" 'Chapter 121': (1203606, 1209863),\n",
" 'Chapter 122': (1209875, 1218066),\n",
" 'Chapter 123': (1218079, 1222459),\n",
" 'Chapter 124': (1222473, 1232788),\n",
" 'Chapter 125': (1232800, 1236768),\n",
" 'Chapter 126': (1236779, 1243765),\n",
" 'Chapter 127': (1243777, 1250307),\n",
" 'Chapter 128': (1250320, 1258390),\n",
" 'Chapter 129': (1258404, 1270856),\n",
" 'Chapter 130': (1270869, 1277248),\n",
" 'Chapter 131': (1277260, 1285069),\n",
" 'Chapter 132': (1285082, 1292585),\n",
" 'Chapter 133': (1292595, 1302392),\n",
" 'Chapter 134': (1302403, 1306648),\n",
" 'Chapter 135': (1306660, 1312807),\n",
" 'Chapter 136': (1312818, 1325196),\n",
" 'Chapter 137': (1325206, 1335367),\n",
" 'Chapter 138': (1335378, 1350083),\n",
" 'Chapter 139': (1350095, 1368974),\n",
" 'Chapter 140': (1368987, 1375976),\n",
" 'Chapter 141': (1375987, 1384187),\n",
" 'Chapter 142': (1384197, 1402209),\n",
" 'Chapter 143': (1402220, 1410852),\n",
" 'Chapter 144': (1410864, 1417337),\n",
" 'Chapter 145': (1417350, 1424143),\n",
" 'Chapter 146': (1424153, 1435423),\n",
" 'Chapter 147': (1435434, 1443815),\n",
" 'Chapter 148': (1443827, 1456028),\n",
" 'Chapter 149': (1456039, 1461442),\n",
" 'Chapter 150': (1461452, 1471642),\n",
" 'Chapter 151': (1471653, 1478639),\n",
" 'Chapter 152': (1478651, 1486770),\n",
" 'Chapter 153': (1486783, 1495198),\n",
" 'Chapter 154': (1495209, 1506519),\n",
" 'Chapter 155': (1506529, 1513793),\n",
" 'Chapter 156': (1513804, 1519102),\n",
" 'Chapter 157': (1519114, 1526244),\n",
" 'Chapter 158': (1526257, 1532999),\n",
" 'Chapter 159': (1533011, 1539464),\n",
" 'Chapter 160': (1539475, 1550543),\n",
" 'Chapter 161': (1550555, 1561717),\n",
" 'Chapter 162': (1561730, 1566915),\n",
" 'Chapter 163': (1566929, 1573960),\n",
" 'Chapter 164': (1573972, 1581957),\n",
" 'Chapter 165': (1581968, 1588633),\n",
" 'Chapter 166': (1588645, 1597177),\n",
" 'Chapter 167': (1597190, 1603512),\n",
" 'Chapter 168': (1603522, 1613939),\n",
" 'Chapter 169': (1613950, 1622742),\n",
" 'Chapter 170': (1622754, 1631118),\n",
" 'Chapter 171': (1631129, 1640271),\n",
" 'Chapter 172': (1640281, 1645270),\n",
" 'Chapter 173': (1645281, 1661005),\n",
" 'Chapter 174': (1661017, 1668134),\n",
" 'Chapter 175': (1668147, 1681915),\n",
" 'Chapter 176': (1681926, 1698654),\n",
" 'Chapter 177': (1698664, 1706589),\n",
" 'Chapter 178': (1706600, 1718176),\n",
" 'Chapter 179': (1718188, 1727716),\n",
" 'Chapter 180': (1727729, 1734132),\n",
" 'Chapter 181': (1734144, 1741147),\n",
" 'Chapter 182': (1741158, 1748429),\n",
" 'Chapter 183': (1748441, 1755235),\n",
" 'Chapter 184': (1755248, 1763140),\n",
" 'Chapter 185': (1763154, 1775008),\n",
" 'Chapter 186': (1775020, 1783412),\n",
" 'Chapter 187': (1783423, 1796121),\n",
" 'Chapter 188': (1796133, 1808030),\n",
" 'Chapter 189': (1808043, 1820752),\n",
" 'Chapter 190': (1820766, 1824958),\n",
" 'Chapter 191': (1824968, 1837385),\n",
" 'Chapter 192': (1837396, 1846970),\n",
" 'Chapter 193': (1846982, 1852552),\n",
" 'Chapter 194': (1852563, 1875958),\n",
" 'Chapter 195': (1875968, 1891721),\n",
" 'Chapter 196': (1891732, 1901053),\n",
" 'Chapter 197': (1901065, 1908516),\n",
" 'Chapter 198': (1908529, 1927307),\n",
" 'Chapter 199': (1927318, 1938576),\n",
" 'Chapter 200': (1938586, 1949815),\n",
" 'Chapter 201': (1949826, 1955504),\n",
" 'Chapter 202': (1955516, 1960165),\n",
" 'Chapter 203': (1960178, 1968513),\n",
" 'Chapter 204': (1968525, 1979128),\n",
" 'Chapter 205': (1979139, 1993246),\n",
" 'Chapter 206': (1993258, 2000801),\n",
" 'Chapter 207': (2000814, 2010415),\n",
" 'Chapter 208': (2010429, 2021112),\n",
" 'Chapter 209': (2021124, 2032585),\n",
" 'Chapter 210': (2032596, 2040742),\n",
" 'Chapter 211': (2040754, 2050948),\n",
" 'Chapter 212': (2050961, 2058827),\n",
" 'Chapter 213': (2058841, 2062981),\n",
" 'Chapter 214': (2062994, 2069063),\n",
" 'Chapter 215': (2069075, 2086019),\n",
" 'Chapter 216': (2086032, 2094717),\n",
" 'Chapter 217': (2094731, 2102215),\n",
" 'Chapter 218': (2102230, 2108511),\n",
" 'Chapter 219': (2108524, 2114563),\n",
" 'Chapter 220': (2114575, 2121489),\n",
" 'Chapter 221': (2121502, 2138539),\n",
" 'Chapter 222': (2138553, 2142699),\n",
" 'Chapter 223': (2142714, 2150076),\n",
" 'Chapter 224': (2150090, 2160598),\n",
" 'Chapter 225': (2160611, 2169695),\n",
" 'Chapter 226': (2169709, 2181032),\n",
" 'Chapter 227': (2181047, 2187682),\n",
" 'Chapter 228': (2187698, 2195247),\n",
" 'Chapter 229': (2195261, 2201533),\n",
" 'Chapter 230': (2201543, 2209307),\n",
" 'Chapter 231': (2209318, 2216996),\n",
" 'Chapter 232': (2217008, 2223189),\n",
" 'Chapter 233': (2223200, 2231358),\n",
" 'Chapter 234': (2231368, 2237418),\n",
" 'Chapter 235': (2237429, 2245212),\n",
" 'Chapter 236': (2245224, 2254116),\n",
" 'Chapter 237': (2254129, 2258748),\n",
" 'Chapter 238': (2258759, 2265657),\n",
" 'Chapter 239': (2265667, 2273244),\n",
" 'Chapter 240': (2273255, 2278030),\n",
" 'Chapter 241': (2278042, 2286957),\n",
" 'Chapter 242': (2286970, 2294811),\n",
" 'Chapter 243': (2294823, 2301093),\n",
" 'Chapter 244': (2301104, 2308533),\n",
" 'Chapter 245': (2308545, 2319288),\n",
" 'Chapter 246': (2319301, 2329361),\n",
" 'Chapter 247': (2329375, 2336641),\n",
" 'Chapter 248': (2336653, 2346282),\n",
" 'Chapter 249': (2346293, 2351634),\n",
" 'Chapter 250': (2351646, 2357547),\n",
" 'Chapter 251': (2357560, 2362664),\n",
" 'Chapter 252': (2362678, 2371927),\n",
" 'Chapter 253': (2371940, 2380747),\n",
" 'Chapter 254': (2380759, 2401423),\n",
" 'Chapter 255': (2401436, 2413736),\n",
" 'Chapter 256': (2413750, 2422780),\n",
" 'Chapter 257': (2422795, 2429139),\n",
" 'Chapter 258': (2429152, 2451042),\n",
" 'Chapter 259': (2451054, 2454769),\n",
" 'Chapter 260': (2454782, 2464557),\n",
" 'Chapter 261': (2464571, 2478191),\n",
" 'Chapter 262': (2478206, 2492662),\n",
" 'Chapter 263': (2492676, 2502383),\n",
" 'Chapter 264': (2502393, 2510871),\n",
" 'Chapter 265': (2510882, 2516536),\n",
" 'Chapter 266': (2516548, 2522656),\n",
" 'Chapter 267': (2522667, 2534007),\n",
" 'Chapter 268': (2534017, 2541217),\n",
" 'Chapter 269': (2541228, 2550621),\n",
" 'Chapter 270': (2550633, 2560555),\n",
" 'Chapter 271': (2560568, 2569428),\n",
" 'Chapter 272': (2569439, 2574361),\n",
" 'Chapter 273': (2574371, 2582900),\n",
" 'Chapter 274': (2582911, 2591361),\n",
" 'Chapter 275': (2591373, 2603577),\n",
" 'Chapter 276': (2603590, 2610037),\n",
" 'Chapter 277': (2610049, 2621780),\n",
" 'Chapter 278': (2621791, 2630559),\n",
" 'Chapter 279': (2630571, 2642981),\n",
" 'Chapter 280': (2642991, 2650259),\n",
" 'Chapter 281': (2650270, 2655080),\n",
" 'Chapter 282': (2655092, 2661472),\n",
" 'Chapter 283': (2661483, 2665803),\n",
" 'Chapter 284': (2665813, 2669053),\n",
" 'Chapter 285': (2669064, 2676866),\n",
" 'Chapter 286': (2676878, 2681741),\n",
" 'Chapter 287': (2681754, 2686389),\n",
" 'Chapter 288': (2686400, 2694508),\n",
" 'Chapter 289': (2694518, 2702171),\n",
" 'Chapter 290': (2702182, 2711476),\n",
" 'Chapter 291': (2711488, 2717734),\n",
" 'Chapter 292': (2717747, 2725063),\n",
" 'Chapter 293': (2725075, 2735213),\n",
" 'Chapter 294': (2735224, 2740707),\n",
" 'Chapter 295': (2740719, 2746541),\n",
" 'Chapter 296': (2746554, 2753110),\n",
" 'Chapter 297': (2753124, 2757166),\n",
" 'Chapter 298': (2757178, 2761937),\n",
" 'Chapter 299': (2761947, 2768946),\n",
" 'Chapter 300': (2768957, 2774372),\n",
" 'Chapter 301': (2774384, 2780726),\n",
" 'Chapter 302': (2780737, 2788496),\n",
" 'Chapter 303': (2788506, 2796405),\n",
" 'Chapter 304': (2796416, 2801417),\n",
" 'Chapter 305': (2801429, 2809444),\n",
" 'Chapter 306': (2809457, 2814587),\n",
" 'Chapter 307': (2814598, 2821733),\n",
" 'Chapter 308': (2821743, 2830641),\n",
" 'Chapter 309': (2830652, 2837084),\n",
" 'Chapter 310': (2837096, 2844273),\n",
" 'Chapter 311': (2844286, 2851082),\n",
" 'Chapter 312': (2851094, 2854450),\n",
" 'Chapter 313': (2854461, 2859825),\n",
" 'Chapter 314': (2859837, 2863545),\n",
" 'Chapter 315': (2863558, 2867438),\n",
" 'Chapter 316': (2867452, 2871291),\n",
" 'Chapter 317': (2871303, 2881724),\n",
" 'Chapter 318': (2881734, 2890648),\n",
" 'Chapter 319': (2890659, 2895611),\n",
" 'Chapter 320': (2895623, 2901600),\n",
" 'Chapter 321': (2901611, 2908666),\n",
" 'Chapter 322': (2908676, 2916155),\n",
" 'Chapter 323': (2916166, 2922637),\n",
" 'Chapter 324': (2922649, 2927977),\n",
" 'Chapter 325': (2927990, 2936152),\n",
" 'Chapter 326': (2936163, 2941344),\n",
" 'Chapter 327': (2941354, 2953701),\n",
" 'Chapter 328': (2953712, 2957335),\n",
" 'Chapter 329': (2957347, 2963825),\n",
" 'Chapter 330': (2963838, 2975628),\n",
" 'Chapter 331': (2975640, 2980768),\n",
" 'Chapter 332': (2980779, 2987775),\n",
" 'Chapter 333': (2987787, 2993311),\n",
" 'Chapter 334': (2993324, 3004375),\n",
" 'Chapter 335': (3004389, 3015383),\n",
" 'Chapter 336': (3015395, 3019531),\n",
" 'Chapter 337': (3019542, 3022806),\n",
" 'Chapter 338': (3022816, 3029539),\n",
" 'Chapter 339': (3029550, 3033720),\n",
" 'Chapter 340': (3033732, 3043990),\n",
" 'Chapter 341': (3044001, 3049520),\n",
" 'Chapter 342': (3049530, 3056726),\n",
" 'Chapter 343': (3056737, 3065865),\n",
" 'Chapter 344': (3065877, 3073225),\n",
" 'Chapter 345': (3073238, 3081182),\n",
" 'Chapter 346': (3081193, 3093226),\n",
" 'Chapter 347': (3093236, 3104287),\n",
" 'Chapter 348': (3104298, 3111479),\n",
" 'Chapter 349': (3111491, 3121881),\n",
" 'Chapter 350': (3121894, 3129357),\n",
" 'Chapter 351': (3129369, 3140886),\n",
" 'Chapter 352': (3140897, 3151252),\n",
" 'Chapter 353': (3151264, 3162117),\n",
" 'Chapter 354': (3162127, 3172699),\n",
" 'Chapter 355': (3172710, 3182727),\n",
" 'Chapter 356': (3182739, 3187844),\n",
" 'Chapter 357': (3187855, 3202014),\n",
" 'Chapter 358': (3202024, 3208550),\n",
" 'Chapter 359': (3208561, 3216644),\n",
" 'Chapter 360': (3216656, 3224192),\n",
" 'Chapter 361': (3224205, 3234131),\n",
" 'Chapter 362': (3234142, 3246514),\n",
" 'Chapter 363': (3246524, 3257927),\n",
" 'Chapter 364': (3257938, 3261527),\n",
" 'Chapter 365': (3261539, 3266518)}"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ch_dict"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we have all the text extracted! But... dictionaries are not the most useful data structure. They can be difficult to query and to get basic statisics from. So we will convert our dictionary into a more robust data structure, a dataframe from the `pandas` package."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | Chapter 1 | \n", "Chapter 2 | \n", "Chapter 3 | \n", "Chapter 4 | \n", "Chapter 5 | \n", "Chapter 6 | \n", "Chapter 7 | \n", "Chapter 8 | \n", "Chapter 9 | \n", "Chapter 10 | \n", "... | \n", "Chapter 356 | \n", "Chapter 357 | \n", "Chapter 358 | \n", "Chapter 359 | \n", "Chapter 360 | \n", "Chapter 361 | \n", "Chapter 362 | \n", "Chapter 363 | \n", "Chapter 364 | \n", "Chapter 365 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "35 | \n", "11727 | \n", "19664 | \n", "28420 | \n", "36609 | \n", "47888 | \n", "55797 | \n", "61631 | \n", "68539 | \n", "80788 | \n", "... | \n", "3182739 | \n", "3187855 | \n", "3202024 | \n", "3208561 | \n", "3216656 | \n", "3224205 | \n", "3234142 | \n", "3246524 | \n", "3257938 | \n", "3261539 | \n", "
1 | \n", "11716 | \n", "19652 | \n", "28409 | \n", "36599 | \n", "47877 | \n", "55785 | \n", "61618 | \n", "68528 | \n", "80778 | \n", "90636 | \n", "... | \n", "3187844 | \n", "3202014 | \n", "3208550 | \n", "3216644 | \n", "3224192 | \n", "3234131 | \n", "3246514 | \n", "3257927 | \n", "3261527 | \n", "3266518 | \n", "
2 rows × 365 columns
\n", "\n", " | chapter | \n", "start_idx | \n", "end_idx | \n", "
---|---|---|---|
0 | \n", "Chapter 1 | \n", "35 | \n", "11716 | \n", "
1 | \n", "Chapter 2 | \n", "11727 | \n", "19652 | \n", "
2 | \n", "Chapter 3 | \n", "19664 | \n", "28409 | \n", "
3 | \n", "Chapter 4 | \n", "28420 | \n", "36599 | \n", "
4 | \n", "Chapter 5 | \n", "36609 | \n", "47877 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
360 | \n", "Chapter 361 | \n", "3224205 | \n", "3234131 | \n", "
361 | \n", "Chapter 362 | \n", "3234142 | \n", "3246514 | \n", "
362 | \n", "Chapter 363 | \n", "3246524 | \n", "3257927 | \n", "
363 | \n", "Chapter 364 | \n", "3257938 | \n", "3261527 | \n", "
364 | \n", "Chapter 365 | \n", "3261539 | \n", "3266518 | \n", "
365 rows × 3 columns
\n", "\n", " | chapter | \n", "text | \n", "
---|---|---|
0 | \n", "Chapter 1 | \n", "“well, prince, so genoa and lucca are now jus... | \n", "
1 | \n", "Chapter 2 | \n", "anna pávlovna’s drawing room was gradually fi... | \n", "
2 | \n", "Chapter 3 | \n", "anna pávlovna’s reception was in full swing. ... | \n", "
3 | \n", "Chapter 4 | \n", "just then another visitor entered the drawing... | \n", "
4 | \n", "Chapter 5 | \n", "“and what do you think of this latest comedy,... | \n", "
... | \n", "... | \n", "... | \n", "
360 | \n", "Chapter 361 | \n", "if history dealt only with external phenomena... | \n", "
361 | \n", "Chapter 362 | \n", "for the solution of the question of free will... | \n", "
362 | \n", "Chapter 363 | \n", "thus our conception of free will and inevitab... | \n", "
363 | \n", "Chapter 364 | \n", "history examines the manifestations of man’s ... | \n", "
364 | \n", "Chapter 365 | \n", "from the time the law of copernicus was disco... | \n", "
365 rows × 2 columns
\n", "\n", " | chapter | \n", "sents | \n", "
---|---|---|
0 | \n", "Chapter 1 | \n", "“well, prince, so genoa and lucca are now jus... | \n", "
1 | \n", "Chapter 1 | \n", "but i warn you, if you don’t tell me that this... | \n", "
2 | \n", "Chapter 1 | \n", "but how do you do? | \n", "
3 | \n", "Chapter 1 | \n", "i see i have frightened you—sit down and tell ... | \n", "
4 | \n", "Chapter 1 | \n", "with these words she greeted prince vasíli kur... | \n", "
... | \n", "... | \n", "... | \n", "
26256 | \n", "Chapter 365 | \n", "as in the question of astronomy then, so in th... | \n", "
26257 | \n", "Chapter 365 | \n", "in astronomy it was the immovability of the ea... | \n", "
26258 | \n", "Chapter 365 | \n", "as with astronomy the difficulty of recognizin... | \n", "
26259 | \n", "Chapter 365 | \n", "but as in astronomy the new view said: “it is ... | \n", "
26260 | \n", "Chapter 365 | \n", "*** | \n", "
26261 rows × 2 columns
\n", "\n", " | chapter | \n", "sents | \n", "word_tokenized | \n", "
---|---|---|---|
0 | \n", "Chapter 1 | \n", "“well, prince, so genoa and lucca are now jus... | \n", "[well, prince, so, genoa, and, lucca, are, now... | \n", "
1 | \n", "Chapter 1 | \n", "but i warn you, if you don’t tell me that this... | \n", "[but, i, warn, you, if, you, don’t, tell, me, ... | \n", "
2 | \n", "Chapter 1 | \n", "but how do you do? | \n", "[but, how, do, you, do] | \n", "
3 | \n", "Chapter 1 | \n", "i see i have frightened you—sit down and tell ... | \n", "[i, see, i, have, frightened, you, sit, down, ... | \n", "
4 | \n", "Chapter 1 | \n", "with these words she greeted prince vasíli kur... | \n", "[with, these, words, she, greeted, prince, vas... | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
26256 | \n", "Chapter 365 | \n", "as in the question of astronomy then, so in th... | \n", "[as, in, the, question, of, astronomy, then, s... | \n", "
26257 | \n", "Chapter 365 | \n", "in astronomy it was the immovability of the ea... | \n", "[in, astronomy, it, was, the, immovability, of... | \n", "
26258 | \n", "Chapter 365 | \n", "as with astronomy the difficulty of recognizin... | \n", "[as, with, astronomy, the, difficulty, of, rec... | \n", "
26259 | \n", "Chapter 365 | \n", "but as in astronomy the new view said: “it is ... | \n", "[but, as, in, astronomy, the, new, view, said,... | \n", "
26260 | \n", "Chapter 365 | \n", "*** | \n", "[] | \n", "
26261 rows × 3 columns
\n", "