{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# A Gentle Introduction to Natural Language Process\n", "Today, we will take a text, *War and Peace* by Leo Tolstoy, and try to get verify or refute the claims of an outside source, *Tolstoy's Phoenix: From Method to Meaning in War and Peace* by George R. Clay (1998). This is a very common task both in and outside of the digital humanities and will introduce you to the popular NLP Python package, the Natural Language Toolkit (NLTK), and expose you to common methodologies for wrangling textual data. \n", "\n", "## Our overall task\n", "\n", "We will start by downloading the book and then we will learn how to clean the text, perform basic statistics, create visualizations, and discuss what we found and how to present those results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Goals\n", "* Understand the rights we have to access books on Project Gutenberg\n", "* Read in a text from Project Gutenberg\n", "* Clean textual data using regular expressions\n", "* Perform basic word frequency statistics\n", "* Create visualziations of these statistics \n", "* Discuss how to communicte thses results\n", "* Return to our research question" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## General methods" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Read in our data\n", "We will start with a url from Project Gutenberg. All of the texts from Gutenberg are in the public domain, so we won't have to worry about rights, but be aware of who own the intellectual property to a text before you scrape it. In this section, we will break the text up by chapter division. Later we'll do the same but by book division. \n", "\n", "#### The requests library\n", "Here, we are using a library called `requests`. This library is great for HTTP requests, which are like asking for a specific action from the internet. \n", "\n", "More details below:\n", "https://pypi.org/project/requests/" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests\n", "url = 'https://www.gutenberg.org/cache/epub/2600/pg2600.txt'\n", "text = requests.get(url).text" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "8250" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## IMPORTANT: this pipeline will work for all text, but you will need to understand your text\n", "## There is no one-size-fits-all for text cleaning, especially for extra information, like tables\n", "## of content and disclaimers\n", "\n", "# In this case, we found the first line that states the start of the book. We found this by looking at the actual book.\n", "# Link to book: https://www.gutenberg.org/cache/epub/2600/pg2600.txt\n", "\n", "start = text.find('BOOK ONE: 1805\\r\\n\\r\\n\\r\\n\\r\\n\\r\\n\\r\\n')\n", "start" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3274769" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "end = text.find('END OF THE PROJECT GUTENBERG EBOOK WAR AND PEACE')\n", "end" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Using the start and end points that we found above, we can filter out all of the text before the start of the book\n", "## and after its end.\n", "\n", "## In Python, we can use square brackets to delimit the new start and new end that we want. As we see below:\n", "\n", "text = text[start:end]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a text that we are confident is nothing but the text of the book, we can begin to dissect it into its component parts. First, let's break it up into chapters." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To do so, I am using a very versatile package called `re` or regular expressions (regex) to do some advanced string parsing. This regex function, finditer, takes in a pattern and a text and returns all of the times that pattern occurs in the text. These patterns can look very complicated, but, in this case, it is '(Chapter [A-Z]+)', which means: *find all of the times when there is the word 'CHAPTER' followed by a space and then any amount of captial letters*.\n", "This pattern fits the roman numeral counting of the chapters (ex. CHAPTER I or CHAPTER XII)\n", "
\n", "\n", "Finditer also return the indices where the chapter title begins and ends. So, we can use the ending of one chapter title and the beginning of the next one to get all of the text in between the two chapters. This text is, by definition, the text in that chapter.\n", "\n", "You can read more about regex [here](https://librarycarpentry.org/lc-data-intro-archives/04-regular-expressions/index.html) and you can play around with your own regex at [regex101](https://regex101.com/). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re\n", "ch_dict = {}\n", "ch_list = list(re.finditer(r'(CHAPTER [A-Z]+)',text))\n", "for i, m in enumerate(ch_list):\n", " if i < len(ch_list)-1:\n", " ch_dict[f'Chapter {i+1}'] = (m.end(0),ch_list[i+1].start(0)-1)\n", " else:\n", " ch_dict[f'Chapter {i+1}'] = (m.end(0), len(text)-1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'Chapter 1': (35, 11716),\n", " 'Chapter 2': (11727, 19652),\n", " 'Chapter 3': (19664, 28409),\n", " 'Chapter 4': (28420, 36599),\n", " 'Chapter 5': (36609, 47877),\n", " 'Chapter 6': (47888, 55785),\n", " 'Chapter 7': (55797, 61618),\n", " 'Chapter 8': (61631, 68528),\n", " 'Chapter 9': (68539, 80778),\n", " 'Chapter 10': (80788, 90636),\n", " 'Chapter 11': (90647, 95709),\n", " 'Chapter 12': (95721, 103553),\n", " 'Chapter 13': (103566, 107749),\n", " 'Chapter 14': (107761, 116408),\n", " 'Chapter 15': (116419, 125129),\n", " 'Chapter 16': (125141, 135663),\n", " 'Chapter 17': (135676, 139826),\n", " 'Chapter 18': (139840, 153522),\n", " 'Chapter 19': (153534, 160235),\n", " 'Chapter 20': (160246, 172138),\n", " 'Chapter 21': (172150, 187834),\n", " 'Chapter 22': (187847, 198012),\n", " 'Chapter 23': (198026, 208352),\n", " 'Chapter 24': (208365, 217847),\n", " 'Chapter 25': (217859, 238302),\n", " 'Chapter 26': (238315, 249685),\n", " 'Chapter 27': (249699, 258953),\n", " 'Chapter 28': (258968, 277450),\n", " 'Chapter 29': (277460, 287532),\n", " 'Chapter 30': (287543, 303366),\n", " 'Chapter 31': (303378, 317732),\n", " 'Chapter 32': (317743, 335287),\n", " 'Chapter 33': (335297, 342076),\n", " 'Chapter 34': (342087, 347628),\n", " 'Chapter 35': (347640, 357850),\n", " 'Chapter 36': (357863, 376832),\n", " 'Chapter 37': (376843, 387567),\n", " 'Chapter 38': (387577, 399996),\n", " 'Chapter 39': (400007, 405255),\n", " 'Chapter 40': (405267, 416777),\n", " 'Chapter 41': (416790, 429464),\n", " 'Chapter 42': (429476, 436196),\n", " 'Chapter 43': (436207, 448576),\n", " 'Chapter 44': (448588, 453728),\n", " 'Chapter 45': (453741, 464522),\n", " 'Chapter 46': (464536, 474352),\n", " 'Chapter 47': (474364, 486140),\n", " 'Chapter 48': (486151, 499062),\n", " 'Chapter 49': (499074, 515816),\n", " 'Chapter 50': (515826, 535228),\n", " 'Chapter 51': (535239, 554482),\n", " 'Chapter 52': (554494, 572446),\n", " 'Chapter 53': (572457, 589592),\n", " 'Chapter 54': (589602, 601924),\n", " 'Chapter 55': (601935, 614563),\n", " 'Chapter 56': (614575, 633355),\n", " 'Chapter 57': (633368, 643577),\n", " 'Chapter 58': (643588, 657352),\n", " 'Chapter 59': (657362, 668384),\n", " 'Chapter 60': (668395, 677348),\n", " 'Chapter 61': (677360, 691174),\n", " 'Chapter 62': (691187, 703199),\n", " 'Chapter 63': (703211, 714894),\n", " 'Chapter 64': (714905, 727688),\n", " 'Chapter 65': (727700, 736157),\n", " 'Chapter 66': (736170, 747241),\n", " 'Chapter 67': (747255, 761993),\n", " 'Chapter 68': (762005, 771296),\n", " 'Chapter 69': (771306, 788002),\n", " 'Chapter 70': (788013, 801020),\n", " 'Chapter 71': (801032, 812316),\n", " 'Chapter 72': (812327, 823160),\n", " 'Chapter 73': (823170, 828195),\n", " 'Chapter 74': (828206, 838301),\n", " 'Chapter 75': (838313, 845370),\n", " 'Chapter 76': (845383, 854584),\n", " 'Chapter 77': (854595, 859683),\n", " 'Chapter 78': (859693, 867016),\n", " 'Chapter 79': (867027, 872301),\n", " 'Chapter 80': (872313, 879240),\n", " 'Chapter 81': (879253, 885833),\n", " 'Chapter 82': (885845, 891870),\n", " 'Chapter 83': (891881, 899229),\n", " 'Chapter 84': (899241, 906388),\n", " 'Chapter 85': (906398, 913648),\n", " 'Chapter 86': (913659, 926615),\n", " 'Chapter 87': (926627, 941734),\n", " 'Chapter 88': (941745, 949937),\n", " 'Chapter 89': (949947, 953867),\n", " 'Chapter 90': (953878, 963628),\n", " 'Chapter 91': (963640, 966960),\n", " 'Chapter 92': (966973, 976016),\n", " 'Chapter 93': (976027, 987789),\n", " 'Chapter 94': (987799, 998043),\n", " 'Chapter 95': (998054, 1015627),\n", " 'Chapter 96': (1015639, 1023407),\n", " 'Chapter 97': (1023420, 1031896),\n", " 'Chapter 98': (1031908, 1036698),\n", " 'Chapter 99': (1036709, 1046151),\n", " 'Chapter 100': (1046163, 1057473),\n", " 'Chapter 101': (1057486, 1065212),\n", " 'Chapter 102': (1065226, 1071207),\n", " 'Chapter 103': (1071219, 1080110),\n", " 'Chapter 104': (1080121, 1088277),\n", " 'Chapter 105': (1088289, 1098614),\n", " 'Chapter 106': (1098627, 1099622),\n", " 'Chapter 107': (1099632, 1105115),\n", " 'Chapter 108': (1105126, 1110849),\n", " 'Chapter 109': (1110861, 1116058),\n", " 'Chapter 110': (1116069, 1122794),\n", " 'Chapter 111': (1122804, 1134448),\n", " 'Chapter 112': (1134459, 1141243),\n", " 'Chapter 113': (1141255, 1151007),\n", " 'Chapter 114': (1151020, 1157898),\n", " 'Chapter 115': (1157909, 1163462),\n", " 'Chapter 116': (1163472, 1174306),\n", " 'Chapter 117': (1174317, 1182434),\n", " 'Chapter 118': (1182446, 1187852),\n", " 'Chapter 119': (1187865, 1195693),\n", " 'Chapter 120': (1195705, 1203595),\n", " 'Chapter 121': (1203606, 1209863),\n", " 'Chapter 122': (1209875, 1218066),\n", " 'Chapter 123': (1218079, 1222459),\n", " 'Chapter 124': (1222473, 1232788),\n", " 'Chapter 125': (1232800, 1236768),\n", " 'Chapter 126': (1236779, 1243765),\n", " 'Chapter 127': (1243777, 1250307),\n", " 'Chapter 128': (1250320, 1258390),\n", " 'Chapter 129': (1258404, 1270856),\n", " 'Chapter 130': (1270869, 1277248),\n", " 'Chapter 131': (1277260, 1285069),\n", " 'Chapter 132': (1285082, 1292585),\n", " 'Chapter 133': (1292595, 1302392),\n", " 'Chapter 134': (1302403, 1306648),\n", " 'Chapter 135': (1306660, 1312807),\n", " 'Chapter 136': (1312818, 1325196),\n", " 'Chapter 137': (1325206, 1335367),\n", " 'Chapter 138': (1335378, 1350083),\n", " 'Chapter 139': (1350095, 1368974),\n", " 'Chapter 140': (1368987, 1375976),\n", " 'Chapter 141': (1375987, 1384187),\n", " 'Chapter 142': (1384197, 1402209),\n", " 'Chapter 143': (1402220, 1410852),\n", " 'Chapter 144': (1410864, 1417337),\n", " 'Chapter 145': (1417350, 1424143),\n", " 'Chapter 146': (1424153, 1435423),\n", " 'Chapter 147': (1435434, 1443815),\n", " 'Chapter 148': (1443827, 1456028),\n", " 'Chapter 149': (1456039, 1461442),\n", " 'Chapter 150': (1461452, 1471642),\n", " 'Chapter 151': (1471653, 1478639),\n", " 'Chapter 152': (1478651, 1486770),\n", " 'Chapter 153': (1486783, 1495198),\n", " 'Chapter 154': (1495209, 1506519),\n", " 'Chapter 155': (1506529, 1513793),\n", " 'Chapter 156': (1513804, 1519102),\n", " 'Chapter 157': (1519114, 1526244),\n", " 'Chapter 158': (1526257, 1532999),\n", " 'Chapter 159': (1533011, 1539464),\n", " 'Chapter 160': (1539475, 1550543),\n", " 'Chapter 161': (1550555, 1561717),\n", " 'Chapter 162': (1561730, 1566915),\n", " 'Chapter 163': (1566929, 1573960),\n", " 'Chapter 164': (1573972, 1581957),\n", " 'Chapter 165': (1581968, 1588633),\n", " 'Chapter 166': (1588645, 1597177),\n", " 'Chapter 167': (1597190, 1603512),\n", " 'Chapter 168': (1603522, 1613939),\n", " 'Chapter 169': (1613950, 1622742),\n", " 'Chapter 170': (1622754, 1631118),\n", " 'Chapter 171': (1631129, 1640271),\n", " 'Chapter 172': (1640281, 1645270),\n", " 'Chapter 173': (1645281, 1661005),\n", " 'Chapter 174': (1661017, 1668134),\n", " 'Chapter 175': (1668147, 1681915),\n", " 'Chapter 176': (1681926, 1698654),\n", " 'Chapter 177': (1698664, 1706589),\n", " 'Chapter 178': (1706600, 1718176),\n", " 'Chapter 179': (1718188, 1727716),\n", " 'Chapter 180': (1727729, 1734132),\n", " 'Chapter 181': (1734144, 1741147),\n", " 'Chapter 182': (1741158, 1748429),\n", " 'Chapter 183': (1748441, 1755235),\n", " 'Chapter 184': (1755248, 1763140),\n", " 'Chapter 185': (1763154, 1775008),\n", " 'Chapter 186': (1775020, 1783412),\n", " 'Chapter 187': (1783423, 1796121),\n", " 'Chapter 188': (1796133, 1808030),\n", " 'Chapter 189': (1808043, 1820752),\n", " 'Chapter 190': (1820766, 1824958),\n", " 'Chapter 191': (1824968, 1837385),\n", " 'Chapter 192': (1837396, 1846970),\n", " 'Chapter 193': (1846982, 1852552),\n", " 'Chapter 194': (1852563, 1875958),\n", " 'Chapter 195': (1875968, 1891721),\n", " 'Chapter 196': (1891732, 1901053),\n", " 'Chapter 197': (1901065, 1908516),\n", " 'Chapter 198': (1908529, 1927307),\n", " 'Chapter 199': (1927318, 1938576),\n", " 'Chapter 200': (1938586, 1949815),\n", " 'Chapter 201': (1949826, 1955504),\n", " 'Chapter 202': (1955516, 1960165),\n", " 'Chapter 203': (1960178, 1968513),\n", " 'Chapter 204': (1968525, 1979128),\n", " 'Chapter 205': (1979139, 1993246),\n", " 'Chapter 206': (1993258, 2000801),\n", " 'Chapter 207': (2000814, 2010415),\n", " 'Chapter 208': (2010429, 2021112),\n", " 'Chapter 209': (2021124, 2032585),\n", " 'Chapter 210': (2032596, 2040742),\n", " 'Chapter 211': (2040754, 2050948),\n", " 'Chapter 212': (2050961, 2058827),\n", " 'Chapter 213': (2058841, 2062981),\n", " 'Chapter 214': (2062994, 2069063),\n", " 'Chapter 215': (2069075, 2086019),\n", " 'Chapter 216': (2086032, 2094717),\n", " 'Chapter 217': (2094731, 2102215),\n", " 'Chapter 218': (2102230, 2108511),\n", " 'Chapter 219': (2108524, 2114563),\n", " 'Chapter 220': (2114575, 2121489),\n", " 'Chapter 221': (2121502, 2138539),\n", " 'Chapter 222': (2138553, 2142699),\n", " 'Chapter 223': (2142714, 2150076),\n", " 'Chapter 224': (2150090, 2160598),\n", " 'Chapter 225': (2160611, 2169695),\n", " 'Chapter 226': (2169709, 2181032),\n", " 'Chapter 227': (2181047, 2187682),\n", " 'Chapter 228': (2187698, 2195247),\n", " 'Chapter 229': (2195261, 2201533),\n", " 'Chapter 230': (2201543, 2209307),\n", " 'Chapter 231': (2209318, 2216996),\n", " 'Chapter 232': (2217008, 2223189),\n", " 'Chapter 233': (2223200, 2231358),\n", " 'Chapter 234': (2231368, 2237418),\n", " 'Chapter 235': (2237429, 2245212),\n", " 'Chapter 236': (2245224, 2254116),\n", " 'Chapter 237': (2254129, 2258748),\n", " 'Chapter 238': (2258759, 2265657),\n", " 'Chapter 239': (2265667, 2273244),\n", " 'Chapter 240': (2273255, 2278030),\n", " 'Chapter 241': (2278042, 2286957),\n", " 'Chapter 242': (2286970, 2294811),\n", " 'Chapter 243': (2294823, 2301093),\n", " 'Chapter 244': (2301104, 2308533),\n", " 'Chapter 245': (2308545, 2319288),\n", " 'Chapter 246': (2319301, 2329361),\n", " 'Chapter 247': (2329375, 2336641),\n", " 'Chapter 248': (2336653, 2346282),\n", " 'Chapter 249': (2346293, 2351634),\n", " 'Chapter 250': (2351646, 2357547),\n", " 'Chapter 251': (2357560, 2362664),\n", " 'Chapter 252': (2362678, 2371927),\n", " 'Chapter 253': (2371940, 2380747),\n", " 'Chapter 254': (2380759, 2401423),\n", " 'Chapter 255': (2401436, 2413736),\n", " 'Chapter 256': (2413750, 2422780),\n", " 'Chapter 257': (2422795, 2429139),\n", " 'Chapter 258': (2429152, 2451042),\n", " 'Chapter 259': (2451054, 2454769),\n", " 'Chapter 260': (2454782, 2464557),\n", " 'Chapter 261': (2464571, 2478191),\n", " 'Chapter 262': (2478206, 2492662),\n", " 'Chapter 263': (2492676, 2502383),\n", " 'Chapter 264': (2502393, 2510871),\n", " 'Chapter 265': (2510882, 2516536),\n", " 'Chapter 266': (2516548, 2522656),\n", " 'Chapter 267': (2522667, 2534007),\n", " 'Chapter 268': (2534017, 2541217),\n", " 'Chapter 269': (2541228, 2550621),\n", " 'Chapter 270': (2550633, 2560555),\n", " 'Chapter 271': (2560568, 2569428),\n", " 'Chapter 272': (2569439, 2574361),\n", " 'Chapter 273': (2574371, 2582900),\n", " 'Chapter 274': (2582911, 2591361),\n", " 'Chapter 275': (2591373, 2603577),\n", " 'Chapter 276': (2603590, 2610037),\n", " 'Chapter 277': (2610049, 2621780),\n", " 'Chapter 278': (2621791, 2630559),\n", " 'Chapter 279': (2630571, 2642981),\n", " 'Chapter 280': (2642991, 2650259),\n", " 'Chapter 281': (2650270, 2655080),\n", " 'Chapter 282': (2655092, 2661472),\n", " 'Chapter 283': (2661483, 2665803),\n", " 'Chapter 284': (2665813, 2669053),\n", " 'Chapter 285': (2669064, 2676866),\n", " 'Chapter 286': (2676878, 2681741),\n", " 'Chapter 287': (2681754, 2686389),\n", " 'Chapter 288': (2686400, 2694508),\n", " 'Chapter 289': (2694518, 2702171),\n", " 'Chapter 290': (2702182, 2711476),\n", " 'Chapter 291': (2711488, 2717734),\n", " 'Chapter 292': (2717747, 2725063),\n", " 'Chapter 293': (2725075, 2735213),\n", " 'Chapter 294': (2735224, 2740707),\n", " 'Chapter 295': (2740719, 2746541),\n", " 'Chapter 296': (2746554, 2753110),\n", " 'Chapter 297': (2753124, 2757166),\n", " 'Chapter 298': (2757178, 2761937),\n", " 'Chapter 299': (2761947, 2768946),\n", " 'Chapter 300': (2768957, 2774372),\n", " 'Chapter 301': (2774384, 2780726),\n", " 'Chapter 302': (2780737, 2788496),\n", " 'Chapter 303': (2788506, 2796405),\n", " 'Chapter 304': (2796416, 2801417),\n", " 'Chapter 305': (2801429, 2809444),\n", " 'Chapter 306': (2809457, 2814587),\n", " 'Chapter 307': (2814598, 2821733),\n", " 'Chapter 308': (2821743, 2830641),\n", " 'Chapter 309': (2830652, 2837084),\n", " 'Chapter 310': (2837096, 2844273),\n", " 'Chapter 311': (2844286, 2851082),\n", " 'Chapter 312': (2851094, 2854450),\n", " 'Chapter 313': (2854461, 2859825),\n", " 'Chapter 314': (2859837, 2863545),\n", " 'Chapter 315': (2863558, 2867438),\n", " 'Chapter 316': (2867452, 2871291),\n", " 'Chapter 317': (2871303, 2881724),\n", " 'Chapter 318': (2881734, 2890648),\n", " 'Chapter 319': (2890659, 2895611),\n", " 'Chapter 320': (2895623, 2901600),\n", " 'Chapter 321': (2901611, 2908666),\n", " 'Chapter 322': (2908676, 2916155),\n", " 'Chapter 323': (2916166, 2922637),\n", " 'Chapter 324': (2922649, 2927977),\n", " 'Chapter 325': (2927990, 2936152),\n", " 'Chapter 326': (2936163, 2941344),\n", " 'Chapter 327': (2941354, 2953701),\n", " 'Chapter 328': (2953712, 2957335),\n", " 'Chapter 329': (2957347, 2963825),\n", " 'Chapter 330': (2963838, 2975628),\n", " 'Chapter 331': (2975640, 2980768),\n", " 'Chapter 332': (2980779, 2987775),\n", " 'Chapter 333': (2987787, 2993311),\n", " 'Chapter 334': (2993324, 3004375),\n", " 'Chapter 335': (3004389, 3015383),\n", " 'Chapter 336': (3015395, 3019531),\n", " 'Chapter 337': (3019542, 3022806),\n", " 'Chapter 338': (3022816, 3029539),\n", " 'Chapter 339': (3029550, 3033720),\n", " 'Chapter 340': (3033732, 3043990),\n", " 'Chapter 341': (3044001, 3049520),\n", " 'Chapter 342': (3049530, 3056726),\n", " 'Chapter 343': (3056737, 3065865),\n", " 'Chapter 344': (3065877, 3073225),\n", " 'Chapter 345': (3073238, 3081182),\n", " 'Chapter 346': (3081193, 3093226),\n", " 'Chapter 347': (3093236, 3104287),\n", " 'Chapter 348': (3104298, 3111479),\n", " 'Chapter 349': (3111491, 3121881),\n", " 'Chapter 350': (3121894, 3129357),\n", " 'Chapter 351': (3129369, 3140886),\n", " 'Chapter 352': (3140897, 3151252),\n", " 'Chapter 353': (3151264, 3162117),\n", " 'Chapter 354': (3162127, 3172699),\n", " 'Chapter 355': (3172710, 3182727),\n", " 'Chapter 356': (3182739, 3187844),\n", " 'Chapter 357': (3187855, 3202014),\n", " 'Chapter 358': (3202024, 3208550),\n", " 'Chapter 359': (3208561, 3216644),\n", " 'Chapter 360': (3216656, 3224192),\n", " 'Chapter 361': (3224205, 3234131),\n", " 'Chapter 362': (3234142, 3246514),\n", " 'Chapter 363': (3246524, 3257927),\n", " 'Chapter 364': (3257938, 3261527),\n", " 'Chapter 365': (3261539, 3266518)}" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ch_dict" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have all the text extracted! But... dictionaries are not the most useful data structure. They can be difficult to query and to get basic statisics from. So we will convert our dictionary into a more robust data structure, a dataframe from the `pandas` package." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Chapter 1Chapter 2Chapter 3Chapter 4Chapter 5Chapter 6Chapter 7Chapter 8Chapter 9Chapter 10...Chapter 356Chapter 357Chapter 358Chapter 359Chapter 360Chapter 361Chapter 362Chapter 363Chapter 364Chapter 365
035117271966428420366094788855797616316853980788...3182739318785532020243208561321665632242053234142324652432579383261539
111716196522840936599478775578561618685288077890636...3187844320201432085503216644322419232341313246514325792732615273266518
\n", "

2 rows × 365 columns

\n", "
" ], "text/plain": [ " Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6 \\\n", "0 35 11727 19664 28420 36609 47888 \n", "1 11716 19652 28409 36599 47877 55785 \n", "\n", " Chapter 7 Chapter 8 Chapter 9 Chapter 10 ... Chapter 356 Chapter 357 \\\n", "0 55797 61631 68539 80788 ... 3182739 3187855 \n", "1 61618 68528 80778 90636 ... 3187844 3202014 \n", "\n", " Chapter 358 Chapter 359 Chapter 360 Chapter 361 Chapter 362 \\\n", "0 3202024 3208561 3216656 3224205 3234142 \n", "1 3208550 3216644 3224192 3234131 3246514 \n", "\n", " Chapter 363 Chapter 364 Chapter 365 \n", "0 3246524 3257938 3261539 \n", "1 3257927 3261527 3266518 \n", "\n", "[2 rows x 365 columns]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd ## You will almost always see pandas imported this way.\n", "## Let's see what happens when we input the dictionary directly\n", "pd.DataFrame(ch_dict)\n", "## it's close but not quite what we wanted" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chapterstart_idxend_idx
0Chapter 13511716
1Chapter 21172719652
2Chapter 31966428409
3Chapter 42842036599
4Chapter 53660947877
............
360Chapter 36132242053234131
361Chapter 36232341423246514
362Chapter 36332465243257927
363Chapter 36432579383261527
364Chapter 36532615393266518
\n", "

365 rows × 3 columns

\n", "
" ], "text/plain": [ " chapter start_idx end_idx\n", "0 Chapter 1 35 11716\n", "1 Chapter 2 11727 19652\n", "2 Chapter 3 19664 28409\n", "3 Chapter 4 28420 36599\n", "4 Chapter 5 36609 47877\n", ".. ... ... ...\n", "360 Chapter 361 3224205 3234131\n", "361 Chapter 362 3234142 3246514\n", "362 Chapter 363 3246524 3257927\n", "363 Chapter 364 3257938 3261527\n", "364 Chapter 365 3261539 3266518\n", "\n", "[365 rows x 3 columns]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## pandas is programmed to look for numerical indices, which dictionaries (because they're an unordered data type) do not have\n", "## we can coerse it though to accept string values as the index with the 'from_dict' method and the 'orient' keyword argument\n", "## the 'reset_index' method will then turn our index into a column and give us an index for the rows\n", "ch_df = pd.DataFrame.from_dict(ch_dict, orient='index')\n", "ch_df = ch_df.reset_index()\n", "ch_df = ch_df.rename(columns={'index':'chapter',0:'start_idx',1:'end_idx'})\n", "ch_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that the dataframe is in the correct orientation, let's use the indices that we got to select the texts of each chapter. All we need to do is input `start_idx:end_idx` into square brackets next to the `text` object. Below is an example. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'\\r\\n\\r\\n“Well, Prince, so Genoa and Lucca are now just family estates of the\\r\\nBuonapartes. But I warn you, if you don’t tell me that this means war,\\r\\nif you still try to defend the infamies and horrors perpetrated by that\\r\\nAntichrist—I really believe he is Antichrist—I will have nothing\\r\\nmore to do with you and you are no longer my friend, no longer my\\r\\n‘faithful slave,’ as you call yourself! But how do you do? I see I\\r\\nhave frightened you—sit down and tell me all the news.”\\r\\n\\r\\nIt was in July, 1805, and the speaker was the well-known Anna Pávlovna\\r\\nSchérer, maid of honor and favorite of the Empress Márya Fëdorovna.\\r\\nWith these words she greeted Prince Vasíli Kurágin, a man of high\\r\\nrank and importance, who was the first to arrive at her reception. Anna\\r\\nPávlovna had had a cough for some days. She was, as she said, suffering\\r\\nfrom la grippe; grippe being then a new word in St. Petersburg, used\\r\\nonly by the elite.\\r\\n\\r\\nAll her invitations without exception, written in French, and delivered\\r\\nby a scarlet-liveried footman that morning, ran as follows:\\r\\n\\r\\n“If you have nothing better to do, Count (or Prince), and if the\\r\\nprospect of spending an evening with a poor invalid is not too terrible,\\r\\nI shall be very charmed to see you tonight between 7 and 10—Annette\\r\\nSchérer.”\\r\\n\\r\\n“Heavens! what a virulent attack!” replied the prince, not in the\\r\\nleast disconcerted by this reception. He had just entered, wearing an\\r\\nembroidered court uniform, knee breeches, and shoes, and had stars on\\r\\nhis breast and a serene expression on his flat face. He spoke in that\\r\\nrefined French in which our grandfathers not only spoke but thought, and\\r\\nwith the gentle, patronizing intonation natural to a man of importance\\r\\nwho had grown old in society and at court. He went up to Anna Pávlovna,\\r\\nkissed her hand, presenting to her his bald, scented, and shining head,\\r\\nand complacently seated himself on the sofa.\\r\\n\\r\\n“First of all, dear friend, tell me how you are. Set your friend’s\\r\\nmind at rest,” said he without altering his tone, beneath the\\r\\npoliteness and affected sympathy of which indifference and even irony\\r\\ncould be discerned.\\r\\n\\r\\n“Can one be well while suffering morally? Can one be calm in times\\r\\nlike these if one has any feeling?” said Anna Pávlovna. “You are\\r\\nstaying the whole evening, I hope?”\\r\\n\\r\\n“And the fete at the English ambassador’s? Today is Wednesday. I\\r\\nmust put in an appearance there,” said the prince. “My daughter is\\r\\ncoming for me to take me there.”\\r\\n\\r\\n“I thought today’s fete had been canceled. I confess all these\\r\\nfestivities and fireworks are becoming wearisome.”\\r\\n\\r\\n“If they had known that you wished it, the entertainment would have\\r\\nbeen put off,” said the prince, who, like a wound-up clock, by force\\r\\nof habit said things he did not even wish to be believed.\\r\\n\\r\\n“Don’t tease! Well, and what has been decided about Novosíltsev’s\\r\\ndispatch? You know everything.”\\r\\n\\r\\n“What can one say about it?” replied the prince in a cold, listless\\r\\ntone. “What has been decided? They have decided that Buonaparte has\\r\\nburnt his boats, and I believe that we are ready to burn ours.”\\r\\n\\r\\nPrince Vasíli always spoke languidly, like an actor repeating a stale\\r\\npart. Anna Pávlovna Schérer on the contrary, despite her forty years,\\r\\noverflowed with animation and impulsiveness. To be an enthusiast had\\r\\nbecome her social vocation and, sometimes even when she did not\\r\\nfeel like it, she became enthusiastic in order not to disappoint the\\r\\nexpectations of those who knew her. The subdued smile which, though it\\r\\ndid not suit her faded features, always played round her lips expressed,\\r\\nas in a spoiled child, a continual consciousness of her charming defect,\\r\\nwhich she neither wished, nor could, nor considered it necessary, to\\r\\ncorrect.\\r\\n\\r\\nIn the midst of a conversation on political matters Anna Pávlovna burst\\r\\nout:\\r\\n\\r\\n“Oh, don’t speak to me of Austria. Perhaps I don’t understand\\r\\nthings, but Austria never has wished, and does not wish, for war. She\\r\\nis betraying us! Russia alone must save Europe. Our gracious sovereign\\r\\nrecognizes his high vocation and will be true to it. That is the one\\r\\nthing I have faith in! Our good and wonderful sovereign has to perform\\r\\nthe noblest role on earth, and he is so virtuous and noble that God will\\r\\nnot forsake him. He will fulfill his vocation and crush the hydra of\\r\\nrevolution, which has become more terrible than ever in the person of\\r\\nthis murderer and villain! We alone must avenge the blood of the just\\r\\none.... Whom, I ask you, can we rely on?... England with her commercial\\r\\nspirit will not and cannot understand the Emperor Alexander’s\\r\\nloftiness of soul. She has refused to evacuate Malta. She wanted to\\r\\nfind, and still seeks, some secret motive in our actions. What answer\\r\\ndid Novosíltsev get? None. The English have not understood and cannot\\r\\nunderstand the self-abnegation of our Emperor who wants nothing for\\r\\nhimself, but only desires the good of mankind. And what have they\\r\\npromised? Nothing! And what little they have promised they will not\\r\\nperform! Prussia has always declared that Buonaparte is invincible, and\\r\\nthat all Europe is powerless before him.... And I don’t believe a\\r\\nword that Hardenburg says, or Haugwitz either. This famous Prussian\\r\\nneutrality is just a trap. I have faith only in God and the lofty\\r\\ndestiny of our adored monarch. He will save Europe!”\\r\\n\\r\\nShe suddenly paused, smiling at her own impetuosity.\\r\\n\\r\\n“I think,” said the prince with a smile, “that if you had been\\r\\nsent instead of our dear Wintzingerode you would have captured the King\\r\\nof Prussia’s consent by assault. You are so eloquent. Will you give me\\r\\na cup of tea?”\\r\\n\\r\\n“In a moment. À propos,” she added, becoming calm again, “I am\\r\\nexpecting two very interesting men tonight, le Vicomte de Mortemart, who\\r\\nis connected with the Montmorencys through the Rohans, one of the best\\r\\nFrench families. He is one of the genuine émigrés, the good ones. And\\r\\nalso the Abbé Morio. Do you know that profound thinker? He has been\\r\\nreceived by the Emperor. Had you heard?”\\r\\n\\r\\n“I shall be delighted to meet them,” said the prince. “But\\r\\ntell me,” he added with studied carelessness as if it had only just\\r\\noccurred to him, though the question he was about to ask was the chief\\r\\nmotive of his visit, “is it true that the Dowager Empress wants\\r\\nBaron Funke to be appointed first secretary at Vienna? The baron by all\\r\\naccounts is a poor creature.”\\r\\n\\r\\nPrince Vasíli wished to obtain this post for his son, but others were\\r\\ntrying through the Dowager Empress Márya Fëdorovna to secure it for\\r\\nthe baron.\\r\\n\\r\\nAnna Pávlovna almost closed her eyes to indicate that neither she nor\\r\\nanyone else had a right to criticize what the Empress desired or was\\r\\npleased with.\\r\\n\\r\\n“Baron Funke has been recommended to the Dowager Empress by her\\r\\nsister,” was all she said, in a dry and mournful tone.\\r\\n\\r\\nAs she named the Empress, Anna Pávlovna’s face suddenly assumed an\\r\\nexpression of profound and sincere devotion and respect mingled with\\r\\nsadness, and this occurred every time she mentioned her illustrious\\r\\npatroness. She added that Her Majesty had deigned to show Baron Funke\\r\\nbeaucoup d’estime, and again her face clouded over with sadness.\\r\\n\\r\\nThe prince was silent and looked indifferent. But, with the womanly and\\r\\ncourtierlike quickness and tact habitual to her, Anna Pávlovna\\r\\nwished both to rebuke him (for daring to speak as he had done of a man\\r\\nrecommended to the Empress) and at the same time to console him, so she\\r\\nsaid:\\r\\n\\r\\n“Now about your family. Do you know that since your daughter came\\r\\nout everyone has been enraptured by her? They say she is amazingly\\r\\nbeautiful.”\\r\\n\\r\\nThe prince bowed to signify his respect and gratitude.\\r\\n\\r\\n“I often think,” she continued after a short pause, drawing nearer\\r\\nto the prince and smiling amiably at him as if to show that political\\r\\nand social topics were ended and the time had come for intimate\\r\\nconversation—“I often think how unfairly sometimes the joys of life\\r\\nare distributed. Why has fate given you two such splendid children?\\r\\nI don’t speak of Anatole, your youngest. I don’t like him,” she\\r\\nadded in a tone admitting of no rejoinder and raising her eyebrows.\\r\\n“Two such charming children. And really you appreciate them less than\\r\\nanyone, and so you don’t deserve to have them.”\\r\\n\\r\\nAnd she smiled her ecstatic smile.\\r\\n\\r\\n“I can’t help it,” said the prince. “Lavater would have said I\\r\\nlack the bump of paternity.”\\r\\n\\r\\n“Don’t joke; I mean to have a serious talk with you. Do you know\\r\\nI am dissatisfied with your younger son? Between ourselves” (and her\\r\\nface assumed its melancholy expression), “he was mentioned at Her\\r\\nMajesty’s and you were pitied....”\\r\\n\\r\\nThe prince answered nothing, but she looked at him significantly,\\r\\nawaiting a reply. He frowned.\\r\\n\\r\\n“What would you have me do?” he said at last. “You know I did all\\r\\na father could for their education, and they have both turned out fools.\\r\\nHippolyte is at least a quiet fool, but Anatole is an active one. That\\r\\nis the only difference between them.” He said this smiling in a way\\r\\nmore natural and animated than usual, so that the wrinkles round\\r\\nhis mouth very clearly revealed something unexpectedly coarse and\\r\\nunpleasant.\\r\\n\\r\\n“And why are children born to such men as you? If you were not a\\r\\nfather there would be nothing I could reproach you with,” said Anna\\r\\nPávlovna, looking up pensively.\\r\\n\\r\\n“I am your faithful slave and to you alone I can confess that my\\r\\nchildren are the bane of my life. It is the cross I have to bear. That\\r\\nis how I explain it to myself. It can’t be helped!”\\r\\n\\r\\nHe said no more, but expressed his resignation to cruel fate by a\\r\\ngesture. Anna Pávlovna meditated.\\r\\n\\r\\n“Have you never thought of marrying your prodigal son Anatole?” she\\r\\nasked. “They say old maids have a mania for matchmaking, and though I\\r\\ndon’t feel that weakness in myself as yet, I know a little person who\\r\\nis very unhappy with her father. She is a relation of yours, Princess\\r\\nMary Bolkónskaya.”\\r\\n\\r\\nPrince Vasíli did not reply, though, with the quickness of memory and\\r\\nperception befitting a man of the world, he indicated by a movement of\\r\\nthe head that he was considering this information.\\r\\n\\r\\n“Do you know,” he said at last, evidently unable to check the sad\\r\\ncurrent of his thoughts, “that Anatole is costing me forty thousand\\r\\nrubles a year? And,” he went on after a pause, “what will it be in\\r\\nfive years, if he goes on like this?” Presently he added: “That’s\\r\\nwhat we fathers have to put up with.... Is this princess of yours\\r\\nrich?”\\r\\n\\r\\n“Her father is very rich and stingy. He lives in the country. He is\\r\\nthe well-known Prince Bolkónski who had to retire from the army under\\r\\nthe late Emperor, and was nicknamed ‘the King of Prussia.’ He is\\r\\nvery clever but eccentric, and a bore. The poor girl is very unhappy.\\r\\nShe has a brother; I think you know him, he married Lise Meinen lately.\\r\\nHe is an aide-de-camp of Kutúzov’s and will be here tonight.”\\r\\n\\r\\n“Listen, dear Annette,” said the prince, suddenly taking Anna\\r\\nPávlovna’s hand and for some reason drawing it downwards. “Arrange\\r\\nthat affair for me and I shall always be your most devoted slave-slafe\\r\\nwith an f, as a village elder of mine writes in his reports. She is rich\\r\\nand of good family and that’s all I want.”\\r\\n\\r\\nAnd with the familiarity and easy grace peculiar to him, he raised the\\r\\nmaid of honor’s hand to his lips, kissed it, and swung it to and fro\\r\\nas he lay back in his armchair, looking in another direction.\\r\\n\\r\\n“Attendez,” said Anna Pávlovna, reflecting, “I’ll speak to\\r\\nLise, young Bolkónski’s wife, this very evening, and perhaps the\\r\\nthing can be arranged. It shall be on your family’s behalf that I’ll\\r\\nstart my apprenticeship as old maid.”\\r\\n\\r\\n\\r\\n\\r\\n\\r\\n\\r'" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = ch_df.iloc[0]['start_idx'] ## .iloc[0] gives us the first element of the dataframe\n", "e = ch_df.iloc[0]['end_idx']\n", "text[s:e]\n", "## the first full chapter of War and Peace\n", "## you can change the number after iloc to change which chapter you see" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is great, but now how do we do it with the whole column? \n", "\n", "We can use the `.apply` method from `pandas`, which allows us to cast a single function to every cell in a column. In this case, we need to use two different columns, so we will use a `lambda` expression. `lambda` expressions are a way to build very simple functions in Python. They are very flexible and support a lot of different features like flow control and regular expressions. [Here](https://www.w3schools.com/python/python_lambda.asp) is a useful starter guide to `lambda` from W3Schools.\n", "\n", "For us, we need `lambda` expression that will pull a value for each column and return the text between those two values. Note the `axis=1` keyword parameter. This tells `pandas` that we want to go through the columns not the rows. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ch_df['text'] = ch_df.apply(lambda x: text[x['start_idx']:x['end_idx']],axis=1)\n", "ch_df['text'] = ch_df['text'].apply(lambda x: x.replace('\\r','').replace('\\n',' ').replace(' ', ' ').lower()) ## using lambda again for some simple cleaning with the replace and lower methods\n", "ch_df = ch_df.drop(['start_idx','end_idx'],axis=1) ## and now we can drop our indices just so they don't clutter up our dataframe" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chaptertext
0Chapter 1“well, prince, so genoa and lucca are now jus...
1Chapter 2anna pávlovna’s drawing room was gradually fi...
2Chapter 3anna pávlovna’s reception was in full swing. ...
3Chapter 4just then another visitor entered the drawing...
4Chapter 5“and what do you think of this latest comedy,...
.........
360Chapter 361if history dealt only with external phenomena...
361Chapter 362for the solution of the question of free will...
362Chapter 363thus our conception of free will and inevitab...
363Chapter 364history examines the manifestations of man’s ...
364Chapter 365from the time the law of copernicus was disco...
\n", "

365 rows × 2 columns

\n", "
" ], "text/plain": [ " chapter text\n", "0 Chapter 1 “well, prince, so genoa and lucca are now jus...\n", "1 Chapter 2 anna pávlovna’s drawing room was gradually fi...\n", "2 Chapter 3 anna pávlovna’s reception was in full swing. ...\n", "3 Chapter 4 just then another visitor entered the drawing...\n", "4 Chapter 5 “and what do you think of this latest comedy,...\n", ".. ... ...\n", "360 Chapter 361 if history dealt only with external phenomena...\n", "361 Chapter 362 for the solution of the question of free will...\n", "362 Chapter 363 thus our conception of free will and inevitab...\n", "363 Chapter 364 history examines the manifestations of man’s ...\n", "364 Chapter 365 from the time the law of copernicus was disco...\n", "\n", "[365 rows x 2 columns]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ch_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tokenization\n", "Now that we have read in our data and began some cleaning, we will tokenize our text. Tokenization is a process by which we can split up the chapter we got into relevent units. There are two main types of tokenization:\n", "* Sentence-level tokenization: splitting up the text into sentences\n", "* Word-level tokenization: splitting up the text into words" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package punkt to /Users/pnadel01/nltk_data...\n", "[nltk_data] Package punkt is already up-to-date!\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## The Natural Language Toolkit (NLTK) provides a lot of very useful utilities for cleaning and analyzing textual data.\n", "## It requires this 'punkt' download for us to be able to tokenize\n", "import nltk\n", "nltk.download('punkt')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Sentence-level tokenization" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "list" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ch_df['sents'] = ch_df.text.apply(nltk.sent_tokenize)\n", "type(ch_df['sents'].iloc[0]) ## each element of the 'sents' column is a list of sentences in eahc chapter" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chaptersents
0Chapter 1“well, prince, so genoa and lucca are now jus...
1Chapter 1but i warn you, if you don’t tell me that this...
2Chapter 1but how do you do?
3Chapter 1i see i have frightened you—sit down and tell ...
4Chapter 1with these words she greeted prince vasíli kur...
.........
26256Chapter 365as in the question of astronomy then, so in th...
26257Chapter 365in astronomy it was the immovability of the ea...
26258Chapter 365as with astronomy the difficulty of recognizin...
26259Chapter 365but as in astronomy the new view said: “it is ...
26260Chapter 365***
\n", "

26261 rows × 2 columns

\n", "
" ], "text/plain": [ " chapter sents\n", "0 Chapter 1 “well, prince, so genoa and lucca are now jus...\n", "1 Chapter 1 but i warn you, if you don’t tell me that this...\n", "2 Chapter 1 but how do you do?\n", "3 Chapter 1 i see i have frightened you—sit down and tell ...\n", "4 Chapter 1 with these words she greeted prince vasíli kur...\n", "... ... ...\n", "26256 Chapter 365 as in the question of astronomy then, so in th...\n", "26257 Chapter 365 in astronomy it was the immovability of the ea...\n", "26258 Chapter 365 as with astronomy the difficulty of recognizin...\n", "26259 Chapter 365 but as in astronomy the new view said: “it is ...\n", "26260 Chapter 365 ***\n", "\n", "[26261 rows x 2 columns]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Now we can 'explode' these lists, so that each sentence has its own cell in the 'sents' column.\n", "sent_explode = ch_df.explode('sents').drop(['text'],axis=1).reset_index(drop=True)\n", "sent_explode" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Word-level tokenization\n", "To conduct word-level tokenization, we'll be using regex again but this time, we'll use a useful function in NLTK that takes in a regular expression and will split words on that pattern. This is opposition to so-called 'white-space tokenization' where words are delimited by spaces. This type of tokenization can be useful in some cases, but in most natural language will be insufficient. That's why this NLTK function is so helpful. All we need to do is input the pattern below and we will get very good English tokenization\n", "\n", "Note that tokenization is language dependent. This means that just because a certain method works for English it will not neccesarily work for other languages." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from nltk.tokenize import RegexpTokenizer\n", "tokenizer = RegexpTokenizer(r\"(?x)\\w+(?:[-’]\\w+)*\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "anna pávlovna had had a cough for some days.\n" ] }, { "data": { "text/plain": [ "['anna', 'pávlovna', 'had', 'had', 'a', 'cough', 'for', 'some', 'days']" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ex_sentence = sent_explode.sents.iloc[5]\n", "print(ex_sentence)\n", "tokenizer.tokenize(ex_sentence)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chaptersentsword_tokenized
0Chapter 1“well, prince, so genoa and lucca are now jus...[well, prince, so, genoa, and, lucca, are, now...
1Chapter 1but i warn you, if you don’t tell me that this...[but, i, warn, you, if, you, don’t, tell, me, ...
2Chapter 1but how do you do?[but, how, do, you, do]
3Chapter 1i see i have frightened you—sit down and tell ...[i, see, i, have, frightened, you, sit, down, ...
4Chapter 1with these words she greeted prince vasíli kur...[with, these, words, she, greeted, prince, vas...
............
26256Chapter 365as in the question of astronomy then, so in th...[as, in, the, question, of, astronomy, then, s...
26257Chapter 365in astronomy it was the immovability of the ea...[in, astronomy, it, was, the, immovability, of...
26258Chapter 365as with astronomy the difficulty of recognizin...[as, with, astronomy, the, difficulty, of, rec...
26259Chapter 365but as in astronomy the new view said: “it is ...[but, as, in, astronomy, the, new, view, said,...
26260Chapter 365***[]
\n", "

26261 rows × 3 columns

\n", "
" ], "text/plain": [ " chapter sents \\\n", "0 Chapter 1 “well, prince, so genoa and lucca are now jus... \n", "1 Chapter 1 but i warn you, if you don’t tell me that this... \n", "2 Chapter 1 but how do you do? \n", "3 Chapter 1 i see i have frightened you—sit down and tell ... \n", "4 Chapter 1 with these words she greeted prince vasíli kur... \n", "... ... ... \n", "26256 Chapter 365 as in the question of astronomy then, so in th... \n", "26257 Chapter 365 in astronomy it was the immovability of the ea... \n", "26258 Chapter 365 as with astronomy the difficulty of recognizin... \n", "26259 Chapter 365 but as in astronomy the new view said: “it is ... \n", "26260 Chapter 365 *** \n", "\n", " word_tokenized \n", "0 [well, prince, so, genoa, and, lucca, are, now... \n", "1 [but, i, warn, you, if, you, don’t, tell, me, ... \n", "2 [but, how, do, you, do] \n", "3 [i, see, i, have, frightened, you, sit, down, ... \n", "4 [with, these, words, she, greeted, prince, vas... \n", "... ... \n", "26256 [as, in, the, question, of, astronomy, then, s... \n", "26257 [in, astronomy, it, was, the, immovability, of... \n", "26258 [as, with, astronomy, the, difficulty, of, rec... \n", "26259 [but, as, in, astronomy, the, new, view, said,... \n", "26260 [] \n", "\n", "[26261 rows x 3 columns]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sent_explode['word_tokenized'] = sent_explode['sents'].apply(tokenizer.tokenize)\n", "sent_explode" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualization\n", "We can now create a general function to visualize word use over the course of the book. For this task, we are more interested in general trends, so let's zoom back out to the book-level.\n", "\n", "For more information on how to create visualization in Python, please go through [this training](https://tuftsdatalab.github.io/uep239-data-analysis/). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## creating a new dataframe organized by book\n", "## this is the same code as above\n", "book_dict = {}\n", "book_list = list(re.finditer(r'(BOOK [A-Z]+:)',text))\n", "for i, m in enumerate(book_list):\n", " if i < len(book_list)-1:\n", " book_dict[f'Book {i+1}'] = (m.end(0),book_list[i+1].start(0)-1)\n", " else:\n", " book_dict[f'Book {i+1}'] = (m.end(0), len(text)-1)\n", "\n", "book_df = pd.DataFrame.from_dict(book_dict, orient='index').reset_index().rename(columns={'index':'book',0:'start_idx',1:'end_idx'})\n", "\n", "book_df['text'] = book_df.apply(lambda x: text[x['start_idx']:x['end_idx']],axis=1)\n", "book_df['text'] = book_df['text'].apply(lambda x: x.replace('\\r','').replace('\\n',' ').replace(' ', ' ').lower())\n", "book_df = book_df.drop(['start_idx','end_idx'],axis=1)\n", "\n", "book_df['word_tokenized'] = book_df.text.apply(tokenizer.tokenize)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAigAAAHbCAYAAADh42GvAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/av/WaAAAACXBIWXMAAA9hAAAPYQGoP6dpAAA7XUlEQVR4nO3de1gV9aL/8c9agoAoICg3BbHSRDMzLUPdZkpeMzXbbT2Ut7ZmeUnJvFRqdsNMzTTTva287CyrU1pqWf6wJBVRMcuU8BIpZWAnAwIDUeb3h8d1WokWuRZ8gffreeZ5WjOzZj7LcPFx5jszNsuyLAEAABjEXtEBAAAAfo+CAgAAjENBAQAAxqGgAAAA41BQAACAcSgoAADAOBQUAABgHI+KDvBXlJSU6Pjx46pTp45sNltFxwEAAH+CZVn65ZdfFB4eLrv90sdIKmVBOX78uCIiIio6BgAA+AsyMzPVsGHDS65TKQtKnTp1JJ37gH5+fhWcBgAA/Bl5eXmKiIhw/B6/lEpZUM6f1vHz86OgAABQyfyZ4RkMkgUAAMahoAAAAONQUAAAgHEq5RgUAED5Onv2rIqLiys6Bgzn6empGjVquGRbFBQAwEVZlqWsrCzl5ORUdBRUEgEBAQoNDb3s+5RRUAAAF3W+nAQHB6tWrVrcHBMXZVmWTp06pRMnTkiSwsLCLmt7FBQAQKnOnj3rKCdBQUEVHQeVgI+PjyTpxIkTCg4OvqzTPQySBQCU6vyYk1q1alVwElQm539eLnfMEgUFAHBJnNZBWbjq54WCAgAAjENBAQAAxinzINmkpCQ999xzSk1N1Q8//KA1a9aoX79+TuukpaVp8uTJ2rJli86cOaPmzZvrnXfeUWRkpCSpsLBQDz30kFavXq2ioiJ1795dL730kkJCQlzyoQAA7hU1ZUO57evbWb3LbV/usnz5co0fP57LtcugzEdQCgoK1KpVKy1atKjU5UeOHFHHjh3VrFkzffrpp/ryyy81bdo0eXt7O9aZMGGC1q1bp7fffltbtmzR8ePHdccdd/z1TwEAANzOZrNp7dq15bKvMh9B6dmzp3r27HnR5Y8++qh69eql2bNnO+ZdeeWVjv/Ozc3VK6+8otdff11dunSRJC1btkzR0dHasWOHbrrpprJGAgCgwpw+fVo1a9as6BhVjkvHoJSUlGjDhg1q2rSpunfvruDgYLVr186pbaWmpqq4uFixsbGOec2aNVNkZKSSk5NL3W5RUZHy8vKcJgAALua///u/1bJlS/n4+CgoKEixsbEqKCiQJL388suKjo6Wt7e3mjVrppdeesnpvZMnT1bTpk1Vq1YtXXHFFZo2bZrTJbOPP/64rrvuOr388stq3Lix4wxBTk6O7rvvPoWEhMjb21vXXHON1q9f77Ttjz76SNHR0apdu7Z69OihH3744U9/pldffVUtWrSQl5eXwsLCNGbMGMeyY8eOqW/fvqpdu7b8/Px01113KTs727F86NChFwzHGD9+vDp37ux43blzZ40bN06TJk1SYGCgQkND9fjjjzuWR0VFSZL69+8vm83meO0uLr1R24kTJ5Sfn69Zs2bpqaee0rPPPquNGzfqjjvu0CeffKKbb75ZWVlZqlmzpgICApzeGxISoqysrFK3m5CQoJkzZ7oyKgBUOFeM46gK4zNc7YcfftCgQYM0e/Zs9e/fX7/88os+++wzWZalVatWafr06XrxxRfVunVrff755xoxYoR8fX01ZMgQSVKdOnW0fPlyhYeHa9++fRoxYoTq1KmjSZMmOfZx+PBhvfPOO3r33XdVo0YNlZSUqGfPnvrll1/02muv6corr9SBAwecblR26tQpzZkzR//5z39kt9t19913a+LEiVq1atUffqbFixcrPj5es2bNUs+ePZWbm6tt27ZJOndw4Hw5OT/2c/To0frHP/6hTz/9tEx/ditWrFB8fLxSUlKUnJysoUOHqkOHDrr11lu1a9cuBQcHa9myZerRo4fLnrlzMS4tKCUlJZKkvn37asKECZKk6667Ttu3b9eSJUt08803/6XtTp06VfHx8Y7XeXl5ioiIuPzAAIAq54cfftCZM2d0xx13qFGjRpKkli1bSpJmzJihuXPnOsY9Nm7cWAcOHNC//vUvR0F57LHHHNuKiorSxIkTtXr1aqeCcvr0aa1cuVL169eXJH388cfauXOn0tLS1LRpU0nSFVdc4ZSruLhYS5YscQx7GDNmjJ544ok/9ZmeeuopPfTQQ3rwwQcd82644QZJUmJiovbt26eMjAzH78aVK1eqRYsW2rVrl2O9P+Paa6/VjBkzJElNmjTRiy++qMTERN16662Oz3r+WTvu5tKCUq9ePXl4eKh58+ZO86Ojo7V161ZJUmhoqE6fPq2cnBynoyjZ2dkX/cBeXl7y8vJyZVQAQBXVqlUrde3aVS1btlT37t3VrVs33XnnnapZs6aOHDmie++9VyNGjHCsf+bMGfn7+ztev/nmm1qwYIGOHDmi/Px8nTlzRn5+fk77aNSokeMXtiTt3btXDRs2dJST0tSqVctpTGZYWJjjuTWXcuLECR0/flxdu3YtdXlaWpoiIiKc/uHevHlzBQQEKC0trcwF5bf+bEZ3cOkYlJo1a+qGG25Qenq60/yDBw86WmybNm3k6empxMREx/L09HQdO3ZMMTExrowDAKiGatSooU2bNunDDz9U8+bNtXDhQl199dX66quvJElLly7V3r17HdNXX32lHTt2SJKSk5MVFxenXr16af369fr888/16KOP6vTp00778PX1dXp9/hk0l+Lp6en02mazybKsP3zfn9n2H7Hb7Rfsq7Rb0ZeW8fzZkfJW5iMo+fn5Onz4sON1RkaG9u7dq8DAQEVGRurhhx/WP/7xD3Xq1Em33HKLNm7cqHXr1jnOg/n7++vee+9VfHy8AgMD5efnp7FjxyomJoYreAAALmGz2dShQwd16NBB06dPV6NGjbRt2zaFh4frm2++UVxcXKnv2759uxo1aqRHH33UMe/o0aN/uL9rr71W3333nQ4ePHjJoyh/RZ06dRQVFaXExETdcsstFyyPjo5WZmamMjMzHUdRDhw4oJycHMcZjfr16zsK2nl79+69oJD8EU9PT509e/YvfpKyKXNB2b17t9Mf0PmxIUOGDNHy5cvVv39/LVmyRAkJCRo3bpyuvvpqvfPOO+rYsaPjPc8//7zsdrsGDBjgdKM2AAAuV0pKihITE9WtWzcFBwcrJSVFP/74o6KjozVz5kyNGzdO/v7+6tGjh4qKirR79279/PPPio+PV5MmTXTs2DGtXr1aN9xwgzZs2KA1a9b84T5vvvlmderUSQMGDNC8efN01VVX6euvv5bNZlOPHj0u+zM9/vjjGjVqlIKDgx2Dcbdt26axY8cqNjZWLVu2VFxcnObPn68zZ87ogQce0M0336y2bdtKkrp06aLnnntOK1euVExMjF577TV99dVXat26dZlynC9KHTp0kJeXl+rWrXvZn+1iylxQOnfu/IeHpIYPH67hw4dfdLm3t7cWLVp00Zu9AQDMZvLVQ35+fkpKStL8+fOVl5enRo0aae7cuY57eNWqVUvPPfecHn74Yfn6+qply5YaP368JOn222/XhAkTNGbMGBUVFal3796aNm2a0+W2F/POO+9o4sSJGjRokAoKCnTVVVdp1qxZLvlMQ4YMUWFhoZ5//nlNnDhR9erV05133inp3NGi9957T2PHjlWnTp1kt9vVo0cPLVy40PH+7t27a9q0aZo0aZIKCws1fPhwDR48WPv27StTjrlz5yo+Pl5Lly5VgwYN9O2337rk85XGZv2ZE2CGycvLk7+/v3Jzcy8YuAQAlYXplxkXFhYqIyPD6V4fwB+51M9NWX5/87BAAABgHAoKAAAVrHbt2hedPvvss4qOVyFceh8UAABQdnv37r3osgYNGpRfEINQUAAAqGBXXXVVRUcwDqd4AACXVFE36kLl5KqfF46gAABKVbNmTdntdh0/flz169dXzZo1ZbPZKjoWDGVZlk6fPq0ff/xRdrtdNWvWvKztUVAAAKWy2+1q3LixfvjhBx0/fryi46CSqFWrliIjI2W3X95JGgoKAOCiatasqcjISJ05c6bcbnGOyqtGjRry8PBwyZE2CgoA4JJsNps8PT3L/NwW4HIwSBYAABiHggIAAIxDQQEAAMahoAAAAONQUAAAgHEoKAAAwDgUFAAAYBwKCgAAMA4FBQAAGIeCAgAAjENBAQAAxqGgAAAA41BQAACAcSgoAADAOBQUAABgHAoKAAAwDgUFAAAYh4ICAACMQ0EBAADGoaAAAADjUFAAAIBxKCgAAMA4FBQAAGCcMheUpKQk9enTR+Hh4bLZbFq7du1F1x01apRsNpvmz5/vNP/kyZOKi4uTn5+fAgICdO+99yo/P7+sUQAAQBVV5oJSUFCgVq1aadGiRZdcb82aNdqxY4fCw8MvWBYXF6f9+/dr06ZNWr9+vZKSkjRy5MiyRgEAAFWUR1nf0LNnT/Xs2fOS63z//fcaO3asPvroI/Xu3dtpWVpamjZu3Khdu3apbdu2kqSFCxeqV69emjNnTqmFBgAAVC8uH4NSUlKie+65Rw8//LBatGhxwfLk5GQFBAQ4yokkxcbGym63KyUlpdRtFhUVKS8vz2kCAABVl8sLyrPPPisPDw+NGzeu1OVZWVkKDg52mufh4aHAwEBlZWWV+p6EhAT5+/s7poiICFfHBgAABnFpQUlNTdULL7yg5cuXy2azuWy7U6dOVW5urmPKzMx02bYBAIB5XFpQPvvsM504cUKRkZHy8PCQh4eHjh49qoceekhRUVGSpNDQUJ04ccLpfWfOnNHJkycVGhpa6na9vLzk5+fnNAEAgKqrzINkL+Wee+5RbGys07zu3bvrnnvu0bBhwyRJMTExysnJUWpqqtq0aSNJ2rx5s0pKStSuXTtXxgEAAJVUmQtKfn6+Dh8+7HidkZGhvXv3KjAwUJGRkQoKCnJa39PTU6Ghobr66qslSdHR0erRo4dGjBihJUuWqLi4WGPGjNHAgQO5ggcAAEj6C6d4du/erdatW6t169aSpPj4eLVu3VrTp0//09tYtWqVmjVrpq5du6pXr17q2LGj/v3vf5c1CgAAqKLKfASlc+fOsizrT6//7bffXjAvMDBQr7/+ell3DQAAqgmexQMAAIxDQQEAAMahoAAAAONQUAAAgHEoKAAAwDgUFAAAYBwKCgAAMA4FBQAAGIeCAgAAjENBAQAAxqGgAAAA41BQAACAcSgoAADAOBQUAABgHAoKAAAwDgUFAAAYh4ICAACMQ0EBAADGoaAAAADjUFAAAIBxKCgAAMA4FBQAAGAcCgoAADAOBQUAABiHggIAAIxDQQEAAMahoAAAAONQUAAAgHEoKAAAwDgUFAAAYBwKCgAAMA4FBQAAGIeCAgAAjFPmgpKUlKQ+ffooPDxcNptNa9eudSwrLi7W5MmT1bJlS/n6+io8PFyDBw/W8ePHnbZx8uRJxcXFyc/PTwEBAbr33nuVn59/2R8GAABUDWUuKAUFBWrVqpUWLVp0wbJTp05pz549mjZtmvbs2aN3331X6enpuv32253Wi4uL0/79+7Vp0yatX79eSUlJGjly5F//FAAAoEqxWZZl/eU322xas2aN+vXrd9F1du3apRtvvFFHjx5VZGSk0tLS1Lx5c+3atUtt27aVJG3cuFG9evXSd999p/Dw8D/cb15envz9/ZWbmys/P7+/Gh8AKlTUlA2XvY1vZ/V2QRKgfJTl97fbx6Dk5ubKZrMpICBAkpScnKyAgABHOZGk2NhY2e12paSklLqNoqIi5eXlOU0AAKDqcmtBKSws1OTJkzVo0CBHU8rKylJwcLDTeh4eHgoMDFRWVlap20lISJC/v79jioiIcGdsAABQwdxWUIqLi3XXXXfJsiwtXrz4srY1depU5ebmOqbMzEwXpQQAACbycMdGz5eTo0ePavPmzU7nmUJDQ3XixAmn9c+cOaOTJ08qNDS01O15eXnJy8vLHVEBAICBXH4E5Xw5OXTokP7f//t/CgoKcloeExOjnJwcpaamOuZt3rxZJSUlateunavjAACASqjMR1Dy8/N1+PBhx+uMjAzt3btXgYGBCgsL05133qk9e/Zo/fr1Onv2rGNcSWBgoGrWrKno6Gj16NFDI0aM0JIlS1RcXKwxY8Zo4MCBf+oKHgAAUPWVuaDs3r1bt9xyi+N1fHy8JGnIkCF6/PHH9f7770uSrrvuOqf3ffLJJ+rcubMkadWqVRozZoy6du0qu92uAQMGaMGCBX/xIwAAgKqmzAWlc+fOutStU/7MbVUCAwP1+uuvl3XXAACgmuBZPAAAwDgUFAAAYBwKCgAAMA4FBQAAGIeCAgAAjENBAQAAxqGgAAAA41BQAACAcSgoAADAOBQUAABgHAoKAAAwDgUFAAAYh4ICAACMQ0EBAADGoaAAAADjUFAAAIBxKCgAAMA4FBQAAGAcCgoAADAOBQUAABiHggIAAIxDQQEAAMahoAAAAONQUAAAgHEoKAAAwDgUFAAAYBwKCgAAMA4FBQAAGIeCAgAAjENBAQAAxqGgAAAA41BQAACAcSgoAADAOGUuKElJSerTp4/Cw8Nls9m0du1ap+WWZWn69OkKCwuTj4+PYmNjdejQIad1Tp48qbi4OPn5+SkgIED33nuv8vPzL+uDAACAqqPMBaWgoECtWrXSokWLSl0+e/ZsLViwQEuWLFFKSop8fX3VvXt3FRYWOtaJi4vT/v37tWnTJq1fv15JSUkaOXLkX/8UAACgSvEo6xt69uypnj17lrrMsizNnz9fjz32mPr27StJWrlypUJCQrR27VoNHDhQaWlp2rhxo3bt2qW2bdtKkhYuXKhevXppzpw5Cg8Pv4yPAwAAqgKXjkHJyMhQVlaWYmNjHfP8/f3Vrl07JScnS5KSk5MVEBDgKCeSFBsbK7vdrpSUlFK3W1RUpLy8PKcJAABUXWU+gnIpWVlZkqSQkBCn+SEhIY5lWVlZCg4Odg7h4aHAwEDHOr+XkJCgmTNnujIqgGouasqGy97Gt7N6uyAJgNJUiqt4pk6dqtzcXMeUmZlZ0ZEAAIAbubSghIaGSpKys7Od5mdnZzuWhYaG6sSJE07Lz5w5o5MnTzrW+T0vLy/5+fk5TQAAoOpyaUFp3LixQkNDlZiY6JiXl5enlJQUxcTESJJiYmKUk5Oj1NRUxzqbN29WSUmJ2rVr58o4AACgkirzGJT8/HwdPnzY8TojI0N79+5VYGCgIiMjNX78eD311FNq0qSJGjdurGnTpik8PFz9+vWTJEVHR6tHjx4aMWKElixZouLiYo0ZM0YDBw7kCh4AACDpLxSU3bt365ZbbnG8jo+PlyQNGTJEy5cv16RJk1RQUKCRI0cqJydHHTt21MaNG+Xt7e14z6pVqzRmzBh17dpVdrtdAwYM0IIFC1zwcQAAQFVgsyzLqugQZZWXlyd/f3/l5uYyHgXAX2LCVTwmZADKU1l+f1eKq3gAAED1QkEBAADGoaAAAADjUFAAAIBxKCgAAMA4FBQAAGAcCgoAADAOBQUAABiHggIAAIxDQQEAAMahoAAAAONQUAAAgHEoKAAAwDgUFAAAYBwKCgAAMA4FBQAAGIeCAgAAjENBAQAAxqGgAAAA41BQAACAcSgoAADAOBQUAABgHAoKAAAwDgUFAAAYh4ICAACMQ0EBAADGoaAAAADjUFAAAIBxKCgAAMA4FBQAAGAcCgoAADAOBQUAABiHggIAAIzj8oJy9uxZTZs2TY0bN5aPj4+uvPJKPfnkk7Isy7GOZVmaPn26wsLC5OPjo9jYWB06dMjVUQAAQCXl8oLy7LPPavHixXrxxReVlpamZ599VrNnz9bChQsd68yePVsLFizQkiVLlJKSIl9fX3Xv3l2FhYWujgMAACohD1dvcPv27erbt6969+4tSYqKitIbb7yhnTt3Sjp39GT+/Pl67LHH1LdvX0nSypUrFRISorVr12rgwIGujgQAACoZlx9Bad++vRITE3Xw4EFJ0hdffKGtW7eqZ8+ekqSMjAxlZWUpNjbW8R5/f3+1a9dOycnJpW6zqKhIeXl5ThMAAKi6XH4EZcqUKcrLy1OzZs1Uo0YNnT17Vk8//bTi4uIkSVlZWZKkkJAQp/eFhIQ4lv1eQkKCZs6c6eqoAADAUC4/gvLWW29p1apVev3117Vnzx6tWLFCc+bM0YoVK/7yNqdOnarc3FzHlJmZ6cLEAADANC4/gvLwww9rypQpjrEkLVu21NGjR5WQkKAhQ4YoNDRUkpSdna2wsDDH+7Kzs3XdddeVuk0vLy95eXm5OioAADCUy4+gnDp1Sna782Zr1KihkpISSVLjxo0VGhqqxMREx/K8vDylpKQoJibG1XEAAEAl5PIjKH369NHTTz+tyMhItWjRQp9//rnmzZun4cOHS5JsNpvGjx+vp556Sk2aNFHjxo01bdo0hYeHq1+/fq6OAwAAKiGXF5SFCxdq2rRpeuCBB3TixAmFh4frvvvu0/Tp0x3rTJo0SQUFBRo5cqRycnLUsWNHbdy4Ud7e3q6OAwAAKiGb9dtbvFYSeXl58vf3V25urvz8/Co6DoBKKGrKhsvexrezelf6DEB5Ksvvb57FAwAAjENBAQAAxqGgAAAA41BQAACAcSgoAADAOBQUAABgHAoKAAAwDgUFAAAYh4ICAACMQ0EBAADGoaAAAADjUFAAAIBxKCgAAMA4FBQAAGAcCgoAADAOBQUAABiHggIAAIxDQQEAAMahoAAAAONQUAAAgHEoKAAAwDgUFAAAYBwKCgAAMA4FBQAAGIeCAgAAjENBAQAAxqGgAAAA41BQAACAcSgoAADAOBQUAABgHAoKAAAwDgUFAAAYh4ICAACM45aC8v333+vuu+9WUFCQfHx81LJlS+3evdux3LIsTZ8+XWFhYfLx8VFsbKwOHTrkjigAAKAScnlB+fnnn9WhQwd5enrqww8/1IEDBzR37lzVrVvXsc7s2bO1YMECLVmyRCkpKfL19VX37t1VWFjo6jgAAKAS8nD1Bp999llFRERo2bJljnmNGzd2/LdlWZo/f74ee+wx9e3bV5K0cuVKhYSEaO3atRo4cKCrIwEAgErG5UdQ3n//fbVt21Z///vfFRwcrNatW2vp0qWO5RkZGcrKylJsbKxjnr+/v9q1a6fk5ORSt1lUVKS8vDynCQAAVF0uP4LyzTffaPHixYqPj9cjjzyiXbt2ady4capZs6aGDBmirKwsSVJISIjT+0JCQhzLfi8hIUEzZ850dVQAAPAbUVM2XPY2vp3V2wVJ3HAEpaSkRNdff72eeeYZtW7dWiNHjtSIESO0ZMmSv7zNqVOnKjc31zFlZma6MDEAADCNywtKWFiYmjdv7jQvOjpax44dkySFhoZKkrKzs53Wyc7Odiz7PS8vL/n5+TlNAACg6nJ5QenQoYPS09Od5h08eFCNGjWSdG7AbGhoqBITEx3L8/LylJKSopiYGFfHAQAAlZDLx6BMmDBB7du31zPPPKO77rpLO3fu1L///W/9+9//liTZbDaNHz9eTz31lJo0aaLGjRtr2rRpCg8PV79+/VwdBwAAVEIuLyg33HCD1qxZo6lTp+qJJ55Q48aNNX/+fMXFxTnWmTRpkgoKCjRy5Ejl5OSoY8eO2rhxo7y9vV0dBwAAVEIuLyiSdNttt+m222676HKbzaYnnnhCTzzxhDt2DwAAKjmexQMAAIxDQQEAAMahoAAAAONQUAAAgHEoKAAAwDgUFAAAYBwKCgAAMA4FBQAAGIeCAgAAjENBAQAAxqGgAAAA41BQAACAcSgoAADAOBQUAABgHAoKAAAwDgUFAAAYh4ICAACMQ0EBAADGoaAAAADjUFAAAIBxKCgAAMA4FBQAAGAcCgoAADAOBQUAABiHggIAAIxDQQEAAMbxqOgAAABUtKgpGy57G9/O6u2CJDiPIygAAMA4FBQAAGAcCgoAADAOBQUAABiHggIAAIxDQQEAAMZxe0GZNWuWbDabxo8f75hXWFio0aNHKygoSLVr19aAAQOUnZ3t7igAAKCScGtB2bVrl/71r3/p2muvdZo/YcIErVu3Tm+//ba2bNmi48eP64477nBnFAAAUIm4raDk5+crLi5OS5cuVd26dR3zc3Nz9corr2jevHnq0qWL2rRpo2XLlmn79u3asWOHu+IAAIBKxG0FZfTo0erdu7diY2Od5qempqq4uNhpfrNmzRQZGank5ORSt1VUVKS8vDynCQAAVF1uudX96tWrtWfPHu3ateuCZVlZWapZs6YCAgKc5oeEhCgrK6vU7SUkJGjmzJnuiAoAAAzk8iMomZmZevDBB7Vq1Sp5e3u7ZJtTp05Vbm6uY8rMzHTJdgEAgJlcXlBSU1N14sQJXX/99fLw8JCHh4e2bNmiBQsWyMPDQyEhITp9+rRycnKc3pedna3Q0NBSt+nl5SU/Pz+nCQAAVF0uP8XTtWtX7du3z2nesGHD1KxZM02ePFkRERHy9PRUYmKiBgwYIElKT0/XsWPHFBMT4+o4AACgEnJ5QalTp46uueYap3m+vr4KCgpyzL/33nsVHx+vwMBA+fn5aezYsYqJidFNN93k6jgAAKAScssg2T/y/PPPy263a8CAASoqKlL37t310ksvVUQUAABgoHIpKJ9++qnTa29vby1atEiLFi0qj90DAIBKhmfxAAAA41BQAACAcSgoAADAOBQUAABgHAoKAAAwDgUFAAAYh4ICAACMQ0EBAADGoaAAAADjUFAAAIBxKCgAAMA4FBQAAGAcCgoAADAOBQUAABiHggIAAIxDQQEAAMahoAAAAONQUAAAgHEoKAAAwDgUFAAAYBwKCgAAMI5HRQeA+0VN2XBZ7/92Vm8XJQEA4M/hCAoAADAOBQUAABiHUzyoNjjVBQCVB0dQAACAcSgoAADAOBQUAABgHMagAEA1drljsyTGZ8E9OIICAACMQ0EBAADGoaAAAADjUFAAAIBxXF5QEhISdMMNN6hOnToKDg5Wv379lJ6e7rROYWGhRo8eraCgINWuXVsDBgxQdna2q6MAAIBKyuUFZcuWLRo9erR27NihTZs2qbi4WN26dVNBQYFjnQkTJmjdunV6++23tWXLFh0/flx33HGHq6MAAIBKyuWXGW/cuNHp9fLlyxUcHKzU1FR16tRJubm5euWVV/T666+rS5cukqRly5YpOjpaO3bs0E033eTqSAAAoJJx+xiU3NxcSVJgYKAkKTU1VcXFxYqNjXWs06xZM0VGRio5ObnUbRQVFSkvL89pAgAAVZdbC0pJSYnGjx+vDh066JprrpEkZWVlqWbNmgoICHBaNyQkRFlZWaVuJyEhQf7+/o4pIiLCnbEBAEAFc2tBGT16tL766iutXr36srYzdepU5ebmOqbMzEwXJQQAACZy263ux4wZo/Xr1yspKUkNGzZ0zA8NDdXp06eVk5PjdBQlOztboaGhpW7Ly8tLXl5e7ooKAAAM4/IjKJZlacyYMVqzZo02b96sxo0bOy1v06aNPD09lZiY6JiXnp6uY8eOKSYmxtVxAABAJeTyIyijR4/W66+/rvfee0916tRxjCvx9/eXj4+P/P39de+99yo+Pl6BgYHy8/PT2LFjFRMTwxU8AABAkhsKyuLFiyVJnTt3dpq/bNkyDR06VJL0/PPPy263a8CAASoqKlL37t310ksvuToKAACopFxeUCzL+sN1vL29tWjRIi1atMjVuwcAAFUAz+IBAADGoaAAAADjUFAAAIBxKCgAAMA4FBQAAGAcCgoAADAOBQUAABiHggIAAIxDQQEAAMahoAAAAONQUAAAgHEoKAAAwDgUFAAAYByXP83YFFFTNlz2Nr6d1dsFSQAA+GP83nLGERQAAGAcCgoAADAOBQUAABiHggIAAIxDQQEAAMahoAAAAONQUAAAgHEoKAAAwDgUFAAAYBwKCgAAMA4FBQAAGIeCAgAAjFNlHxYIwFyX+1C0qvRANACl4wgKAAAwDgUFAAAYh4ICAACMwxgUN+NcO36LnwfgQpf790Li70ZVxBEUAABgHI6goFxw5AAAUBYVegRl0aJFioqKkre3t9q1a6edO3dWZBwAAGCICisob775puLj4zVjxgzt2bNHrVq1Uvfu3XXixImKigQAAAxRYQVl3rx5GjFihIYNG6bmzZtryZIlqlWrll599dWKigQAAAxRIWNQTp8+rdTUVE2dOtUxz263KzY2VsnJyResX1RUpKKiIsfr3NxcSVJeXt5F91FSdOqyc15q+3/W5eYgAxlcncEEJvw5mPAdQQYyVLcM55dZlvXHG7IqwPfff29JsrZv3+40/+GHH7ZuvPHGC9afMWOGJYmJiYmJiYmpCkyZmZl/2BUqxVU8U6dOVXx8vON1SUmJTp48qaCgINlstr+0zby8PEVERCgzM1N+fn6uikoGMpCBDGSoYhlMyVEVMliWpV9++UXh4eF/uG6FFJR69eqpRo0ays7OdpqfnZ2t0NDQC9b38vKSl5eX07yAgACXZPHz86vQH3oykIEMZCBD5chgSo7KnsHf3/9PrVchg2Rr1qypNm3aKDEx0TGvpKREiYmJiomJqYhIAADAIBV2iic+Pl5DhgxR27ZtdeONN2r+/PkqKCjQsGHDKioSAAAwRIUVlH/84x/68ccfNX36dGVlZem6667Txo0bFRISUi779/Ly0owZMy44dVSeyEAGMpCBDOZnMCVHdctgs6w/c60PAABA+eFhgQAAwDgUFAAAYBwKCgAAMA4FBQAAGIeCggqXkZGhM2fOVHSMCsefwf9h7D5woer296LaF5QjR46oS5cubt/PDz/8oNdee00ffPCBTp8+7bSsoKBATzzxhNszbNq0STNmzNDmzZslSUlJSerZs6e6dOmiZcuWuX3/F3P11Vfr0KFDFbLv48ePa8aMGYqLi9PEiRP19ddfu32fGzdu1L59+ySdu0Hhk08+qQYNGsjLy0sNGzbUrFmz3P5F1KdPH/3nP//Rr7/+6tb9XEpRUZEmTpyoTp066dlnn5UkPfXUU6pdu7bq1Kmj//qv/yqXhyN+8cUXGjx4sK644gr5+PjI19dXLVu21LRp08rt4Yz/8z//o9mzZ6t///6KiYlRTEyM+vfvr+eee04//vhjuWS4lMzMTA0fPtzt+/n111+1detWHThw4IJlhYWFWrlypdszSFJaWpqWLVvm+D74+uuvdf/992v48OGO78+K4OXlpbS0tArZd0FBgZYtW6ZHH31UL774on766Se377PaX2b8xRdf6Prrr9fZs2fdto9du3apW7duKikpUXFxsRo0aKC1a9eqRYsWks7d4j88PNytGV577TUNGzZM1157rQ4ePKiFCxdqwoQJuvPOO1VSUqLXXntNq1at0p133um2DHfccUep89977z116dJFderUkSS9++67bstQq1YtHT16VPXr19eBAwfUvn171a9fX61bt9a+fft07NgxJScn69prr3VbhmbNmmnp0qX629/+poSEBM2dO1ePPvqooqOjlZ6eroSEBE2YMEGTJ092Wwa73a4aNWrI19dXgwYN0j//+U+1adPGbfsrTXx8vN58800NGjRIH3zwgW655RatX79ezzzzjOx2u6ZPn66ePXtqwYIFbsvw0UcfqX///urVq5d8fHz07rvvavjw4fL19dU777wjy7K0devWUh/B4Sq7du1S9+7dVatWLcXGxjruBZWdna3ExESdOnVKH330kdq2beu2DH+kPL4nDx48qG7duunYsWOy2Wzq2LGjVq9erbCwMEnl8z0pnfsHRN++fVW7dm2dOnVKa9as0eDBg9WqVSuVlJRoy5Yt+vjjj936D9vfPnvut1544QXdfffdCgoKkiTNmzfPbRmaN2+urVu3KjAwUJmZmerUqZN+/vlnNW3aVEeOHJGHh4d27Nihxo0buy1DlS8of/Tl9v3332vOnDlu/aG/9dZbFRERoZdfflkFBQWaPHmy3nrrLW3atEmtW7cul794rVu31rBhwzRu3DglJiaqT58+evrppzVhwgRJ0ty5c7VmzRpt3brVbRnsdrs6dep0wQ/0ypUrdfvttzuer+TOozl2u11ZWVkKDg5Wv379VFJSonfffVceHh4qKSlRXFyc8vPztW7dOrdl8Pb21sGDBxUZGamWLVtq+vTp+vvf/+5YvmHDBo0fP96tR5Xsdru++uorffzxx3r11Ve1f/9+tWzZUv/85z8VFxenunXrum3f50VGRurVV19VbGysvvnmGzVp0kTvvvuu+vbtK+ncEb8RI0bo22+/dVuG1q1b67777tOoUaMc+xw3bpzS0tJUXFysnj17KiIiwq0/kzfddJNatWqlJUuWXPDwU8uyNGrUKH355ZdKTk52W4b333//ksu/+eYbPfTQQ279jurfv7+Ki4u1fPly5eTkaPz48Tpw4IA+/fRTRUZGlltBad++vbp06aKnnnpKq1ev1gMPPKD7779fTz/9tKRzD69NTU3Vxx9/7LYMdrtdrVq1uuCZc1u2bFHbtm3l6+srm83m1qM5v/2uvPvuu5WRkaEPPvhA/v7+ys/PV//+/VW/fn29/vrrbsugP3zecSVns9ms8PBwKyoqqtQpPDzcstvtbs1Qt25dKz093WleQkKCVbduXWvnzp1WVlaW2zP4+vpa33zzjeO1p6en9cUXXzhep6WlWUFBQW7N8MYbb1gNGza0Xn31Vaf5Hh4e1v79+9267/NsNpuVnZ1tWZZlRUREWElJSU7L9+zZY4WFhbk1Q1hYmJWcnGxZlmWFhIRYe/bscVp+8OBBy8fHx60ZfvvnYFmWlZKSYo0cOdLy9/e3fHx8rEGDBlmJiYluzeDj42MdPXrU8drT09P66quvHK8zMjKsWrVquTWDt7e3lZGR4XhdUlJieXp6WsePH7csy7KSkpKs+vXruz1DWlraRZenpaVZ3t7ebs1gs9ksu91u2Wy2i07u/o4KDg62vvzyS8frkpISa9SoUVZkZKR15MiRcvmetCzL8vPzsw4dOmRZlmWdPXvW8vDwcPo7um/fPiskJMStGRISEqzGjRtf8Hewor4rr7jiCuvjjz92Wr5t2zYrIiLCrRmq/BiURo0a6fnnn1dGRkap04YNG8olR2FhodPrKVOm6JFHHlG3bt20fft2t+/f09PTaeyLl5eXateu7fTa3eMRBg4cqM8++0yvvPKKBgwYoJ9//tmt+yuNzWZz/CvVbrdf8FTNgIAAt+fq37+/nn76aZ09e1Z9+/bVSy+95DTmZOHChbruuuvcmuH3brzxRv3rX//S8ePH9dJLLykzM1O33nqrW/cZGRnpOCqwa9cu2Ww27dy507E8JSVFDRo0cGuGBg0aKD093fH6yJEjKikpcRxCb9iwofLz892aITQ01Olz/97OnTvd/giQsLAwvfvuuyopKSl12rNnj1v3L50bf+Lh8X9PX7HZbFq8eLH69Omjm2++WQcPHnR7ht/uWzr3HeHt7e30PVGnTh3l5ua6df9TpkzRm2++qfvvv18TJ05UcXGxW/d3Mef/HAoLCx2n2s5r0KCB28dHVdizeMpLmzZtlJqaqrvuuqvU5Tabze0DEq+55hpt3779gnENEydOVElJiQYNGuTW/UvSVVddpa+//lpXX321pHOnts6P+ZDOfTE3bNjQ7TmioqKUlJSkmTNnqlWrVlq6dOkFh7XdybIsNW3aVDabTfn5+fryyy+d/r8cPnzYreMNJOmZZ55RbGysmjVrppiYGL399tvatGmTmjZtqsOHD+vkyZP66KOP3JrhYmrVqqWhQ4dq6NChbv+FMGrUKA0dOlQvv/yyUlNTNWfOHD3yyCP6+uuvZbfbtXjxYj300ENuzTB48GD985//1KOPPiovLy/NmzdPt99+u2rWrClJ2rt3r1vPsUvnvgdGjhyp1NRUde3a9YIxKEuXLtWcOXPcmuH89+T502u/Vx7fk82aNdPu3bsVHR3tNP/FF1+UJN1+++1u3f95UVFROnTokK688kpJUnJysiIjIx3Ljx07dsEva3e44YYblJqaqtGjR6tt27ZatWpVuX5XSlLXrl3l4eGhvLw8paen65prrnEsO3r0qKPIu0uVLyhPPPGETp06ddHlzZs3V0ZGhlszDB48WFu2bHGc5/6tSZMmybIsLVmyxK0ZHnnkEadxBX5+fk7Ld+/efdES52p2u10zZ87UrbfeqsGDB7v9nPJv/X4swVVXXeX0eseOHerfv79bM/j7+2v79u165ZVXtG7dOkVFRamkpESnT5/WoEGDdP/997u9LN58882OX8IX07RpU7dmGD9+vIKDg5WcnKzhw4dr0KBBjjE5p06d0oQJE/Too4+6NcMjjzyigoICPfnkkyoqKlL37t31wgsvOJY3aNBAixcvdmuG0aNHq169enr++ef10ksvOf4+1KhRQ23atNHy5cvd/nfz4YcfVkFBwUWXX3XVVfrkk0/cmqF///564403dM8991yw7MUXX1RJSYnbvycl6f7773f6TvrtL2VJ+vDDD8vlyk9Jql27tlasWKHVq1crNja2XL8rZ8yYcUGW31q3bp3+9re/uTVDlR8kC7Pl5+fryJEjio6O/sNfmEBVV1xcrP/5n/+RJNWrV0+enp4VnAim+O6775SamqrY2Fj5+vpWdJxyQUEBAADGqfKDZAGgMiuvm0mSofLkqC4ZKCgAYLD8/Hxt2bKFDAZkMCVHdclQ5QfJAoDJ/szNJMlQPhlMyUGGc6rNGJTvvvvuoldG7NixQzfddBMZyEAGMpR7BrvdrrCwsIsOEj99+rSysrLcegUHGczKQYb/5dbbwBkkOjra+umnny6Yv3XrVsvf358MZCADGSokQ1RUlPXmm29edPnnn3/u9juoksGsHGQ4p9qMQbnpppvUrVs3/fLLL455SUlJ6tWr1wXXe5OBDGQgQ3llOH+TtIspj5ukkcGsHGT4X26tPwY5e/as1b9/f+vmm2+2CgsLrc2bN1u1a9e25s+fTwYykIEMFZZh//791q5duy66/PTp09a3335LhnLIYEoOMpxTbQqKZVlWUVGRFRsba7Vv396qXbu2tXDhQjKQgQxkMCYDgP9TpQfJfvnllxfM++WXXzRo0CD17t1b999/v2P+75+TQwYykIEM7s4A4OKqdEGx2+0XnCf77evz/22z2dw2EpkMZCADGQCUXZW+D4q7HwJIBjKQgQwA3KNKH0EBAACVU7W5zFg69+yAsWPHKjY2VrGxsRo3bpyOHDlCBjKQgQwVnuG777676LIdO3aQoRwzmJKjumeoNgXlo48+UvPmzbVz505de+21uvbaa5WSkqIWLVpo06ZNZCADGchQoRm6deumkydPXjB/27Zt6tGjBxnKMYMpOap9BrdfJ2SI6667zpo8efIF8ydPnmy1bt2aDGQgAxkqNMOwYcOsNm3aWHl5eY55W7Zssfz8/Kx58+aRoRwzmJKjumeoNgXFy8vLOnjw4AXz09PTLS8vLzKQgQxkqNAM1f2GdSZlMCVHdc9QbU7x1K9fX3v37r1g/t69exUcHEwGMpCBDBWawW63a/Xq1fL09FSXLl10++23KyEhQQ8++GC57J8M5uWo7hmq9GXGvzVixAiNHDlS33zzjdq3by/p3Dm0Z599VvHx8WQgAxnIUO4ZSrtZ3OOPP65Bgwbp7rvvVqdOnRzrlOcN66pjBlNykOE33H6MxhAlJSXWvHnzrAYNGlg2m82y2WxWgwYNrPnz51slJSVkIAMZyFDuGWw2m2W32x37/P3r8//tzqfGksGsHGT4TQ7Lqn73QTn/xNI6deqQgQxkIEOFZTh69OifXrdRo0ZkcGMGU3KQ4f9Uu4Ly448/Kj09XZLUrFkz1atXjwxkIAMZjMkA4H+59fiMQfLz861hw4ZZNWrUcBym8vDwsIYPH24VFBSQgQxkIEOFZrAsyzp8+LA1ZswYq2vXrlbXrl2tsWPHWocPHy63/ZPBvBzVOUO1uYonPj5eW7Zs0bp165STk6OcnBy999572rJlix566CEykIEMZKjQDCbcLI4MZuWo9hncXoEMERQUZH3yyScXzN+8ebNVr149MpCBDGSo0Awm3CyODGblqO4Zqk1B8fHxsQ4cOHDB/K+++sqqVasWGchABjJUaAYTbhZHBrNyVPcM1eYUT0xMjGbMmKHCwkLHvF9//VUzZ85UTEwMGchABjJUaAYTbhZHBrNyVPcM1eZGbS+88IK6d++uhg0bqlWrVpKkL774Qt7e3vroo4/IQAYykKFCM1TnG9aZlsGUHNU9Q7W6zPjUqVNatWqVvv76a0lSdHS04uLi5OPjQwYykIEMFZrBsizNnz9fc+fO1fHjxyVJ4eHhevjhhzVu3DjZbDYylFMGU3JU9wzVqqAAQGVQ3W5YZ3IGU3JUxwzV5hTPTz/9pKCgIElSZmamli5dql9//VV9+vRRp06dyEAGMpChQjOcZ8LN4shgVo5qm8GtQ3AN8OWXX1qNGjWy7Ha7dfXVV1uff/65FRISYtWuXdvy8/OzatSoYa1Zs4YMZCADGSokw3km3CyODGblqO4ZqnxB6dGjh3XbbbdZW7dute677z6rQYMG1vDhw62zZ89aZ8+etR544AGrXbt2ZCADGchQIRnOGzlypHXFFVdYH3zwgZWbm2vl5uZaGzZssK688kpr1KhRZCjHDKbkqO4ZqnxBCQoKsr744gvLsizrl19+sWw2m7V7927H8rS0NMvf358MZCADGSokw2+zVPTN4shgVo7qnqHK3wfl5MmTCg0NlSTVrl1bvr6+qlu3rmN53bp1HQN/yEAGMpChvDOcd+rUKYWEhFwwPzg4WKdOnSJDOWYwJUd1z1DlC4qkCy6DKq/L1MhABjKQ4c8y4WZxZDArR3XPUC2u4hk6dKi8vLwkSYWFhRo1apR8fX0lSUVFRWQgAxnIUKEZJDNuFkcGs3JU9wxV/j4ow4YN+1PrLVu2jAxkIAMZyj3Db1X0zeLIYF6O6pyhyhcUAABQ+VSLUzwAYDoTbhZHBrNyVPsMbr1GCABwSSbcLI4MZuUgwznV4ioeADDVpEmT1LJlSyUlJalz58667bbb1Lt3b+Xm5urnn3/Wfffdp1mzZpGhHDKYkoMM/8ut9QcAcEkm3CyODGblIMM5HEEBgApkws3iyGBWDjKcQ0EBgApmws3iyGBWDjJwFQ8AVDgTbhZHBrNykIH7oABAhTLhZnFkMCsHGc6hoAAAAOMwBgUAABiHggIAAIxDQQEAAMahoAAAAONQUAAAgHEoKABcrnPnzho/frxb9xEVFaX58+e7dR8AKg4FBQAAGIeCAgAAjENBAeAWZ86c0ZgxY+Tv76969epp2rRpOn9fyJ9//lmDBw9W3bp1VatWLfXs2VOHDh1yev8777yjFi1ayMvLS1FRUZo7d+4l9/fyyy8rICBAiYmJbvtMAMoPBQWAW6xYsUIeHh7auXOnXnjhBc2bN08vv/yypHPP+Ni9e7fef/99JScny7Is9erVS8XFxZKk1NRU3XXXXRo4cKD27dunxx9/XNOmTdPy5ctL3dfs2bM1ZcoUffzxx+ratWt5fUQAbsSt7gG4XOfOnXXixAnt37/f8QTUKVOm6P3339d7772npk2batu2bWrfvr0k6aefflJERIRWrFihv//974qLi9OPP/6ojz/+2LHNSZMmacOGDdq/f7+kc4Nkx48frx9++EH/+c9/tGnTJrVo0aL8PywAt+AICgC3uOmmm5wezx4TE6NDhw7pwIED8vDwULt27RzLgoKCdPXVVystLU2SlJaWpg4dOjhtr0OHDjp06JDOnj3rmDd37lwtXbpUW7dupZwAVQwFBUCl9be//U1nz57VW2+9VdFRALgYBQWAW6SkpDi93rFjh5o0aaLmzZvrzJkzTst/+uknpaenq3nz5pKk6Ohobdu2zen927ZtU9OmTVWjRg3HvBtvvFEffvihnnnmGc2ZM8eNnwZAeaOgAHCLY8eOKT4+Xunp6XrjjTe0cOFCPfjgg2rSpIn69u2rESNGaOvWrfriiy909913q0GDBurbt68k6aGHHlJiYqKefPJJHTx4UCtWrNCLL76oiRMnXrCf9u3b64MPPtDMmTO5cRtQhXhUdAAAVdPgwYP166+/6sYbb1SNGjX04IMPauTIkZKkZcuW6cEHH9Rtt92m06dPq1OnTvrggw/k6ekpSbr++uv11ltvafr06XryyScVFhamJ554QkOHDi11Xx07dtSGDRvUq1cv1ahRQ2PHji2vjwnATbiKBwAAGIdTPAAAwDgUFAAAYBwKCgAAMA4FBQAAGIeCAgAAjENBAQAAxqGgAAAA41BQAACAcSgoAADAOBQUAABgHAoKAAAwzv8HR9iJ42ckcz4AAAAASUVORK5CYII=", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "search = 'napoleon'\n", "book_df['search_count'] = book_df['word_tokenized'].apply(lambda x: x.count(search.lower()))\n", "\n", "book_df.plot(x='book',y='search_count', kind='bar')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generalizing\n", "Once I am happy with how my pipeline is functioning, I like to then try to write a function that is as general as possible. In this case, we started with a Gutenberg URL to a `txt` file, so let's take that as the starting place for our general function. Please take some time to develop a function which will return a `DataFrame` which we can use to generate a similar visualization as above from any Gutenbery URL. Be careful to note where, in the code above, some methods are specific to *War and Peace* and which are general. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_gutenberg_data(\n", " url, # the Gutenberg URL\n", " # you may need more arguments for your function \n", " ):\n", " \"\"\" \n", " Takes in a Gutenberg URL \n", " Returns a DataFrame of relevant information\n", " \"\"\"\n", " ## GET TEXT FROM URL WITH A REQUEST\n", " \n", " ## TRIM YOUR TEXT BASED ON SUBSTRINGS\n", "\n", " content_dict = {}\n", " content_list = list(re.finditer(r'YOUR REGEX HERE!!'), text)\n", "\n", " ## CREATE A DICTIONARY FOR THE INDEX POSITONS YOUR CONTENT\n", " \n", " ## CONVERT YOUR DICTIONARY INTO A DATAFRAME\n", "\n", " ## CREATE A TEXT COLUMN IN YOUR DATAFRAME\n", "\n", " ## CLEAN YOUR TEXT\n", "\n", " ## TOKENIZE (SENTENCE AND WORD LEVEL)\n", "\n", " ## RETURN YOUR DATAFRAME" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "search = '' ## INPUT SEARCH TERM HERE\n", "\n", "## ADD A COUNTS COLUMN \n", "\n", "## PLOT A GRAPH" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Specific Application" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From *Tolstoy's Phoenix: From Method to Meaning in War and Peace* by George R. Clay (1998), page 7:\n", "
\n", ">In the most direct use of Tolstoy's technqiues, he categorizes in his own voice. Instead of \"The princess smiled, thinking she know more abou the subject than Prince Vasili,\" he will write: \"The princess smiled, *as people do* who think they know more about the subject under discussion than those they are talking with\" (1:2, p.77; emphasis added). The first version expresses a private thought, but the one Tolstoy used implies (as R. F. Christian phrased it) that \"there is a basic denominator of human experience\" -- a sameness about our pattern of behavior, so that we all know what kind of smile it is... By writing \"as people do,\" Tolstoy isn't telling us what this smile is like, he is assuming that we *know* what it is like: that we have seen it many times before..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try to find more examples and see if they fall in line with Clay's analysis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# the phrases Clay identifies as relevent\n", "phrases = [\n", " 'as people do',\n", " 'of one who',\n", " 'peculiar to',\n", " 'as with everyone',\n", " 'as one who',\n", " 'only used by persons who',\n", " 'of a man who',\n", " 'which a man has',\n", " 'in the way',\n", " 'as is usually the case',\n", " 'such as often occurs',\n", " 'which usually follows'\n", "]\n", "def printExcerpt(row):\n", " print(row['chapter'], row['sents'], sep='\\t')\n", " print('\\n')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AS PEOPLE DO\n", "Chapter 21\tdo you understand that in consideration of the count’s services, his request would be granted?...” the princess smiled as people do who think they know more about the subject under discussion than those they are talking with.\n", "\n", "\n", "-----\n", "OF ONE WHO\n", "Chapter 6\the put on the air of one who finds it impossible to reply to such nonsense, but it would in fact have been difficult to give any other answer than the one prince andrew gave to this naïve question.\n", "\n", "\n", "Chapter 10\t“i can just imagine what a funny figure that policeman cut!” and as he waved his arms to impersonate the policeman, his portly form again shook with a deep ringing laugh, the laugh of one who always eats well and, in particular, drinks well.\n", "\n", "\n", "Chapter 21\tshe had the air of one who has suddenly lost faith in the whole human race.\n", "\n", "\n", "Chapter 113\twhen he had joined the freemasons he had experienced the feeling of one who confidently steps onto the smooth surface of a bog.\n", "\n", "\n", "Chapter 150\tbut in spite of that she seemed to be disillusioned about everything and told everyone that she did not believe either in friendship or in love, or any of the joys of life, and expected peace only “yonder.” she adopted the tone of one who has suffered a great disappointment, like a girl who has either lost the man she loved or been cruelly deceived by him.\n", "\n", "\n", "Chapter 170\tthe emperor, with the agitation of one who has been personally affronted, was finishing with these words: “to enter russia without declaring war!\n", "\n", "\n", "-----\n", "PECULIAR TO\n", "Chapter 1\tshe is rich and of good family and that’s all i want.” and with the familiarity and easy grace peculiar to him, he raised the maid of honor’s hand to his lips, kissed it, and swung it to and fro as he lay back in his armchair, looking in another direction.\n", "\n", "\n", "Chapter 51\tprince vasíli frowned, twisting his mouth, his cheeks quivered and his face assumed the coarse, unpleasant expression peculiar to him.\n", "\n", "\n", "Chapter 75\ti will follow.” when princess mary returned from her father, the little princess sat working and looked up with that curious expression of inner, happy calm peculiar to pregnant women.\n", "\n", "\n", "Chapter 150\tjulie on the contrary accepted his attentions readily, though in a manner peculiar to herself.\n", "\n", "\n", "-----\n", "AS WITH EVERYONE\n", "Chapter 18\tberg, oblivious of irony or indifference, continued to explain how by exchanging into the guards he had already gained a step on his old comrades of the cadet corps; how in wartime the company commander might get killed and he, as senior in the company, might easily succeed to the post; how popular he was with everyone in the regiment, and how satisfied his father was with him.\n", "\n", "\n", "Chapter 25\tas with everyone, her face assumed a forced unnatural expression as soon as she looked in a glass.\n", "\n", "\n", "-----\n", "AS ONE WHO\n", "Chapter 8\thalfway through supper prince andrew leaned his elbows on the table and, with a look of nervous agitation such as pierre had never before seen on his face, began to talk—as one who has long had something on his mind and suddenly determines to speak out.\n", "\n", "\n", "-----\n", "ONLY USED BY PERSONS WHO\n", "Chapter 17\they, who’s there?” he called out in a tone only used by persons who are certain that those they call will rush to obey the summons.\n", "\n", "\n", "-----\n", "OF A MAN WHO\n", "Chapter 5\t“if buonaparte remains on the throne of france a year longer,” the vicomte continued, with the air of a man who, in a matter with which he is better acquainted than anyone else, does not listen to others but follows the current of his own thoughts, “things will have gone too far.\n", "\n", "\n", "Chapter 10\tas soon as he had seen a visitor off he returned to one of those who were still in the drawing room, drew a chair toward him or her, and jauntily spreading out his legs and putting his hands on his knees with the air of a man who enjoys life and knows how to live, he swayed to and fro with dignity, offered surmises about the weather, or touched on questions of health, sometimes in russian and sometimes in very bad but self-confident french; then again, like a man weary but unflinching in the fulfillment of duty, he rose to see some visitors off and, stroking his scanty gray hairs over his bald patch, also asked them to dinner.\n", "\n", "\n", "Chapter 37\treviewing his impressions of the recent battle, picturing pleasantly to himself the impression his news of a victory would create, or recalling the send-off given him by the commander in chief and his fellow officers, prince andrew was galloping along in a post chaise enjoying the feelings of a man who has at length begun to attain a long-desired happiness.\n", "\n", "\n", "Chapter 37\this face took on the stupid artificial smile (which does not even attempt to hide its artificiality) of a man who is continually receiving many petitioners one after another.\n", "\n", "\n", "Chapter 43\t“now what does this mean, gentlemen?” said the staff officer, in the reproachful tone of a man who has repeated the same thing more than once.\n", "\n", "\n", "Chapter 46\tit expressed the concentrated and happy resolution you see on the face of a man who on a hot day takes a final run before plunging into the water.\n", "\n", "\n", "Chapter 72\the was entirely absorbed by two considerations: his wife’s guilt, of which after his sleepless night he had not the slightest doubt, and the guiltlessness of dólokhov, who had no reason to preserve the honor of a man who was nothing to him.... “i should perhaps have done the same thing in his place,” thought pierre.\n", "\n", "\n", "Chapter 177\the said a few words to prince andrew and chernýshev about the present war, with the air of a man who knows beforehand that all will go wrong, and who is not displeased that it should be so.\n", "\n", "\n", "Chapter 258\ti must tell you, mon cher,” he continued in the sad and measured tones of a man who intends to tell a long story, “that our name is one of the most ancient in france.” and with a frenchman’s easy and naïve frankness the captain told pierre the story of his ancestors, his childhood, youth, and manhood, and all about his relations and his financial and family affairs, “ma pauvre mère” playing of course an important part in the story.\n", "\n", "\n", "Chapter 358\tthe replies this theory gives to historical questions are like the replies of a man who, watching the movements of a herd of cattle and paying no attention to the varying quality of the pasturage in different parts of the field, or to the driving of the herdsman, should attribute the direction the herd takes to what animal happens to be at its head.\n", "\n", "\n", "Chapter 363\tso that if we examine the case of a man whose connection with the external world is well known, where the time between the action and its examination is great, and where the causes of the action are most accessible, we get the conception of a maximum of inevitability and a minimum of free will.\n", "\n", "\n", "-----\n", "WHICH A MAN HAS\n", "Chapter 173\tnapoleon was in that state of irritability in which a man has to talk, talk, and talk, merely to convince himself that he is in the right.\n", "\n", "\n", "Chapter 334\tnow that he was telling it all to natásha he experienced that pleasure which a man has when women listen to him—not clever women who when listening either try to remember what they hear to enrich their minds and when opportunity offers to retell it, or who wish to adopt it to some thought of their own and promptly contribute their own clever comments prepared in their little mental workshop—but the pleasure given by real women gifted with a capacity to select and absorb the very best a man shows of himself.\n", "\n", "\n", "-----\n", "IN THE WAY\n", "Chapter 18\the was in the way and was the only one who did not notice the fact.\n", "\n", "\n", "Chapter 56\tberg put on the cleanest of coats, without a spot or speck of dust, stood before a looking glass and brushed the hair on his temples upwards, in the way affected by the emperor alexander, and, having assured himself from the way rostóv looked at it that his coat had been noticed, left the room with a pleasant smile.\n", "\n", "\n", "Chapter 128\tsónya was afraid to leave natásha and afraid of being in the way when she was with them.\n", "\n", "\n", "Chapter 130\tthe old countess sighed as she looked at them; sónya was always getting frightened lest she should be in the way and tried to find excuses for leaving them alone, even when they did not wish it.\n", "\n", "\n", "Chapter 152\tshe knew what she ought to have said to natásha, but she had been unable to say it because mademoiselle bourienne was in the way, and because, without knowing why, she felt it very difficult to speak of the marriage.\n", "\n", "\n", "Chapter 208\tthe stout man rose, frowned, shrugged his shoulders, and evidently trying to appear firm began to pull on his jacket without looking about him, but suddenly his lips trembled and he began to cry, in the way full-blooded grown-up men cry, though angry with himself for doing so.\n", "\n", "\n", "Chapter 221\tonce or twice he was shouted at for being in the way.\n", "\n", "\n", "Chapter 234\tthose who went away, taking what they could and abandoning their houses and half their belongings, did so from the latent patriotism which expresses itself not by phrases or by giving one’s children to save the fatherland and similar unnatural exploits, but unobtrusively, simply, organically, and therefore in the way that always produces the most powerful results.\n", "\n", "\n", "Chapter 282\tin all these plottings the subject of intrigue was generally the conduct of the war, which all these men believed they were directing; but this affair of the war went on independently of them, as it had to go: that is, never in the way people devised, but flowing always from the essential attitude of the masses.\n", "\n", "\n", "-----\n", "AS IS USUALLY THE CASE\n", "Chapter 95\tas is usually the case with people meeting after a prolonged separation, it was long before their conversation could settle on anything.\n", "\n", "\n", "-----\n", "SUCH AS OFTEN OCCURS\n", "Chapter 139\tafter a casual pause, such as often occurs when receiving friends for the first time in one’s own house, “uncle,” answering a thought that was in his visitors’ minds, said: “this, you see, is how i am finishing my days... death will come.\n", "\n", "\n", "-----\n", "WHICH USUALLY FOLLOWS\n", "Chapter 334\tthey all three of them now experienced that feeling of awkwardness which usually follows after a serious and heartfelt talk.\n", "\n", "\n", "-----\n" ] } ], "source": [ "for phrase in phrases:\n", " print(phrase.upper())\n", " sub_df = sent_explode.loc[sent_explode['sents'].str.contains(phrase)]\n", " sub_df.apply(printExcerpt,axis=1)\n", " print('-----')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great! Now we can see all of the times that Tolstoy uses these particular phrases and we can begin to see what Clay is saying. In this lines, Tolstoy is not giving us a direct characterization or description of a person or thing. Instead he is telling us about the type of person or thing it is. The next step would be then to see if there is a trend in the people or things he refers to in this manner or if this is, as Clay contends, is a general feature of Tolstoy's prose. At every step in that process, we must make sure to compare our results with Clay and Tolstoy himself. It can be easy to trick yourself into a discovery and the only way to avoid that is to have a close connection to the data itself. Always spend time to play around with your data, even after you think it's 100% clean, to see what it hides." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reviewing what we learned\n", "* How to take advantage of free repositories of text like Project Gutenberg\n", "* Taking a url to a text and reading it into our Python runtime\n", "* Cleaning our data so that we can find some general statisics\n", "* How to create visualizations of these statisics\n", "* Applying simple methodologies to a complex question\n", "* Outlining a method we could use to answer such a question" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a challenge, now that we have a good baseline for picking out the phrases Clay points us too, can you create dataframe that stores the sentence, what phrase from Clay it uses, and what chapter it is in? Be sure to start with a dictionary, as we did when we created a dataframe together and then convert it into a dataframe using `from_dict`. When you're done, you can then export it as a .csv file, so you can use it later with the method `whatever_your_df_is_called.to_csv('name_of_file.csv')`. This way you can share your work with other scholars and later can deploy it to a website or another publication. If you have trouble or just want to show off how you did it, feel free to reach out and let me know at peter.nadel@tufts.edu " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Thanks for reading" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.10.8 ('webscrape_env')", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.10.8" }, "vscode": { "interpreter": { "hash": "9fab2b31d7f94569bb304ed7220415dbab9e94896d45f4079aae060fbbe4f8bf" } } }, "nbformat": 4, "nbformat_minor": 2 }