{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ "# The Art of Webscraping III: Scraping Reddit\n", "\n", "In the past two notebooks in this webscraiping series, we saw how we could use Python to automate getting data from websites. First `beautifulsoup` gave us a method of navigating the HTML of a static webpage and then `selenium` allowed us to parse dynamically generated pages.\n", "\n", "In this notebook, we'll look at a specialized source of data that we can pull from, Reddit. Reddit.com is a collection of forums where uses can discuss topics of shared interest. A lot of people use Reddit, so many NLP researchers use it as a place to gather data for novel datasets. In this example, we'll collect a swathe of text from [r/latin](reddit.com/r/latin) and save it as a CSV.\n", "\n", "Over hte past few years, Reddit has made it very difficult to get large chunks of their data. That said, another group, pullpush.io, have saved and hosted terabytes of histroical Reddit data for public use. We'll be using their API in this notebook.\n", "\n", "**Nota Bene**: pullpush's API service is designed for academic use only! It is an incredible resource, especially because it is free. Please do not abuse it!" ], "metadata": { "id": "o0UGl8wMCxS6" } }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "cqa0Igx3CrH5" }, "outputs": [], "source": [ "# imports\n", "import requests\n", "from datetime import datetime, timezone, timedelta\n", "from tqdm import tqdm\n", "import random\n", "random.seed(42)" ] }, { "cell_type": "markdown", "source": [ "Because pullpush is an API service. We will ask for data using an URL. This URL will include information like what subreddit we want to search through, the dates we want to search in, and how the results should be ordered. Se below for an example." ], "metadata": { "id": "URee4NQWF9Cj" } }, { "cell_type": "code", "source": [ "ex_url = \"https://api.pullpush.io/reddit/search/submission/?subreddit=latin\"" ], "metadata": { "id": "z059Q9V5FKuF" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [], "metadata": { "id": "z8KyGKFhHcQq" } }, { "cell_type": "markdown", "source": [ "Let's break that down into its component parts:\n", "\n", "\n", "* *https://api.pullpush.io/reddit/search/*: This part of the URL should never change. This is the base URL that we'll be adding to depending on our purposes.\n", "* *submission/*: This addition tells pullpsuh that we want to search posts (submissions) and not comments. As we will see later, there is different string that we can use instead that will allow us to search for comments.\n", "* *?*: This question mark is the start of our specific query. It tells pullpush that we are going to be giving it instructions about what data we are expecting to get.\n", "* *subreddit=latin*: This section tells pullpush we want data from the r/latin subreddit. This is very simple but there are many we can nuance this significantly.\n", "\n", "This is the simplest type of query we can run so let's see what it gives us.\n", "\n" ], "metadata": { "id": "cnNNRPLRGlws" } }, { "cell_type": "code", "source": [ "# using the requests library to make a GET request\n", "response = requests.get(ex_url)\n", "print(response.status_code) # status code 200 means it worked" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "uw8hwlsiGlON", "outputId": "c838b231-2e04-4738-8001-97bd234b0710" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "200\n" ] } ] }, { "cell_type": "code", "source": [ "data = response.json()['data']\n", "len(data), type(data[0]) # 100 dictionary responses" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "E5LZj7jcjPIk", "outputId": "94394e63-b6b8-41ce-ed22-263bfb78fa45" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "(100, dict)" ] }, "metadata": {}, "execution_count": 10 } ] }, { "cell_type": "code", "source": [ "data[0]" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "jq38YL31jU5H", "outputId": "95919387-2048-4c1c-f9c1-c19f02f8caa2" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "{'approved_at_utc': None,\n", " 'subreddit': 'latin',\n", " 'selftext': 'Share here your funny stories about people trying to test your abilities in Latin, such as trying to make you translate gibberish ',\n", " 'author_fullname': 't2_ckg1oxkg',\n", " 'saved': False,\n", " 'mod_reason_title': None,\n", " 'gilded': 0,\n", " 'clicked': False,\n", " 'title': 'Has anyone ever pulled the “dolor ipsum” text for you to translate?',\n", " 'link_flair_richtext': [],\n", " 'subreddit_name_prefixed': 'r/latin',\n", " 'hidden': False,\n", " 'pwls': 6,\n", " 'link_flair_css_class': '',\n", " 'downs': 0,\n", " 'thumbnail_height': None,\n", " 'top_awarded_type': None,\n", " 'hide_score': True,\n", " 'name': 't3_1czonxu',\n", " 'quarantine': False,\n", " 'link_flair_text_color': 'light',\n", " 'upvote_ratio': 1.0,\n", " 'author_flair_background_color': None,\n", " 'subreddit_type': 'public',\n", " 'ups': 1,\n", " 'total_awards_received': 0,\n", " 'media_embed': {},\n", " 'thumbnail_width': None,\n", " 'author_flair_template_id': None,\n", " 'is_original_content': False,\n", " 'user_reports': [],\n", " 'secure_media': None,\n", " 'is_reddit_media_domain': False,\n", " 'is_meta': False,\n", " 'category': None,\n", " 'secure_media_embed': {},\n", " 'link_flair_text': 'Humor',\n", " 'can_mod_post': False,\n", " 'score': 1,\n", " 'approved_by': None,\n", " 'is_created_from_ads_ui': False,\n", " 'author_premium': False,\n", " 'thumbnail': 'self',\n", " 'edited': False,\n", " 'author_flair_css_class': None,\n", " 'author_flair_richtext': [],\n", " 'gildings': {},\n", " 'content_categories': None,\n", " 'is_self': True,\n", " 'mod_note': None,\n", " 'created': 1716567641.0,\n", " 'link_flair_type': 'text',\n", " 'wls': 6,\n", " 'removed_by_category': None,\n", " 'banned_by': None,\n", " 'author_flair_type': 'text',\n", " 'domain': 'self.latin',\n", " 'allow_live_comments': False,\n", " 'selftext_html': '<!-- SC_OFF --><div class=\"md\"><p>Share here your funny stories about people trying to test your abilities in Latin, such as trying to make you translate gibberish </p>\\n</div><!-- SC_ON -->',\n", " 'likes': None,\n", " 'suggested_sort': None,\n", " 'banned_at_utc': None,\n", " 'view_count': None,\n", " 'archived': False,\n", " 'no_follow': True,\n", " 'is_crosspostable': False,\n", " 'pinned': False,\n", " 'over_18': False,\n", " 'all_awardings': [],\n", " 'awarders': [],\n", " 'media_only': False,\n", " 'link_flair_template_id': '4d1e6ea6-f46d-11e9-833d-0e06f385cf88',\n", " 'can_gild': False,\n", " 'spoiler': False,\n", " 'locked': False,\n", " 'author_flair_text': None,\n", " 'treatment_tags': [],\n", " 'visited': False,\n", " 'removed_by': None,\n", " 'num_reports': None,\n", " 'distinguished': None,\n", " 'subreddit_id': 't5_2qloa',\n", " 'author_is_blocked': False,\n", " 'mod_reason_by': None,\n", " 'removal_reason': None,\n", " 'link_flair_background_color': '#982133',\n", " 'id': '1czonxu',\n", " 'is_robot_indexable': True,\n", " 'report_reasons': None,\n", " 'author': 'Stoirelius',\n", " 'discussion_type': None,\n", " 'num_comments': 0,\n", " 'send_replies': True,\n", " 'whitelist_status': 'all_ads',\n", " 'contest_mode': False,\n", " 'mod_reports': [],\n", " 'author_patreon_flair': False,\n", " 'author_flair_text_color': None,\n", " 'permalink': '/r/latin/comments/1czonxu/has_anyone_ever_pulled_the_dolor_ipsum_text_for/',\n", " 'parent_whitelist_status': 'all_ads',\n", " 'stickied': False,\n", " 'url': 'https://www.reddit.com/r/latin/comments/1czonxu/has_anyone_ever_pulled_the_dolor_ipsum_text_for/',\n", " 'subreddit_subscribers': 96669,\n", " 'created_utc': 1716567641.0,\n", " 'num_crossposts': 0,\n", " 'media': None,\n", " 'is_video': False}" ] }, "metadata": {}, "execution_count": 11 } ] }, { "cell_type": "markdown", "source": [ "## Collecting Submissions (Posts)" ], "metadata": { "id": "2ubHOIi_xMw-" } }, { "cell_type": "markdown", "source": [ "As we can see, each one of these dictionaries from the `data` list, holds the information from an individial post. But why are there only 100? Pullpush only allows users to get 100 posts per request, meaning that we'll have to get creative with how we request data from Pullpush.\n", "\n", "To do so we'll have to take advantage of the other request modifiers besides just \"subreddit.\" A list of all of these can be found [here](https://pullpush.io/#docs).\n", "\n", "One method that we can try is using timestamps to segment the data into chunks less than or equal to 100. Pullpush allows us to ask for posts given a specific time block. We can then loop through these time blocks until we get the data that we want.\n", "\n", "Below I'll walkthrough getting data for a single day. According to the pullpush documentation, there is a \"before\" and an \"after\" modeifier, but these only accept an \"Epoch value\". What does that mean?" ], "metadata": { "id": "qoWnw_u4j_tn" } }, { "cell_type": "code", "source": [ "# normal python date\n", "_date = datetime(2022, 1, 1)\n", "_date, type(_date)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "NmMtDH55jb-4", "outputId": "5e1f8ff6-ba2a-457c-b899-d905d0aa1c88" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "(datetime.datetime(2022, 1, 1, 0, 0), datetime.datetime)" ] }, "metadata": {}, "execution_count": 14 } ] }, { "cell_type": "code", "source": [ "# epoch value\n", "dt_with_timezone = _date.replace(tzinfo=timezone.utc)\n", "int(dt_with_timezone.timestamp())" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "7tHNKlDEmfkV", "outputId": "c6b805dd-d8c0-4c66-a8bf-f617c99a5b69" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "1640995200" ] }, "metadata": {}, "execution_count": 19 } ] }, { "cell_type": "markdown", "source": [ "An \"epoch value\" or Unix Timestamp is a special menthod of encoding dates for computers. It is a standard which represents dates as the number a seconds that have elasped since January 1, 1970. This might seem abritary and that's because it is! That said we can create a few functions to make translating between normal Python datetime objects and epoch values easier." ], "metadata": { "id": "2dr7o8kHnBSq" } }, { "cell_type": "code", "source": [ "def convert_utc_to_date(ts):\n", " '''\n", " Converts a UTC timestamp to a local datetime object.\n", " '''\n", " utc_datetime = datetime.utcfromtimestamp(ts).replace(tzinfo=timezone.utc)\n", " local_datetime = utc_datetime.astimezone()\n", " return local_datetime.strftime('%Y-%m-%d %H:%M:%S')" ], "metadata": { "id": "77dAh0Oemtpe" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "def convert_date_to_utc(dt):\n", " '''\n", " Converts a local datetime object to a UTC timestamp.\n", " '''\n", " dt_with_timezone = dt.replace(tzinfo=timezone.utc)\n", " return int(dt_with_timezone.timestamp())" ], "metadata": { "id": "Hm4r0wO2n8Eu" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "print(convert_utc_to_date(convert_date_to_utc(_date))) # should print 2022-01-01 00:00:00" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "E5ketugHoBGi", "outputId": "06f6b55c-2146-4d98-cb1d-218e14b0c19f" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "2022-01-01 00:00:00\n" ] } ] }, { "cell_type": "markdown", "source": [ "Now let's try adding this to our request URL and retrieve a day worth of posts. A day is 86400 seconds so all we need to do is convert our datetime object to UTC and then add 86400." ], "metadata": { "id": "qAdLFMLjoPug" } }, { "cell_type": "code", "source": [ "start_date = datetime(2024, 5, 23)\n", "utc_ts = convert_date_to_utc(start_date)\n", "\n", "url_query = f\"https://api.pullpush.io/reddit/search/submission/?after={utc_ts}&before={utc_ts+86400}&subreddit=latin\" # can use & to join modifiers\n", "url_query" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "id": "DnmIz__goGci", "outputId": "c1bdd7f3-1ac5-4669-c433-8aa249b692c5" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'https://api.pullpush.io/reddit/search/submission/?after=1716422400&before=1716508800&subreddit=latin'" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" } }, "metadata": {}, "execution_count": 24 } ] }, { "cell_type": "code", "source": [ "res = requests.get(url_query)\n", "if res.status_code == 200:\n", " data = res.json()['data']\n", " print(f\"Number of posts: {len(data)}\")\n", " print(f\"Most recent post: {convert_utc_to_date(data[0]['created_utc'])}, {data[0]['title']}\")\n", " print(f\"Least recent post: {convert_utc_to_date(data[-1]['created_utc'])}, {data[-1]['title']}\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "cifkfn_mpOHc", "outputId": "6f4e1295-91aa-4e22-fb62-2b17d0543aa0" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Number of posts: 19\n", "Most recent post: 2024-05-23 22:53:50, Latin Crochet Tapestry\n", "Least recent post: 2024-05-23 00:29:14, What (in your opinion) is the most boring part about Latin?\n" ] } ] }, { "cell_type": "markdown", "source": [ "Wonderful! Now we have a way to get all of the posts for a single day. Now we can create a loop where we go through every day between a start date and an end date, collecting all of the data in between. To faciliate this we are going to create a generator which does so.\n", "\n", "Generators look like functions, but they're slightly different. Instead of using the `return` keyword, an generator using the `yield` keyword, which acts like an index in a list. Refer to the example below." ], "metadata": { "id": "V_EgQ5NbqCkU" } }, { "cell_type": "code", "source": [ "# generating squares\n", "def gen_squares(n):\n", " for i in range(n): # loop through each number in range 0 to n\n", " yield i, i*i # return the number and the number's square\n", "\n", "for i in gen_squares(5): # computation only occurs here\n", " print(i)\n", "print() # prints empty line\n", "\n", "type(gen_squares(5)) # type = generator" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "H32x8SpepYXb", "outputId": "e81e387b-d723-45ef-8724-d710f552c1e4" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "(0, 0)\n", "(1, 1)\n", "(2, 4)\n", "(3, 9)\n", "(4, 16)\n", "\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "generator" ] }, "metadata": {}, "execution_count": 40 } ] }, { "cell_type": "code", "source": [ "# out date generator\n", "def date_range_generator(start_date, end_date):\n", " current = start_date # sets current to the start date\n", " total_days = (end_date - start_date).days +1 # defines the amount of days we want to loop through\n", "\n", " for _ in range(total_days): # for each day\n", " yield current # give back the current date\n", " current += timedelta(days=1) # add a day to the current date, move to the next day" ], "metadata": { "id": "6RcaUFZqt8V3" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# one last thing... adding a progress bar\n", "def date_range_generator(start_date, end_date):\n", " current = start_date # sets current to the start date\n", " total_days = (end_date - start_date).days +1 # defines the amount of days we want to loop through\n", "\n", " for _ in tqdm(range(total_days), desc=\"Processing Days\", unit=\"day\"): # for each day, now with a progress bar\n", " yield current # give back the current date\n", " current += timedelta(days=1) # add a day to the current date, move to the next day" ], "metadata": { "id": "6jJc21scu_u9" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# giving it a try!\n", "start_date = datetime(2024, 5, 1)\n", "end_date = datetime(2024, 5, 7) # just a week of data\n", "\n", "data = []\n", "for day in date_range_generator(start_date, end_date):\n", " utc_ts = convert_date_to_utc(day)\n", " url_query = f\"https://api.pullpush.io/reddit/search/submission/?after={utc_ts}&before={utc_ts+86400}&subreddit=latin\"\n", " res = requests.get(url_query)\n", " if res.status_code == 200:\n", " for post in res.json()['data']: # loop through each post and...\n", " if post not in data: # checking if it is already in our data list\n", " data.append(post) # if not, then we can add it" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "YSW3zaCPvYfI", "outputId": "49ac7aca-cac1-49de-cc3b-035214d4cd95" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "Processing Days: 100%|██████████| 7/7 [00:07<00:00, 1.08s/day]\n" ] } ] }, { "cell_type": "code", "source": [ "len(data) # more than 100!!" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "-iS81BKWv14P", "outputId": "fbc1da3a-6f71-4c62-8eac-acff2f1f70c4" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "126" ] }, "metadata": {}, "execution_count": 77 } ] }, { "cell_type": "markdown", "source": [ "Now that we have a good way of getting our data, we can dump it into a DataFrame and save it as a CSV. Most of the information here is either repetitive or not useful, so we can select only a subset of the most valuable data." ], "metadata": { "id": "lLlt35-hwjvk" } }, { "cell_type": "code", "source": [ "import pandas as pd\n", "\n", "cols_of_interest = [\n", " 'author', # username\n", " 'created_utc', # when it was posted\n", " 'id', # id of thread, useful for comments\n", " 'num_comments', # number of comments\n", " 'score', # upvotes - downvotes\n", " 'selftext', # text of the post\n", " 'title', # title of the post\n", " 'url' # url to the thread\n", "]\n", "\n", "df = pd.DataFrame(data)\n", "df = df[cols_of_interest]\n", "df['created_utc'] = df['created_utc'].astype(int)\n", "df['date'] = df['created_utc'].apply(convert_utc_to_date) # convert utc to normal date format\n", "df.to_csv('r_latin20240501to20240507.csv')" ], "metadata": { "id": "x3qI6ZyEv3Lj" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "df.head()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 380 }, "id": "9qfGxT4Uwz3B", "outputId": "a58e5e24-42e4-4d61-9f70-b044aedda54a" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " author created_utc id num_comments score \\\n", "0 LeYGrec 1714599117 1chxw8d 1 1 \n", "1 chmendez 1714594619 1chw2n9 0 1 \n", "2 pacemqueamorem 1714594094 1chvuy0 0 1 \n", "3 nondescriptredditer1 1714585634 1chscrw 0 1 \n", "4 Robastion404 1714583448 1chrgtn 0 1 \n", "\n", " selftext \\\n", "0 Salvēte Latīnistae omnēs,\\n\\nI wonder what ... \n", "1 \n", "2 Hello,\\nmy best friend's birthday is coming up... \n", "3 Hello,\\n\\nCan someone please type out for me, ... \n", "4 I’m getting my first tattoo in August and I ne... \n", "\n", " title \\\n", "0 Classical Latin, Late Latin Medieval Latin Pho... \n", "1 The start phrase of the \"The First Catilinarian\" \n", "2 Quote about friendship? \n", "3 Transcription Request re a letter from 1323 \n", "4 Need this translated for a tattoo \n", "\n", " url date \n", "0 https://www.reddit.com/r/latin/comments/1chxw8... 2024-05-01 21:31:57 \n", "1 /r/ancientrome/comments/1chw0hn/todays_roman_q... 2024-05-01 20:16:59 \n", "2 https://www.reddit.com/r/latin/comments/1chvuy... 2024-05-01 20:08:14 \n", "3 https://www.reddit.com/r/latin/comments/1chscr... 2024-05-01 17:47:14 \n", "4 https://www.reddit.com/r/latin/comments/1chrgt... 2024-05-01 17:10:48 " ], "text/html": [ "\n", "
\n", " | author | \n", "created_utc | \n", "id | \n", "num_comments | \n", "score | \n", "selftext | \n", "title | \n", "url | \n", "date | \n", "
---|---|---|---|---|---|---|---|---|---|
0 | \n", "LeYGrec | \n", "1714599117 | \n", "1chxw8d | \n", "1 | \n", "1 | \n", "Salvēte Latīnistae omnēs,\\n\\nI wonder what ... | \n", "Classical Latin, Late Latin Medieval Latin Pho... | \n", "https://www.reddit.com/r/latin/comments/1chxw8... | \n", "2024-05-01 21:31:57 | \n", "
1 | \n", "chmendez | \n", "1714594619 | \n", "1chw2n9 | \n", "0 | \n", "1 | \n", "\n", " | The start phrase of the \"The First Catilinarian\" | \n", "/r/ancientrome/comments/1chw0hn/todays_roman_q... | \n", "2024-05-01 20:16:59 | \n", "
2 | \n", "pacemqueamorem | \n", "1714594094 | \n", "1chvuy0 | \n", "0 | \n", "1 | \n", "Hello,\\nmy best friend's birthday is coming up... | \n", "Quote about friendship? | \n", "https://www.reddit.com/r/latin/comments/1chvuy... | \n", "2024-05-01 20:08:14 | \n", "
3 | \n", "nondescriptredditer1 | \n", "1714585634 | \n", "1chscrw | \n", "0 | \n", "1 | \n", "Hello,\\n\\nCan someone please type out for me, ... | \n", "Transcription Request re a letter from 1323 | \n", "https://www.reddit.com/r/latin/comments/1chscr... | \n", "2024-05-01 17:47:14 | \n", "
4 | \n", "Robastion404 | \n", "1714583448 | \n", "1chrgtn | \n", "0 | \n", "1 | \n", "I’m getting my first tattoo in August and I ne... | \n", "Need this translated for a tattoo | \n", "https://www.reddit.com/r/latin/comments/1chrgt... | \n", "2024-05-01 17:10:48 | \n", "