{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Scraping Twitter using Selenium" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Welcome to this notebook which is part of an introduction to web scraping with Selenium. Specifically, we are going to scrape tweets about bitcoin.\n", "\n", "Disclaimer: \n", "- There are lots of improvements that can be done to this code, which significantly improve the data quality obtained. This notebook has only one purpose, namely to explain the basics of selenium web scraping.\n", "\n", "- For some parts, I have used Izzy Analytics on Youtube as inspiration. I recommend to give him a watch: https://www.youtube.com/watch?v=3KaffTIZ5II&t=289s " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "##### Task 1: Collecting our ingredients: (Guided) \n", "\n", "You need \n", "- An python environment with Selenium.\n", "- Google Chrome.\n", "- ChromeDriver (Chromium)\n", "- A Twitter Account\n", "\n", "The collection of these are described in the presentation pdf, which is also in this repo.\n", "\n", "Also, we need to import the following:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from time import sleep # Will come in hand\n", "from getpass import getpass # For logging in to Twitter through Python\n", "from selenium import webdriver # Our WebDriver\n", "\n", "# other, but necessary:\n", "from selenium.webdriver.common.by import By # For Crawling\n", "from selenium.webdriver.common.keys import Keys # For Crawling\n", "from selenium.webdriver.chrome.options import (\n", " Options,\n", ") # For setting some options for the driver, see Appendix.\n", "from selenium.common.exceptions import NoSuchElementException # Avoiding adds\n", "from selenium.webdriver.support import expected_conditions as EC # Conditions\n", "from selenium.webdriver.support.ui import (\n", " WebDriverWait,\n", ") # Make sure the element is loaded" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "##### Task 2: Setting up, and starting our driver: (Guided)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "##### Task 3: Open Twitter, and provide the notebook with your login: (Guided)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "my_username = input(\"Provide a username: \")\n", "my_password = getpass()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Extra: HTML and XPATH\n", "\n", "What makes Selenium very powerful compared to more traditional web scraping framework, is that we can easily extract the parts of the html we want. This makes it easy to get clean data sets from the start.\n", "\n", "Instead of downloading whole html pages, and then clean out the data, we can give Selenium instruction to where the elements we want are located, and then extract only this information.\n", "\n", "*HTML*, which stands for HyperText Markup Language, is the foundation of every website you see on the internet. It is a simple and powerful language used to create the structure and content of web pages. Think of HTML as the skeleton that gives a web page its shape.\n", "\n", "Example:\n", "\n", "
\n", " First div\n", "
\n", " Second div\n", " \n", "
\n", "
\n", "
\n", " Third div\n", "
\n", "\n", "*XPath* is a query language used to navigate and select elements from a HTML document. It provides a concise way to locate specific elements or extract data based on their element structure, attributes, or content.\n", "\n", "To get the input element in the code above, we would have to feed Selenium with\n", " \n", " /html/body/div[1]/div/input[@placeholder='Middle input']\n", "\n", "##### In our case...\n", "\n", "The location of the element where you provide your username at twitter in full XPATH:\n", "\n", " \"/html/body/div[1]/div/div/div[1]/div/div/div/div/div/div/div[2]/div[2]/div/div/div[2]/div[2]/div/div/div/\n", " div[5]/label/div/div[2]/div/input\"\n", "\n", "But this also works:\n", "\n", " \"//input[@name='text']\"\n", "\n", "Because it's name is unique in the whole HTML code. As we see, getting the right identifier takes some practice.\n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "##### Task 4: Our first crawling by logging in: (Guided)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "##### Task 5: Our second crawling: (Try yourself) - 10 min" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "##### Task 5: Search for tweets mentioning \"bitcoin\": (Guided)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Note: If you make the width of the screen smaller, the element is not there anymore." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "##### Extra: Twitter Advanced Search\n", "\n", "Using Selenium enables us to navigate pages, but it also force us to think smart. We want our code to do as little as possible to save time. Take this example:\n", "\n", "Bitcoin was exchanged at about 50'000 dollars in october 2021.\n", "Bitcoin was exchanged at about 20'000 dollars in october 2022.\n", "\n", "To search for particular dates, we can search for:\n", "\n", "```\"bitcoin\" lang:en until:2021-10-15 since:2021-10-14 -filter:links -filter:replies```\n", "\n", "and\n", " \n", "```\"bitcoin\" lang:en until:2021-10-15 since:2021-10-14 -filter:links -filter:replies```\n", "\n", "Here, we will have also filtered such that we get: *only english tweets*, *no links* and *no replies*.\n", "\n", "This could also be achieved by clicking \"advanced search\", then the boxes we want. Here we saved a lot of time, by prompting the search box instead." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Question: The upper code-snippet might not work, why?" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "##### Task 6: Click on Latest (Homework)\n", "We want to look at the latest. Try to click it by\n", "1. Locating the element\n", "2. Use element.click()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "If you have more time, try clicking \"Top\" again, or try to click on the \"Tweet\" button" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "##### Task 7: Scraping tweets by locating tweets (cards), collect them, and combine them in to a deck of \"cards\": (Guided)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The cards are WebElements until now. We can pick one card, and go a bit deeper." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "##### Task 8: Finding the Twitter Handle (Name of Twitter Account, not username): (Guided)\n", "\n", "NOTE: as soon as we have selected an element, we have to start the xpath with \".\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "##### Task 9: We can also find username and date: (Homework)\n", "\n", "First, try yourself. Username is a bit easier than date. *Hint*: Try to look for an unique identifier / tag. \n", "\n", "Selenium has the following ways of identifying elements:\n", "\n", " driver.find_element(By.ID, \"id\")\n", " driver.find_element(By.NAME, \"name\")\n", " driver.find_element(By.XPATH, \"xpath\")\n", " driver.find_element(By.LINK_TEXT, \"link text\")\n", " driver.find_element(By.PARTIAL_LINK_TEXT, \"partial link text\")\n", " driver.find_element(By.TAG_NAME, \"tag name\")\n", " driver.find_element(By.CLASS_NAME, \"class name\")\n", " driver.find_element(By.CSS_SELECTOR, \"css selector\")" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "##### Task 10: At last, lets collect the tweet itself (This is a bit more complicated):" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Let's extend our collection from one to several tweets\n", "\n", "##### Wrapping up: Make a function that executes all the steps above, and makes each tweet and the collected information into a tuple" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We need to scroll, which can be done by:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "driver.execute_script(\"window.scroll(0,document.body.scrollHeight);\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Last part is inspired by @israel-dryer (github), and updated to fit our case. \n", "\n", "- Especially the \n", "\n", " ```\n", " driver.find_elements(By.XPATH, '//article[@data-testid=\"tweet\"]')\n", " ```\n", "\n", " is replaced by\n", "\n", " ```\n", " WebDriverWait(driver, 10).until(\n", " EC.presence_of_element_located((By.XPATH, \"//input[@name='text']\"))\n", " )\n", " ```\n", "\n", "- I have also added a loading bar." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = my_scraper(DRIVER_PATH, options, max_tweets=17)\n", "data" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Appendix\n", "\n", "\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Some mentionworthy options:\n", "\n", "options.add_experimental_option(\n", " \"prefs\",\n", " {\n", " \"download.default_directory\": PLACE_YOUR_DESIRED_PATH,\n", " \"download.prompt_for_download\": False,\n", " \"download.directory_upgrade\": True,\n", " \"safebrowsing.enabled\": True,\n", " },\n", ")\n", "# setDownloadPreferences: Sets the download preferences for the browser.\n", "# Here, it specifies the default download directory, disables the download prompt,\n", "# enables directory upgrade, and enables safe browsing.\n", "\n", "options.add_argument(\"--headless=new\")\n", "# setHeadlessMode: Sets the browser in headless mode, which means it runs without a\n", "# graphical user interface.\n", "\n", "options.add_argument(\"--disable-gpu\")\n", "# disableGPU: Disables the use of the GPU (graphics processing unit) in the browser.\n", "\n", "options.add_argument(\"--no-sandbox\")\n", "# disableSandbox: Disables the sandbox mode, which provides an extra layer of security for the browser.\n", "\n", "options.add_argument(\"--disable-dev-shm-usage\")\n", "# disableDevShmUsage: Disables the use of /dev/shm temporary storage in the browser.\n", "\n", "options.add_argument(\"--log-level=3\")\n", "# setLogLevel: Sets the logging level for the browser. Here, it sets the log level to 3, which is the highest level of logging.\n", "\n", "options.add_argument(\"--silent\")\n", "# setSilentMode: Sets the browser in silent mode, which suppresses most browser notifications and prompts." ] } ], "metadata": { "kernelspec": { "display_name": "final_project", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0" } }, "nbformat": 4, "nbformat_minor": 2 }