{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Scraping Twitter using Selenium" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Welcome to this notebook which is part of an introduction to web scraping with Selenium. Specifically, we are going to scrape tweets about bitcoin.\n", "\n", "Disclaimer: \n", "- There are lots of improvements that can be done to this code, which significantly improve the data quality obtained. This notebook has only one purpose, namely to explain the basics of selenium web scraping.\n", "\n", "- For some parts, I have used Izzy Analytics on Youtube as inspiration. I recommend to give him a watch: https://www.youtube.com/watch?v=3KaffTIZ5II&t=289s " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "##### Task 1: Collecting our ingredients: (Guided) \n", "\n", "You need \n", "- An python environment with Selenium.\n", "- Google Chrome.\n", "- ChromeDriver (Chromium)\n", "- A Twitter Account\n", "\n", "The collection of these are described in the presentation pdf, which is also in this repo.\n", "\n", "Also, we need to import the following:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from time import sleep # Will come in hand\n", "from getpass import getpass # For logging in to Twitter through Python\n", "from selenium import webdriver # Our WebDriver\n", "\n", "# other, but necessary:\n", "from selenium.webdriver.common.by import By # For Crawling\n", "from selenium.webdriver.common.keys import Keys # For Crawling\n", "from selenium.webdriver.chrome.options import (\n", " Options,\n", ") # For setting some options for the driver, see Appendix.\n", "from selenium.common.exceptions import NoSuchElementException # Avoiding adds\n", "from selenium.webdriver.support import expected_conditions as EC # Conditions\n", "from selenium.webdriver.support.ui import (\n", " WebDriverWait,\n", ") # Make sure the element is loaded" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "##### Task 2: Setting up, and starting our driver: (Guided)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "##### Task 3: Open Twitter, and provide the notebook with your login: (Guided)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "my_username = input(\"Provide a username: \")\n", "my_password = getpass()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Extra: HTML and XPATH\n", "\n", "What makes Selenium very powerful compared to more traditional web scraping framework, is that we can easily extract the parts of the html we want. This makes it easy to get clean data sets from the start.\n", "\n", "Instead of downloading whole html pages, and then clean out the data, we can give Selenium instruction to where the elements we want are located, and then extract only this information.\n", "\n", "*HTML*, which stands for HyperText Markup Language, is the foundation of every website you see on the internet. It is a simple and powerful language used to create the structure and content of web pages. Think of HTML as the skeleton that gives a web page its shape.\n", "\n", "Example:\n", "\n", "