Download the notebook here

Scraping Twitter using Selenium (solution)#

Welcome to this notebook which is part of an introduction to web scraping with Selenium. Specifically, we are going to scrape tweets about bitcoin.

Disclaimer: - There are lots of improvements that can be done to this code, which significantly improve the data quality obtained. This notebook has only one purpose, namely to explain the basics of selenium web scraping.

For some parts, I have used Izzy Analytics on Youtube as inspiration. I recommend to give him a watch: https://www.youtube.com/watch?v=3KaffTIZ5II&t=289s

Task 1: Collecting our ingredients: (Guided)#

You need - An python environment with Selenium. - Google Chrome. - ChromeDriver (Chromium) - A Twitter Account

The collection of these are described in the presentation pdf, which is also in this repo.

Also, we need to import the following:

[1]:

from time import sleep  # Will come in hand
from getpass import getpass  # For logging in to Twitter through Python
from selenium import webdriver  # Our WebDriver

# other, but necessary:
from selenium.webdriver.common.by import By  # For Crawling
from selenium.webdriver.common.keys import Keys  # For Crawling
from selenium.webdriver.chrome.options import (
    Options,
)  # For setting some options for the driver, see Appendix.
from selenium.common.exceptions import NoSuchElementException  # Avoiding adds
from selenium.webdriver.support import expected_conditions as EC  # Conditions
from selenium.webdriver.support.ui import (
    WebDriverWait,
)  # Make sure the element is loaded

Task 2: Setting up, and starting our driver: (Guided)#

[ ]:

# Path configurations:
DRIVER_PATH = "C:\\Program Files (x86)\\chromedriver.exe"

# Set some options: default
options = Options()

# Start driver:
driver = webdriver.Chrome(DRIVER_PATH, options=options)

Task 3: Open Twitter, and provide the notebook with your login: (Guided)#

[12]:

web_site = "https://twitter.com/home"
driver.get(web_site)

[4]:

my_username = input("Provide a username: ")
my_password = getpass()

Extra: HTML and XPATH#

What makes Selenium very powerful compared to more traditional web scraping framework, is that we can easily extract the parts of the html we want. This makes it easy to get clean data sets from the start.

Instead of downloading whole html pages, and then clean out the data, we can give Selenium instruction to where the elements we want are located, and then extract only this information.

HTML, which stands for HyperText Markup Language, is the foundation of every website you see on the internet. It is a simple and powerful language used to create the structure and content of web pages. Think of HTML as the skeleton that gives a web page its shape.

Example:

<div>
    First div
    <div>
        Second div
        <input type="text" placeholder="Middle input" />
    </div>
</div>
<div>
    Third div
</div>

XPath is a query language used to navigate and select elements from a HTML document. It provides a concise way to locate specific elements or extract data based on their element structure, attributes, or content.

To get the input element in the code above, we would have to feed Selenium with

/html/body/div[1]/div/input[@placeholder='Middle input']

In our case…#

The location of the element where you provide your username at twitter in full XPATH:

"/html/body/div[1]/div/div/div[1]/div/div/div/div/div/div/div[2]/div[2]/div/div/div[2]/div[2]/div/div/div/
div[5]/label/div/div[2]/div/input"

But this also works:

"//input[@name='text']"

Because it’s name is unique in the whole HTML code. As we see, getting the right identifier takes some practice.

Task 4: Our first crawling by logging in: (Guided)#

[13]:

username = driver.find_element(By.XPATH, "//input[@name='text']")
username.send_keys(my_username)
username.send_keys(Keys.RETURN)

Task 5: Our second crawling: (Try yourself) - 10 min#

[14]:

password = driver.find_element(By.XPATH, "//input[@name='password']")
password.send_keys(my_password)
password.send_keys(Keys.RETURN)

Task 5: Search for tweets mentioning “bitcoin”: (Guided)#

[15]:

search_box = driver.find_element(By.XPATH, "//input[@aria-label='Search query']")
search_box.send_keys("bitcoin")
search_box.send_keys(Keys.RETURN)

Note: If you make the width of the screen smaller, the element is not there anymore.

Extra: Twitter Advanced Search#

Using Selenium enables us to navigate pages, but it also force us to think smart. We want our code to do as little as possible to save time. Take this example:

Bitcoin was exchanged at about 50’000 dollars in october 2021. Bitcoin was exchanged at about 20’000 dollars in october 2022.

To search for particular dates, we can search for:

"bitcoin" lang:en until:2021-10-15 since:2021-10-14 -filter:links -filter:replies

and

"bitcoin" lang:en until:2021-10-15 since:2021-10-14 -filter:links -filter:replies

Here, we will have also filtered such that we get: only english tweets, no links and no replies.

This could also be achieved by clicking “advanced search”, then the boxes we want. Here we saved a lot of time, by prompting the search box instead.

[16]:

search_box = driver.find_element(By.XPATH, "//input[@aria-label='Search query']")
search_box.send_keys(
    '"bitcoin" lang:en until:2021-10-15 since:2021-10-14 -filter:links -filter:replies'
)
search_box.send_keys(Keys.RETURN)

Question: The upper code-snippet might not work, why?

Task 6: Click on Latest (Homework)#

We want to look at the latest. Try to click it by 1. Locating the element 2. Use element.click()

[17]:

driver.find_element(By.LINK_TEXT, "Latest").click()

If you have more time, try clicking “Top” again, or try to click on the “Tweet” button

Task 7: Scraping tweets by locating tweets (cards), collect them, and combine them in to a deck of “cards”: (Guided)#

[18]:

cards = driver.find_elements(By.XPATH, '//article[@data-testid="tweet"]')

[ ]:

for card in cards:
    print(card)

The cards are WebElements until now. We can pick one card, and go a bit deeper.

[ ]:

card = cards[0]
card.text

Task 8: Finding the Twitter Handle (Name of Twitter Account, not username): (Guided)#

NOTE: as soon as we have selected an element, we have to start the xpath with “.”

[ ]:

handle = card.find_element(By.XPATH, ".//a/div/div[1]/span/span").text
print(handle)

Task 9: We can also find username and date: (Homework)#

First, try yourself. Username is a bit easier than date. Hint: Try to look for an unique identifier / tag.

Selenium has the following ways of identifying elements:

driver.find_element(By.ID, "id")
driver.find_element(By.NAME, "name")
driver.find_element(By.XPATH, "xpath")
driver.find_element(By.LINK_TEXT, "link text")
driver.find_element(By.PARTIAL_LINK_TEXT, "partial link text")
driver.find_element(By.TAG_NAME, "tag name")
driver.find_element(By.CLASS_NAME, "class name")
driver.find_element(By.CSS_SELECTOR, "css selector")

[22]:

username = card.find_element(By.XPATH, ".//span[contains(text(),'@')]").text
date = card.find_element(By.XPATH, ".//time").get_attribute(
    "datetime"
)  # Sponsored Content does not have this

Task 10: At last, lets collect the tweet itself (This is a bit more complicated):#

[23]:

tweet_body = card.find_elements(By.XPATH, ".//div/div[2]/div[2]/div[2]/div/span")
text_list = [span.text for span in tweet_body]
tweet_text = " "
tweet_text = tweet_text.join(text_list)

Let’s extend our collection from one to several tweets

Wrapping up: Make a function that executes all the steps above, and makes each tweet and the collected information into a tuple#

[24]:

def collect_tweet(card):
    try:
        date = card.find_element(By.XPATH, ".//time").get_attribute(
            "datetime"
        )  # Sponsored Content does not have this
    except NoSuchElementException:
        return False

    handle = card.find_element(By.XPATH, ".//a/div/div[1]/span/span").text
    username = card.find_element(By.XPATH, ".//span[contains(text(),'@')]").text

    tweet_text = _collect_text(card)

    tweet = (handle, username, date, tweet_text)
    return tweet


def _collect_text(card):
    tweet_body = card.find_elements(By.XPATH, ".//div/div[2]/div[2]/div[2]/div/span")
    text_list = [span.text for span in tweet_body]
    tweet_text = " "
    return tweet_text.join(text_list)

[ ]:

tweets = []
for card in cards:
    tweet = collect_tweet(card)
    if tweet:
        tweets.append(tweet)

tweets

We need to scroll, which can be done by:

[32]:

driver.execute_script("window.scroll(0,document.body.scrollHeight);")

Last part is inspired by @israel-dryer (github), and updated to fit our case.

Especially the

driver.find_elements(By.XPATH, '//article[@data-testid="tweet"]')

is replaced by

WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.XPATH, "//input[@name='text']"))
    )

I have also added a loading bar.

[26]:

def my_scraper(DRIVER_PATH, options, max_tweets):
    driver = webdriver.Chrome(DRIVER_PATH, options=options)
    web_site = "https://twitter.com/home"
    driver.get(web_site)

    # Crawl:

    username = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, "//input[@name='text']"))
    )
    username.send_keys(my_username)
    username.send_keys(Keys.RETURN)

    password = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, "//input[@name='password']"))
    )
    password.send_keys(my_password)
    password.send_keys(Keys.RETURN)

    search_box = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located(
            (By.XPATH, "//input[@aria-label='Search query']")
        )
    )
    search_box.send_keys(
        '"bitcoin" lang:en until:2021-10-15 since:2021-10-14 -filter:links -filter:replies'
    )
    search_box.send_keys(Keys.RETURN)

    latest = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.LINK_TEXT, "Latest"))
    )
    latest.click()

    # Scrape:

    data = []
    tweet_ids = set()  # In order to not collect duplicates
    last_position = driver.execute_script("return window.pageYOffset;")
    scrolling = True

    while scrolling:
        page_cards = driver.find_elements(By.XPATH, '//article[@data-testid="tweet"]')
        for card in page_cards[-15:]:
            tweet = collect_tweet(card)

            if tweet:
                tweet_id = "".join(tweet)

                if tweet_id not in tweet_ids:
                    tweet_ids.add(tweet_id)
                    data.append(tweet)

        # Loading bar VISUALIZATION
        percent_done = int((len(data) / max_tweets) * 100)
        print(f"{percent_done}% ", end="", flush=True)

        scroll_attempt = 0

        while True:
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            sleep(2)
            curr_position = driver.execute_script("return window.pageYOffset;")

            if last_position == curr_position:
                scroll_attempt += 1

                # end of scroll region
                if scroll_attempt >= 3:
                    scrolling = False
                    break

                else:
                    sleep(2)  # attempt another scroll

            else:
                last_position = curr_position
                break

        if len(data) > max_tweets:
            scrolling = False

    # Close the web driver
    driver.close()
    return data

[ ]:

data = my_scraper(DRIVER_PATH, options, max_tweets=17)
data

[ ]:

# Some mentionworthy options:

options.add_experimental_option(
    "prefs",
    {
        "download.default_directory": PLACE_YOUR_DESIRED_PATH,
        "download.prompt_for_download": False,
        "download.directory_upgrade": True,
        "safebrowsing.enabled": True,
    },
)
# setDownloadPreferences: Sets the download preferences for the browser.
# Here, it specifies the default download directory, disables the download prompt,
# enables directory upgrade, and enables safe browsing.

options.add_argument("--headless=new")
# setHeadlessMode: Sets the browser in headless mode, which means it runs without a
# graphical user interface.

options.add_argument("--disable-gpu")
# disableGPU: Disables the use of the GPU (graphics processing unit) in the browser.

options.add_argument("--no-sandbox")
# disableSandbox: Disables the sandbox mode, which provides an extra layer of security for the browser.

options.add_argument("--disable-dev-shm-usage")
# disableDevShmUsage: Disables the use of /dev/shm temporary storage in the browser.

options.add_argument("--log-level=3")
# setLogLevel: Sets the logging level for the browser. Here, it sets the log level to 3, which is the highest level of logging.

options.add_argument("--silent")
# setSilentMode: Sets the browser in silent mode, which suppresses most browser notifications and prompts.