BeautifulSoup – parsing on the clutch.co site and adding the rules and regulations of the robot

I want to use Python with BeautifulSoup to scrape information from the Clutch.co website.

I want to collect data from companies that are listed at clutch.co :: lets take for example the it agencies from israel that are visible on clutch.co:

https://clutch.co/il/agencies/digital

my approach!?

import requests
from bs4 import BeautifulSoup
import time

def scrape_clutch_digital_agencies(url):
    # Set a User-Agent header
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }

    # Create a session to handle cookies
    session = requests.Session()

    # Check the robots.txt file
    robots_url = urljoin(url, '/robots.txt')
    robots_response = session.get(robots_url, headers=headers)

    # Print robots.txt content (for informational purposes)
    print("Robots.txt content:")
    print(robots_response.text)

    # Wait for a few seconds before making the first request
    time.sleep(2)

    # Send an HTTP request to the URL
    response = session.get(url, headers=headers)

    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the HTML content of the page
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find the elements containing agency names (adjust this based on the website structure)
        agency_name_elements = soup.select('.company-info .company-name')

        # Extract and print the agency names
        agency_names = [element.get_text(strip=True) for element in agency_name_elements]

        print("Digital Agencies in Israel:")
        for name in agency_names:
            print(name)
    else:
        print(f"Failed to retrieve the page. Status code: {response.status_code}")

# Example usage
url="https://clutch.co/il/agencies/digital"
scrape_clutch_digital_agencies(url)

well – to be frank; i struggle with the conditions – the site throws back the following
ie. i run this in google-colab:

and it throws back in the developer-console on colab:

NameError                                 Traceback (most recent call last)

<ipython-input-1-cd8d48cf2638> in <cell line: 47>()
     45 # Example usage
     46 url="https://clutch.co/il/agencies/digital"
---> 47 scrape_clutch_digital_agencies(url)

<ipython-input-1-cd8d48cf2638> in scrape_clutch_digital_agencies(url)
     13 
     14     # Check the robots.txt file
---> 15     robots_url = urljoin(url, '/robots.txt')
     16     robots_response = session.get(robots_url, headers=headers)
     17 

NameError: name 'urljoin' is not defined

Well I need to get more insights- I am pretty sute that i will get round the robots-impact. The robot is target of many many interest. so i need to add the things that impact my tiny bs4 – script.

update: dear dear Hedgehog -i tried to cope with robot – but its hard – i can open a new thread:

well i have tried to cope with the robot.txt – and therefore i used the selenium-approach.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time

def scrape_clutch_digital_agencies_with_selenium(url):
    # Set up Chrome options for headless browsing
    chrome_options = Options()
    chrome_options.add_argument('--headless')  # Run Chrome in headless mode

    # Create a Chrome webdriver instance
    driver = webdriver.Chrome(options=chrome_options)

    # Visit the URL
    driver.get(url)

    # Wait for the JavaScript challenge to be completed (adjust sleep time if needed)
    time.sleep(5)

    # Get the page source after JavaScript has executed
    page_source = driver.page_source

    # Parse the HTML content of the page
    soup = BeautifulSoup(page_source, 'html.parser')

    # Find the elements containing agency names (adjust this based on the website structure)
    agency_name_elements = soup.select('.company-info .company-name')

    # Extract and print the agency names
    agency_names = [element.get_text(strip=True) for element in agency_name_elements]

    print("Digital Agencies in Israel:")
    for name in agency_names:
        print(name)

    # Close the webdriver
    driver.quit()

# Example usage
url="https://clutch.co/il/agencies/digital"
scrape_clutch_digital_agencies_with_selenium(url)

You have to import from the corresponding module to use urljoin(url, '/robots.txt') in your code:

from urllib.parse import urljoin

However, be aware that you will get an error, because the robots.txt is located under https://clutch.co/robots.txt

Leave a Comment