I want to use Python
with BeautifulSoup
to scrape information from the Clutch.co website.
I want to collect data from companies that are listed at clutch.co :: lets take for example the it agencies from israel that are visible on clutch.co:
https://clutch.co/il/agencies/digital
my approach!?
import requests
from bs4 import BeautifulSoup
import time
def scrape_clutch_digital_agencies(url):
# Set a User-Agent header
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
# Create a session to handle cookies
session = requests.Session()
# Check the robots.txt file
robots_url = urljoin(url, '/robots.txt')
robots_response = session.get(robots_url, headers=headers)
# Print robots.txt content (for informational purposes)
print("Robots.txt content:")
print(robots_response.text)
# Wait for a few seconds before making the first request
time.sleep(2)
# Send an HTTP request to the URL
response = session.get(url, headers=headers)
# Check if the request was successful (status code 200)
if response.status_code == 200:
# Parse the HTML content of the page
soup = BeautifulSoup(response.text, 'html.parser')
# Find the elements containing agency names (adjust this based on the website structure)
agency_name_elements = soup.select('.company-info .company-name')
# Extract and print the agency names
agency_names = [element.get_text(strip=True) for element in agency_name_elements]
print("Digital Agencies in Israel:")
for name in agency_names:
print(name)
else:
print(f"Failed to retrieve the page. Status code: {response.status_code}")
# Example usage
url="https://clutch.co/il/agencies/digital"
scrape_clutch_digital_agencies(url)
well – to be frank; i struggle with the conditions – the site throws back the following
ie. i run this in google-colab:
and it throws back in the developer-console on colab:
NameError Traceback (most recent call last)
<ipython-input-1-cd8d48cf2638> in <cell line: 47>()
45 # Example usage
46 url="https://clutch.co/il/agencies/digital"
---> 47 scrape_clutch_digital_agencies(url)
<ipython-input-1-cd8d48cf2638> in scrape_clutch_digital_agencies(url)
13
14 # Check the robots.txt file
---> 15 robots_url = urljoin(url, '/robots.txt')
16 robots_response = session.get(robots_url, headers=headers)
17
NameError: name 'urljoin' is not defined
Well I need to get more insights- I am pretty sute that i will get round the robots-impact. The robot is target of many many interest. so i need to add the things that impact my tiny bs4 – script.
update: dear dear Hedgehog -i tried to cope with robot – but its hard – i can open a new thread:
well i have tried to cope with the robot.txt – and therefore i used the selenium-approach.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time
def scrape_clutch_digital_agencies_with_selenium(url):
# Set up Chrome options for headless browsing
chrome_options = Options()
chrome_options.add_argument('--headless') # Run Chrome in headless mode
# Create a Chrome webdriver instance
driver = webdriver.Chrome(options=chrome_options)
# Visit the URL
driver.get(url)
# Wait for the JavaScript challenge to be completed (adjust sleep time if needed)
time.sleep(5)
# Get the page source after JavaScript has executed
page_source = driver.page_source
# Parse the HTML content of the page
soup = BeautifulSoup(page_source, 'html.parser')
# Find the elements containing agency names (adjust this based on the website structure)
agency_name_elements = soup.select('.company-info .company-name')
# Extract and print the agency names
agency_names = [element.get_text(strip=True) for element in agency_name_elements]
print("Digital Agencies in Israel:")
for name in agency_names:
print(name)
# Close the webdriver
driver.quit()
# Example usage
url="https://clutch.co/il/agencies/digital"
scrape_clutch_digital_agencies_with_selenium(url)
You have to import from the corresponding module to use urljoin(url, '/robots.txt')
in your code:
from urllib.parse import urljoin
However, be aware that you will get an error, because the robots.txt
is located under https://clutch.co/robots.txt