The script fails to grab links from a webpage utilizing the requests module

Question 1

I’m trying to scrape the links located in the middle of this webpage under the title, By City, using the requests module.

The script fails miserably. Although the status code is 200, I end up getting nothing on the console.

Here is what I’ve tried with:

import requests
from bs4 import BeautifulSoup

link = 'https://www.nursinghomes.com/'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
    'Referer': 'https://www.nursinghomes.com/',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'en-US,en;q=0.9'
}

res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"html.parser")
for item in soup.select("li > strong > a[class^='text-link']"):
    print(item.get("href"))

Expected results (partial):

/tx/austin/
/md/baltimore/
/ny/bronx/
/ny/brooklyn/
/il/chicago/

Question 2

If you add print(res.text) you’ll see the following HTML response:

... Request unsuccessful. Incapsula incident ID: 15190006 ...

The websites detects the use of a non-browser, and rejects the request, so bs4 does not receive any useful HTML to parse.

You’ll need to find a way to trick the website in trusting the request.

Leave a Comment Cancel reply