Unable to scrape a manga page with Playwright

Question 1

I am Scraping a Manga page: https://battwo.com/title/75019-blue-lock
can’t get the info I want. These are the issues:

The page opens up but it doesn’t close up. Already used page.setDefaultNavigationTimeout(timeout, 7000) (failed many times 🤣)
The console.log doesn’t return const title 😒
I am using the try catch for errors; but am I missing something?🤔

I’ve been using Playwright and Puppeteer documentation for the code.

I want to scrape the following elements from the page:

title
image
status
year 👈 this is what I’m hoping to get…
genres
synopsis
artists
authors
uploaders

To get just one element I’m triying to use await page.$eval
To get multiple elements I’m planning to use await page.$$eval then map the elements to get an array of all the content. When the scrape is finished, I want to pass the data to a CSV file, then convert the CSV to an Excel or Google sheet.

This is the code I’ve built so far:

import playwright from 'playwright';

(async () => {
    try {
        
    // Start the browser, observe the process. // Or 'chromium' or 'webkit'.
    const browser = await playwright.firefox.launch({ headless: false }); 

    // Create a new incognito browser context. Set User Agent Method (Avoid block requests).
    const context = await browser.newContext({
        userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)' +
            ' AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
        });
    
    // Create a new page in a pristine context. Set ViewportSize. 
    const page = await context.newPage();
    await page.setViewportSize
        ({width: 640,
        height: 480,});
    
    // Page to get data from
    await page.goto('https://battwo.com/title/75019-blue-lock');
    page.setDefaultNavigationTimeout(timeout,7000) 
    // Configure the Main Selector.   
    await page.waitForSelector('main');
    
    // Configure the const Selector.  
    // Extract the required info.
    await page.waitForSelector('#text "Blue Lock"');
    const title = await page.$eval('.#text "Blue Lock" h3', element => element.innerText);
    console.log(title); 

    // Close all process
    await context.close();
    await browser.close();
}   catch (error) {

}
}) ();

Question 2

There’s a major problem in your code:

try {
  // all of your logic
} catch (error) {
  // do nothing!?
}

This Pokemon exception handler swallows all of your exceptions, giving you no feedback about your execution and leading you into thinking the problem is something to do with the navigation. Always log your errors. Adding console.error(error) into the catch block immediately tells you what you need to do to start fixing the script:

ReferenceError: timeout is not defined
    at /home/greg/programming/scraping/a.js:23:38

The line in question is

page.setDefaultNavigationTimeout(timeout,7000)

You probably meant to call this like:

page.setDefaultNavigationTimeout(7000)

Before going further, the reason the process is hanging is that browser.close() is never called. Always put that in a finally block so your process can exit normally regardless of whether an error occurs or not.

Next bug: '#text "Blue Lock"' is not a CSS selector. Logs to the rescue once again:

page.waitForSelector: Unexpected token ""Blue Lock"" while parsing selector "#text "Blue Lock""
    at /home/greg/programming/scraping/a.js:28:16 {
  name: 'Error'

Try using locators, and avoid hardcoding the text you want into the selector (you don’t know that in advance, right?).

Here’s my solution:

const playwright = require("playwright"); // ^1.38.0

const url = "<Your URL>";
const userAgent =
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36";

let browser;
(async () => {
  browser = await playwright.firefox.launch();
  const context = await browser.newContext({userAgent});
  const page = await context.newPage();
  await page.goto(url, {waitUntil: "domcontentloaded"});
  const title = await page.locator("h3:visible").textContent();
  console.log(title); // => Blue Lock
})()
  .catch(err => console.error(err))
  .finally(() => browser.close());

Now that you’re back on track, I’ll leave it as an exercise to grab the rest of the data you want. Warning: the page is a bit tricky to scrape. For example, here’s my first attempt at extracting genres, which works, but could be cleaned up a bit:

const genres = await page
  .getByText("Genres:")
  .evaluate(el =>
    [...el.parentElement.querySelectorAll("span")]
      .map(e => e.textContent)
      .filter(e => e.length > 1)
      .filter((_, i) => i % 2 === 0)
  );
console.log(genres); // => [ 'Shounen(B),', 'Action,', 'Drama,', 'Sports' ]

Description/synopsis is easier:

const description = await page
  .locator('meta[name="description"]')
  .getAttribute("content");
console.log(description); // => The story begins with Japan’s elimination ...

By the way, a lot of the data is available statically, so you can get it faster and easier without browser automation using fetch and Cheerio.

If you run into problems getting any of the data, please open a new, specific question focused on a narrow goal.

Question 3

I can suggest you to use wpscript ultimate manga scraper codes, it uses wordpress scraper as well as puppeteer js to scrap manga from mangafox

Leave a Comment Cancel reply