PDF shows as a Buffer Object in node.js. How do I parse it into text?

I am working on handling files on the backend. I am trying to send a PDF to the backend, and then I want to parse the PDF so that I can read its text. It seems that I am able to send the PDF over to the backend. However, I don’t know how to read the PDF text after I get it on the backend. Here is my post request:

app.post("/submitPDF", (request, response) => {
  console.log("Made a post request>>>", request.body);

  // if (!request.files && !request.files.pdfFile) {
  //   console.log("No file received");
  //   response.status(400);
  //   response.end();
  // }
  pdfParse(request.body.pdfFile).then((result) => {
    console.log(result.text);
  });
  response.status(201).send({ message: "File upload successful" });
});

Here is my API POST request just to show how I sent the PDF. I created a FormData object, appended my PDF, and then sent it in my post request:

export const fetchPDF = (value) => {
  console.log("The value>>>", value);
  const formData = new FormData();
  formData.append('pdfFile', value);
  console.log(Object.fromEntries(formData.entries())) // this is how to console log items in FormData

  return fetch(`${baseURL}/submitPDF`, {
    method: 'POST',
    headers: {
      'Content-Type': 'multipart/form-data', // had to change content-type to accept pdfs. this fixed the cors error
    },
    body: formData
  })
    .then((response) => {
      if (response.ok) {
        console.log("The response is ok");
        return response;
      } else {
        // If not successful, handle the error
        console.log("the response is not ok", response);
        throw new Error(`Error: ${response.status} - ${response.statusText}`);
      }
    })
    .catch((error) => {
      console.log("There is an error>>>", error.message);
    })
}

When I console log the request.body, which contains the PDF, I get some buffer object like this:

Made a post request>>> <Buffer 2d 2d 2d 2d 2d 2d 57 65 62 4b 69 74 46 6f >72 6d 42 6f 75 6e 64 61 72 79 35 57 69 37 45 36 4f 31 49 36 37 45 6f 53 42 >32 0d 0a 43 6f 6e 74 65 6e 74 2d … 120 more bytes>

I tried to parse my PDF using pdf-parse like this:

pdfParse(request.body.pdfFile).then((result) => {
  console.log(result.text);
});

But I get these 2 errors:

throw new Error(‘Invalid parameter in getDocument, ‘ + ‘need either >Uint8Array, string or a parameter object’);

Error: Invalid parameter in getDocument, need either Uint8Array, string or a parameter object

It seems I have to parse the buffer object, but I’m not sure how I exactly do that? Would I have to convert the buffer object into string? If so, how do I do that? And then I use pdf-parse afterwards so I can read the PDF’s text?

  • 1

    had to change content-type to accept pdfs. this fixed the cors error – content type has not much to do with CORS, but this may be one thing that breaks your process, the buffer starts with ------WebKitFormBoundary5Wi7E6O1I67EoSB2 (use some online converter, like rapidtables.com/convert/number/hex-to-ascii.html), and that’s not what a pdf decoder expects for sure.

    – 




You need some middleware to upload the file.

Multer is recommended for example:
https://github.com/expressjs/multer

Then update your code to something like:

const express = require('express')
const multer  = require('multer')
const upload = multer({ dest: 'uploads/' })

const app = express()

app.post('/profile', upload.single('pdf'), function (req, res, next) {
    // req.file is the `pdf` file
    // req.body will hold the text fields, if there were any

    pdfParse(req.body.pdf).then((result)...
})

Leave a Comment