Your Learn: How do I extract text from an pdf and image

Extracting Text from PDFs and Images Using Node.js in MVC Applications

In this article, we’ll walk you through extracting text from PDFs and images and saving it to a Microsoft Word file. This process uses Optical Character Recognition (OCR) technology, which can recognize and extract text from images and scanned documents. OCR technology is widely used in mobile and web applications to scan documents and convert them into editable text. For example, you might need to pull text from an image containing a vehicle number or convert the content of a PDF into a Word document to store in a database. Various OCR tools and scripts are available for extracting text from images and PDFs, saving the results in a DOCX file. Here, we'll focus on using Google V2 and guide you through its usage and configuration to convert scanned PDF images to Word. We'll assuming that you know the configure of NPM , Node Js on your server Or Localhost.

If you're wondering how to extract text from a PDF or image in an MVC application like Laravel or CodeIgniter using Node.js, here’s a straightforward approach:

Set Up a Route and Controller: Start by creating a route and a controller in your Laravel or CodeIgniter application. This controller will manage communication with the Node.js server, which will handle the OCR processing.
Send Requests to the Node.js Server: Inside the controller, use Laravel's HTTP client (or a similar tool in CodeIgniter) to send requests to the Node.js endpoint. This request should include the PDF or image file from which you want to extract text.
Process and Extract Text: On the Node.js server, use OCR libraries to process the PDF or image, extracting the text content. Once complete, the server will send back the extracted text as a JSON response.
Display or Process the Extracted Text: After receiving the response, you can process the extracted text in your application—for instance, parsing out a vehicle number or saving details in a database.
Testing with Postman: To verify everything is working as expected, use Postman to send sample PDF or image files to your Node.js server. Postman will show the extracted text returned in JSON format, helping you confirm that the setup functions correctly.

Following these steps allows you to set up a reliable OCR solution for extracting text from PDFs and images in your Laravel or CodeIgniter application, with Node.js handling the processing. This setup enables seamless data extraction and storage, improving efficiency and usability.

First Login to your Google Cloud console account and create a new project then in your left side panel you will see the API and services action panel below it you will see Library link click on it and enable Google Drive API And Google+ API. You can search in google "How to enable Google Drive API and get client credentials".

									     
Click Create Credentials > OAuth client ID.
Click Application type > Desktop app.
In the Name field, type a name for the credential. This name is only shown in the Google Cloud console.
Click Create. The OAuth client created screen appears, showing your new Client ID and Client secret.
Click OK. The newly created credential appears under OAuth 2.0 Client IDs.
Save the downloaded JSON file as credentials.json, and move the file to your working directory.

After completing the above step open your command in window , if you are in linux or ubuntu open you putty and logged in there. I am using wampp latest verison in windows and my all projects are there inside D:\wamp64\www\ and in ubuntu / linux / WHM root directory is public_html folder. so go to root folder of your server using command prompt or putty follow the step to configure / run node js

Create a directory and go to inside that directory and same as in using putty if you are not in localhost then create a package.json file , it will ask you some configuration and at the end it will created. if you dont want you can copy the package.json file and placed inside ocr9 folder directly from here.

									     
D:\wamp64\www> mkdir ocr9
D:\wamp64\www>cd ocr9

Now create a package.json file inside ocr9 folder and copy and paste below content in package.json file and placed inside ocr9 directory. This package json file has all packages which needs for OCR.

									     
{
  "name": "mine",
  "version": "1.0.0",
  "description": "gtest",
  "main": "app.js",
  "scripts": {
    "test": "test"
  },
  "keywords": [
    "test"
  ],
  "author": "rohit",
  "license": "ISC",
  "dependencies": {
    "@google-cloud/local-auth": "^2.1.0",
    "cookie-session": "^2.0.0",
    "cookies": "^0.8.0",
    "csv-parser": "^3.0.0",
    "express": "^4.18.2",
    "express-fileupload": "^1.4.0",
    "formidable": "^2.1.1",
    "fs-extra": "^11.1.0",
    "googleapis": "^105.0.0",
    "mammoth": "^1.5.1",
    "mv": "^2.1.1",
    "node-storage": "^0.0.9",
    "pdf-lib": "^1.17.1",
    "pdfkit": "^0.13.0",
    "uuid": "^9.0.0"
  }
}

Create a file named app.js and paste the following code inside it. Requests will be accepted by the do_ocr endpoint, and the response will be a JSON object. To access the API, you need to pass the data_key_en header. When sending a request from Postman, the drive.files.insert function will be called after passing validations. Your document will be converted to DOCX format and stored in Google Drive, then downloaded and saved to your local directory. We only need the extracted text from PDF and image files, so the 'deleteFile' function will remove both the uploaded document and the downloaded file from Google Drive after scanning. Please note that this code does not remove the uploaded document from Google Drive.You can write a function for that if you want to save space

If you comment out both lines of deleteFile and then run the Node program, you will see both the uploaded and scanned DOCX files inside the 'ocr9' directory. If you open the converted DOCX file, you will find an image on the first page and the extracted text on the second page.

									     
#!/usr/bin/env node
const http = require('http');
const fs = require('fs').promises;
const path = require('path');
const process = require('process');
const util = require('util');
const { authenticate } = require('@google-cloud/local-auth');
const { google } = require('googleapis');
const fsall = require('fs');
const mv = require('mv');
var mammoth = require("mammoth");

const express = require('express');
const fileUpload = require('express-fileupload');
const app = express();

const tempFileDir = __dirname + '/';
const SCOPES = ['https://www.googleapis.com/auth/drive'];
const TOKEN_PATH = path.join(process.cwd(), 'token.json');
const CREDENTIALS_PATH = path.join(process.cwd(), 'credentials.json');
const C_PATH = path.join(process.cwd(), 'test.html');
const data_key_en = 'NTVhNmFmNWMzZDZmNzg1MzY4NzNkZWE2NGY';

/**
 * Reads previously authorized credentials from the save file.
 *
 * @return {Promise<OAuth2Client|null>}
 */
async function loadSavedCredentialsIfExist() {
    try {
        const content = await fs.readFile(TOKEN_PATH);
        const credentials = JSON.parse(content);
        return google.auth.fromJSON(credentials);
    } catch (err) {
        return null;
    }
}

/**
 * Serializes credentials to a file comptible with GoogleAUth.fromJSON.
 *
 * @param {OAuth2Client} client
 * @return {Promise<void>}
 */
async function saveCredentials(client) {
    const content = await fs.readFile(CREDENTIALS_PATH);
    const keys = JSON.parse(content);
    const key = keys.installed || keys.web;
    const payload = JSON.stringify({
        type: 'authorized_user',
        client_id: key.client_id,
        client_secret: key.client_secret,
        refresh_token: client.credentials.refresh_token,
    });
    await fs.writeFile(TOKEN_PATH, payload);
}

/**
 * Load or request or authorization to call APIs.
 *
 */
async function authorize() {
    let client = await loadSavedCredentialsIfExist();
    if (client) {
        return client;
    }
    client = await authenticate({
        scopes: SCOPES,
        keyfilePath: CREDENTIALS_PATH,
    });
    if (client.credentials) {
        await saveCredentials(client);
    }
    return client;
}

function wait(ms) {
    return new Promise((r) => setTimeout(r, ms));
}

app.use(fileUpload({
    defCharset: 'utf8',
    defParamCharset: 'utf8',
    useTempFiles: true,
    tempFileDir: tempFileDir,
    //safeFileNames:true,
    limits: { fileSize: 5000000 },
}));

app.post('/', function(req, res) {
    res.setHeader('Content-Type', 'application/json');
    return res.status(400).send(JSON.stringify({ 'status': false, 'msg': 'you welcome' }));
});

app.post('/do_ocr', function(req, res) {

    let myrequestFile;
    let uploadPath;

    if (req.headers.api_key != data_key_en) {
        res.setHeader('Content-Type', 'application/json');
        return res.status(400).send(JSON.stringify({ 'status': false, 'msg': 'Invalid api key.' }));
    }

    if (!req.files || Object.keys(req.files).length === 0) {
        res.setHeader('Content-Type', 'application/json');
        return res.status(400).send(JSON.stringify({ 'status': false, 'msg': 'Request file can not be empty.' }));
    }

    if (req.files.size > 5000000) {
        res.setHeader('Content-Type', 'application/json');
        return res.status(400).send(JSON.stringify({ 'status': false, 'msg': 'size limit has been exceeded.' }));
    }

    myrequestFile = req.files.myrequestFile;
    uploadPath = tempFileDir + myrequestFile.name;
    myrequestFile.mv(uploadPath, function(err) {
        if (err) {
            return res.status(500).send(JSON.stringify({ 'status': false, 'msg': 'Unable to process try again !!!' }));
        }
        
        return authorize().then((authClient) => {
            (async() => {
               const drive = google.drive({ version: 'v2', auth: authClient });
                var folderId = '1Mv34Mdbpd5R4dUpYRu4y39';
                var fileMetadata = {
                    'name': myrequestFile.name,
                    parents: [folderId]
                };
            
                var media = {                   
                    mimeType:req.files.myrequestFile.mimetype,
                    body: fsall.createReadStream(myrequestFile.name)
                };
                let results = [];
                drive.files.insert({
                    convert: true,
                    ocr: true,
                    resource: fileMetadata,
                    media: media,
                    fields: 'id'
                }, function(err, file) {
                    if (err) {
                        console.log(err);
                    } else {
                        const driveexp = google.drive({ version: "v2" });
                        driveexp.files.export({
                                auth: authClient,
                                fileId: file.data.id, 
                                mimeType: "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
                            }, { responseType: "arraybuffer" },
                            (err, res) => {
                                if (err) {
                                    console.log(err);
                                } else {
                                    fs.writeFile(file.data.id + ".docx", Buffer.from(res.data), function(err) {
                                        if (err) {
                                            return console.log(err);
                                        }
                                    });
                                    
                                }
                            }
                        );
                        
                        setTimeout(function(){
                            mammoth.extractRawText({path: tempFileDir+file.data.id + ".docx"})
                            .then(function(result){
                                var text = result.value; // The raw text
                                var textLines = text.split ("\n");
                                let dataArray=[];
                               
                                for (var i = 0; i < textLines.length; i++) {
                                    if (textLines[i].length !== 0) {
                                        dataArray.push(textLines[i]);
                                    }
                                }
                                deleteFile(tempFileDir+file.data.id + ".docx").then(x => console.log("res", x)).catch(err=>console.log(err.message));
                                deleteFile(tempFileDir+myrequestFile.name).then(x => console.log("res", x)).catch(err=>console.log(err.message));
                                res.status(200).send(JSON.stringify({ 'status': true, 'msg': 'File has been scan successfully.','rawdata':dataArray}));
                            })
                            .done();

                        },10000);
                    }
                })
              })();
           
        });
       
    });
});

async function readFile() {
    const data = await fsall.promises.readFile(tempFileDir + "gfilename.txt", "utf-8");
    return data;
}

function deleteFile(file) {
    return new Promise((resolve, reject) => {
        fs.unlink(file, (err) => {
            if (err) reject(err);
            resolve(`Deleted ${file}`)
        })
    })
}

var server = app.listen(8080, '127.0.0.1', function () {
    var host = server.address().address;
    var port = server.address().port;

    console.log('App listening at http://%s:%s', host, port);
});

Now run the below command. it will install all the required node js packages.

									     
npm install

Finally run the below command when you run this command it will automatically open your default browser window and you will be redirect you to google account login screen , there you need to logged in to that account for which you created or enabled API access when sign in done it will automatically redirect you to your provided redirect url and give the message "Authentication successful! Please return to the console." basically this process when done this will automatically creates a file called token.json inside ocr9 folder. so if you run again the command , it will not ask for again login , and you can run the program and scan document. But in case if you change the redirect url or any other settings ,then you need to manually removed the token.js file and rerun the programe again. When you send the request from postman like "http://127.0.0.1/do_ocr" first it will upload the file to directory using MV (move) then authorize() function call , then insert function call , it will convert and then upload file to google drive then export function will call because

we want that file in our local folder so we have to export it from google drive , you can see mimeType as application/vnd.openxmlformats-officedocument.wordprocessingml.document it you want pdf then change accordingly. setTimeout(function(){mammoth will be used here so that we don't need document we need only extracted text.

if you see in your google drive account there you will see the converted doc with first page will containe image and second page will containe extracted text. You can also removed the document from google drive to reduce space after successfull extraction , like i am using deleteFile reduce space.

									     
node app.js

Open your postman and below will be your headers and in body selected form-data and put "myrequestFile" as key as file. http://127.0.0.1:8080/do_ocr for accessing it.

									     
encType:multipart/related
api_key:NTVhNmFmNWMzZDZmNzg1MzY4NzNkZWE2NGY