Knowledge Share Helps you in some way
How do I extract text from an pdf and image

How do I extract text from an pdf and image

Tuesday, 10 December 2024 |

Extract Text from Image or PDF Using Node js

In this article, we'll demonstrate how to Pull out text from PDFs and images and save it to a microsoft word file. Optical Character Recognition (OCR) technology is employed to recognize text within images or documents. Many mobile and web applications utilize OCR to scan documents and extract text. For instance, if you need to extract a vehicle's number from an image, or convert text from a PDF to Word and then store the details in a database. Various OCR tools and scripts are available for extracting text from images and PDFs, saving the results in a DOCX file. Here, we'll focus on using Google V2 and guide you through its usage and configuration to convert scanned PDF images to Word. We'll assuming that you know the configure of NPM , Node Js on your server Or Localhost.

If you're wondering How do I extract text from a PDF and image using Node.js in an MVC application like Laravel or CodeIgniter, here's a straightforward method:

Start by creating a route and a controller in your Laravel or CodeIgniter application. This controller will manage the task of interacting with the Node.js server. Inside the controller, utilize Laravel's HTTP client to send a request to the endpoint on the Node.js server.

The Node.js server will process the PDF or image, extract the text, and send it back as a JSON response. Once your application receives this response, it can process the extracted text image, such as identifying a vehicle number.

To ensure this process works smoothly, you can test it using Postman. Postman will send your PDF or image document to the Node.js server. The server will scan the document, extract the text, and return it in JSON format.

By following these steps, you can efficiently extract text from images or PDFs using Node.js in your Laravel or CodeIgniter application.

First Login to your Google Cloud console account and create a new project then in your left side panel you will see the API and services action panel below it you will see Library link click on it and enable Google Drive API And Google+ API. You can search in google "How to enable Google Drive API and get client credentials".

									     
Click Create Credentials > OAuth client ID.
Click Application type > Desktop app.
In the Name field, type a name for the credential. This name is only shown in the Google Cloud console.
Click Create. The OAuth client created screen appears, showing your new Client ID and Client secret.
Click OK. The newly created credential appears under OAuth 2.0 Client IDs.
Save the downloaded JSON file as credentials.json, and move the file to your working directory.										
									

After completing the above step open your command in window , if you are in linux or ubuntu open you putty and logged in there. I am using wampp latest verison in windows and my all projects are there inside D:\wamp64\www\ and in ubuntu / linux / WHM root directory is public_html folder. so go to root folder of your server using command prompt or putty follow the step to configure / run node js

Create a directory and go to inside that directory and same as in using putty if you are not in localhost then create a package.json file , it will ask you some configuration and at the end it will created. if you dont want you can copy the package.json file and placed inside ocr9 folder directly from here.

									     
D:\wamp64\www> mkdir ocr9
D:\wamp64\www>cd ocr9
										
									

Now create a package.json file inside ocr9 folder and copy and paste below content in package.json file and placed inside ocr9 directory. This package json file has all packages which needs for OCR.

									     
{
  "name": "mine",
  "version": "1.0.0",
  "description": "gtest",
  "main": "app.js",
  "scripts": {
    "test": "test"
  },
  "keywords": [
    "test"
  ],
  "author": "rohit",
  "license": "ISC",
  "dependencies": {
    "@google-cloud/local-auth": "^2.1.0",
    "cookie-session": "^2.0.0",
    "cookies": "^0.8.0",
    "csv-parser": "^3.0.0",
    "express": "^4.18.2",
    "express-fileupload": "^1.4.0",
    "formidable": "^2.1.1",
    "fs-extra": "^11.1.0",
    "googleapis": "^105.0.0",
    "mammoth": "^1.5.1",
    "mv": "^2.1.1",
    "node-storage": "^0.0.9",
    "pdf-lib": "^1.17.1",
    "pdfkit": "^0.13.0",
    "uuid": "^9.0.0"
  }
}
										
									

Create a file named app.js and paste the following code inside it. Requests will be accepted by the do_ocr endpoint, and the response will be a JSON object. To access the API, you need to pass the data_key_en header. When sending a request from Postman, the drive.files.insert function will be called after passing validations. Your document will be converted to DOCX format and stored in Google Drive, then downloaded and saved to your local directory. We only need the extracted text from PDF and image files, so the 'deleteFile' function will remove both the uploaded document and the downloaded file from Google Drive after scanning. Please note that this code does not remove the uploaded document from Google Drive.You can write a function for that if you want to save space

If you comment out both lines of deleteFile and then run the Node program, you will see both the uploaded and scanned DOCX files inside the 'ocr9' directory. If you open the converted DOCX file, you will find an image on the first page and the extracted text on the second page.

									     
#!/usr/bin/env node
const http = require('http');
const fs = require('fs').promises;
const path = require('path');
const process = require('process');
const util = require('util');
const { authenticate } = require('@google-cloud/local-auth');
const { google } = require('googleapis');
const fsall = require('fs');
const mv = require('mv');
var mammoth = require("mammoth");

const express = require('express');
const fileUpload = require('express-fileupload');
const app = express();

const tempFileDir = __dirname + '/';
const SCOPES = ['https://www.googleapis.com/auth/drive'];
const TOKEN_PATH = path.join(process.cwd(), 'token.json');
const CREDENTIALS_PATH = path.join(process.cwd(), 'credentials.json');
const C_PATH = path.join(process.cwd(), 'test.html');
const data_key_en = 'NTVhNmFmNWMzZDZmNzg1MzY4NzNkZWE2NGY';

/**
 * Reads previously authorized credentials from the save file.
 *
 * @return {Promise<OAuth2Client|null>}
 */
async function loadSavedCredentialsIfExist() {
    try {
        const content = await fs.readFile(TOKEN_PATH);
        const credentials = JSON.parse(content);
        return google.auth.fromJSON(credentials);
    } catch (err) {
        return null;
    }
}

/**
 * Serializes credentials to a file comptible with GoogleAUth.fromJSON.
 *
 * @param {OAuth2Client} client
 * @return {Promise<void>}
 */
async function saveCredentials(client) {
    const content = await fs.readFile(CREDENTIALS_PATH);
    const keys = JSON.parse(content);
    const key = keys.installed || keys.web;
    const payload = JSON.stringify({
        type: 'authorized_user',
        client_id: key.client_id,
        client_secret: key.client_secret,
        refresh_token: client.credentials.refresh_token,
    });
    await fs.writeFile(TOKEN_PATH, payload);
}

/**
 * Load or request or authorization to call APIs.
 *
 */
async function authorize() {
    let client = await loadSavedCredentialsIfExist();
    if (client) {
        return client;
    }
    client = await authenticate({
        scopes: SCOPES,
        keyfilePath: CREDENTIALS_PATH,
    });
    if (client.credentials) {
        await saveCredentials(client);
    }
    return client;
}

function wait(ms) {
    return new Promise((r) => setTimeout(r, ms));
}

app.use(fileUpload({
    defCharset: 'utf8',
    defParamCharset: 'utf8',
    useTempFiles: true,
    tempFileDir: tempFileDir,
    //safeFileNames:true,
    limits: { fileSize: 5000000 },
}));

app.post('/', function(req, res) {
    res.setHeader('Content-Type', 'application/json');
    return res.status(400).send(JSON.stringify({ 'status': false, 'msg': 'you welcome' }));
});

app.post('/do_ocr', function(req, res) {

    let myrequestFile;
    let uploadPath;

    if (req.headers.api_key != data_key_en) {
        res.setHeader('Content-Type', 'application/json');
        return res.status(400).send(JSON.stringify({ 'status': false, 'msg': 'Invalid api key.' }));
    }

    if (!req.files || Object.keys(req.files).length === 0) {
        res.setHeader('Content-Type', 'application/json');
        return res.status(400).send(JSON.stringify({ 'status': false, 'msg': 'Request file can not be empty.' }));
    }

    if (req.files.size > 5000000) {
        res.setHeader('Content-Type', 'application/json');
        return res.status(400).send(JSON.stringify({ 'status': false, 'msg': 'size limit has been exceeded.' }));
    }

    myrequestFile = req.files.myrequestFile;
    uploadPath = tempFileDir + myrequestFile.name;
    myrequestFile.mv(uploadPath, function(err) {
        if (err) {
            return res.status(500).send(JSON.stringify({ 'status': false, 'msg': 'Unable to process try again !!!' }));
        }
        
        return authorize().then((authClient) => {
            (async() => {
               const drive = google.drive({ version: 'v2', auth: authClient });
                var folderId = '1Mv34Mdbpd5R4dUpYRu4y39';
                var fileMetadata = {
                    'name': myrequestFile.name,
                    parents: [folderId]
                };
            
                var media = {                   
                    mimeType:req.files.myrequestFile.mimetype,
                    body: fsall.createReadStream(myrequestFile.name)
                };
                let results = [];
                drive.files.insert({
                    convert: true,
                    ocr: true,
                    resource: fileMetadata,
                    media: media,
                    fields: 'id'
                }, function(err, file) {
                    if (err) {
                        console.log(err);
                    } else {
                        const driveexp = google.drive({ version: "v2" });
                        driveexp.files.export({
                                auth: authClient,
                                fileId: file.data.id, 
                                mimeType: "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
                            }, { responseType: "arraybuffer" },
                            (err, res) => {
                                if (err) {
                                    console.log(err);
                                } else {
                                    fs.writeFile(file.data.id + ".docx", Buffer.from(res.data), function(err) {
                                        if (err) {
                                            return console.log(err);
                                        }
                                    });
                                    
                                }
                            }
                        );
                        
                        setTimeout(function(){
                            mammoth.extractRawText({path: tempFileDir+file.data.id + ".docx"})
                            .then(function(result){
                                var text = result.value; // The raw text
                                var textLines = text.split ("\n");
                                let dataArray=[];
                               
                                for (var i = 0; i < textLines.length; i++) {
                                    if (textLines[i].length !== 0) {
                                        dataArray.push(textLines[i]);
                                    }
                                }
                                deleteFile(tempFileDir+file.data.id + ".docx").then(x => console.log("res", x)).catch(err=>console.log(err.message));
                                deleteFile(tempFileDir+myrequestFile.name).then(x => console.log("res", x)).catch(err=>console.log(err.message));
                                res.status(200).send(JSON.stringify({ 'status': true, 'msg': 'File has been scan successfully.','rawdata':dataArray}));
                            })
                            .done();

                        },10000);
                    }
                })
              })();
           
        });
       
    });
});

async function readFile() {
    const data = await fsall.promises.readFile(tempFileDir + "gfilename.txt", "utf-8");
    return data;
}

function deleteFile(file) {
    return new Promise((resolve, reject) => {
        fs.unlink(file, (err) => {
            if (err) reject(err);
            resolve(`Deleted ${file}`)
        })
    })
}

var server = app.listen(8080, '127.0.0.1', function () {
    var host = server.address().address;
    var port = server.address().port;

    console.log('App listening at http://%s:%s', host, port);
});										
									

Now run the below command. it will install all the required node js packages.

									     
npm install										
									

Finally run the below command when you run this command it will automatically open your default browser window and you will be redirect you to google account login screen , there you need to logged in to that account for which you created or enabled API access when sign in done it will automatically redirect you to your provided redirect url and give the message "Authentication successful! Please return to the console." basically this process when done this will automatically creates a file called token.json inside ocr9 folder. so if you run again the command , it will not ask for again login , and you can run the program and scan document. But in case if you change the redirect url or any other settings ,then you need to manually removed the token.js file and rerun the programe again. When you send the request from postman like "http://127.0.0.1/do_ocr" first it will upload the file to directory using MV (move) then authorize() function call , then insert function call , it will convert and then upload file to google drive then export function will call because

we want that file in our local folder so we have to export it from google drive , you can see mimeType as application/vnd.openxmlformats-officedocument.wordprocessingml.document it you want pdf then change accordingly. setTimeout(function(){mammoth will be used here so that we don't need document we need only extracted text.

if you see in your google drive account there you will see the converted doc with first page will containe image and second page will containe extracted text. You can also removed the document from google drive to reduce space after successfull extraction , like i am using deleteFile reduce space.

									     
node app.js
										
									

Open your postman and below will be your headers and in body selected form-data and put "myrequestFile" as key as file. http://127.0.0.1:8080/do_ocr for accessing it.

									     
encType:multipart/related
api_key:NTVhNmFmNWMzZDZmNzg1MzY4NzNkZWE2NGY										
									

jQuery Node Js