Manipulating PDFS in Apps Script

If you’ve used my Convert any file with Apps Script you’ll know that it can take a wide range of files (currently 53 different mimeTypes, 42 types of imports, and 26 kinds of exports) and convert from one format to another. It uses the import and export functionality of the Drive API to find a route (sometimes involving multiple conversions) to, for example, turn a microsoft word file into a pdf.

However, it can’t make image files from anything other than image files in other image formats. This has been a problem when using the free tier Gemini flash api. At the time of writing, Gemini only supports media input in image format.

Since the Convert any file with Apps Script library can already convert most types to pdf, it seemed to me that a new library convert pdfs to images would be handy. This article is about that. There’ll be future articles about applying the results to the Gemini API

Page Content hide

1 PDF to image conversions approach

2 Asynchronicity and Promises

3 Getting started

3.1 Splitting files into individual pages

3.2 Combining pdfs

3.3 Rearranging pdfs

4 Accessing all of pdflib methods

5 Bulk processing

6 Exports

7 Setup

7.1 Enabling Drive REST API

7.2 Updating Appsscript.json

7.3 Using API central with PdfFiddler

7.3.1 Splitting up a folder full of pdfs into single pages

7.4 Getting thumbnails

8 Next

9 Links

10 Share with your network

PDF to image conversions approach

I guess the main reason that Drive (and therefore my bmWeConvertAnyFile library) doesn’t support pdf to image conversions is that a pdf has multiple pages, so it’s more of a fan out to 1 image per page than a conversion.

However, once you’ve split the pdf into separate pages, you could use the Drive thumbnail link of each new single page pdf as an image and then we’d have a route from anything to image via a pdf, to single page pdfs, then to a file thumbnail.

There is an excellent node library – pdf-lib – (as far as I can tell it seems to be maintained by Microsoft) which allows you do all sorts of things to pdf files, so I’ve ported that to Apps Script, exposed it so you can have access to all its functionality, and wrapped a few of the useful features specific to this conversion problem in an easy to use class.

Asynchronicity and Promises

As with most Node/Javascript libraries, pdf-lib is largely asynchronous. Apps Script runs in a blocking mode, but it does support more or less support the Promises and async/await syntax so with a little bit of care we can still use all of the pdf-lib features. For how I handle this in Apps Script see Promises and async class constructors in Apps Script

Getting started

You’ll need the bmPDF library (1Aq5ReP4_Ghk4zLKnJvtyPO717UlBqT0YqyCnpoWonxrIiI6xMa8g0sf2) as a minimum. There are some sample pdfs you can use in this pdf folder

Splitting files into individual pages

Here’s a hello world to take a pdf file and convert it to multiple pdfs. Since this is initially the most common use case for this library, lets start there.

const helloworldsplit = async () => {
  const id = "1-bOAhC0riDmlLFb7Kc_ypBJV6K8K8RZN"
  const blob = DriveApp.getFileById(id).getBlob()

  // get an instance
  const pf = await bmPDF.Exports.PdfFiddler.build({
    blobs: blob, name: "helloworld-split"
  })

  // split them up into separate blobs
  const results = await pf.splitBlob()

  // now write these blobs to drive
  const files = results.splits.map(f => DriveApp.createFile(f.blob))

  files.forEach(f =>
    console.log('...created file', f.getName())
  )

}

hello world to split a pdf into single pages

And the result will be 3 new pdf blobs – 1 for each page

11:01:31 AM	Info	...created file helloworld-split-0.pdf
11:01:31 AM	Info	...created file helloworld-split-1.pdf
11:01:31 AM	Info	...created file helloworld-split-2.pdf

result of splitting files

Combining pdfs

You may have noticed that the plural of ‘blob’ as one of the arguments to PdfFiddler.build({blobs}). That’s because combining pdf blobs is implicitly built in. To combine pdfs, just specify an array of blobs.

const helloworldcombine = async () => {

  const ids = [
    "1v5kJ5SOY2nu3DI1LKwALb3seaBpF3kWu",
    "1rYYtwARFe9X9nZWxkaTpsFoI-zKouPus"
  ]
  const blobs = ids.map(id=>DriveApp.getFileById(id).getBlob())


  // get an instance - this will combine all the blobs provided
  const pf = await bmPDF.Exports.PdfFiddler.build({
    blobs, 
    name: "helloworld-combine.pdf"
  })


  // now write these blobs to drive
  const file = DriveApp.createFile(pf.blob)
  console.log('...created file', file.getName())
  
}

It’s that simple! Here’s the result

11:47:04 AM	Info	...created file helloworld-combine.pdf

combine pdf files

Rearranging pdfs

Let’s say you want to combine pdfs, then split them in to 2 separate pdfs – odd and even pages.

const helloworldrearrange = async () => {

  const ids = [
    "1v5kJ5SOY2nu3DI1LKwALb3seaBpF3kWu",
    "1rYYtwARFe9X9nZWxkaTpsFoI-zKouPus"
  ]
  const blobs = ids.map(id=>DriveApp.getFileById(id).getBlob())

  // first combine them
  const pf = await bmPDF.Exports.PdfFiddler.build({
    blobs, 
    name: "helloworld-combine.pdf"
  })

  // now we can split them into single pages
  const {splits} = await pf.splitBlob()

  // and combine the ones we want 
  const odds = await  bmPDF.Exports.PdfFiddler.build({
    blobs: splits.filter((_,i)=> i % 2).map(f=>f.blob), 
    name: "helloworld-odds.pdf"
  })

  const evens = await  bmPDF.Exports.PdfFiddler.build({
    blobs: splits.filter((_,i)=> !(i % 2)).map(f=>f.blob),  
    name: "helloworld-evens.pdf"
  })


  // now write these blobs to drive
  const evenFile = DriveApp.createFile(evens.blob)
  console.log('...created file', evenFile.getName())
  const oddFile = DriveApp.createFile(odds.blob)
  console.log('...created file', oddFile.getName())
  
}

combine then rearrange

And here’s the result of this one

12:00:15 PM	Info	...created file helloworld-evens.pdf
12:00:16 PM	Info	...created file helloworld-odds.pdf

rearrange result

Accessing all of pdflib methods

So far we’ve looked at some of the the simplified methods built in to the PdfFiddler – but all of the pdflib document methods and properties are accessible via the pdfFiddler.doc property, and all of pdf-lib is accessible via pdfFiddler.PDFLib. Here’s an example of making a copy of a pdf and setting a few properties.

const helloworldvarious= async () => {
  const id = "1-bOAhC0riDmlLFb7Kc_ypBJV6K8K8RZN"
  const blob = DriveApp.getFileById(id).getBlob()

  // get an instance
  const pf = await bmPDF.Exports.PdfFiddler.build({
    blobs: blob, 
    name: "helloworld-various"
  })

  // the pdf file is in the doc property
  pf.doc.setAuthor ("Bruce Mcpherson")
  pf.doc.setTitle ("Testing various pdf properties")

  // we've changed the document, so get a new blob 
  const newBlob = await pf.getBlob()


  // get it back and check the other
  const withprops = await bmPDF.Exports.PdfFiddler.build({
    blobs: newBlob,
    name: "helloworld-various-withprops.pdf"
  })

  console.log ('author:', withprops.doc.getAuthor())
  console.log ('title:', withprops.doc.getTitle())
}

the result

12:54:50 PM	Info	author: Bruce Mcpherson
12:54:50 PM	Info	title: Testing various pdf properties

result of setting some document properties

Bulk processing

So far we’ve used the DrvApp to get and write files to Drive. I find it more convenient to use my bmApiCentral .Drv API as this supports linux style paths, has built in caching and uses the Drive REST API directly rather than Apps Script services. We’ll also need that later to extract thumbnails from Drive files.

You don’t have to use that of course, but all the examples following assume you do.

Exports

All my libraries and scripts nowadays feature an Export object which not only abstracts library sources but also fixes some of the loading order problems you get with Apps Script, and also has a built in property access checker. For more info on this approach see Fix Apps Script file order problems with Exports.

The following examples assume you have the following Exports file in your script (All the code for these examples is available at testBmPdf) or just copy it in from here.

var Exports = {

  /**
   * @param {object} p params
   * @param {function} tokenService how to get a token
   * @param {function} fetch how to fetch
   */
  Init(...args) {
    return this.Deps.init(...args)
  },

  get ApiCentral() {
    return bmApiCentral.Exports
  },

  get Deps() {
    return this.ApiCentral.Deps
  },

  get pdfLibImport () {
    return bmPDF.Exports
  },

  get PdfFiddler() {
    return this.pdfLibImport.PdfFiddler
  },

  get PDFLib() {
    return this.pdfLibImport.PDFLib
  },

  get PDFDocument() {
    return this.PDFLib.PDFDocument
  },

  get libExports() {
    return bmApiCentral.Exports
  },

  get Deps() {
    return this.libExports.Deps
  },




  /**
   * Drv instance with validation
   * @param {...*} args
   * @return {Drv} a proxied instance of Drv with property checking enabled
   */
  newDrv(...args) {
    return this.ApiCentral.newDrv(...args)
  },

  /**
   * Utils namespace
   * @return {Utils} 
   */
  get Utils() {
    return this.ApiCentral.Utils
  },

  // used to trap access to unknown properties
  guard(target) {
    return new Proxy(target, this.validateProperties)
  },

  /**
   * for validating attempts to access non existent properties
   */
  get validateProperties() {
    return {
      get(target, prop, receiver) {
        // typeof and console use the inspect prop
        if (
          typeof prop !== 'symbol' &&
          prop !== 'inspect' &&
          !Reflect.has(target, prop)
        ) throw `guard detected attempt to get non-existent property ${prop}`

        return Reflect.get(target, prop, receiver)
      },

      set(target, prop, value, receiver) {
        if (!Reflect.has(target, prop)) throw `guard attempt to set non-existent property ${prop}`
        return Reflect.set(target, prop, value, receiver)
      }
    }
  }

}

exports

Setup

Since we’ll now be using the Drive REST API, you’ll need to enable it and update your appsscript.json with some oauth dependencies

Enabling Drive REST API

The simplest way to do this is just to enable the Drive Advanced service in your editor. This will automatically enable the Drive API (since the advanced service uses that anyway).Alternatively, if you are using a standard cloud project (instead of the default Apps Script managed one), you can head over to the cloud console and enable it there.

Updating Appsscript.json

If it’s not visible in your editor go to your project settings and click the checkbox to show it, then update it with these scopes

  "oauthScopes": [
    "https://www.googleapis.com/auth/script.external_request",
    "https://www.googleapis.com/auth/drive"
  ]

Using API central with PdfFiddler

bmApiCentral is dependency free, so you’ll need to first of all pass over a couple of functions it can use from your script, then you can get an instance of Drv to use

  Exports.Deps.init({
    tokenService: ScriptApp.getOAuthToken,
    fetch: UrlFetchApp.fetch
  })

  // this is an enhanced drv client with built in caching and support for unix style paths on drive
  const drv = Exports.newDrv()

Splitting up a folder full of pdfs into single pages

First copy the sample files to a folder of your choice on your Drive, and update the code below with paths to your input and output folders

  const inputPath = '/public/samplepdfs'
  const outputPath = '/public/splitpdfs'

Next, get the input files and output folder – creating it as necessary and downloading the file content. We can use Drive query language to narrow down the files in the input folder. In this case we’re selecting all files that have a mimeType of pdf.


  // limit list to just pdf files
  // you could add additional drive queries here - for example "and name = 'my.pdf'"
  const mime = "application/pdf"
  const query = `mimeType = '${mime}'`

  // get all the pdf files in the given directory
  const inputs = drv.getFilesInFolder({ path: inputPath, query })

  // we'll just put the split files in a subfolder
  // if it's not there we'll create it
  const outputFolder = drv.getFolder({
    path: outputPath,
    createIfMissing: true
  })

  // do the whole thing - these will be the blobs for each of the input files
  const inputFiles = inputs.data.files.map(file => drv.download(file))

input/output

Next we convert each of the blobs to PdfFiddlers, and split them into single pages

  // the pdfs they create will be quasi async
  const inputPdfs = await Promise.all(inputFiles.map(f => Exports.PdfFiddler.build({ blobs: f.blob, name: f.data.name})))

  // now split each of those into single page pdfs
  const singlePages = await Promise.all(inputPdfs.map(f => f.splitBlob()))

make single pages

Finally upload the single pages to the output folder

  // now upload everything
  const outputs = singlePages.map(input => {
    const uploads = input.splits.map(f => drv.upload({
      blob: f.blob,
      parentId: outputFolder.data.id
    }))
    return {
      input,
      uploads
    }
  })

  console.log('Finished:uploaded these files')
  outputs.forEach(f => {
    console.log('...from', f.input.source.name)
    f.uploads.forEach(s => console.log('.... ', s.data.name, 'size', s.data.size))
  })

upload single pages

The result log

2:58:43 PM	Info	Finished:uploaded these files
2:58:43 PM	Info	...from flyer.pdf
2:58:43 PM	Info	....  flyer.pdf-0.pdf size 306456
2:58:43 PM	Info	...from example.pdf
2:58:43 PM	Info	....  example.pdf-0.pdf size 58894
2:58:43 PM	Info	....  example.pdf-1.pdf size 210226
2:58:43 PM	Info	....  example.pdf-2.pdf size 32597
2:58:43 PM	Info	...from drylab.pdf
2:58:44 PM	Info	....  drylab.pdf-0.pdf size 666278
2:58:44 PM	Info	....  drylab.pdf-1.pdf size 112603
2:58:44 PM	Info	....  drylab.pdf-2.pdf size 639816
2:58:44 PM	Info	...from somatosensory.pdf
2:58:44 PM	Info	....  somatosensory.pdf-0.pdf size 92685
2:58:44 PM	Info	....  somatosensory.pdf-1.pdf size 73875
2:58:44 PM	Info	....  somatosensory.pdf-2.pdf size 51826
2:58:44 PM	Info	....  somatosensory.pdf-3.pdf size 70866

result

Getting thumbnails

Now that we have a script to convert a collection of pdfs into a collection of single page pdfs, we can complete final step and make images from each page.


  // now get all the thumnails
  const imagePath = '/public/splitimages'
  // if it's not there we'll create it
  const imageFolder = drv.getFolder({
    path: imagePath,
    createIfMissing: true
  })
  const files = outputs
    .map(f=>f.uploads)
    .flat(Infinity)
    .map(f=>drv.get({id: f.data.id}, [{fields: 'thumbnailLink'}]))
  
  const images = UrlFetchApp
    .fetchAll(files.map(file=>({
      url: file.data.thumbnailLink
    })))
    .map ((f,i)=>f.getBlob().setName(`${files[i].data.name}.png`))
    .map (blob=>drv.upload ({
      blob,
      parentId: imageFolder.data.id
    }))
  
  console.log ('Finished:created this images in', imageFolder.data.name)
  images.forEach(f=>console.log('...',f.data.name))

get all the thumbnails and write them to drive

Result of creating images

3:43:33 PM	Info	Finished:created this images in splitimages
3:43:33 PM	Info	... flyer.pdf-0.pdf.png
3:43:33 PM	Info	... example.pdf-0.pdf.png
3:43:33 PM	Info	... example.pdf-1.pdf.png
3:43:33 PM	Info	... example.pdf-2.pdf.png
3:43:33 PM	Info	... drylab.pdf-0.pdf.png
3:43:33 PM	Info	... drylab.pdf-1.pdf.png
3:43:33 PM	Info	... drylab.pdf-2.pdf.png
3:43:33 PM	Info	... somatosensory.pdf-0.pdf.png
3:43:33 PM	Info	... somatosensory.pdf-1.pdf.png
3:43:33 PM	Info	... somatosensory.pdf-2.pdf.png
3:43:33 PM	Info	... somatosensory.pdf-3.pdf.png

image uploads

That’s all for now. You can find all the code for these examples at testBmPdf. In the next article on this topic we’ll look at how to use Convert any file with Apps Script in conjunction with PdfFiddler to create an image from any kind of input.