If you’ve used my Convert any file with Apps Script you’ll know that it can take a wide range of files (currently 53 different mimeTypes, 42 types of imports, and 26 kinds of exports) and convert from one format to another. It uses the import and export functionality of the Drive API to find a route (sometimes involving multiple conversions) to, for example, turn a microsoft word file into a pdf.
However, it can’t make image files from anything other than image files in other image formats. This has been a problem when using the free tier Gemini flash api. At the time of writing, Gemini only supports media input in image format.
Since the Convert any file with Apps Script library can already convert most types to pdf, it seemed to me that a new library convert pdfs to images would be handy. This article is about that. There’ll be future articles about applying the results to the Gemini API
PDF to image conversions approach
I guess the main reason that Drive (and therefore my bmWeConvertAnyFile library) doesn’t support pdf to image conversions is that a pdf has multiple pages, so it’s more of a fan out to 1 image per page than a conversion.
However, once you’ve split the pdf into separate pages, you could use the Drive thumbnail link of each new single page pdf as an image and then we’d have a route from anything to image via a pdf, to single page pdfs, then to a file thumbnail.
There is an excellent node library – pdf-lib – (as far as I can tell it seems to be maintained by Microsoft) which allows you do all sorts of things to pdf files, so I’ve ported that to Apps Script, exposed it so you can have access to all its functionality, and wrapped a few of the useful features specific to this conversion problem in an easy to use class.
Asynchronicity and Promises
As with most Node/Javascript libraries, pdf-lib is largely asynchronous. Apps Script runs in a blocking mode, but it does support more or less support the Promises and async/await syntax so with a little bit of care we can still use all of the pdf-lib features. For how I handle this in Apps Script see Promises and async class constructors in Apps Script
Getting started
You’ll need the bmPDF library (1Aq5ReP4_Ghk4zLKnJvtyPO717UlBqT0YqyCnpoWonxrIiI6xMa8g0sf2) as a minimum. There are some sample pdfs you can use in this pdf folder
Splitting files into individual pages
Here’s a hello world to take a pdf file and convert it to multiple pdfs. Since this is initially the most common use case for this library, lets start there.
const helloworldsplit = async () => {
const id = "1-bOAhC0riDmlLFb7Kc_ypBJV6K8K8RZN"
const blob = DriveApp.getFileById(id).getBlob()
// get an instance
const pf = await bmPDF.Exports.PdfFiddler.build({
blobs: blob, name: "helloworld-split"
})
// split them up into separate blobs
const results = await pf.splitBlob()
// now write these blobs to drive
const files = results.splits.map(f => DriveApp.createFile(f.blob))
files.forEach(f =>
console.log('...created file', f.getName())
)
}
hello world to split a pdf into single pages
And the result will be 3 new pdf blobs – 1 for each page
11:01:31 AM Info ...created file helloworld-split-0.pdf
11:01:31 AM Info ...created file helloworld-split-1.pdf
11:01:31 AM Info ...created file helloworld-split-2.pdf
result of splitting files
Combining pdfs
You may have noticed that the plural of ‘blob’ as one of the arguments to PdfFiddler.build({blobs}). That’s because combining pdf blobs is implicitly built in. To combine pdfs, just specify an array of blobs.
const helloworldcombine = async () => {
const ids = [
"1v5kJ5SOY2nu3DI1LKwALb3seaBpF3kWu",
"1rYYtwARFe9X9nZWxkaTpsFoI-zKouPus"
]
const blobs = ids.map(id=>DriveApp.getFileById(id).getBlob())
// get an instance - this will combine all the blobs provided
const pf = await bmPDF.Exports.PdfFiddler.build({
blobs,
name: "helloworld-combine.pdf"
})
// now write these blobs to drive
const file = DriveApp.createFile(pf.blob)
console.log('...created file', file.getName())
}
It’s that simple! Here’s the result
11:47:04 AM Info ...created file helloworld-combine.pdf
combine pdf files
Rearranging pdfs
Let’s say you want to combine pdfs, then split them in to 2 separate pdfs – odd and even pages.
const helloworldrearrange = async () => {
const ids = [
"1v5kJ5SOY2nu3DI1LKwALb3seaBpF3kWu",
"1rYYtwARFe9X9nZWxkaTpsFoI-zKouPus"
]
const blobs = ids.map(id=>DriveApp.getFileById(id).getBlob())
// first combine them
const pf = await bmPDF.Exports.PdfFiddler.build({
blobs,
name: "helloworld-combine.pdf"
})
// now we can split them into single pages
const {splits} = await pf.splitBlob()
// and combine the ones we want
const odds = await bmPDF.Exports.PdfFiddler.build({
blobs: splits.filter((_,i)=> i % 2).map(f=>f.blob),
name: "helloworld-odds.pdf"
})
const evens = await bmPDF.Exports.PdfFiddler.build({
blobs: splits.filter((_,i)=> !(i % 2)).map(f=>f.blob),
name: "helloworld-evens.pdf"
})
// now write these blobs to drive
const evenFile = DriveApp.createFile(evens.blob)
console.log('...created file', evenFile.getName())
const oddFile = DriveApp.createFile(odds.blob)
console.log('...created file', oddFile.getName())
}
combine then rearrange
And here’s the result of this one
12:00:15 PM Info ...created file helloworld-evens.pdf
12:00:16 PM Info ...created file helloworld-odds.pdf
rearrange result
Accessing all of pdflib methods
So far we’ve looked at some of the the simplified methods built in to the PdfFiddler – but all of the pdflib document methods and properties are accessible via the pdfFiddler.doc property, and all of pdf-lib is accessible via pdfFiddler.PDFLib. Here’s an example of making a copy of a pdf and setting a few properties.
const helloworldvarious= async () => {
const id = "1-bOAhC0riDmlLFb7Kc_ypBJV6K8K8RZN"
const blob = DriveApp.getFileById(id).getBlob()
// get an instance
const pf = await bmPDF.Exports.PdfFiddler.build({
blobs: blob,
name: "helloworld-various"
})
// the pdf file is in the doc property
pf.doc.setAuthor ("Bruce Mcpherson")
pf.doc.setTitle ("Testing various pdf properties")
// we've changed the document, so get a new blob
const newBlob = await pf.getBlob()
// get it back and check the other
const withprops = await bmPDF.Exports.PdfFiddler.build({
blobs: newBlob,
name: "helloworld-various-withprops.pdf"
})
console.log ('author:', withprops.doc.getAuthor())
console.log ('title:', withprops.doc.getTitle())
}
the result
12:54:50 PM Info author: Bruce Mcpherson
12:54:50 PM Info title: Testing various pdf properties
result of setting some document properties
Bulk processing
So far we’ve used the DrvApp to get and write files to Drive. I find it more convenient to use my bmApiCentral .Drv API as this supports linux style paths, has built in caching and uses the Drive REST API directly rather than Apps Script services. We’ll also need that later to extract thumbnails from Drive files.
You don’t have to use that of course, but all the examples following assume you do.
Exports
All my libraries and scripts nowadays feature an Export object which not only abstracts library sources but also fixes some of the loading order problems you get with Apps Script, and also has a built in property access checker. For more info on this approach see Fix Apps Script file order problems with Exports.
The following examples assume you have the following Exports file in your script (All the code for these examples is available at testBmPdf) or just copy it in from here.
var Exports = {
/**
* @param {object} p params
* @param {function} tokenService how to get a token
* @param {function} fetch how to fetch
*/
Init(...args) {
return this.Deps.init(...args)
},
get ApiCentral() {
return bmApiCentral.Exports
},
get Deps() {
return this.ApiCentral.Deps
},
get pdfLibImport () {
return bmPDF.Exports
},
get PdfFiddler() {
return this.pdfLibImport.PdfFiddler
},
get PDFLib() {
return this.pdfLibImport.PDFLib
},
get PDFDocument() {
return this.PDFLib.PDFDocument
},
get libExports() {
return bmApiCentral.Exports
},
get Deps() {
return this.libExports.Deps
},
/**
* Drv instance with validation
* @param {...*} args
* @return {Drv} a proxied instance of Drv with property checking enabled
*/
newDrv(...args) {
return this.ApiCentral.newDrv(...args)
},
/**
* Utils namespace
* @return {Utils}
*/
get Utils() {
return this.ApiCentral.Utils
},
// used to trap access to unknown properties
guard(target) {
return new Proxy(target, this.validateProperties)
},
/**
* for validating attempts to access non existent properties
*/
get validateProperties() {
return {
get(target, prop, receiver) {
// typeof and console use the inspect prop
if (
typeof prop !== 'symbol' &&
prop !== 'inspect' &&
!Reflect.has(target, prop)
) throw `guard detected attempt to get non-existent property ${prop}`
return Reflect.get(target, prop, receiver)
},
set(target, prop, value, receiver) {
if (!Reflect.has(target, prop)) throw `guard attempt to set non-existent property ${prop}`
return Reflect.set(target, prop, value, receiver)
}
}
}
}
exports
Setup
Since we’ll now be using the Drive REST API, you’ll need to enable it and update your appsscript.json with some oauth dependencies
Enabling Drive REST API
The simplest way to do this is just to enable the Drive Advanced service in your editor. This will automatically enable the Drive API (since the advanced service uses that anyway).Alternatively, if you are using a standard cloud project (instead of the default Apps Script managed one), you can head over to the cloud console and enable it there.
Updating Appsscript.json
If it’s not visible in your editor go to your project settings and click the checkbox to show it, then update it with these scopes
"oauthScopes": [
"https://www.googleapis.com/auth/script.external_request",
"https://www.googleapis.com/auth/drive"
]
Using API central with PdfFiddler
bmApiCentral is dependency free, so you’ll need to first of all pass over a couple of functions it can use from your script, then you can get an instance of Drv to use
Exports.Deps.init({
tokenService: ScriptApp.getOAuthToken,
fetch: UrlFetchApp.fetch
})
// this is an enhanced drv client with built in caching and support for unix style paths on drive
const drv = Exports.newDrv()
Splitting up a folder full of pdfs into single pages
First copy the sample files to a folder of your choice on your Drive, and update the code below with paths to your input and output folders
const inputPath = '/public/samplepdfs'
const outputPath = '/public/splitpdfs'
Next, get the input files and output folder – creating it as necessary and downloading the file content. We can use Drive query language to narrow down the files in the input folder. In this case we’re selecting all files that have a mimeType of pdf.
// limit list to just pdf files
// you could add additional drive queries here - for example "and name = 'my.pdf'"
const mime = "application/pdf"
const query = `mimeType = '${mime}'`
// get all the pdf files in the given directory
const inputs = drv.getFilesInFolder({ path: inputPath, query })
// we'll just put the split files in a subfolder
// if it's not there we'll create it
const outputFolder = drv.getFolder({
path: outputPath,
createIfMissing: true
})
// do the whole thing - these will be the blobs for each of the input files
const inputFiles = inputs.data.files.map(file => drv.download(file))
input/output
Next we convert each of the blobs to PdfFiddlers, and split them into single pages
// the pdfs they create will be quasi async
const inputPdfs = await Promise.all(inputFiles.map(f => Exports.PdfFiddler.build({ blobs: f.blob, name: f.data.name})))
// now split each of those into single page pdfs
const singlePages = await Promise.all(inputPdfs.map(f => f.splitBlob()))
make single pages
Finally upload the single pages to the output folder
// now upload everything
const outputs = singlePages.map(input => {
const uploads = input.splits.map(f => drv.upload({
blob: f.blob,
parentId: outputFolder.data.id
}))
return {
input,
uploads
}
})
console.log('Finished:uploaded these files')
outputs.forEach(f => {
console.log('...from', f.input.source.name)
f.uploads.forEach(s => console.log('.... ', s.data.name, 'size', s.data.size))
})
upload single pages
The result log
2:58:43 PM Info Finished:uploaded these files
2:58:43 PM Info ...from flyer.pdf
2:58:43 PM Info .... flyer.pdf-0.pdf size 306456
2:58:43 PM Info ...from example.pdf
2:58:43 PM Info .... example.pdf-0.pdf size 58894
2:58:43 PM Info .... example.pdf-1.pdf size 210226
2:58:43 PM Info .... example.pdf-2.pdf size 32597
2:58:43 PM Info ...from drylab.pdf
2:58:44 PM Info .... drylab.pdf-0.pdf size 666278
2:58:44 PM Info .... drylab.pdf-1.pdf size 112603
2:58:44 PM Info .... drylab.pdf-2.pdf size 639816
2:58:44 PM Info ...from somatosensory.pdf
2:58:44 PM Info .... somatosensory.pdf-0.pdf size 92685
2:58:44 PM Info .... somatosensory.pdf-1.pdf size 73875
2:58:44 PM Info .... somatosensory.pdf-2.pdf size 51826
2:58:44 PM Info .... somatosensory.pdf-3.pdf size 70866
result
Getting thumbnails
Now that we have a script to convert a collection of pdfs into a collection of single page pdfs, we can complete final step and make images from each page.
// now get all the thumnails
const imagePath = '/public/splitimages'
// if it's not there we'll create it
const imageFolder = drv.getFolder({
path: imagePath,
createIfMissing: true
})
const files = outputs
.map(f=>f.uploads)
.flat(Infinity)
.map(f=>drv.get({id: f.data.id}, [{fields: 'thumbnailLink'}]))
const images = UrlFetchApp
.fetchAll(files.map(file=>({
url: file.data.thumbnailLink
})))
.map ((f,i)=>f.getBlob().setName(`${files[i].data.name}.png`))
.map (blob=>drv.upload ({
blob,
parentId: imageFolder.data.id
}))
console.log ('Finished:created this images in', imageFolder.data.name)
images.forEach(f=>console.log('...',f.data.name))
get all the thumbnails and write them to drive
Result of creating images
3:43:33 PM Info Finished:created this images in splitimages
3:43:33 PM Info ... flyer.pdf-0.pdf.png
3:43:33 PM Info ... example.pdf-0.pdf.png
3:43:33 PM Info ... example.pdf-1.pdf.png
3:43:33 PM Info ... example.pdf-2.pdf.png
3:43:33 PM Info ... drylab.pdf-0.pdf.png
3:43:33 PM Info ... drylab.pdf-1.pdf.png
3:43:33 PM Info ... drylab.pdf-2.pdf.png
3:43:33 PM Info ... somatosensory.pdf-0.pdf.png
3:43:33 PM Info ... somatosensory.pdf-1.pdf.png
3:43:33 PM Info ... somatosensory.pdf-2.pdf.png
3:43:33 PM Info ... somatosensory.pdf-3.pdf.png
image uploads
Next
That’s all for now. You can find all the code for these examples at testBmPdf. In the next article on this topic we’ll look at how to use Convert any file with Apps Script in conjunction with PdfFiddler to create an image from any kind of input.
Links
bmPDF library (1Aq5ReP4_Ghk4zLKnJvtyPO717UlBqT0YqyCnpoWonxrIiI6xMa8g0sf2) github
bmApiCentral library (1L4pGblikbjQLQp8nQCdmCfyCxmF3MIShzsK8yy_mJ9_2YMdanXQA75vI) (github)
testBmPdf (github)
sample pdf folder , Gormenghast folder
pdf-lib https://pdf-lib.js.org/
Convert any file with Apps Script
Promises and async class constructors in Apps Script
fix Apps Script file order problems with Exports
bmWeConvertAnyFile – 1dZjyNnLCMS_2oEcFRjOf9oYu0qNHkfkfQovycSytn1FhsABTXy0Wnp4z (github)