How to plan your extractor

First, there's a few files that you should become familiar with, as they'll help you out a ton. Skim this when you're first getting started, but if you have trouble figuring out what's going on, this is where you should start:

text.py - mikf has put together some handy functions for working with HTML and URLs. These save you from having to write regexes or string manipulate every little thing. Here's a few examples:

split_html - Pass it HTML code. It returns back text that has been split out by the HTML tags. Great to find text fields in an HTML page.
root_from_url - Pass it a full URL, and it returns back the site. Pass it (http://www.cw.com/file/storage/image.jpg) and it will return http://www.cw.com/ - For when you want to manipulate a URL.
ext_from_url - Pass it a full URL, and it returns back the filename extension. Pass it (http://www.cw.com/file/storage/image.jpg) and it will return .jpg. You'll need this to write your filename.
nameext_from_url - Pass it a URL that has a parameter (http://www.cw.com/file/image.php?search=Pretty%20Pictures) and it returns back the parameter (Pretty%20Pictures). Use this if you want to save "Pretty Pictures" as part of the folder or filename.
extract - Extract text between two points of text. If your site text has a piece of data you want to extract, you send this function the full text, the beginning, and the end. For example, Here is our image of the day: Image123 And here is our favorite of the week and you want to extract the data at "Image123" then you would extract "day:", "And here is".
extract_all - Like Extract, but you can pass it a list of fields that you're looking for. Example here.

common.py - These are the common methods you'll want to use.

Extractor - This is how you download an individual file.
GalleryExtractor - Example code for how you can find the URLs in an album, and send those to the Extractor.

message.py - This is how you communicate with gallery-dl.

You send it a Message() with the URL of your page, along with information it needs for the database like the site, filename, etc.

There are some methods built into the Extractor you can use as well. To test these, inside your items you can add:

    `#Save the full text of the HTML page to the 'page' variable:  
    page = self.request(text.ensure_http_scheme(self.url)).text  
    # Output it to the screen for debugging:  
    print(f'**DEBUG OUTPUT::: page = {page}')  
    `

And some variables of interest you can reference in your methods: self.match, self.match.groups

Lets go back and look at the sample code in our extractor:

from .common import Extractor, Message
from .. import text
Here we import those files I mentioned earlier. This is how we call them.

class CWExampleExtractor(Extractor):
Now we're setting up the class. It doesn't matter what you call it, but most people call it a name that relates to our site and what we're doing.

category = "cw" This has to align with our site abbreviation name we've been using all along.

subcategory = "test"
<<Note from SpiffyChatterbox - I'm not sure what this one does>>

pattern = r"(?:https?://)?contosoweb\.com"
This is how gallery-dl determines when to use this extractor. It's very important to get your regex right on this. It also has to be unique among all of the other extractors.

def items(self):
An items() method is required. This is how you setup the URLs that interact with gallery-dl.

url = "https://www.contosoweb.com/_img/gallery/2022/fancy-image.jpg"
You need a url parameter, as this is what gallery-dl is ultimately here for - downloading this file. We won't keep it in our sample code for long, but it's good for getting started.

data = text.nameext_from_url(url)
You need a data parameter, which is a dictionary of information about your file.

yield Message.Directory, data
yield Message.Url, url, data
And here is where we finally pass the url and data parameters to Message, which tells gallery-dl to download the file, where to store it, and let it do it's tracking magic.

One other tidbit: I like to put some print lines to help me debug what's going on. Inside my items() method, I'll put a line like this:

self.log.debug(f"Logging in with URL set to {url}")

So I can see what the variables are set to. This helps me work through the kinks and see what's going on.

In the next (and last) installment, I'll show you how to modify this for your site.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to plan your extractor

Clone this wiki locally