Tokenize entire <cite> element in line of text #3306

starsbit · 2024-05-30T13:46:55Z

starsbit
May 30, 2024

Hello,

I am trying to build a custom tokenizer because I want to achieve the following behavior:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce erat quam, ultrices at diam a, ullamcorper commodo sapien. Curabitur vestibulum auctor massa sed placerat. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Vivamus sed consequat orci. <cite>marked</cite>

When using the current HTML tokenizer, it submits two tokens here for the HTML renderer: <cite>, </cite>
I want to be able to recognize <cite>marked</cite> as a single token.

My current approach is this

export class CiteTokenizer extends Tokenizer {
  constructor(options?: MarkedOptions) {
    super(options);
  }

  override html(src: string): Tokens.HTML | undefined {
    const match = /^<cite>([\s\S]+?)<\/cite>/.exec(src);

    if (match) {
      return {
        type: 'html',
        pre: false,
        text: match[0],
        block: false,
        raw: match[0],
      };
    }

    return super.html(src);
  }
}

This does work for single lines of cite elements, but not if other text is in front of it. No matter what I tried so far, I did not succeed in implementing such behavior.

Is this even possible by implementing a tokenizer, or is a different way a better approach?

Answered by UziTech

May 30, 2024

You will need to write a custom extension.

Something like:

const citeExtension = {
  name: 'cite',
  level: 'inline',
  start(src) { return src.indexOf("<cite"); },
  tokenizer(src, tokens) {
    const rule = /^<cite>([\s\S]+?)<\/cite>/;
    const match = rule.exec(src);
    if (match) {
      return {
        type: 'html',
        pre: false,
        text: match[0],
        block: false,
        raw: match[0],
      };
    }
  }
};

View full answer

UziTech · 2024-05-30T14:23:48Z

UziTech
May 30, 2024
Maintainer

You will need to write a custom extension.

Something like:

const citeExtension = {
  name: 'cite',
  level: 'inline',
  start(src) { return src.indexOf("<cite"); },
  tokenizer(src, tokens) {
    const rule = /^<cite>([\s\S]+?)<\/cite>/;
    const match = rule.exec(src);
    if (match) {
      return {
        type: 'html',
        pre: false,
        text: match[0],
        block: false,
        raw: match[0],
      };
    }
  }
};

2 replies

starsbit May 30, 2024
Author

Thank you very much for this answer!! I sadly cannot seem to get this extension triggered. Neither the start nor tokenizer function gets called in my code, while my previous tokenizer does get called.

For now I am generating my options like this using the extension:

export const markedOptionsFactory = (): MarkedOptions => {
  return {
    extensions: [citeExtension] as any,
  };
};

const citeExtension = {
  name: 'cite',
  level: 'inline',
  start(src: any) {
    console;
    return src.indexOf('<cite');
  },
  tokenizer(src: any, tokens: any) {
    const rule = /^<cite>([\s\S]+?)<\/cite>/;
    const match = rule.exec(src);
    console.log(src, tokens);
    if (match) {
      return {
        type: 'html',
        pre: false,
        text: match[0],
        block: false,
        raw: match[0],
      };
    }
    return;
  },
  renderer(token: any) {
    return `<cite>${token.text}</cite>`;
  },
};

Am I missing something for it to be recognized? Thank you very much again!!

UziTech Jun 2, 2024
Maintainer

This seems to work

import { Marked } from 'marked';

const marked = new Marked();

const markedOptionsFactory = () => {
  return {
    extensions: [citeExtension]
  };
};

const citeExtension = {
  name: 'cite',
  level: 'inline',
  start(src) {
    return src.indexOf('<cite');
  },
  tokenizer(src, tokens) {
    const rule = /^<cite>([\s\S]+?)<\/cite>/;
    const match = rule.exec(src);
    if (match) {
      return {
        type: 'cite',
        raw: match[0],
        text: match[0]
      };
    }
  },
  renderer(token) {
    return `<cite id="test">${token.text}</cite>`;
  }
};

marked.use(markedOptionsFactory());

console.log(marked.parse('<cite>foo</cite>'));

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenize entire <cite> element in line of text #3306

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Tokenize entire <cite> element in line of text #3306

Uh oh!

starsbit May 30, 2024

Replies: 1 comment · 2 replies

Uh oh!

UziTech May 30, 2024 Maintainer

Uh oh!

starsbit May 30, 2024 Author

Uh oh!

UziTech Jun 2, 2024 Maintainer

starsbit
May 30, 2024

Replies: 1 comment 2 replies

UziTech
May 30, 2024
Maintainer

starsbit May 30, 2024
Author

UziTech Jun 2, 2024
Maintainer