Prompt Generator

Overview

This part contains TemplateGenerator and VerbalizerGenerator. Both follow the implementation in Making Pre-trained Language Models Better Few-shot Learners(Gao et al. 2020) and conduct automatic generation of hard template and verbalizer based on the given counterpart.

Base Classes

All prompt generator using LM-BFF method can be realized using the two base classes by simply re-implementing the abstract method in the two classes. The provided implementation T5TemplateGenerator and RobertaVerbalizerGenerator both inherit from the base classes, respectively.

class TemplateGenerator(model: transformers.modeling_utils.PreTrainedModel, tokenizer: transformers.tokenization_utils.PreTrainedTokenizer, tokenizer_wrapper: tokenizers.Tokenizer, verbalizer: openprompt.prompt_base.Verbalizer, max_length: Optional[int] = 20, target_number: Optional[int] = 2, beam_width: Optional[int] = 100, length_limit: Optional[List[int]] = None, forbidden_word_ids: Optional[List[int]] = [], config: Optional[yacs.config.CfgNode] = None)[source]

This is the automatic template search implementation for LM-BFF. It uses a generation model to generate multi-part text to fill in the template. By jointly considering all samples in the dataset, it uses beam search decoding method to generate a designated number of templates with the highest probability. The generated template may be uniformly used for all samples in the dataset.

Parameters
  • model (PretrainedModel) – A pretrained model for generation.

  • tokenizer (PretrainedTokenizer) – A corresponding type tokenizer.

  • tokenizer_wrapper (TokenizerWrapper) – A corresponding type tokenizer wrapper class.

  • max_length (Optional[int]) – The maximum length of total generated template. Defaults to 20.

  • target_number (Optional[int]) – The number of separate parts to generate, e.g. in T5, every <extra_id_{}> token stands for one part. Defaults to 2.

  • beam_width (Optional[int]) – The beam search width. Defaults to 100.

  • length_limit (Optional[List[int]]) – The length limit for each part of content, if None, there is no limit. If not None, the list should have a length equal to target_number. Defaults to None.

  • forbidden_word_ids (Optional[List[int]]) – Any tokenizer-specific token_id you want to prevent from generating. Defaults to [], i.e. all tokens in the vocabulary are allowed in the generated template.

property device

return the device of the model

abstract get_part_token_id(part_id: int) int[source]

Get the start token id for the current part. It should be specified according to the specific model type. For T5 model, for example, the start token for part_id=0 is <extra_id_0>, this method should return the corresponding token_id. :param part_id: The current part id (starts with 0). :type part_id: int

Returns

The corresponding start token_id.

Return type

token_id (int)

convert_template(generated_template: List[str], original_template: List[Dict]) str[source]

Given original template used for template generation,convert the generated template into a standard template for downstream prompt model, return a str Example: generated_template: [‘<extra_id_0>’, ‘it’, ‘is’, ‘<extra_id_1>’, ‘one’, ‘</s>’] original_template: [{‘add_prefix_space’: ‘’, ‘placeholder’: ‘text_a’}, {‘add_prefix_space’: ‘ ‘, ‘mask’: None}, {‘add_prefix_space’: ‘ ‘, ‘meta’: ‘labelword’}, {‘add_prefix_space’: ‘ ‘, ‘mask’: None}, {‘add_prefix_space’: ‘’, ‘text’: ‘.’}] return: “{‘placeholder’:’text_a’} it is {“mask”} one.”

classmethod from_config(config: yacs.config.CfgNode, **kwargs)[source]
Returns

template_generator (TemplateGenerator)

generate(dataset: List[openprompt.data_utils.utils.InputExample])[source]
Parameters

dataset (List[InputExample]) – The dataset based on which template it to be generated.

Returns

The generated template text

Return type

template_text (List[str])

class VerbalizerGenerator(model: transformers.modeling_utils.PreTrainedModel, tokenizer: transformers.tokenization_utils.PreTrainedTokenizer, candidate_num: Optional[int] = 100, label_word_num_per_class: Optional[int] = 100)[source]

This is the automatic label word search implementation in LM-BFF.

Parameters
  • model (PretrainedModel) – A pre-trained model for label word generation.

  • tokenizer (PretrainedTokenizer) – The corresponding tokenize.

  • candidate_num (Optional[int]) – The number of label word combinations to generate. Validation will then be performed on each combination. Defaults to 100.

  • label_word_num_per_class (Optional[int]) – The number of candidate label words per class. Defaults to 100.

abstract post_process(word: str)[source]

Post-processing for generated labrl word.

Parameters

word (str) – The original word token.

Returns

The post-processed token.

Return type

processed_word (str)

abstract invalid_label_word(word: str)[source]

Decide whether the generated token is a valid label word. Heuristic strategy can be implemented here, e.g. requiring that a label word must be the start token of a word.

Parameters

word (str) – The token.

Returns

True if it cannot be a label word.

Return type

is_invalid (bool)

classmethod from_config(config: yacs.config.CfgNode, **kwargs)[source]
Returns

verbalizer_generator (VerbalizerGenerator)

generate()[source]

Generate label words.

Returns

A list of generated label word.

Return type

label_words (List[List[str]])

T5TemplateGenerator

class T5TemplateGenerator(model: transformers.models.t5.modeling_t5.T5ForConditionalGeneration, tokenizer: transformers.models.t5.tokenization_t5.T5Tokenizer, tokenizer_wrapper: tokenizers.Tokenizer, verbalizer: openprompt.prompt_base.Verbalizer, max_length: Optional[int] = 20, target_number: Optional[int] = 2, beam_width: Optional[int] = 100, length_limit: Optional[List[int]] = None, forbidden_word_ids: Optional[List[int]] = [3, 19794, 22354], config: Optional[yacs.config.CfgNode] = None)[source]

Automatic template search using T5 model. This class inherits from TemplateGenerator.

get_part_token_id(part_id)[source]

Get the start token id for the current part. It should be specified according to the specific model type. For T5 model, for example, the start token for part_id=0 is <extra_id_0>, this method should return the corresponding token_id. :param part_id: The current part id (starts with 0). :type part_id: int

Returns

The corresponding start token_id.

Return type

token_id (int)

RobertaVerbalizerGenerator

class RobertaVerbalizerGenerator(model: transformers.models.roberta.modeling_roberta.RobertaForMaskedLM, tokenizer: transformers.models.roberta.tokenization_roberta.RobertaTokenizer, candidate_num: Optional[int] = 100, label_word_num_per_class: Optional[int] = 100)[source]
invalid_label_word(word: str)[source]

Decide whether the generated token is a valid label word. Heuristic strategy can be implemented here, e.g. requiring that a label word must be the start token of a word.

Parameters

word (str) – The token.

Returns

True if it cannot be a label word.

Return type

is_invalid (bool)

post_process(word: str)[source]

Post-processing for generated labrl word.

Parameters

word (str) – The original word token.

Returns

The post-processed token.

Return type

processed_word (str)