Prompt Generator¶

Overview¶

This part contains TemplateGenerator and VerbalizerGenerator. Both follow the implementation in Making Pre-trained Language Models Better Few-shot Learners(Gao et al. 2020) and conduct automatic generation of hard template and verbalizer based on the given counterpart.

Base Classes¶

All prompt generator using LM-BFF method can be realized using the two base classes by simply re-implementing the abstract method in the two classes. The provided implementation T5TemplateGenerator and RobertaVerbalizerGenerator both inherit from the base classes, respectively.

class TemplateGenerator(model: transformers.modeling_utils.PreTrainedModel, tokenizer: transformers.tokenization_utils.PreTrainedTokenizer, tokenizer_wrapper: tokenizers.Tokenizer, verbalizer: openprompt.prompt_base.Verbalizer, max_length: Optional[int] = 20, target_number: Optional[int] = 2, beam_width: Optional[int] = 100, length_limit: Optional[List[int]] = None, forbidden_word_ids: Optional[List[int]] = [], config: Optional[yacs.config.CfgNode] = None)[source]¶

This is the automatic template search implementation for LM-BFF. It uses a generation model to generate multi-part text to fill in the template. By jointly considering all samples in the dataset, it uses beam search decoding method to generate a designated number of templates with the highest probability. The generated template may be uniformly used for all samples in the dataset.

Parameters

model (PretrainedModel) – A pretrained model for generation.
tokenizer (PretrainedTokenizer) – A corresponding type tokenizer.
tokenizer_wrapper (TokenizerWrapper) – A corresponding type tokenizer wrapper class.
max_length (Optional[int]) – The maximum length of total generated template. Defaults to 20.
target_number (Optional[int]) – The number of separate parts to generate, e.g. in T5, every <extra_id_{}> token stands for one part. Defaults to 2.
beam_width (Optional[int]) – The beam search width. Defaults to 100.
length_limit (Optional[List[int]]) – The length limit for each part of content, if None, there is no limit. If not None, the list should have a length equal to target_number. Defaults to None.
forbidden_word_ids (Optional[List[int]]) – Any tokenizer-specific token_id you want to prevent from generating. Defaults to [], i.e. all tokens in the vocabulary are allowed in the generated template.

property device¶: return the device of the model

abstract get_part_token_id(part_id: int) → int[source]¶

Get the start token id for the current part. It should be specified according to the specific model type. For T5 model, for example, the start token for part_id=0 is <extra_id_0>, this method should return the corresponding token_id. :param part_id: The current part id (starts with 0). :type part_id: int

Returns: The corresponding start token_id.
Return type: token_id (int)

convert_template(generated_template: List[str], original_template: List[Dict]) → str[source]¶: Given original template used for template generation,convert the generated template into a standard template for downstream prompt model, return a str Example: generated_template: [‘<extra_id_0>’, ‘it’, ‘is’, ‘<extra_id_1>’, ‘one’, ‘</s>’] original_template: [{‘add_prefix_space’: ‘’, ‘placeholder’: ‘text_a’}, {‘add_prefix_space’: ‘ ‘, ‘mask’: None}, {‘add_prefix_space’: ‘ ‘, ‘meta’: ‘labelword’}, {‘add_prefix_space’: ‘ ‘, ‘mask’: None}, {‘add_prefix_space’: ‘’, ‘text’: ‘.’}] return: “{‘placeholder’:’text_a’} it is {“mask”} one.”

classmethod from_config(config: yacs.config.CfgNode, **kwargs)[source]¶

Returns: template_generator (TemplateGenerator)

generate(dataset: List[openprompt.data_utils.utils.InputExample])[source]¶

Parameters: dataset (List[InputExample]) – The dataset based on which template it to be generated.
Returns: The generated template text
Return type: template_text (List[str])

class VerbalizerGenerator(model: transformers.modeling_utils.PreTrainedModel, tokenizer: transformers.tokenization_utils.PreTrainedTokenizer, candidate_num: Optional[int] = 100, label_word_num_per_class: Optional[int] = 100)[source]¶

This is the automatic label word search implementation in LM-BFF.

Parameters

model (PretrainedModel) – A pre-trained model for label word generation.
tokenizer (PretrainedTokenizer) – The corresponding tokenize.
candidate_num (Optional[int]) – The number of label word combinations to generate. Validation will then be performed on each combination. Defaults to 100.
label_word_num_per_class (Optional[int]) – The number of candidate label words per class. Defaults to 100.

abstract post_process(word: str)[source]¶

Post-processing for generated labrl word.

Parameters: word (str) – The original word token.
Returns: The post-processed token.
Return type: processed_word (str)

abstract invalid_label_word(word: str)[source]¶

Decide whether the generated token is a valid label word. Heuristic strategy can be implemented here, e.g. requiring that a label word must be the start token of a word.

Parameters: word (str) – The token.
Returns: True if it cannot be a label word.
Return type: is_invalid (bool)

classmethod from_config(config: yacs.config.CfgNode, **kwargs)[source]¶

Returns: verbalizer_generator (VerbalizerGenerator)

generate()[source]¶

Generate label words.

Returns: A list of generated label word.
Return type: label_words (List[List[str]])

T5TemplateGenerator¶

class T5TemplateGenerator(model: transformers.models.t5.modeling_t5.T5ForConditionalGeneration, tokenizer: transformers.models.t5.tokenization_t5.T5Tokenizer, tokenizer_wrapper: tokenizers.Tokenizer, verbalizer: openprompt.prompt_base.Verbalizer, max_length: Optional[int] = 20, target_number: Optional[int] = 2, beam_width: Optional[int] = 100, length_limit: Optional[List[int]] = None, forbidden_word_ids: Optional[List[int]] = [3, 19794, 22354], config: Optional[yacs.config.CfgNode] = None)[source]¶

Automatic template search using T5 model. This class inherits from TemplateGenerator.

get_part_token_id(part_id)[source]¶

Get the start token id for the current part. It should be specified according to the specific model type. For T5 model, for example, the start token for part_id=0 is <extra_id_0>, this method should return the corresponding token_id. :param part_id: The current part id (starts with 0). :type part_id: int

Returns: The corresponding start token_id.
Return type: token_id (int)

RobertaVerbalizerGenerator¶

class RobertaVerbalizerGenerator(model: transformers.models.roberta.modeling_roberta.RobertaForMaskedLM, tokenizer: transformers.models.roberta.tokenization_roberta.RobertaTokenizer, candidate_num: Optional[int] = 100, label_word_num_per_class: Optional[int] = 100)[source]¶

invalid_label_word(word: str)[source]¶

Decide whether the generated token is a valid label word. Heuristic strategy can be implemented here, e.g. requiring that a label word must be the start token of a word.

Parameters: word (str) – The token.
Returns: True if it cannot be a label word.
Return type: is_invalid (bool)

post_process(word: str)[source]¶

Post-processing for generated labrl word.

Parameters: word (str) – The original word token.
Returns: The post-processed token.
Return type: processed_word (str)