Prompt Generator¶
Overview¶
This part contains TemplateGenerator and VerbalizerGenerator. Both follow the implementation in Making Pre-trained Language Models Better Few-shot Learners(Gao et al. 2020) and conduct automatic generation of hard template and verbalizer based on the given counterpart.
Base Classes¶
All prompt generator using LM-BFF method can be realized using the two base classes by simply re-implementing the abstract method in the two classes. The provided implementation T5TemplateGenerator
and RobertaVerbalizerGenerator
both inherit from the base classes, respectively.
- class TemplateGenerator(model: transformers.modeling_utils.PreTrainedModel, tokenizer: transformers.tokenization_utils.PreTrainedTokenizer, tokenizer_wrapper: tokenizers.Tokenizer, verbalizer: openprompt.prompt_base.Verbalizer, max_length: Optional[int] = 20, target_number: Optional[int] = 2, beam_width: Optional[int] = 100, length_limit: Optional[List[int]] = None, forbidden_word_ids: Optional[List[int]] = [], config: Optional[yacs.config.CfgNode] = None)[source]¶
This is the automatic template search implementation for LM-BFF. It uses a generation model to generate multi-part text to fill in the template. By jointly considering all samples in the dataset, it uses beam search decoding method to generate a designated number of templates with the highest probability. The generated template may be uniformly used for all samples in the dataset.
- Parameters
model (
PretrainedModel
) – A pretrained model for generation.tokenizer (
PretrainedTokenizer
) – A corresponding type tokenizer.tokenizer_wrapper (
TokenizerWrapper
) – A corresponding type tokenizer wrapper class.max_length (
Optional[int]
) – The maximum length of total generated template. Defaults to 20.target_number (
Optional[int]
) – The number of separate parts to generate, e.g. in T5, every <extra_id_{}> token stands for one part. Defaults to 2.beam_width (
Optional[int]
) – The beam search width. Defaults to 100.length_limit (
Optional[List[int]]
) – The length limit for each part of content, if None, there is no limit. If not None, the list should have a length equal to target_number. Defaults to None.forbidden_word_ids (
Optional[List[int]]
) – Any tokenizer-specific token_id you want to prevent from generating. Defaults to [], i.e. all tokens in the vocabulary are allowed in the generated template.
- property device¶
return the device of the model
- abstract get_part_token_id(part_id: int) int [source]¶
Get the start token id for the current part. It should be specified according to the specific model type. For T5 model, for example, the start token for part_id=0 is <extra_id_0>, this method should return the corresponding token_id. :param part_id: The current part id (starts with 0). :type part_id:
int
- Returns
The corresponding start token_id.
- Return type
token_id (
int
)
- convert_template(generated_template: List[str], original_template: List[Dict]) str [source]¶
Given original template used for template generation,convert the generated template into a standard template for downstream prompt model, return a
str
Example: generated_template: [‘<extra_id_0>’, ‘it’, ‘is’, ‘<extra_id_1>’, ‘one’, ‘</s>’] original_template: [{‘add_prefix_space’: ‘’, ‘placeholder’: ‘text_a’}, {‘add_prefix_space’: ‘ ‘, ‘mask’: None}, {‘add_prefix_space’: ‘ ‘, ‘meta’: ‘labelword’}, {‘add_prefix_space’: ‘ ‘, ‘mask’: None}, {‘add_prefix_space’: ‘’, ‘text’: ‘.’}] return: “{‘placeholder’:’text_a’} it is {“mask”} one.”
- classmethod from_config(config: yacs.config.CfgNode, **kwargs)[source]¶
- Returns
template_generator (
TemplateGenerator
)
- class VerbalizerGenerator(model: transformers.modeling_utils.PreTrainedModel, tokenizer: transformers.tokenization_utils.PreTrainedTokenizer, candidate_num: Optional[int] = 100, label_word_num_per_class: Optional[int] = 100)[source]¶
This is the automatic label word search implementation in LM-BFF.
- Parameters
model (
PretrainedModel
) – A pre-trained model for label word generation.tokenizer (
PretrainedTokenizer
) – The corresponding tokenize.candidate_num (
Optional[int]
) – The number of label word combinations to generate. Validation will then be performed on each combination. Defaults to 100.label_word_num_per_class (
Optional[int]
) – The number of candidate label words per class. Defaults to 100.
- abstract invalid_label_word(word: str)[source]¶
Decide whether the generated token is a valid label word. Heuristic strategy can be implemented here, e.g. requiring that a label word must be the start token of a word.
- classmethod from_config(config: yacs.config.CfgNode, **kwargs)[source]¶
- Returns
verbalizer_generator (
VerbalizerGenerator
)
T5TemplateGenerator¶
- class T5TemplateGenerator(model: transformers.models.t5.modeling_t5.T5ForConditionalGeneration, tokenizer: transformers.models.t5.tokenization_t5.T5Tokenizer, tokenizer_wrapper: tokenizers.Tokenizer, verbalizer: openprompt.prompt_base.Verbalizer, max_length: Optional[int] = 20, target_number: Optional[int] = 2, beam_width: Optional[int] = 100, length_limit: Optional[List[int]] = None, forbidden_word_ids: Optional[List[int]] = [3, 19794, 22354], config: Optional[yacs.config.CfgNode] = None)[source]¶
Automatic template search using T5 model. This class inherits from
TemplateGenerator
.- get_part_token_id(part_id)[source]¶
Get the start token id for the current part. It should be specified according to the specific model type. For T5 model, for example, the start token for part_id=0 is <extra_id_0>, this method should return the corresponding token_id. :param part_id: The current part id (starts with 0). :type part_id:
int
- Returns
The corresponding start token_id.
- Return type
token_id (
int
)
RobertaVerbalizerGenerator¶
- class RobertaVerbalizerGenerator(model: transformers.models.roberta.modeling_roberta.RobertaForMaskedLM, tokenizer: transformers.models.roberta.tokenization_roberta.RobertaTokenizer, candidate_num: Optional[int] = 100, label_word_num_per_class: Optional[int] = 100)[source]¶