Templates

Overview

The template is one of the most important module in prompt-learning, which wraps the original input with textual or soft-encoding sequence.

We implement common template classes in OpenPrompt.

Manual Template

The basic manually defined textual template.

class ManualTemplate(tokenizer: transformers.tokenization_utils.PreTrainedTokenizer, text: Optional[str] = None, placeholder_mapping: dict = {'<text_a>': 'text_a', '<text_b>': 'text_b'})[source]
Parameters
  • tokenizer (PreTrainedTokenizer) – A tokenizer to appoint the vocabulary and the tokenization strategy.

  • text (Optional[List[str]], optional) – manual template format. Defaults to None.

  • placeholder_mapping (dict) – A place holder to represent the original input text. Default to {'<text_a>': 'text_a', '<text_b>': 'text_b'}

on_text_set()[source]

when template text was set

  1. parse text

Prefix Template

The template of prefix-tuning from Prefix-Tuning: Optimizing Contuous Prompts for Generation.

class PrefixTuningTemplate(model: transformers.modeling_utils.PreTrainedModel, tokenizer: transformers.tokenization_utils.PreTrainedTokenizer, mapping_hook: Optional[torch.nn.modules.module.Module] = None, text: Optional[str] = None, num_token: Optional[int] = 5, placeholder_mapping: dict = {'<text_a>': 'text_a', '<text_b>': 'text_b'}, prefix_dropout: Optional[float] = 0.0, mid_dim: Optional[int] = 512, using_encoder_past_key_values: Optional[bool] = True, using_decoder_past_key_values: Optional[bool] = True)[source]

This is the implementation which support T5 and other Encoder-Decoder model, as soon as their blocks allows the past_key_values to be injected to the model. This implementation modifies the huggingface’s T5 forward without touching the code-base. However, it may fail to work when used in DataParallel model. Please use it using single gpu or model-parallel training.

Parameters
  • model (PreTrainedModel) – The pre-trained model.

  • plm_config (PretrainedConfig) – The configuration of the current pre-trained model.

  • tokenizer (PreTrainedTokenizer) – The tokenizer of the current pre-trained model.

  • mapping_hook (nn.Module, optional) –

  • text (str, optional) –

  • num_token (int, optional) –

  • placeholder_mapping (dict) –

  • prefix_dropout (float, optional) – The dropout rate for the prefix sequence.

on_text_set()[source]

A hook to do something when template text was set. The designer of the template should explicitly know what should be down when the template text is set.

generate_parameters() None[source]

Generate parameters needed for new tokens’ embedding in P-tuning

wrap_one_example(example) List[Dict][source]

Given an input example which contains input text, which can be referenced by self.template.placeholder_mapping ‘s value. This function process the example into a list of dict, Each dict functions as a group, which has the sample properties, such as whether it’s shortenable, whether it’s the masked position, whether it’s soft token, etc. Since a text will be tokenized in the subsequent processing procedure, these attributes are broadcasted along the tokenized sentence.

Parameters

example (InputExample) – An InputExample object, which should have attributes that are able to be filled in the template.

Returns

A list of dict of the same length as self.text. e.g. [{"loss_ids": 0, "text": "It was"}, {"loss_ids": 1, "text": "<mask>"}, ]

Return type

List[Dict]

process_batch(batch: Union[Dict, openprompt.data_utils.utils.InputFeatures]) Union[Dict, openprompt.data_utils.utils.InputFeatures][source]

Convert input_ids to inputs_embeds for normal token, use the embedding inside PLM for new token, use MLP or LSTM

Ptuning Template

The template of P-tuning from GPT understands, too..

class PtuningTemplate(model: transformers.modeling_utils.PreTrainedModel, tokenizer: transformers.tokenization_utils.PreTrainedTokenizer, text: Optional[List[str]] = None, prompt_encoder_type: str = 'lstm', placeholder_mapping: dict = {'<text_a>': 'text_a', '<text_b>': 'text_b'})[source]
Parameters
  • model (PreTrainedModel) – The pre-trained language model for the current prompt-learning task.

  • tokenizer (PreTrainedTokenizer) – A tokenizer to appoint the vocabulary and the tokenization strategy.

  • prompt_encoder_type (str) – head above the embedding layer of new tokens. Can be lstm or mlp.

  • text (Optional[List[str]], optional) – manual template format. Defaults to None.

  • placeholder_mapping (dict) – A place holder to represent the original input text. Default to {'<text_a>': 'text_a', '<text_b>': 'text_b'}

on_text_set()[source]

when template text was set, generate parameters needed in p-tuning input embedding phrase

generate_parameters() None[source]

generate parameters needed for new tokens’ embedding in P-tuning

process_batch(batch: Union[Dict, openprompt.data_utils.utils.InputFeatures]) Union[Dict, openprompt.data_utils.utils.InputFeatures][source]

Convert input_ids to inputs_embeds for normal tokens, use the embedding layer of PLM for new tokens, use a brand new embedding layer, with MLP or LSTM head

PTR Template

The template of PTR from PTR: Prompt Tuning with Rules for Text Classification.

class PTRTemplate(model: transformers.modeling_utils.PreTrainedModel, tokenizer: transformers.tokenization_utils.PreTrainedTokenizer, text: Optional[str] = None, placeholder_mapping: dict = {'<text_a>': 'text_a', '<text_b>': 'text_b'})[source]
Parameters
  • model (PreTrainedModel) – The pre-trained language model for the current prompt-learning task.

  • tokenizer (PreTrainedTokenizer) – A tokenizer to appoint the vocabulary and the tokenization strategy.

  • text (Optional[List[str]], optional) – manual template format. Defaults to None.

  • soft_token (str, optional) – The special token for soft token. Default to <soft>

  • placeholder_mapping (dict) – A place holder to represent the original input text. Default to {'<text_a>': 'text_a', '<text_b>': 'text_b'}

Mixed Template

Our newly introduced mixed template class to flexibly define your templates.

class MixedTemplate(model: transformers.modeling_utils.PreTrainedModel, tokenizer: transformers.tokenization_utils.PreTrainedTokenizer, text: Optional[str] = None, placeholder_mapping: dict = {'<text_a>': 'text_a', '<text_b>': 'text_b'})[source]

The Mixed Template class defined by a string of text. See more examples in the tutorial.

Parameters
  • model (PreTrainedModel) – The pre-trained language model for the current prompt-learning task.

  • tokenizer (PreTrainedTokenizer) – A tokenizer to appoint the vocabulary and the tokenization strategy.

  • text (Optional[List[str]], optional) – manual template format. Defaults to None.

get_default_soft_token_ids() List[int][source]

This function identifies which tokens are soft tokens.

Sometimes tokens in the template are not from the vocabulary, but a sequence of soft tokens. In this case, you need to implement this function

Raises

NotImplementedError – if needed, add soft_token_ids into registered_inputflag_names attribute of Template class and implement this method.

prepare()[source]

get the soft token indices ( soft_token_ids ) for the template

"soft_id" can be used to reference the previous soft token, which means these tokens use the same embeddings. Note that ``”soft_id”`` should have index start from 1 but not 0

e.g. when self.text is '{"soft": None} {"soft": "the", "soft_id": 1} {"soft": None} {"soft": "it", "soft_id": 3} {"soft_id": 1} {"soft": "was"} {"mask"}', output is [1, 2, 3, 4, 2, 5, 0]

on_text_set()[source]

when template text was set

  1. parse text

  2. generate parameter needed

process_batch(batch: Union[Dict, openprompt.data_utils.utils.InputFeatures]) Union[Dict, openprompt.data_utils.utils.InputFeatures][source]

Convert input_ids to inputs_embeds for normal tokens, use the embedding layer of PLM for soft tokens, use a new embedding layer which is initialized with their corresponding embedding of hard tokens