预训练语言模型复现CPT-1&Restructure_pretrain

(1）CPT -pretrain

CPT参数初始化，不是random initilized
是inference Robert的参数。
roberta_zh/: Place the checkpoint of Chinese RoBERTa, as the CPT initialize the encoder from the checkpoint.

CPT的pretrain是基于megetron_LM （github中有官方介绍），代码中的pretrain文件下的部分。
这是实现model分块在不同显卡上完成训练。model parallel 可以是model的不同layer分在不同的GPU上，也可以是model的的Tensor calculation 分在不同的GPU上。

megatron-lm是一个包，PLM训练的工具包。

CPT的整个过程是reference了Megatron-PLM的过程（data preprocess，pretrain，finetune,downstream task evaluation）
在CPT的introducation中，介绍了the process is refered megatron plm.
the whole pretrain process of CPT model: https://github.com/fastnlp/CPT/blob/master/pretrain/README.md

(2) restructure _pretrain_plm_process

data format: json (not text) _数据工具：Datalab.——加载数据
https://github.com/ExpressAI/DataLab/tree/main/datasets

公开的模型，适用于处理的任务类型存在一些差异，根据能够处理的任务类型公开的模型结构。
the model details

模型参数量大致在11billon级别（可对比已有PLM list找到位置，这个paramter还可。https://openbmb.github.io/BMList/）

ques1: what the data format used for plm?
ques2: the template used for every task is same?

answer:

signal structure:

general singal:

“Thank you <X> me to your party <Y> week.” and the prompted
target would be “<X> for inviting <Y> last <Z>”

task related singal:
• multiple-choice format
• generation format

做好当下，一切随缘吧