Saturday, August 6, 2022

Train a sinusoidal one

 $ fairseq-hydra-train -m --config-dir fairseq/examples/roberta/config/pretraining --config-name sinusoidal task.data=/home/jason-chou/Downloads/$DATA_DIR

Thursday, August 4, 2022

RoBERTa uses learned position embeddings!

 args.encoder_learned_pos = safe_getattr(args, "encoder_learned_pos", True)

So...I need to find a variation that uses sinusoidal position embeddings to test long.

Transformer Language Models without Positional Encodings Still Learn Positional Information is the only preprint/paper mentioning MLM ALiBi so far?

Wednesday, August 3, 2022

PyTorch checkpoint state dict

Sadly RobertaModel doesn't give you the full picture:

>>> from fairseq.models.roberta import RobertaModel

>>> roberta = RobertaModel.from_pretrained('./roberta.base/')

One has to load from the checkpoint themself:

>>> path = './roberta.base/model.pt'

>>> with open(path, 'rb') as f:

>>>   state = torch.load(f, map_location=torch.device("cpu"))

>>> state['args']

Namespace(no_progress_bar=False, log_interval=25, log_format='json', tbmf_wrapper=False, seed=1, cpu=False, fp16=True, memory_efficient_fp16=True, fp16_init_scale=4, fp16_scale_window=128, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=1.0, user_dir=None, criterion='masked_lm', tokenizer=None, bpe=None, optimizer='adam', lr_scheduler='polynomial_decay', task='masked_lm', num_workers=2, skip_invalid_size_inputs_valid_test=True, max_tokens=999999, max_sentences=16, required_batch_size_multiple=1, dataset_impl='mmap', train_subset='train', valid_subset='valid', validate_interval=1, disable_validation=False, only_validate=False, max_sentences_valid=16, curriculum=0, distributed_world_size=512, distributed_rank=0, distributed_backend='nccl', distributed_port=19812, device_id=0, distributed_no_spawn=False, ddp_backend='c10d', bucket_cap_mb=200, fix_batches_to_gpus=False, find_unused_parameters=True, arch='roberta_base', max_epoch=0, max_update=500000, clip_norm=0.0, sentence_avg=False, update_freq=[1], lr=[0.0006], min_lr=-1, use_bmuf=False, global_sync_iter=10, restore_file='checkpoint_last.pt', reset_dataloader=True, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, optimizer_overrides='{}', save_interval=1, save_interval_updates=2000, keep_interval_updates=-1, keep_last_epochs=-1, no_save=False, no_epoch_checkpoints=True, no_last_checkpoints=False, no_save_optimizer_state=False, best_checkpoint_metric='loss', maximize_best_checkpoint_metric=False, adam_betas='(0.9, 0.98)', adam_eps=1e-06, weight_decay=0.01, force_anneal=None, warmup_updates=24000, end_learning_rate=0.0, power=1.0, total_num_update=500000, sample_break_mode='complete', tokens_per_sample=512, mask_prob=0.15, leave_unmasked_prob=0.1, random_token_prob=0.1, activation_fn='gelu', dropout=0.1, attention_dropout=0.1, encoder_embed_dim=768, encoder_layers=12, encoder_attention_heads=12, encoder_ffn_embed_dim=3072, pooler_activation_fn='tanh', max_positions=512, activation_dropout=0.0)

Revisit roberta-large embeddings

def topk_similar_tokens(roberta, index, k, normalize=False, beta=100.):

  embed_tokens = roberta.get_parameter('model.encoder.sentence_encoder.embed_tokens.weight')

  if normalize:

    embed_tokens = embed_tokens / embed_tokens.norm(dim=1, keepdim=True)

  prob = (beta * embed_tokens[index] @ embed_tokens.T).softmax(dim=-1)

  values, indices = prob.topk(k)

  # Print the result

  print("\nTop predictions:\n")

  for value, index in zip(values, indices):

    print(f"{roberta.decode(index.unsqueeze(0)) if index.item() != roberta.task.source_dictionary.pad() else '<pad>'}: {100 * value.item():.2f}%")

>>> topk_similar_tokens(roberta, roberta.task.mask_idx, 10, normalize=True, beta=10.)


Top predictions:


<mask>: 0.54%

: 0.02%

 the: 0.01%

 and: 0.01%

,: 0.01%

 to: 0.01%

.: 0.01%

 that: 0.01%

 in: 0.01%

 GG: 0.01%

>>> topk_similar_tokens(roberta, roberta.task.source_dictionary.bos(), 10, normalize=True, beta=10.)


Top predictions:


: 2.59%

<mask>: 0.02%

: 0.01%

.: 0.01%

 the: 0.01%

,: 0.01%

 a: 0.01%

!: 0.01%

。: 0.01%

?: 0.01%

>>> topk_similar_tokens(roberta, roberta.task.source_dictionary.pad(), 10, normalize=True, beta=10.)


Top predictions:


<pad>: 74.20%

: 0.02%

channelAvailability: 0.02%

PsyNetMessage: 0.01%

 guiIcon: 0.01%

NetMessage: 0.01%

: 0.01%

?????-?????-: 0.01%

 0.01%

EStreamFrame: 0.01%

>>> topk_similar_tokens(roberta, roberta.task.source_dictionary.eos(), 10, normalize=True, beta=10.)


Top predictions:


: 0.79%

.: 0.03%

<mask>: 0.03%

,: 0.02%

 (: 0.02%

 ": 0.02%

 and: 0.02%

 the: 0.02%

 The: 0.02%

-: 0.01%


Nothing shocking here. Other than <pad> it needs to get quite cold in terms of beta for the aligned embeddings to stand out though.

Tuesday, August 2, 2022

validate roberta.base with wikitext-103

$ python3 fairseq_cli/validate.py examples/language_model/data-bin/wikitext-103 --path ~/Downloads/roberta.base/model.pt --task masked_lm --max-tokens 512

2022-08-02 14:16:30 | INFO | fairseq.data.data_utils | loaded 3,760 examples from: examples/language_model/data-bin/wikitext-103/valid

2022-08-02 14:16:30 | INFO | fairseq.tasks.masked_lm | loaded 580 blocks from: examples/language_model/data-bin/wikitext-103/valid

2022-08-02 14:43:30 | INFO | valid | valid on 'valid' subset | loss 2.055 | ppl 4.16 | wps 0 | wpb 246983 | bsz 580


$ python3 fairseq_cli/validate.py examples/language_model/data-bin/wikitext-103 --path ~/Downloads/roberta.base/model.pt --task masked_lm --max-tokens 1024


2022-08-02 15:04:04 | INFO | fairseq.data.data_utils | loaded 3,760 examples from: examples/language_model/data-bin/wikitext-103/valid

2022-08-02 15:04:04 | INFO | fairseq.tasks.masked_lm | loaded 580 blocks from: examples/language_model/data-bin/wikitext-103/valid

2022-08-02 15:31:16 | INFO | valid | valid on 'valid' subset | loss 2.055 | ppl 4.16 | wps 0 | wpb 246983 | bsz 580


$ python3 fairseq_cli/validate.py examples/language_model/data-bin/wikitext-103 --path ~/Downloads/roberta.base/model.pt --task masked_lm --max-tokens 256


2022-08-02 15:33:29 | INFO | fairseq.data.data_utils | loaded 3,760 examples from: examples/language_model/data-bin/wikitext-103/valid

2022-08-02 15:33:29 | INFO | fairseq.tasks.masked_lm | loaded 580 blocks from: examples/language_model/data-bin/wikitext-103/valid

Traceback (most recent call last):

  File "/Users/jason_chou/Documents/GitHub/fairseq/fairseq_cli/validate.py", line 153, in <module>

    cli_main()

  File "/Users/jason_chou/Documents/GitHub/fairseq/fairseq_cli/validate.py", line 147, in cli_main

    distributed_utils.call_main(

  File "/Users/jason_chou/Documents/GitHub/fairseq/fairseq/distributed/utils.py", line 369, in call_main

    main(cfg, **kwargs)

  File "/Users/jason_chou/Documents/GitHub/fairseq/fairseq_cli/validate.py", line 93, in main

    itr = task.get_batch_iterator(

  File "/Users/jason_chou/Documents/GitHub/fairseq/fairseq/tasks/fairseq_task.py", line 295, in get_batch_iterator

    batch_sampler = dataset.batch_by_size(

  File "/Users/jason_chou/Documents/GitHub/fairseq/fairseq/data/base_wrapper_dataset.py", line 61, in batch_by_size

    return self.dataset.batch_by_size(

  File "/Users/jason_chou/Documents/GitHub/fairseq/fairseq/data/fairseq_dataset.py", line 145, in batch_by_size

    return data_utils.batch_by_size(

  File "/Users/jason_chou/Documents/GitHub/fairseq/fairseq/data/data_utils.py", line 340, in batch_by_size

    return batch_by_size_fn(

  File "fairseq/data/data_utils_fast.pyx", line 108, in fairseq.data.data_utils_fast.batch_by_size_fn

    cpdef list batch_by_size_fn(

  File "fairseq/data/data_utils_fast.pyx", line 123, in fairseq.data.data_utils_fast.batch_by_size_fn

    return batch_by_size_vec(indices, num_tokens_vec, max_tokens,

  File "fairseq/data/data_utils_fast.pyx", line 30, in fairseq.data.data_utils_fast.batch_by_size_vec

    assert max_tokens <= 0 or np.max(num_tokens_vec) <= max_tokens, (

AssertionError: Sentences lengths should not exceed max_tokens=256


The first two runs are probably equivalent. I need to figure out how to actually change the context length. Exact same with max_tokens=2048 and 4096 as well:

$ python3 fairseq_cli/validate.py examples/language_model/data-bin/wikitext-103 --path ~/Downloads/roberta.base/model.pt --task masked_lm --max-tokens 2048


2022-08-02 15:48:54 | INFO | fairseq.data.data_utils | loaded 3,760 examples from: examples/language_model/data-bin/wikitext-103/valid

2022-08-02 15:48:55 | INFO | fairseq.tasks.masked_lm | loaded 580 blocks from: examples/language_model/data-bin/wikitext-103/valid

2022-08-02 16:16:02 | INFO | valid | valid on 'valid' subset | loss 2.055 | ppl 4.16 | wps 0 | wpb 246983 | bsz 580  


$ python3 fairseq_cli/validate.py examples/language_model/data-bin/wikitext-103 --path ~/Downloads/roberta.base/model.pt --task masked_lm --max-tokens 4096


2022-08-02 16:18:01 | INFO | fairseq.data.data_utils | loaded 3,760 examples from: examples/language_model/data-bin/wikitext-103/valid

2022-08-02 16:18:01 | INFO | fairseq.tasks.masked_lm | loaded 580 blocks from: examples/language_model/data-bin/wikitext-103/valid

2022-08-02 16:45:08 | INFO | valid | valid on 'valid' subset | loss 2.055 | ppl 4.16 | wps 0 | wpb 246983 | bsz 580  

Monday, August 1, 2022

What tokens are most similar to , , , and in roberta-large?

>>> import torch

>>> roberta = torch.hub.load('pytorch/fairseq', 'roberta.large')

>>> def topk_similar_tokens(roberta, index, k, normalize=False):

...   embed_tokens = roberta.get_parameter('model.encoder.sentence_encoder.embed_tokens.weight')

...   if normalize:

...     embed_tokens = embed_tokens / embed_tokens.norm(dim=1, keepdim=True)

...   _, indices = (embed_tokens[index] @ embed_tokens.T).topk(k)

...   return indices

... 

>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.mask_idx, 10))

' TM JC CSI Zeus Karma CG BG GG MM Harmony'

>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.source_dictionary.bos(), 10))

'.?*.,. *.!.,,.*+.-.'

>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.source_dictionary.pad(), 10))

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

  File "/Users/jason_chou/.cache/torch/hub/pytorch_fairseq_main/fairseq/models/roberta/hub_interface.py", line 75, in decode

    sentences = [

  File "/Users/jason_chou/.cache/torch/hub/pytorch_fairseq_main/fairseq/models/roberta/hub_interface.py", line 76, in <listcomp>

    self.bpe.decode(self.task.source_dictionary.string(s)) for s in sentences

  File "/Users/jason_chou/.cache/torch/hub/pytorch_fairseq_main/fairseq/data/encoders/gpt2_bpe.py", line 41, in decode

    [int(tok) if tok not in {"<unk>", "<mask>"} else tok for tok in x.split()]

  File "/Users/jason_chou/.cache/torch/hub/pytorch_fairseq_main/fairseq/data/encoders/gpt2_bpe.py", line 41, in <listcomp>

    [int(tok) if tok not in {"<unk>", "<mask>"} else tok for tok in x.split()]

ValueError: invalid literal for int() with base 10: '<pad>'

>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.source_dictionary.pad(), 10)[1:])

'channelAvailability\x05EngineDebug<|endoftext|>PsyNetMessage 裏覚醒'

>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.source_dictionary.eos(), 10))

' \u200b,, .... TM ..….. \u200e…… MM'

>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.mask_idx, 10, normalize=True))

'<mask> the and, to. that in GG'

>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.source_dictionary.bos(), 10, normalize=True))

'<mask>. the, a!。?'

>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.source_dictionary.pad(), 10, normalize=True))

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

  File "/Users/jason_chou/.cache/torch/hub/pytorch_fairseq_main/fairseq/models/roberta/hub_interface.py", line 75, in decode

    sentences = [

  File "/Users/jason_chou/.cache/torch/hub/pytorch_fairseq_main/fairseq/models/roberta/hub_interface.py", line 76, in <listcomp>

    self.bpe.decode(self.task.source_dictionary.string(s)) for s in sentences

  File "/Users/jason_chou/.cache/torch/hub/pytorch_fairseq_main/fairseq/data/encoders/gpt2_bpe.py", line 41, in decode

    [int(tok) if tok not in {"<unk>", "<mask>"} else tok for tok in x.split()]

  File "/Users/jason_chou/.cache/torch/hub/pytorch_fairseq_main/fairseq/data/encoders/gpt2_bpe.py", line 41, in <listcomp>

    [int(tok) if tok not in {"<unk>", "<mask>"} else tok for tok in x.split()]

ValueError: invalid literal for int() with base 10: '<pad>'

>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.source_dictionary.pad(), 10, normalize=True)[1:])

'channelAvailabilityPsyNetMessage guiIconNetMessage\x05?????-?????-\x1bEStreamFrame'

>>> roberta.decode(topk_similar_tokens(roberta, roberta.task.source_dictionary.eos(), 10, normalize=True))

'.<mask>, ( " and the The-'


Conclusion: in terms of their embeddings, they are most similar to meaningless tokens.
(Unnormalized inner product is informative here since 
untie_weights_roberta=False)