Large Language Model

UET

Since 2004

ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN

VNU-University of Engineering and Technology
Natural Language Processing - INT3406E 20
Large Language Model
Nguyen Van Vinh - UET

Outline
● Introduction to LM
● Large Language Models and applications
UET-FIT 2
Language Modeling (Mô hình ngôn ngữ)?
● What is the probability of “Tôi trình bày ChatGPT tại Trường ĐH Công
Nghệ” ?
● What is the probability of “Công Nghệ học Đại trình bày ChatGPT tại Tôi” ?
● “Tôi trình bày ChatGPT tại Trường ĐH Công nghệ, địa điểm …”) or
P(…/Tôi trình bày ChatGPT tại Trường ĐH Công nghệ, địa điểm) ?
● A model that computes either of these:
W = w1,w2,w3,w4,w5…wn
P(W) or P(wn|w1,w2…wn-1) is called a language model
3
Large Language Model
4
Large Language Model (Hundreds of Billions of
Tokens)
5
6
Large Language Models - yottaFlops of Compute
Source: https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture11-prompting-rlhf.pdf 7
Why LLMs?
● Double Descent
8
Why LLMs?
● Scaling Law for Neural Language Models

○ Performance depends strongly on scale! We keep getting better performance as
we scale the model, data, and compute up!
9
Why LLMs?
● Generalization
○ We can now use one single model to solve many NLP tasks
10
Why LLMs? Emergence in few-shot prompting
Emergent Abilities
• Some ability of LM is
not present in
smaller models but
is present in larger
models
Emergent Capability - In-Context Learning
12
Emergent Capability - In-Context Learning
13
What is pre-training / fine-tuning?
● “Pre-train” a model on a large dataset for task X, then “fine-tune” it on a

dataset for task Y
● Key idea: X is somewhat related to Y, so a model that can do X will have
some good neural representations for Y as well (transfer learning)
● ImageNet pre-training is huge in computer vision: learning generic visual
features for recognizing objects
Can we find some task X that can be

useful for a wide range of
downstream tasks Y?
14
Pretraining + Prompting Paradigm
15
Prompting Engineering (2020  now)
● Prompts involve instructions and context passed to a language model to
achieve a desired task
Prompt engineering is
the practice of
developing and
optimizing prompts to
efficiently use language
models (LMs) for a
variety of applications
16
Prompt Engineering Techniques
● Many advanced prompting techniques have been designed to
improve performance on complex tasks •
○ Few-shot prompts
○ Chain-of-thought (CoT) prompting
○ Self-Consistency
○ Knowledge Generation Prompting
○ ReAct
17
Temperature and Top-p Sampling in LLMs
● Temperature and Top-p sampling are two essential parameters that can be
tweaked to control the output of LLMs
● Temperature (0-2): This parameter determines the creativity and diversity of the text
generated by LLMs model. A higher temperature value (e.g., 1.5) leads to more
diverse and creative text, while a lower value (e.g., 0.5) results in more focused and
deterministic text.
● Top-p Sampling (0-1): This parameter maintains a balance between diversity and
high-probability words by selecting tokens from the top-p most probable tokens
whose collective probability mass is greater than or equal to a threshold p.
18
Three major forms of pre-training (LLMs)
19
BERT: Bidirectional Encoder Representations from
Transformers
Source: (Devlin et al, 2019): BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 20
Masked Language Modeling (MLM)
● Q: Why we can’t do language modeling with bidirectional models?
● Solution: Mask out k% of the input words, and then predict the masked words
21
Next Sentence Prediction (NSP)
22
BERT pre-training
23
RoBERTa
● BERT is still under-trained
● Removed the next sentence prediction pre-training — it adds more noise than
benefits!
● Trained longer with 10x data & bigger batch sizes
● Pre-trained on 1,024 V100 GPUs for one day in 2019
24
(Liu et al., 2019): RoBERTa: A Robustly Optimized BERT Pretraining Approach
Text-to-text models: the best of both worlds (Bard)?
● Encoder-only models (e.g., BERT) enjoy the benefits of bidirectionality but they can’t be
used to generate text
● Decoder-only models (e.g., GPT3, Lamma2) can do generation but they are left-to-right
LMs..
● Text-to-text models combine the best of both worlds!
T5 = Text-to-Text Transfer Transformer
(Raffel et al., 2020): Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer 25
How to use these pre-trained models?
26
From GPT to GPT-2 to GPT-3
27
Quiz
● Context size?
● The larger the size context, the more difficult it is?
28
GPT-3: language models are few-shot learners
● GPT-2 → GPT-3: 1.5B → 175B (# of parameters), ~14B → 300B (# of tokens)
29
GPT-3’s in-context learning
30
[2020] GPT-3 to [2022] ChatGPT
What’s new?
● Training on code
● Supervised
instruction tuning
● RLHF =
Reinforcement
learning from
human feedback
Source: Fu, 2022, “How does GPT Obtain its Ability? Tracing Emergent Abilities of Language
Models to their Sources" 31
How was ChatGPT developed?
32
Evaluation of LLMs
33
LLMs newest
● Claude 2.1 (Anthropic)

○ 200K Context Window
○ 2x Decrease in Hallucination Rates
● GPT4 turbo (Open AI)

○ 128K Context Window
Vietnamese
● PhoGPT (VinAI)
● FPT.AI
● VNG (Zalo):
● …
34
ChatGPT application for reading comprehension (ChatPdf)
● Fine-tune the ChatGPT model with training data in specific domain

● Using LLM improvement techniques based on Retrieval Augmented
Generation (RAG)
● Use efficient Prompting to achieve expectation output
35
Large Language models Risks
● LLMs make mistakes

○ (falsehoods, hallucinations)
● LLMs can be misused
○ (misinformation, spam)
● LLMs can cause harms
○ (toxicity, biases, stereotypes)
● LLMs can be attacked
○ (adversarial examples, poisoning, prompt injection)
● LLMs are costly to train and deploy
36
Summary
● Introduction to LLM
● Large Language models (types)
UET-FIT 37
UET
Since 2004
ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN

VNU-University of Engineering and Technology
Thank you
Email me
[email protected]

Large Language Model

Uploaded by

Copyright:

Available Formats

Large Language Model

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Large Language Model

Uploaded by

Copyright:

Available Formats

UET

ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN

Natural Language Processing - INT3406E 20

Large Language Model

Nguyen Van Vinh - UET

● Scaling Law for Neural Language Models

● “Pre-train” a model on a large dataset for task X, then “fine-tune” it on a

Can we find some task X that can be

● Q: Why we can’t do language modeling with bidirectional models?

T5 = Text-to-Text Transfer Transformer

● GPT-2 → GPT-3: 1.5B → 175B (# of parameters), ~14B → 300B (# of tokens)

● Claude 2.1 (Anthropic)

● GPT4 turbo (Open AI)

● Fine-tune the ChatGPT model with training data in specific domain

● LLMs make mistakes

ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN

You might also like