Large Language Model

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

UET

Since 2004

ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN


VNU-University of Engineering and Technology

Natural Language Processing - INT3406E 20

Large Language Model

Nguyen Van Vinh - UET


Outline

● Introduction to LM
● Large Language Models and applications

UET-FIT 2
Language Modeling (Mô hình ngôn ngữ)?

● What is the probability of “Tôi trình bày ChatGPT tại Trường ĐH Công
Nghệ” ?
● What is the probability of “Công Nghệ học Đại trình bày ChatGPT tại Tôi” ?
● “Tôi trình bày ChatGPT tại Trường ĐH Công nghệ, địa điểm …”) or
P(…/Tôi trình bày ChatGPT tại Trường ĐH Công nghệ, địa điểm) ?
● A model that computes either of these:
W = w1,w2,w3,w4,w5…wn
P(W) or P(wn|w1,w2…wn-1) is called a language model

3
Large Language Model

4
Large Language Model (Hundreds of Billions of
Tokens)

5
6
Large Language Models - yottaFlops of Compute

Source: https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture11-prompting-rlhf.pdf 7
Why LLMs?

● Double Descent

8
Why LLMs?

● Scaling Law for Neural Language Models


○ Performance depends strongly on scale! We keep getting better performance as
we scale the model, data, and compute up!

9
Why LLMs?

● Generalization
○ We can now use one single model to solve many NLP tasks

10
Why LLMs? Emergence in few-shot prompting
Emergent Abilities
• Some ability of LM is
not present in
smaller models but
is present in larger
models
Emergent Capability - In-Context Learning

12
Emergent Capability - In-Context Learning

13
What is pre-training / fine-tuning?

● “Pre-train” a model on a large dataset for task X, then “fine-tune” it on a


dataset for task Y
● Key idea: X is somewhat related to Y, so a model that can do X will have
some good neural representations for Y as well (transfer learning)
● ImageNet pre-training is huge in computer vision: learning generic visual
features for recognizing objects

Can we find some task X that can be


useful for a wide range of
downstream tasks Y?

14
Pretraining + Prompting Paradigm

15
Prompting Engineering (2020  now)
● Prompts involve instructions and context passed to a language model to
achieve a desired task

Prompt engineering is
the practice of
developing and
optimizing prompts to
efficiently use language
models (LMs) for a
variety of applications

16
Prompt Engineering Techniques
● Many advanced prompting techniques have been designed to
improve performance on complex tasks •
○ Few-shot prompts
○ Chain-of-thought (CoT) prompting
○ Self-Consistency
○ Knowledge Generation Prompting
○ ReAct

17
Temperature and Top-p Sampling in LLMs

● Temperature and Top-p sampling are two essential parameters that can be
tweaked to control the output of LLMs
● Temperature (0-2): This parameter determines the creativity and diversity of the text
generated by LLMs model. A higher temperature value (e.g., 1.5) leads to more
diverse and creative text, while a lower value (e.g., 0.5) results in more focused and
deterministic text.
● Top-p Sampling (0-1): This parameter maintains a balance between diversity and
high-probability words by selecting tokens from the top-p most probable tokens
whose collective probability mass is greater than or equal to a threshold p.

18
Three major forms of pre-training (LLMs)

19
BERT: Bidirectional Encoder Representations from
Transformers

Source: (Devlin et al, 2019): BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 20
Masked Language Modeling (MLM)

● Q: Why we can’t do language modeling with bidirectional models?

● Solution: Mask out k% of the input words, and then predict the masked words

21
Next Sentence Prediction (NSP)

22
BERT pre-training

23
RoBERTa
● BERT is still under-trained
● Removed the next sentence prediction pre-training — it adds more noise than
benefits!
● Trained longer with 10x data & bigger batch sizes
● Pre-trained on 1,024 V100 GPUs for one day in 2019

24
(Liu et al., 2019): RoBERTa: A Robustly Optimized BERT Pretraining Approach
Text-to-text models: the best of both worlds (Bard)?
● Encoder-only models (e.g., BERT) enjoy the benefits of bidirectionality but they can’t be
used to generate text
● Decoder-only models (e.g., GPT3, Lamma2) can do generation but they are left-to-right
LMs..
● Text-to-text models combine the best of both worlds!

T5 = Text-to-Text Transfer Transformer

(Raffel et al., 2020): Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer 25
How to use these pre-trained models?

26
From GPT to GPT-2 to GPT-3

27
Quiz

● Context size?
● The larger the size context, the more difficult it is?

28
GPT-3: language models are few-shot learners

● GPT-2 → GPT-3: 1.5B → 175B (# of parameters), ~14B → 300B (# of tokens)

29
GPT-3’s in-context learning

30
[2020] GPT-3 to [2022] ChatGPT

What’s new?
● Training on code

● Supervised
instruction tuning

● RLHF =
Reinforcement
learning from
human feedback

Source: Fu, 2022, “How does GPT Obtain its Ability? Tracing Emergent Abilities of Language
Models to their Sources" 31
How was ChatGPT developed?

32
Evaluation of LLMs

33
LLMs newest

● Claude 2.1 (Anthropic)


○ 200K Context Window
○ 2x Decrease in Hallucination Rates

● GPT4 turbo (Open AI)


○ 128K Context Window
Vietnamese
● PhoGPT (VinAI)
● FPT.AI
● VNG (Zalo):
● …

34
ChatGPT application for reading comprehension (ChatPdf)

● Fine-tune the ChatGPT model with training data in specific domain


● Using LLM improvement techniques based on Retrieval Augmented
Generation (RAG)
● Use efficient Prompting to achieve expectation output

35
Large Language models Risks

● LLMs make mistakes


○ (falsehoods, hallucinations)
● LLMs can be misused
○ (misinformation, spam)
● LLMs can cause harms
○ (toxicity, biases, stereotypes)
● LLMs can be attacked
○ (adversarial examples, poisoning, prompt injection)
● LLMs are costly to train and deploy

36
Summary

● Introduction to LLM
● Large Language models (types)

UET-FIT 37
UET
Since 2004

ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN


VNU-University of Engineering and Technology

Thank you
Email me
[email protected]

You might also like