最近研究了一下 NLP 相關的模型,覺得還蠻有興趣的,有計畫要試著拿 taggle 上面的 dataset 做一些實作,來進行 text classification,在此也順便紀錄一下我的一些學習筆記。
此篇文章,會重點整理幾個 XLNet 重要的觀念,由於我也是個 NLP 新手,只能講點表皮的東西,如果大家有興趣,可以自行深入研究。
You will know…
- GPT vs. Bert
- XLNet 重點觀念
- Permutation Language Model (Attention Mask)
- Two-Stream Attention
GPT-2 vs. Bert
GPT-2 – Auto Regressive
讓電腦由同一方向閱讀文章,(如果是正著閱讀,就像人類閱讀文章的概念一樣),並在過程中,學習且猜測下文或上文的內容。
(缺點是他只有兩個方向可以進行閱讀學習,但我們生活中的文章,不全然是這樣的邏輯)
假設我們有一文本:
The discussion includes a critical evaluation of the documentary sources.
依照正著閱讀學習的話大致如下表格
| round | knowing word | guessing answer | correct answer |
| 1 | The | xxx | discussion |
| 2 | The discussion | xox | includes |
| 3 | The discussion includes | ooo | a |
| 4 | The discussion includes a | … | critical |
| 5 | The discussion includes a critical | … | evaluation |
| 6 | The discussion includes a critical evaluation | … | of |
| 7 | The discussion includes a critical evaluation of | … | the |
| 8 | The discussion includes a critical evaluation of the | … | documentary |
| 9 | The discussion includes a critical evaluation of the documentary | … | sources |
guessing answer 的過程也就是我們的訓練過程,經過多筆資料的訓練,訓練好的模型就可以幫助我們猜出上文或下文可能出現的句子。
Bert – Auto Encoding
利用 mask 遮住文本部分內容,讓電腦猜測並學習 mask 的解答是什麼。
例如一文本: The kids were all wearing animal masks.
在訓練過程中由於 mask 的緣故,電腦可能只能看到 The <mask> were all <mask> animal masks.
好處是可以同時看到上下文的資訊,但麻煩的地方是 mask 的處理,以及 mask 與 mask 之間無法互作參考。
XLNet 重點觀念
簡單來說就是把 GPT-2 以及 Bert 的優點拿過來的混合體 !?
Permutation Language Model (Attention Mask)
在閱讀文本時,將文本排列組合後,組合成 Attention Mask,再交由電腦進行學習。
| 原始文本 | I have a cat called Yaya |
| 組合一 | have a cat called Yaya I |
| 組合二 | a cat called Yaya I have |
| 組合三 | cat called Yaya I have a |
| … | … |
而每一種組合都會對應產生出一個二維的 Attention Mask[i][j],i 代表原文本的單字 (順序必須相同),j 代表該單字依造對應的排列組合所能看到的單字有哪些,0為看不到,1為看的到。
我們以上表格的組合二為例:
a cat called Yaya I have
- I 能看到 a cat called Yaya
- have 能看到 a cat called Yaya I
- a 什麼都看不到
- …
| Attention Mask | idx | 0 | 1 | 2 | 3 | 4 | 5 |
| I | 0 | x | 0 | 1 | 1 | 1 | 1 |
| have | 1 | 1 | x | 1 | 1 | 1 | 1 |
| a | 2 | 0 | 0 | x | 0 | 0 | 0 |
| cat | 3 | 0 | 0 | 1 | x | 0 | 0 |
| called | 4 | 0 | 0 | 1 | 1 | x | 0 |
| Yaya | 5 | 0 | 0 | 1 | 1 | 1 | x |
Two-Stream Attention
| a | cat | called | Yaya | I | have |
| 2 | 3 | 4 | 5 | 0 | 1 |
Content-Stream Attention
全部都有 context information + position information
P (called) = (called+4, a+2, cat+3)
Query-Stream Attention
自己 (called) 不會有 context information,只看的到 position information
P (called) = (4, a+2, cat+3)

左下的 Attention: Query-Stream Attention
from XLNet 論文