当前位置: 首页 > news >正文

Natural Language Processing

NLP before LLM

Context-free Grammar

A context-free grammar (CFG) contains a set of production rules, which are rules saying how each nonterminal can be replaced by a string of terminals and non-terminals. Derivations are a sequence of steps where non-terminals are replaced by the right-hand side of a production rule in the CFG.

Given \(G\) as a CFG. The language of \(G\), denoted as \(\mathcal{L}(G)\), is the set of strings derivable by \(G\) (from the start symbol). A language \(L\) is called a context-free language (CFL) if there is a CFG \(G\) such that \(L=\mathcal{L}(G)\).

Theorem. Every regular language (regular expressions, regex), is context-free, but not vice versa.

Example \(L_0\): Natural Language.

Clearly, CFG is (usually) not perfect: \(L_0\) could cover more than the language you really want, but it is already useful for parsing.

A CFG is in Chomsky normal form (CNF) if it is \(\varepsilon\)-free and if in addition Chomsky normal form each production is either of the form \(A\to B\;C\) or \(A\to a\). (\(A\; B\; C\) can be the same symbol)

Theorem. Any context-free grammar can be converted into a CNF.

Given a CFG, syntactic parsing refers to the problem of mapping from a sentence to its parse tree. We can use dynamic programming to parse a sentence (CKY parsing). However, there coult be ambiguity. To resolve ambiguity, we can use probabilistic CFG (PCFG), which associates a probability with each production rule. The probability of a parse tree is the product of the probabilities of all the production rules used in the tree.

Latent Semantic Analysis

The term-document (TD) matrix \(W_{td}\in\mathbb{R}^{m\times n}\), where \(m\) is the number of terms and \(n\) is the number of documents. Each entry \(W_{td}(ij)\) is the frequency of term \(i\) in document \(j\). i.e., rows are terms and columns are documents.

Since the co-occurrence statistics might be sparse, instead of directly using ܹ\(W_{td}\), we want to infer a "latent" vector representation (row vec) for the words/documents which satisfies:

\[W_{td}(ij)\approx \max(lv(w_i)lv(d_j)^\top,0). \]

To achieve this, we can use SVD. Besides, in order to make related words have related vectors, we need to use low-rank approximation. That is, we compute the SVD of \(W_{td}\) and keep only the top \(k\) singular values and corresponding singular vectors.

Another problem is that SVD would pay too much attention to the high-freq words. So we apply TF-IDF normalization. That is, we replace \(W_{td}(ij)\) by \(tf(i,j)\cdot idf(i)\), where

  • Term frequency (\(tf(i,j)\)):

\[\frac{\text{# of times word }i\text{ appears in doc }j}{\text{# of words in doc }j}. \]

  • Inverse document frequency (\(idf(i)\)), smoothed version:

\[\log \left(\frac{\text{# of docs} + 1}{\text{# of docs containing word }i + 1}\right)+1. \]

Hidden Markov Model

Gaussian Mixture Model

Consider a mixture of \(k\) Gaussian distributions. Each Gaussian has its own mean \(\mu_c\), variance \(\sigma_c\), and mixture weight \(\pi_c\). The probability density function of the mixture is

\[p(x)=\sum_{c=1}^k \pi_c \mathcal{N}(x;\mu_c,\sigma_c). \]

Equivalent "latent variable" form:

\[\begin{align*} p(z=c)&=\pi_c\\ p(x|z=c)&=\mathcal{N}(x;\mu_c,\sigma_c). \end{align*} \]

Given a dataset \(X\) that is i.i.d. drawn from GMM, we want to estimate the parameters \(\theta=\{\pi_c,\mu_c,\sigma_c\}_{c=1}^k\).

The Expectation-Maximization Algorithm

The EM algorithm is used to find maximum likelihood parameters of a statistical model. Formally, we want to optimize \(\theta\) to maximize the log-likelihood of the observed data:

\[\log p(X|\theta)=\log \sum_Z p(X,Z|\theta). \]

Consider any \(q\) distribution, applying Jensen's inequality gives

\[\begin{align*} \log \sum_Z p(X,Z|\theta)=&\ \log \sum_Z q(Z)\frac{p(X,Z|\theta)}{q(Z)}\\ \ge&\ \sum_Z q(Z)\log p(X,Z|\theta)+\text{H}(q), \end{align*} \]

where \(\text{H}(q)=-\sum_Z q(Z)\log q(Z)\) is the entropy of \(q\).

Denote the current parameter as \(\theta'\). Set \(q(Z)=p(Z|X,\theta')\) and let the above objective be \(Q(\theta|\theta')\). One can verify that

\[\log p(X|\theta)= Q(\theta|\theta')+\text{KL}(q(Z)||p(Z|X,\theta)). \]

When \(\theta'\) is close to \(\theta\), the KL divergence is small, and maximizing \(Q(\theta|\theta')\) is approximately maximizing \(\log p(X|\theta)\).

The EM algorithm iteratively performs the following two steps until convergence. In the \(k\)-th iteration:

E-step: Let \(\theta_k\) be the current parameter estimate. Compute $$q(Z)=p(Z|X,\theta_k)=\frac{p(Z,X|\theta_k)}{p(X|\theta_k)}.$$

For GMM, we have \(\theta=\{\pi_c,\mu_c,\sigma_c\}_{c=1}^k\), so when \(Z=c,X=x_i\), it's

\[r_{ic}=\frac{\pi_{c}\mathcal{N}(x_i;\mu_{c},\sigma_{c})}{\sum_{c'} \pi_{c'} \mathcal{N}(x_i;\mu_{c'},\sigma_{c'})}. \]

M-step: We fix \(q(Z)\) and optimize \(\theta\) to maximize \(Q(\theta|\theta_k)\). That is, minimize

\[\sum_i\sum_c r_{ic}\log \left(\pi_c\mathcal{N}(x_i;\mu_c,\sigma_c)\right). \]

Let \(m_c=\sum_i r_{ic}\) and \(m=\sum_c m_c\). Solving the optimization problem gives

\[\begin{align*} \pi_c=&\ \frac{m_c}{m},\\ \mu_c=&\ \frac{1}{m_c}\sum_i r_{ic} x_i,\\ \sigma_c=&\ \frac{1}{m_c}\sum_i r_{ic}(x_i-\mu_c)^2. \end{align*} \]

Hidden Markov Model

Given a sentence, we want to identify the grammatical category of each word. We can use HMM to model the problem. Let \(O\) be the observed words, and \(Q\) be the hidden states (grammatical categories). The model parameters include:

  • Initial state distribution: \(\pi_i=p(q_1=i)\);
  • State transition distribution: \(a_{ij}=p(q_{t+1}=j|q_t=i)\);
  • Emission distribution: \(b_i(o_t)=p(o_t|q_t=i)\).

If we're given the parameters \(A,B,\pi\), we can calculate the probability of an observation sequence \(O\) using the forward algorithm, and we can find the most likely hidden state sequence \(Q\) using the Viterbi algorithm. They both use dynamic programming.

Thus, for supervised learning, we just need to count & normalize; for unsupervised learning, we can use the EM algorithm with \(Q\) as hidden states.

N-gram

A Language Model assigns a probability of any sequence of words. That is,

\[\sum_{W\in V^*} p(W)=1. \]

A good LM should put high probability to more "likely" sentences.

  • Uni-gram LM: \(P(w_1w_2\cdots w_T)=P(w_1)P(w_2)\cdots P(w_T).\)
  • Bi-gram LM: \(P(w_1w_2\cdots w_T)=P(w_1)P(w_2|w_1)P(w_3|w_2)\cdots P(w_T|w_{T-1}).\)
  • Tri-gram LM: \(P(w_1w_2\cdots w_T)=P(w_1)P(w_2|w_1)P(w_3|w_1w_2)\cdots P(w_T|w_{T-2}w_{T-1}).\)
  • N-gram LM: \(P(w_1w_2\cdots w_T)=\prod_{t=n}^T P(w_t|w_{t-(n-1)}\cdots w_{t-1}).\)

There are two special tokens that are very useful in building a LM.

  • The end-of-sentence token: <eos>, tells where a sentence could end.
  • The out-of-vocabulary token: <unk>, replaces rare words (e.g., appearing only once in the training corpus).

In order to build a LM, we need to estimate the conditional probabilities \(P(w_t|w_{t-(n-1)}\cdots w_{t-1})\). The straightforward way is to directly count & normalize. However, this would lead to the zero-frequency problem (cannot create new sentences). Ways to solve it:

  • Add-k smoothing: $$P(w_t|w_{t-(n-1)}\cdots w_{t-1})=\frac{\text{count}(w_{t-(n-1)}\cdots w_{t-1}w_t)+k}{\text{count}(w_{t-(n-1)}\cdots w_{t-1})+k|V|}.$$
  • Linear interpolation: $$P(w_t|w_{t-(n-1)}\cdots w_{t-1})=\lambda_n P(w_t|w_{t-(n-1)}\cdots w_{t-1})+\lambda_{n-1} P(w_t|w_{t-(n-2)}\cdots w_{t-1})+\cdots+\lambda_1 P(w_t),$$
    where \(\sum_{i=1}^n \lambda_i=1\).
  • Backoff: $$P(w_t|w_{t-(n-1)}\cdots w_{t-1})=\begin{cases} \frac{\text{count}(w_{t-(n-1)}\cdots w_{t-1}w_t)}{\text{count}(w_{t-(n-1)}\cdots w_{t-1})}, & \text{if } \text{count}(w_{t-(n-1)}\cdots w_{t})>0 \ \alpha(w_{t-(n-2)}\cdots w_{t-1})P(w_t|w_{t-(n-2)}\cdots w_{t-1}), & \text{otherwise} \end{cases},$$
    where \(\alpha(w_{t-(n-2)}\cdots w_{t-1})\) is a normalization factor.

Given a test set \(W\), we define the LM's perplexity to be

\[PPL(W)=2^{-l},\text{ where }l=\frac{\log_2(P(W))}{\text{token_len}(W)}. \]

Perplexity cares more about diversity than quality.

Word2vec

In the LSA section, we obtained word vectors by decomposing the term-document matrix; In this section, we will talk about another approach based on prediction.

  • Skip-gram: learn representations that predict the context given a word.
  • CBOW (Continous Bag-of-Words): learn representations that predict a word given context.
http://www.gsyq.cn/news/11974.html

相关文章:

  • Python 在自动化与运维中的价值与实践
  • redis 哨兵模式主从数据同步失败
  • US$66.5 Yanhua ACDP FEM/BDC Bench Integrated Interface Board
  • sql练习笔记
  • 算法练习
  • 一次CPU飙升问题排查定位
  • ros2 control 2
  • 新学期每日总结(第4天)
  • VSCode 升级 C++支持版本
  • 在electron-vite使用ShadCN
  • 9-23
  • Ubuntu Uninstall App
  • day11 课程(学员管理系统案例)
  • US$128 OBD II Adapter Plus OBD Cable Works with CKM100 and DIGIMASTER III for Key Programming
  • jmeter函数
  • Windows 10 C盘占用释放 - tfel
  • CherryStudio+cpolar:让智能工作流突破组织边界 - 详解
  • 科学计算方法--矩阵分析记录
  • 分布式链路追踪-SkyWalking - 指南
  • Say 题选记(9.21 - 9.27)
  • 9.25总结
  • Day08-C:\Users\Lenovo\Desktop\note\code\JavaSE\Basic\src\com\David\array-ArrayDemo01~07
  • ansible注意的和错误代码分析
  • 用 Rust 和 Tesseract OCR 识别验证码
  • 基于寄存器地址amp;标准外设库的LED流水灯
  • Rust 和 Tesseract OCR 实现验证码识别
  • AI-Powered-ToDo-List
  • Python 在 Web 开发中的应用与趋势
  • LLM MOE的进化之路
  • 【pytorch】关于深度学习模型是怎么使数据从头流动到尾的