Natural Language Processing

发布时间：2026/7/29 13:12:58

NLP before LLM

Context-free Grammar

A context-free grammar (CFG) contains a set of production rules, which are rules saying how each nonterminal can be replaced by a string of terminals and non-terminals. Derivations are a sequence of steps where non-terminals are replaced by the right-hand side of a production rule in the CFG.

Given $G$ as a CFG. The language of $G$, denoted as $\mathcal{L}(G)$, is the set of strings derivable by $G$ (from the start symbol). A language $L$ is called a context-free language (CFL) if there is a CFG $G$ such that $L=\mathcal{L}(G)$.

Theorem. Every regular language (regular expressions, regex), is context-free, but not vice versa.

Example $L_0$: Natural Language.

Clearly, CFG is (usually) not perfect: $L_0$ could cover more than the language you really want, but it is already useful for parsing.

A CFG is in Chomsky normal form (CNF) if it is $\varepsilon$-free and if in addition Chomsky normal form each production is either of the form $A\to B\;C$ or $A\to a$. ($A\; B\; C$ can be the same symbol)

Theorem. Any context-free grammar can be converted into a CNF.

Given a CFG, syntactic parsing refers to the problem of mapping from a sentence to its parse tree. We can use dynamic programming to parse a sentence (CKY parsing). However, there coult be ambiguity. To resolve ambiguity, we can use probabilistic CFG (PCFG), which associates a probability with each production rule. The probability of a parse tree is the product of the probabilities of all the production rules used in the tree.

Latent Semantic Analysis

The term-document (TD) matrix $W_{td}\in\mathbb{R}^{m\times n}$, where $m$ is the number of terms and $n$ is the number of documents. Each entry $W_{td}(ij)$ is the frequency of term $i$ in document $j$. i.e., rows are terms and columns are documents.

Since the co-occurrence statistics might be sparse, instead of directly using ܹ$W_{td}$, we want to infer a "latent" vector representation (row vec) for the words/documents which satisfies:

\[W_{td}(ij)\approx \max(lv(w_i)lv(d_j)^\top,0). \]

To achieve this, we can use SVD. Besides, in order to make related words have related vectors, we need to use low-rank approximation. That is, we compute the SVD of $W_{td}$ and keep only the top $k$ singular values and corresponding singular vectors.

Another problem is that SVD would pay too much attention to the high-freq words. So we apply TF-IDF normalization. That is, we replace $W_{td}(ij)$ by $tf(i,j)\cdot idf(i)$, where

Term frequency ($tf(i,j)$):

\[\frac{\text{# of times word }i\text{ appears in doc }j}{\text{# of words in doc }j}. \]

Inverse document frequency ($idf(i)$), smoothed version:

\[\log \left(\frac{\text{# of docs} + 1}{\text{# of docs containing word }i + 1}\right)+1. \]

Hidden Markov Model

Gaussian Mixture Model

Consider a mixture of $k$ Gaussian distributions. Each Gaussian has its own mean $\mu_c$, variance $\sigma_c$, and mixture weight $\pi_c$. The probability density function of the mixture is

\[p(x)=\sum_{c=1}^k \pi_c \mathcal{N}(x;\mu_c,\sigma_c). \]

Equivalent "latent variable" form:

\[\begin{align*} p(z=c)&=\pi_c\\ p(x|z=c)&=\mathcal{N}(x;\mu_c,\sigma_c). \end{align*} \]

Given a dataset $X$ that is i.i.d. drawn from GMM, we want to estimate the parameters $\theta=\{\pi_c,\mu_c,\sigma_c\}_{c=1}^k$.

The Expectation-Maximization Algorithm

The EM algorithm is used to find maximum likelihood parameters of a statistical model. Formally, we want to optimize $\theta$ to maximize the log-likelihood of the observed data:

\[\log p(X|\theta)=\log \sum_Z p(X,Z|\theta). \]

Consider any $q$ distribution, applying Jensen's inequality gives

\[\begin{align*} \log \sum_Z p(X,Z|\theta)=&\ \log \sum_Z q(Z)\frac{p(X,Z|\theta)}{q(Z)}\\ \ge&\ \sum_Z q(Z)\log p(X,Z|\theta)+\text{H}(q), \end{align*} \]

where $\text{H}(q)=-\sum_Z q(Z)\log q(Z)$ is the entropy of $q$.

Denote the current parameter as $\theta'$. Set $q(Z)=p(Z|X,\theta')$ and let the above objective be $Q(\theta|\theta')$. One can verify that

\[\log p(X|\theta)= Q(\theta|\theta')+\text{KL}(q(Z)||p(Z|X,\theta)). \]

When $\theta'$ is close to $\theta$, the KL divergence is small, and maximizing $Q(\theta|\theta')$ is approximately maximizing $\log p(X|\theta)$.

The EM algorithm iteratively performs the following two steps until convergence. In the $k$-th iteration:

E-step: Let $\theta_k$ be the current parameter estimate. Compute $$q(Z)=p(Z|X,\theta_k)=\frac{p(Z,X|\theta_k)}{p(X|\theta_k)}.$$

For GMM, we have $\theta=\{\pi_c,\mu_c,\sigma_c\}_{c=1}^k$, so when $Z=c,X=x_i$, it's

\[r_{ic}=\frac{\pi_{c}\mathcal{N}(x_i;\mu_{c},\sigma_{c})}{\sum_{c'} \pi_{c'} \mathcal{N}(x_i;\mu_{c'},\sigma_{c'})}. \]

M-step: We fix $q(Z)$ and optimize $\theta$ to maximize $Q(\theta|\theta_k)$. That is, minimize

\[\sum_i\sum_c r_{ic}\log \left(\pi_c\mathcal{N}(x_i;\mu_c,\sigma_c)\right). \]

Let $m_c=\sum_i r_{ic}$ and $m=\sum_c m_c$. Solving the optimization problem gives

\[\begin{align*} \pi_c=&\ \frac{m_c}{m},\\ \mu_c=&\ \frac{1}{m_c}\sum_i r_{ic} x_i,\\ \sigma_c=&\ \frac{1}{m_c}\sum_i r_{ic}(x_i-\mu_c)^2. \end{align*} \]

Hidden Markov Model

Given a sentence, we want to identify the grammatical category of each word. We can use HMM to model the problem. Let $O$ be the observed words, and $Q$ be the hidden states (grammatical categories). The model parameters include:

Initial state distribution: $\pi_i=p(q_1=i)$;
State transition distribution: $a_{ij}=p(q_{t+1}=j|q_t=i)$;
Emission distribution: $b_i(o_t)=p(o_t|q_t=i)$.

If we're given the parameters $A,B,\pi$, we can calculate the probability of an observation sequence $O$ using the forward algorithm, and we can find the most likely hidden state sequence $Q$ using the Viterbi algorithm. They both use dynamic programming.

Thus, for supervised learning, we just need to count & normalize; for unsupervised learning, we can use the EM algorithm with $Q$ as hidden states.

N-gram

A Language Model assigns a probability of any sequence of words. That is,

\[\sum_{W\in V^*} p(W)=1. \]

A good LM should put high probability to more "likely" sentences.

Uni-gram LM: $P(w_1w_2\cdots w_T)=P(w_1)P(w_2)\cdots P(w_T).$
Bi-gram LM: $P(w_1w_2\cdots w_T)=P(w_1)P(w_2|w_1)P(w_3|w_2)\cdots P(w_T|w_{T-1}).$
Tri-gram LM: $P(w_1w_2\cdots w_T)=P(w_1)P(w_2|w_1)P(w_3|w_1w_2)\cdots P(w_T|w_{T-2}w_{T-1}).$
N-gram LM: $P(w_1w_2\cdots w_T)=\prod_{t=n}^T P(w_t|w_{t-(n-1)}\cdots w_{t-1}).$

There are two special tokens that are very useful in building a LM.

The end-of-sentence token: <eos>, tells where a sentence could end.
The out-of-vocabulary token: <unk>, replaces rare words (e.g., appearing only once in the training corpus).

In order to build a LM, we need to estimate the conditional probabilities $P(w_t|w_{t-(n-1)}\cdots w_{t-1})$. The straightforward way is to directly count & normalize. However, this would lead to the zero-frequency problem (cannot create new sentences). Ways to solve it:

Add-k smoothing: $$P(w_t|w_{t-(n-1)}\cdots w_{t-1})=\frac{\text{count}(w_{t-(n-1)}\cdots w_{t-1}w_t)+k}{\text{count}(w_{t-(n-1)}\cdots w_{t-1})+k|V|}.$$
Linear interpolation: $$P(w_t|w_{t-(n-1)}\cdots w_{t-1})=\lambda_n P(w_t|w_{t-(n-1)}\cdots w_{t-1})+\lambda_{n-1} P(w_t|w_{t-(n-2)}\cdots w_{t-1})+\cdots+\lambda_1 P(w_t),$$
where $\sum_{i=1}^n \lambda_i=1$.
Backoff: $$P(w_t|w_{t-(n-1)}\cdots w_{t-1})=\begin{cases} \frac{\text{count}(w_{t-(n-1)}\cdots w_{t-1}w_t)}{\text{count}(w_{t-(n-1)}\cdots w_{t-1})}, & \text{if } \text{count}(w_{t-(n-1)}\cdots w_{t})>0 \ \alpha(w_{t-(n-2)}\cdots w_{t-1})P(w_t|w_{t-(n-2)}\cdots w_{t-1}), & \text{otherwise} \end{cases},$$
where $\alpha(w_{t-(n-2)}\cdots w_{t-1})$ is a normalization factor.

Given a test set $W$, we define the LM's perplexity to be

\[PPL(W)=2^{-l},\text{ where }l=\frac{\log_2(P(W))}{\text{token_len}(W)}. \]

Perplexity cares more about diversity than quality.

Word2vec

In the LSA section, we obtained word vectors by decomposing the term-document matrix; In this section, we will talk about another approach based on prediction.

Skip-gram: learn representations that predict the context given a word.
CBOW (Continous Bag-of-Words): learn representations that predict a word given context.