2023.09.25.

Sequence Modeling with CTC

Sequence Modeling with CTC


Introduction

Consider speech recognition. We have a dataset of audio clips and
corresponding transcripts. Unfortunately, we don’t know how the characters in
the transcript align to the audio. This makes training a speech recognizer
harder than it might at first seem.

Without this alignment, the simple approaches aren’t available to us. We
could devise a rule like “one character corresponds to ten inputs”. But
people’s rates of speech vary, so this type of rule can always be broken.
Another alternative is to hand-align each character to its location in the
audio. From a modeling standpoint this works well — we’d know the ground truth
for each input time-step. However, for any reasonably sized dataset this is
prohibitively time consuming.

This problem doesn’t just turn up in speech recognition. We see it in many
other places. Handwriting recognition from images or sequences of pen strokes
is one example. Action labelling in videos is another.

Handwriting recognition: The input can be
(x,y)(x,y) coordinates of a pen stroke or
pixels in an image.
Speech recognition: The input can be a spectrogram or some
other frequency based feature extractor.

Connectionist Temporal Classification (CTC) is a way to get around not
knowing the alignment between the input and the output. As we’ll see, it’s
especially well suited to applications like speech and handwriting
recognition.


To be a bit more formal, let’s consider mapping input sequences
X=[x1,x2,,xT]X = [x_1, x_2, ldots, x_T]

There are challenges which get in the way of us
using simpler supervised learning algorithms. In particular:

  • Both XX and YY
    can vary in length.
  • The ratio of the lengths of XX and YY
    can vary.
  • We don’t have an accurate alignment (correspondence of the elements) of
    XX and Y.Y.

The CTC algorithm overcomes these challenges. For a given XX
it gives us an output distribution over all possible YY’s. We
can use this distribution either to infer a likely output or to assess
the probability of a given output.

Not all ways of computing the loss function and performing inference are
tractable. We’ll require that CTC do both of these efficiently.

Loss Function: For a given input, we’d like to train our
model to maximize the probability it assigns to the right answer. To do this,
we’ll need to efficiently compute the conditional probability
p(YX).p(Y mid X). The function p(YX)p(Y mid X) should
also be differentiable, so we can use gradient descent.

Inference: Naturally, after we’ve trained the model, we
want to use it to infer a likely YY given an X.X.
This means solving

Y=argmaxYp(YX). Y^* enspace =enspace {mathop{text{argmax}}limits_{Y}} enspace p(Y mid X).

Ideally YY^*

The Algorithm

The CTC algorithm can assign a probability for any YY
given an X.X. The key to computing this probability is how CTC
thinks about alignments between inputs and outputs. We’ll start by looking at
these alignments and then show how to use them to compute the loss function and
perform inference.

Alignment

The CTC algorithm is alignment-free — it doesn’t require an
alignment between the input and the output. However, to get the probability of
an output given an input, CTC works by summing over the probability of all
possible alignments between the two. We need to understand what these
alignments are in order to understand how the loss function is ultimately
calculated.

To motivate the specific form of the CTC alignments, first consider a naive
approach. Let’s use an example. Assume the input has length six and Y=Y = [c, a, t]. One way to align XX and YY
is to assign an output character to each input step and collapse repeats.


This approach has two problems.

  • Often, it doesn’t make sense to force every input step to align to
    some output. In speech recognition, for example, the input can have stretches
    of silence with no corresponding output.
  • We have no way to produce outputs with multiple characters in a row.
    Consider the alignment [h, h, e, l, l, l, o]. Collapsing repeats will
    produce “helo” instead of “hello”.

To get around these problems, CTC introduces a new token to the set of
allowed outputs. This new token is sometimes called the blank token. We’ll
refer to it here as ϵ.epsilon. The
ϵepsilon token doesn’t correspond to anything and is simply
removed from the output.

The alignments allowed by CTC are the same length as the input. We allow any
alignment which maps to YY after merging repeats and removing
ϵepsilon tokens:


If YY has two of the same character in a row, then a valid
alignment must have an ϵepsilon between them. With this rule
in place, we can differentiate between alignments which collapse to “hello” and
those which collapse to “helo”.

Let’s go back to the output [c, a, t] with an input of length six. Here are
a few more examples of valid and invalid alignments.


The CTC alignments have a few notable properties. First, the allowed
alignments between XX and YY are monotonic.
If we advance to the next input, we can keep the corresponding output the
same or advance to the next one. A second property is that the alignment of
XX to YY is many-to-one. One or more input
elements can align to a single output element but not vice-versa. This implies
a third property: the length of YY cannot be greater than the
length of X.X.

Loss Function

The CTC alignments give us a natural way to go from probabilities at each
time-step to the probability of an output sequence.


To be precise, the CTC objective
for a single (X,Y)(X, Y) pair is:

p(YX)=p(Y mid X) ;; =
AAX,Ysum_{A in mathcal{A}_{X,Y}}
t=1Tpt(atX)prod_{t=1}^T ; p_t(a_t mid X)

The CTC conditional probability
marginalizes over the set of valid alignments
computing the probability for a single alignment step-by-step.

Models trained with CTC typically use a recurrent neural network (RNN) to
estimate the per time-step probabilities, pt(atX).p_t(a_t mid X).

If we aren’t careful, the CTC loss can be very expensive to compute. We
could try the straightforward approach and compute the score for each alignment
summing them all up as we go. The problem is there can be a massive number of
alignments.

For a YY of length UU without any repeat
characters and an XX of length TT the size
of the set is (T+UTU).{T + U choose T – U}.

For most problems this would be too slow.

Thankfully, we can compute the loss much faster with a dynamic programming
algorithm. The key insight is that if two alignments have reached the same
output at the same step, then we can merge them.

Summing over all alignments can be very expensive.
Dynamic programming merges alignments, so it’s much faster.

Since we can have an ϵepsilon before or after any token in
YY, it’s easier to describe the algorithm
using a sequence which includes them. We’ll work with the sequence
Z=[ϵ, y1, ϵ, y2, , ϵ, yU, ϵ] Z enspace =enspace [epsilon, ~y_1, ~epsilon, ~y_2,~ ldots, ~epsilon, ~y_U, ~epsilon]

Let’s let αalpha be the score of the merged alignments at a
given node. More precisely, αs,talpha_{s, t}

Case 1:


In this case, we can’t jump over zs1z_{s-1}

To ensure we don’t skip zs1z_{s-1}

αs,t=alpha_{s, t} ; =
(αs1,t1+αs,t1)(alpha_{s-1, t-1} + alpha_{s, t-1}) quadquad cdot
pt(zsX)p_t(z_{s} mid X)

Case 2:

In the second case, we’re allowed to skip the previous token in
Z.Z. We have this case whenever zs1z_{s-1}

αs,t=alpha_{s, t} ; =
(αs2,t1+αs1,t1+αs,t1)(alpha_{s-2, t-1} + alpha_{s-1, t-1} + alpha_{s, t-1}) quadquad cdot
pt(zsX)p_t(z_{s} mid X)

Below is an example of the computation performed by the dynamic programming
algorithm. Every valid alignment has a path in this graph.

output
Y=Y = [a, b]
input, XX
Node (s,t)(s, t) in the diagram represents
αs,talpha_{s, t}

There are two valid starting nodes and two valid final nodes since the
ϵepsilon at the beginning and end of the sequence is
optional. The complete probability is the sum of the two final nodes.

Now that we can efficiently compute the loss function, the next step is to
compute a gradient and train the model. The CTC loss function is differentiable
with respect to the per time-step output probabilities since it’s just sums and
products of them. Given this, we can analytically compute the gradient of the
loss function with respect to the (unnormalized) output probabilities and from
there run backpropagation as usual.

For a training set Dmathcal{D}, the model’s parameters
are tuned to minimize the negative log-likelihood
(X,Y)Dlogp(YX) sum_{(X, Y) in mathcal{D}} -log; p(Y mid X)

Inference

After we’ve trained the model, we’d like to use it to find a likely output
for a given input. More precisely, we need to solve:

Y=argmaxYp(YX) Y^* enspace = enspace {mathop{text{argmax}}limits_{Y}} enspace p(Y mid X)

One heuristic is to take the most likely output at each time-step. This
gives us the alignment with the highest probability:

A=argmaxAt=1Tpt(atX) A^* enspace = enspace {mathop{text{argmax}}limits_{A}} enspace prod_{t=1}^{T} ; p_t(a_t mid X)

We can then collapse repeats and remove ϵepsilon tokens to
get Y.Y.

For many applications this heuristic works well, especially when most of the
probability mass is alloted to a single alignment. However, this approach can
sometimes miss easy to find outputs with much higher probability. The problem
is, it doesn’t take into account the fact that a single output can have many
alignments.

Here’s an example. Assume the alignments [a, a, ϵepsilon]
and [a, a, a] individually have lower probability than [b, b, b]. But
the sum of their probabilities is actually greater than that of [b, b, b]. The
naive heuristic will incorrectly propose Y=Y = [b] as
the most likely hypothesis. It should have chosen Y=Y = [a].
To fix this, the algorithm needs to account for the fact that [a, a, a] and [a,
a, ϵepsilon] collapse to the same output.

We can use a modified beam search to solve this. Given limited
computation, the modified beam search won’t necessarily find the
most likely Y.Y. It does, at least, have
the nice property that we can trade-off more computation
(a larger beam-size) for an asymptotically better solution.

A regular beam search computes a new set of hypotheses at each input step.
The new set of hypotheses is generated from the previous set by extending each
hypothesis with all possible output characters and keeping only the top
candidates.

A standard beam search algorithm with an alphabet of
{ϵ,a,b}{epsilon, a, b} and a beam size
of three.

We can modify the vanilla beam search to handle multiple alignments mapping to
the same output. In this case instead of keeping a list of alignments in the
beam, we store the output prefixes after collapsing repeats and removing
ϵepsilon characters. At each step of the search we accumulate
scores for a given prefix based on all the alignments which map to it.

The CTC beam search algorithm with an output alphabet
{ϵ,a,b}{epsilon, a, b}
and a beam size of three.

A proposed extension can map to two output prefixes if the character is a
repeat. This is shown at T=3T=3 in the figure above
where ‘a’ is proposed as an extension to the prefix [a]. Both [a] and [a, a] are
valid outputs for this proposed extension.

When we extend [a] to produce [a,a], we only want include the part of the
previous score for alignments which end in ϵ.epsilon. Remember, the
ϵepsilon is required between repeat characters. Similarly,
when we don’t extend the prefix and produce [a], we should only include the part
of the previous score for alignments which don’t end in ϵ.epsilon.

Given this, we have to keep track of two probabilities for each prefix
in the beam. The probability of all alignments which end in
ϵepsilon and the probability of all alignments which don’t
end in ϵ.epsilon. When we rank the hypotheses at
each step before pruning the beam, we’ll use their combined scores.

The implementation of this algorithm doesn’t require much code, but it is
dense and tricky to get right. Checkout this
gist
for an example implementation in Python.

In some problems, such as speech recognition, incorporating a language model
over the outputs significantly improves accuracy. We can include the language
model as a factor in the inference problem.

Y=argmaxYY^* enspace = enspace {mathop{text{argmax}}limits_{Y}}
p(YX)p(Y mid X) quad cdot
p(Y)αp(Y)^alpha quad cdot
L(Y)βL(Y)^beta

The function L(Y)L(Y) computes the length of
YY in terms of the language model tokens and acts as a word
insertion bonus. With a word-based language model L(Y)L(Y)
counts the number of words in Y.Y. If we use a character-based
language model then L(Y)L(Y) counts the number of characters
in Y.Y. The language model scores are only included when a
prefix is extended by a character (or word) and not at every step of the
algorithm. This causes the search to favor shorter prefixes, as measured by
L(Y)L(Y), since they don’t include as many language model
updates. The word insertion bonus helps with this. The parameters
αalpha and βbeta are usually set by
cross-validation.

The language model scores and word insertion term can be included in the
beam search. Whenever we propose to extend a prefix by a character, we can
include the language model score for the new character given the prefix so
far.

Properties of CTC

We mentioned a few important properties of CTC so far. Here we’ll go
into more depth on what these properties are and what trade-offs they offer.

Conditional Independence

One of the most commonly cited shortcomings of CTC is the conditional
independence assumption it makes.

Graphical model for CTC.

The model assumes that every output is conditionally independent of
the other outputs given the input. This is a bad assumption for many
sequence to sequence problems.

Say we had an audio clip of someone saying “triple A”.
Another valid transcription could
be “AAA”. If the first letter of the predicted transcription is ‘A’, then
the next letter should be ‘A’ with high probability and ‘r’ with low
probability. The conditional independence assumption does not allow for this.

If we predict an ‘A’ as the first letter then the suffix ‘AA’ should get much
more probability than ‘riple A’. If we predict ‘t’ first, the opposite
should be true.

In fact speech recognizers using CTC don’t learn a language model over the
output nearly as well as models which are conditionally dependent.
However, a separate language model can
be included and usually gives a good boost to accuracy.

The conditional independence assumption made by CTC isn’t always a bad
thing. Baking in strong beliefs over output interactions makes the model less
adaptable to new or altered domains. For example, we might want to use a speech
recognizer trained on phone conversations between friends to transcribe
customer support calls. The language in the two domains can be quite different
even if the acoustic model is similar. With a CTC acoustic model, we can easily
swap in a new language model as we change domains.

Alignment Properties

The CTC algorithm is alignment-free. The objective function
marginalizes over all alignments. While CTC does make strong assumptions about
the form of alignments between XX and YY, the
model is agnostic as to how probability is distributed amongst them. In some
problems CTC ends up allocating most of the probability to a single alignment.
However, this isn’t guaranteed.

We could force the model to choose a single
alignment by replacing the sum with a max in the objective function,
p(YX)=maxAAX,Yt=1Tp(atX). p(Y mid X) enspace = enspace max_{A in mathcal{A}_{X,Y}} enspace prod_{t=1}^T ; p(a_t mid X).

As mentioned before, CTC only allows monotonic alignments. In
problems such as speech recognition this may be a valid assumption. For other
problems like machine translation where a future word in a target sentence
can align to an earlier part of the source sentence, this assumption is a
deal-breaker.

Another important property of CTC alignments is that they are
many-to-one. Multiple inputs can align to at most one output. In some
cases this may not be desirable. We might want to enforce a strict one-to-one
correspondence between elements of XX and
Y.Y. Alternatively, we may want to allow multiple output
elements to align to a single input element. For example, the characters
“th” might align to a single input step of audio. A character based CTC model
would not allow that.

The many-to-one property implies that the output can’t have more time-steps
than the input.

If YY has rr consecutive
repeats, then the length of YY must be less than
the length of XX by 2r1.2r – 1.

This is usually not a problem for speech and handwriting recognition since the
input is much longer than the output. However, for other problems where
YY is often longer than XX, CTC just won’t
work.

CTC in Context

In this section we’ll discuss how CTC relates to other commonly used
algorithms for sequence modeling.

HMMs

At a first glance, a Hidden Markov Model (HMM) seems quite different from
CTC. But, the two algorithms are actually quite similar. Understanding the
relationship between them will help us understand what advantages CTC has over
HMM sequence models and give us insight into how CTC could be changed for
various use cases.

Let’s use the same notation as before,
XX is the input sequence and YY
is the output sequence with lengths TT and
UU respectively. We’re interested in learning
p(YX).p(Y mid X). One way to simplify the problem is to apply
Bayes’ Rule:
p(YX)p(XY)p(Y). p(Y mid X) ; propto ; p(X mid Y) ; p(Y).

The p(Y)p(Y) term can be any language model, so let’s focus on
p(XY).p(X mid Y). Like before we’ll let
Amathcal{A} be a set of allowed alignments between
XX and Y.Y. Members of
Amathcal{A} have length T.T.
Let’s otherwise leave Amathcal{A} unspecified for now. We’ll
come back to it later. We can marginalize over alignments to get
p(XY)=AAp(X,AY). p(X mid Y); = ; sum_{A in mathcal{A}} ; p(X, A mid Y).

p(X)=p(X) quad =
AAt=1Tsum_{A in mathcal{A}} ; prod_{t=1}^T
p(xtat)p(x_t mid a_t) quad cdot
p(atat1)p(a_t mid a_{t-1})

The first assumption is the usual Markov property. The state
ata_t

The graphical model for an HMM.

Now we can take just a few steps to transform the HMM into CTC and see how
the two models relate. First, let’s assume that the transition probabilities
p(atat1)p(a_t mid a_{t-1})

The HMM can be used with discriminative models which estimate p(ax).p(a mid x).
To do this, we apply Bayes’ rule and rewrite the model as
p(X)AAt=1Tp(atxt)p(xt)p(at) p(X) enspace propto enspace sum_{A in mathcal{A}} enspace prod_{t=1}^T ; frac{p(a_t mid x_t); p(x_t)}{p(a_t)}

If we assume a uniform prior over the states aa and condition on all of
XX instead of a single element at a time, we arrive at
p(X)AAt=1Tp(atX). p(X) enspace propto enspace sum_{A in mathcal{A}} enspace prod_{t=1}^T ; p(a_t mid X).

The above equation is essentially the CTC loss function, assuming the set
Amathcal{A} is the same. In fact, the HMM framework does not specify what
Amathcal{A} should consist of. This part of the model can be designed on a
per-problem basis. In many cases the model doesn’t condition on YY and the
set Amathcal{A} consists of all possible length TT sequences from the
output alphabet. In this case, the HMM can be drawn as an ergodic state
transition diagram in which every state connects to every other state. The
figure below shows this model with the alphabet or set of unique hidden states
as {a,b,c}.{a, b, c}.

In our case the transitions allowed by the model are strongly related to
Y.Y. We want the HMM to reflect this. One possible model could
be a simple linear state transition diagram. The figure below shows this with
the same alphabet as before and Y=Y = [a, b]. Another commonly
used model is the Bakis or left-right HMM. In this model any
transition which proceeds from the left to the right is allowed.

Ergodic HMM: Any node can be either a starting or
final state.
Linear HMM: Start on the left, end on the right.
CTC HMM: The first two nodes are the starting
states and the last two nodes are the final states.

In CTC we augment the alphabet with ϵepsilon and the HMM model allows a
subset of the left-right transitions. The CTC HMM has two start
states and two accepting states.

One possible source of confusion is that the HMM model differs for any unique
Y.Y. This is in fact standard in applications such as speech recognition. The
state diagram changes based on the output Y.Y. However, the functions which
estimate the observation and transition probabilities are shared.

Let’s discuss how CTC improves on the original HMM model. First, we can think
of the CTC state diagram as a special case HMM which works well for many
problems of interest. Incorporating the blank as a hidden state in the HMM
allows us to use the alphabet of YY as the other hidden states. This model
also gives a set of allowed alignments which may be a good prior for some
problems.

Perhaps most importantly, CTC is discriminative. It models p(YX)p(Y mid X) directly, an idea that’s been important in the past with other
discriminative improvements to HMMs.
Discriminative training let’s us apply powerful learning algorithms like the
RNN directly towards solving the problem we care about.

Encoder-Decoder Models

The encoder-decoder is perhaps the most commonly used framework for sequence
modeling with neural networks. These models have an encoder
and a decoder. The encoder maps the input sequence XX into a
hidden representation. The decoder consumes the hidden representation and
produces a distribution over the outputs. We can write this as

H=encode(X)p(YX)=decode(H). begin{aligned} Henspace &= enspacetextsf{encode}(X) \[.5em] p(Y mid X)enspace &= enspace textsf{decode}(H). end{aligned}

The encode()textsf{encode}(cdot) and
decode()textsf{decode}(cdot) functions are typically RNNs. The
decoder can optionally be equipped with an attention mechanism. The hidden
state sequence HH has the same number of time-steps as the
input, T.T. Sometimes the encoder subsamples the input. If the
encoder subsamples the input by a factor ss then
HH will have T/sT/s time-steps.

We can interpret CTC in the encoder-decoder framework. This is helpful to
understand the developments in encoder-decoder models that are applicable to
CTC and to develop a common language for the properties of these
models.

Encoder: The encoder of a CTC model can be just about any
encoder we find in commonly used encoder-decoder models. For example the
encoder could be a multi-layer bidirectional RNN or a convolutional network.
There is a constraint on the CTC encoder that doesn’t apply to the others. The
input length cannot be sub-sampled so much that T/sT/s
is less than the length of the output.

Decoder: We can view the decoder of a CTC model as a simple
linear transformation followed by a softmax normalization. This layer should
project all TT steps of the encoder output
HH into the dimensionality of the output alphabet.

We mentioned earlier that CTC makes a conditional independence assumption over
the characters in the output sequence. This is one of the big advantages that
other encoder-decoder models have over CTC — they can model the
dependence over the outputs. However in practice, CTC is still more commonly
used in tasks like speech recognition as we can partially overcome the
conditional independence assumption by including an external language model.

Practitioner’s Guide

So far we’ve mostly developed a conceptual understanding of CTC. Here we’ll go
through a few implementation tips for practitioners.

Software: Even with a solid understanding of CTC, the
implementation is difficult. The algorithm has several edge cases and a fast
implementation should be written in a lower-level programming language.
Open-source software tools make it much easier to get started:

  • Baidu Research has open-sourced
    warp-ctc. The
    package is written in C++ and CUDA. The CTC loss function runs on either
    the CPU or the GPU. Bindings are available for Torch, TensorFlow and
    PyTorch.
  • TensorFlow has built in
    CTC loss
    and CTC beam search
    functions for the CPU.
  • Nvidia also provides a GPU implementation of CTC in
    cuDNN versions 7 and up.

Numerical Stability: Computing the CTC loss naively is
numerically unstable. One method to avoid this is to normalize the
αalpha’s at each time-step. The original publication
has more detail on this including the adjustments to the gradient.
In practice this works well enough
for medium length sequences but can still underflow for long sequences.
A better solution is to compute the loss function in log-space with the
log-sum-exp trick.

When computing the sum of two probabilities in log space use the identity
log(ea+eb)=max{a,b}+log(1+eab) log(e^a + e^b) = max{a, b} + log(1 + e^{-|a-b|})

Inference should also be done in log-space using the log-sum-exp trick.

Beam Search: There are a couple of good tips to know about
when implementing and using the CTC beam search.

The correctness of the beam search can be tested as follows.

  1. Run the beam search algorithm on an arbitrary input.
  2. Save the inferred output Y¯bar{Y}
  3. Compute the actual CTC score cc for
    Y¯.bar{Y}.
  4. Check that c¯cbar{c} approx c

A common question when using a beam search decoder is the size of the beam
to use. There is a trade-off between accuracy and runtime. We can check if the
beam size is in a good range. To do this first compute the CTC score for the
inferred output ci.c_i.



Source link

Facebook
Twitter
LinkedIn
Pinterest