Target of this paper: a new foundation model
Modern FMs are predominantly based on a single type of sequence model:
The efficacy of self-attention is attributed to its ability to route information densely within a context window, allowing it to model complex data.
Drawbacks:
An enormous body of research has appeared on more efficient variants of attention to overcome these drawbacks, but often at the expense of the very properties that makes it effective.
Recently, structured state space sequence models (SSMs) have emerged as a promising class of architectures for sequence modeling.
However, they have been less effective at modeling discrete and information-dense data such as text.
We propose a new class of selective state space models, that improves on prior work on several axes to achieve the modeling power of Transformers while scaling linearly in sequence length.
Selective SSMs, and by extension the Mamba architecture, are fully recurrent models with key properties that make them suitable as the backbone of general foundation models operating on sequences.
Structured state space sequence models (S4) are a recent class of sequence models for deep learning that are broadly related to RNNs, and CNNs, and classical state space models.
A particular continuous system (1) that maps a 1-dimensional function or sequence
Concretely, S4 models are defined with four parameters
The first stage transforms the continuous parameters
Various rules can be used such as the zero-order hold (ZOH, 零階保持) defined in equation (4).
After the parameters have been transformed from
Linear recurrence:
Global convolution:
Linear recurrence:
Global convolution:
LTI (線性非時變):
An important property of equations (1) to (3) is that the model's dynamics are constant through time.
All structured SSMs have been LTI (e.g. computed as convolutions) because of fundamental efficiency constraints.
However, a core insight of this work is that LTI models have fundamental limitations in modeling certain types of data, and our technical contributions involve removing the LTI constraint while overcoming the efficiency bottlenecks.
Argument: a fundamental problem of sequence modeling is compressing context into a smaller state.
Summary: The efficiency vs. effectiveness tradeoff of sequence models is characterized by how well they compress their state.
One method of incorporating a selection mechanism into models is by letting their parameters that affect interactions along the sequence be input-dependent.
Earlier works attempted to incorporate special cases of selection.
However, as previously mentioned a core limitation in the usage of SSMs is their computational efficiency, which was why S4 and all derivatives used LTI (non-selective) models, most commonly in the form of global convolutions.
The selection mechanism is designed to overcome the limitations of LTI models.
The main idea is to leverage properties of modern accelerators (GPUs) to materialize the state
Concretely, instead of preparing the scan input
To avoid the sequential recurrence, we observe that despite not being linear it can still be parallelized with a work-efficient parallel scan algorithm.
Finally, we must also avoid saving the intermediate states, which are necessary for backpropagation.
Properties
When
(softplus)
For the audio waveform modality, we compare primarily to the SaShiMi architecture and training protocols (Goel et al., 2022).
The architecture is a UNet with alternating S4 and MLP blocks, which we consider replacing with Mamba.
efficacy: /ˈefəkəsi/ context window: the maximum sequence length that a transformer can process at a time
Complexity: * Transformer = linear time * RNN = constant time 這裡指的應該是計算下一個 step 的計算量 RNN 只會從當前的 state 開始計算,因此只要 constant time complexity
缺乏說明 motivation?
Hyena /haɪˈiːnə/