联系方式

您当前位置:首页 >> Python编程Python编程

日期:2024-11-14 08:53

ECE 498/598 Fall 2024, Homeworks 3 and 4

Remarks:

1. HW3&4: You can reduce the context length to 32 if you are having trouble with the

training time.

2. HW3&4: During test evaluation, note that positional encodings for unseen/long

context are not trained. You are supposed to evaluate it as is. It is OK if it doesn’t

work well.

3. HW3&4: Comments are an important component of the HW grade. You are expected

to explain the experimental findings. If you don’t provide technically meaningful

comments, you might receive a lower score even if your code and experiments are

accurate.

4. The deadline for HW3 is November 11th at 11:59 PM, and the deadline for HW4 is

November 18th at 11:59 PM. For each assignment, please submit both your code and a

PDF report that includes your results (figures) for each question. You can generate the

PDF report from a Jupyter Notebook (.ipynb file) by adding comments in markdown

cells.

1

The objective of this assignment is comparing transformer architecture and SSM-type

architectures (specifically Mamba [1]) on the associative recall problem. We provided an

example code recall.ipynb which provides an example implementation using 2 layer

transformer. You will adapt this code to incorporate different positional encodings, use

Mamba layers, or modify dataset generation.

Background: As you recall from the class, associative recall (AR) assesses two abilities

of the model: Ability to locate relevant information and retrieve the context around that

information. AR task can be understood via the following question: Given input prompt

X = [a 1 b 2 c 3 b], we wish the model to locate where the last token b occurs earlier

and output the associated value Y = 2. This is crucial for memory-related tasks or bigram

retrieval (e.g. ‘Baggins’ should follow ‘Bilbo’).

To proceed, let us formally define the associative recall task we will study in the HW.

Definition 1 (Associative Recall Problem) Let Q be the set of target queries with cardinal ity |Q| = k. Consider a discrete input sequence X of the form X = [. . . q v . . . q] where the

query q appears exactly twice in the sequence and the value v follows the first appearance

of q. We say the model f solves AR(k) if f(X) = v for all sequences X with q ∈ Q.

Induction head is a special case of the definition above where the query q is fixed (i.e. Q

is singleton). Induction head is visualized in Figure 1. On the other extreme, we can ask the

model to solve AR for all queries in the vocabulary.

Problem Setting

Vocabulary: Let [K] = {1, . . . , K} be the token vocabulary. Obtain the embedding of

the vocabulary by randomly generating a K × d matrix V with IID N(0, 1) entries, then

normalized its rows to unit length. Here d is the embedding dimension. The embedding of

the i-th token is V[i]. Use numpy.random.seed(0) to ensure reproducibility.

Experimental variables: Finally, for the AR task, Q will simply be the first M elements

of the vocabulary. During experiments, K, d, M are under our control. Besides this we will

also play with two other variables:

• Context length: We will train these models up to context length L. However, we

will evaluate with up to 3L. This is to test the generalization of the model to unseen

lengths.

• Delay: In the basic AR problem, the value v immediately follows q. Instead, we will

introduce a delay variable where v will appear τ tokens after q. τ = 1 is the standard.

Models: The motivation behind this HW is reproducing the results in the Mamba paper.

However, we will also go beyond their evaluations and identify weaknesses of both trans former and Mamba architectures. Specifically, we will consider the following models in our

evaluations:

2

Figure 1: We will work on the associative recall (AR) problem. AR problem requires the

model to retrieve the value associated with all queries whereas the induction head requires

the same for a specific query. Thus, the latter is an easier problem. The figure above is

directly taken from the Mamba paper [1]. The yellow-shaded regions highlight the focus of

this homework.

• Transformer: We will use the transformer architecture with 2 attention layers (no

MLP). We will try the following positional encodings: (i) learned PE (provided code),

(ii) Rotary PE (RoPE), (iii) NoPE (no positional encoding)

• Mamba: We will use the Mamba architecture with 2 layers.

• Hybrid Model: We will use an initial Mamba layer followed by an attention layer.

No positional encoding is used.

Hybrid architectures are inspired by the Mamba paper as well as [2] which observes the

benefit of starting the model with a Mamba layer. You should use public GitHub repos to

find implementations (e.g. RoPE encoding or Mamba layer). As a suggestion, you can use

this GitHub Repo for the Mamba model.

Generating training dataset: During training, you train with minibatch SGD (e.g. with

batch size 64) until satisfactory convergence. You can generate the training sequences for

AR as follows given (K, d, M, L, τ):

1. Training sequence length is equal to L.

2. Sample a query q ∈ Q and a value v ∈ [K] uniformly at random, independently. Recall

that size of Q is |Q| = M.

3. Place q at the end of the sequence and place another q at an index i chosen uniformly

at random from 1 to L − τ.

4. Place value token at the index i + τ.

3

5. Sample other tokens IID from [K]−q i.e. other tokens are drawn uniformly at random

but are not equal to q.

6. Set label token Y = v.

Test evaluation: Test dataset is same as above. However, we will evaluate on all sequence

lengths from τ + 1 to 3L. Note that τ + 2 is the shortest possible sequence.

Empirical Evidence from Mamba Paper: Table 2 of [1] demonstrates that Mamba can do

a good job on the induction head problem i.e. AR with single query. Additionally, Mamba

is the only model that exhibits length generalization, that is, even if you train it pu to context

length L, it can still solve AR for context length beyond L. On the other hand, since Mamba

is inherently a recurrent model, it may not solve the AR problem in its full generality. This

motivates the question: What are the tradeoffs between Mamba and transformer, and can

hybrid models help improve performance over both?

Your assignments are as follows. For each problem, make sure to return the associated

code. These codes can be separate cells (clearly commented) on a single Jupyter/Python file.

Grading structure:

• Problem 1 will count as your HW3 grade. This only involves Induction Head

experiments (i.e. M = 1).

• Problems 2 and 3 will count as your HW4 grade.

• You will make a single submission.

Problem 1 (50=25+15+10pts). Set K = 16, d = 8, L = 32 or L = 64.

• Train all models on the induction heads problem (M = 1, τ = 1). After training,

evaluate the test performance and plot the accuracy of all models as a function of

the context length (similar to Table 2 of [1]). In total, you will be plotting 5 curves

(3 Transformers, 1 Mamba, 1 Hybrid). Comment on the findings and compare the

performance of the models including length generalization ability.

• Repeat the experiment above with delay τ = 5. Comment on the impact of delay.

• Which models converge faster during training? Provide a plot of the convergence rate

where the x-axis is the number of iterations and the y-axis is the AR accuracy over a

test batch. Make sure to specify the batch size you are using (ideally use 32 or 64).

Problem 2 (30pts). Set K = 16, d = 8, L = 32 or L = 64. We will train Mamba, Transformer

with RoPE, and Hybrid. Set τ = 1 (standard AR).

• Train Mamba models for M = 4, 8, 16. Note that M = 16 is the full AR (retrieve any

query). Comment on the results.

• Train Transformer models for M = 4, 8, 16. Comment on the results and compare

them against Mamba’s behavior.

4

• Train the Hybrid model for M = 4, 8, 16. Comment and compare.

Problem 3 (20=15+5pts). Set K = 16, d = 64, L = 32 or L = 64. We will only train

Mamba models.

• Set τ = 1 (standard AR). Train Mamba models for M = 4, 8, 16. Compare against the

corresponding results of Problem 2. How does embedding d impact results?

• Train a Mamba model for M = 16 for τ = 10. Comment if any difference.

References

[1] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state

spaces. arXiv preprint arXiv:2312.00752, 2023.

[2] Jongho Park, Jaeseung Park, Zheyang Xiong, Nayoung Lee, Jaewoong Cho, Samet

Oymak, Kangwook Lee, and Dimitris Papailiopoulos. Can mamba learn how to learn? a

comparative study on in-context learning tasks. arXiv preprint arXiv:2402.04248, 2024.

5


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:821613408 微信:horysk8 电子信箱:[email protected]
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:horysk8