NH-LoRA

Neuromorphic Horizontal LoRA : Future-Aware Structural Expansion of Low-Rank Adapters for Rehearsal-Free Class-Incremental Learning
Hartmann Kanisius && Syauqi Nabil Tasril 30 min read (maybe) · Apr 11, 2026

Preface

Tl;dr :

We proposed new LoRA architecture called NH-LoRA as part of PEFT on Continual Learning, spesifically on image classification. We also show it's performance on Continual Learning task with final average accuracy and forgetting.

Hello ! So... this project is a part of collaboration between me (Hartmann) and Syauqi. Anyway, this new architecture was inspired by neuromorphic computing [3], DEN [5], and CL-LoRA method [2]. The problem domain itself, Class-Incremental Learning, was relatively new and rarely discussed (at least at the time I wrote this). But, we believe it could be very useful for existing models, especially when they need to be maintained over time and able to learn new knowledge, without forget the old one. That, is what we want to address.


Oh, btw, this was make as KCVanguard Final Project, part of KCV Lab Recruitment. Maybe, we'll make the Indonesian version of this (and I'll put it here). Because this is a new architecture, there will be really, really lot of equation here. We'll try to explain them in a comprehensive, structured, and (hopefully) easy-to-understand way.


For those who just wonder about the architecture, u can skip to here


Problem Statement

Image Classification

Image Classification is simply a task to give a label to an image. Every image classification model have this common idea :

(Incremental) Image Classification

In image classification, a model is trained on a fixed set of classes, so it only learns to recognize classes within that domain.

hmmm.... What if we want to add a new class later ?

Well, we can fine-tune or retrain the model. Let's say I chose to fine-tune because I don't have time to retraining. So the model learn the new concept well, but somehow, maybe it lose some ability to recognize the old ones at the same time. This is known as forgetting, common problems on fine-tuning.

So, incremental learning can be viewed as a special task in fine-tune domain [1]. We add new classes over time while try to preserve performance on previously learned classes (Btw, we use the term incremental here, but it's similar with continual learning)

Formally, we can think of the learning process as a sequence of tasks T1,T2,,Tt, where each task Ti introduces a new set of classes Ci.

After training on task t, the model should still be able to correctly classify images from all classes seen so far:

Ctotal=C1C2Ct

In short, this diagram should helps the intuition :

To deal with incremental image classification, we use vision transformer model. There are more model tho, but we use this because it's the same model that CL-LoRA use, so the benchmark can be done fairly.

Vision Transformer (ViT)

Vision Transformers (ViT) is a Transformer but for images, works by treating image patches as tokens.

Transformers? Thought that was for NLP, iirc? ... Yep, but the idea actually can be used for images as well. I think better we explained some concept here to make sure we are on the same page ....    

Transformer

Transformers are models that process a sequence by letting each element compare itself with the others, instead of reading everything in a fixed (left-to-right) order [6]. Transformer is identical with attention (imo, since these two terms are often mentioned together).

... attention, like how much we put attention into something ?

Attention

I think, better if we explain these use sentence :

The animal didn't cross the street because it was too tired

In Natural Language Processing, that sentence would be converted into tokens. These can be whole words, subwords, or character pieces .. depending on the tokenizer. For this example, let's say the sentences are converted into these tokens :

The_ | animal_ | didn_ | '_ | t_ | cross_ | the_ | street_ | because_ | it_ | was_ | too_ | tire | d_

Based on the infamous paper [7], each token gets projected into three vectors, Query (Q), Key (K), Value (V), and the attention score between tokens will be :

Attention(Q,K,V)=softmax(QKTdk)V

Now, we will use some math to find out how much it_ "pays attention" to animal_ compared to street_.


Some Math

To do so, let’s assume this hypothetical vector values that a trained model might output for these words :

Qit=[0.6, 0.3, 0.4, 0.8] Kanimal=[0.5, 0.4, 0.3, 0.7] Kstreet=[0.2, 0.1, 0.5, 0.3] Kwas=[0.3, 0.1, 0.1, 0.5]

Assume dk=4, so the scaling factor is:

dk=4=2

First, compute the raw score from it_ :

Attending to animal_:

QitKanimal=(0.6)(0.5)+(0.3)(0.4)+(0.4)(0.3)+(0.8)(0.7)=0.30+0.12+0.12+0.56=1.10

Attending to street_:

QitKstreet=(0.6)(0.2)+(0.3)(0.1)+(0.4)(0.5)+(0.8)(0.3)=0.12+0.030.200.24=0.53

Attending to was_:

QitKwas=(0.6)(0.3)+(0.3)(0.1)+(0.4)(0.1)+(0.8)(0.5)=0.18+0.03+0.04+0.40=0.65



Then, we apply scaling :

[1.102, 0.532, 0.652]=[0.55, 0.265, 0.325]



Apply softmax :

softmax([0.55, 0.265, 0.325])[0.45, 0.20, 0.35]

So it_ puts roughly 45% of its attention weight on animal_, 35% on was_, and 20% on street_.

The output embedding of it_ becomes a weighted blend of the value vectors, with the strongest contribution coming from animal_.

At the end, it will be something like this :

 
(A reminder that those are dummy numbers for visualization purposes)

(Vision) Transformer

Alright, now we know how to "count attention" with words. How those were applied to the image, to be exactly? ...

Patch Embedding

Vision Transformer works by cutting the image into patches and treating each patch like a token in a sentence. For example, a 224×224 image with patch size 16×16 :

Number of patches=22416×22416=14×14=196 patches

Each patch is then flattened and mapped into a D-dimensional embedding using a linear projection (like [w,x,y,z] in attention chapter). It's called patch embedding



Positional Encoding

We have the embedding, now start count attention score ? Not really.

One problem with Transformers is we need to give the order of the input. If you shuffled all the tokens randomly and fed them... well, it still works, but it won't give good information for the model to "understand" the context. The model won't find any difference for an orange image or a scrambled orange puzzle.

With text, it's easy with text because they are ordered, like how we read them from left to right. With images... they don't have natural "reading order" the like words. Patch number 7 isn't inherently "after" patch 6 in any meaningful sense. But a patch containing something like "orange and round" can mean very different things depending on whether it appears in the top-left corner, the center, or the bottom-right. So, to solve this, we add positional embeddings

The final embedding z0 are derived from this calculation :

z0=[xclass;xp1E;xp2E;;xpNE]+Epos

With :

The reasoning here .. is basically to make the embedding more grouped by position.


Some Math

Let's take an example. Imagine we are looking at a picture of a landscape. Patch 1 is the top-left corner (blue sky), and Patch 16 is the bottom-right corner (which happens to be a blue lake). Visually, they might look identical.

Let's say our embedding dimension is 4:


If we stopped here, the model would think these two patches are the exact same thing in the exact same context. Now, let's add the positional embeddings (Epos):


Ohh? Even though the visual pixels were identical, the final embedding are different, based on the position


Finally, this embedding will be input for Transformer architecture, with it's attention mechanism, MLP head, yadda yadda...

In short, may this diagram helps the intuition :


NH-LoRA

Now, let's see the proposed architecture. Ehhh, wait, I think I still need to explain some terms before that.

Preface

PEFT (Parameter Efficient Tuning)

PEFT, is the "family of techniques" that fine-tune model by updating a small number of parameters. In many cases, these methods can achieve performance that is comparable to fine-tuning all model parameter. So almost same result with less time, win-win solution.

There are really lot of them (well, family technique), I think this figure should give good overview about PEFT [10].

peft

We won't explain all of them (that will be a burden), Let's just focus on LoRA.

LoRA

Okay, so LoRA stands for Low-Rank Adaption, which some technique to approximate updates on weight matrix with a "low-rank decomposition" matrix [11].

... what does that mean? Let's start with diagram for better intuiton :

As shown in diagram, we call the original pre-trained weights as W. The intuition was, during training (or fine-tuning, to be specific), we want to find a change to these weights, which we'll call ΔW. Instead of learning that ΔW matrix directly, we freeze the original weights W and approximate ΔW by multiplying two much smaller matrices together, A and B.


Some Math
Let's take an example. From training, "ideal" weight update ΔW from a standard backprop looks like this 3×3 matrix:

ΔW=[2463694812]

Normally, that's 9 separate parameters we have to update and train. But if we look closely, there's a pattern. Every row is essentially just a multiple of the sequence [1,2,3]. Because of this redundancy (which actually happens a lot in neural networks during adaptation), we can represent this exact same matrix by taking the outer product of a 3×1 column matrix (A) and a 1×3 row matrix (B) at a rank size of r=1 :

A=[234]B=[123]


So, if we multiply A×B, we get the exact same ΔW back, right ? With doing so, we only train the parameters inside A and B, which is 6 parameters.


In short, We want to optimize A and B such that AB is as close as possible to the actual updated ΔW.


Architecture

Now, let's see the proposed architecture (this time fr).

Btw, some important terms that we'll use a lot here :

About slot, each slot serves as a compact representation of a learned subspace and is used to generate low-rank factors. Specifically, the slot is used to parameterize the LoRA matrices :

A(s)=fA(s),B(s)=fB(s)

where s is the slot embedding, and fA() and fB() are learned mappings.

These factors are then used to construct the low-rank update :

ΔW(s)=B(s)A(s)

This mechanism allows each slot to dynamically control the low-rank adaptation applied to the model.


Task-State Encoder

Tasks

Remember when we talk about task in incremental image classification here ? Now.. what is a task, actually?

Task is simply a group or subset of labels. Let's say we have the full set of classes :

C={c1,c2,,cK}

Then, these classes are divided into several task groups:

C1,C2,,CT

with the condition:

CiCj=for ij

This means that each class belongs to only one task.

Warmup

Suppose task t has the following dataset:

Dt={(xi,yi)}i=1nt,yiCt

Warm-up will takes a small initial batch:

Bt={(xi,yi)}i=1m,mnt

Let

Btwarm={(xn,yn)}n=1Nw

be the sample set from the warm-up phase.

With f() denoting the feature extractor from the frozen backbone and the currently active adapter, the task mean feature is defined as :

μt=1Nwn=1Nwf(xn).


The feature dispersion is summarized using diagonal covariance or per-dimension variance:

σt=1Nwn=1Nw(f(xn)μt)2,

where 2 denotes element-wise squaring.


To compute a lightweight gradient sketch over trainable parameters during warm-up, we use

gt=[ΔW1Lwarm2,ΔW2Lwarm2,,ΔWMLwarm2],

where Lwarm is the warm-up loss and {ΔWm}m=1M denotes the set of monitored trainable adapter modules.


To compare the current task with previous experience, a similarity signal is computed from the history bank. If

Ht1={h1,,ht1}

is the summary of previous tasks, then the similarity score is

st={0,t=1,maxi{1,,t1}cos(ψs(μt),hi),t>1.

Here, ψs() is a lightweight projection so that the current mean feature lies in the same space as the historical representations.


We also measures initial uncertainty through the average prediction entropy:

et=1Nwn=1NwH(pθ(yxn)),

where H() denotes Shannon entropy.

All of these statistics are then combined into a raw task vector:

ht=[pool(μt); pool(σt); gt; st; et].


So, to summary this :

Input image pixels and labels compute class centroids (μt), feature dispersion (σt), similarity (st), entropy (et), and gradient sketch (gt) encode all of them History Task, as input to the Horizon Planner.


Horizon Planner

Horizon Planner receive two history :

We know that Current History Task is ht from previous section. But how to calculate Previous History Task ?

Let's say task before current task are as follow :

Ht1={h1,h2,,ht1}.

To make the planner compare the current task not only with a single past task but with the full history, NH-LoRA uses a history aggregation module conditioned on ht. For each new task, a query is built from the current state, while keys and values are built from all elements in the history bank:

qt=Wqht,ki=Wkhi,vi=Wvhi,

where Wq, Wk, and Wv are learned projections.


The attention weight for each historical task is computed as

ai(t)=exp(qtkida)j=1t1exp(qtkjda),

and the aggregated history summary is defined as

h~t=i=1t1ai(t)vi.



For layer l, the planner forms a combined representation from the current task state and the aggregated history, in here ht~ , or current history, is denoted as zt :

ult=[zt;h~t;|zth~t|;zth~t;el],

where el is the layer embedding that distinguishes each layer identity, and denotes element-wise multiplication.


Next, the planner maps this input into a latent representation:

rlt=ϕl(ult),

where ϕl() is a small MLP specific to layer l.



From the latent representation rlt, the Horizon Planner predicts four main signals:

νlt=σ(wν,lrlt+bν,l),ξlt=σ(wξ,lrlt+bξ,l),ρlt=rmin+(rmaxrmin)σ(wρ,lrlt+bρ,l),κlt=σ(wκ,lrlt+bκ,l),

where σ() is the sigmoid function, while w and b denote the learned weights and biases for the corresponding output heads.

νlt(0,1) is the novelty score
ξlt(0,1) is the conflict score
ρlt[rmin,rmax] is the rank budget
κlt(0,1) is the signal consolidation

Decision

Through the value of novelty and conflict, the planner selects one of four actions :

Alt={reuse_shared_and_small_update,νlt<τνξlt<τξ,expand_rank_existing_slot,νltτνξlt<τξ,open_new_slot,νltτνξltτξ,freeze_old_strong_retention,νlt<τνξltτξ.

Here, νlt represents the novelty score and ξlt represents the conflict score for layer l at task t. The thresholds τν and τξ determine whether the current task is considered sufficiently new or sufficiently conflicting with previous knowledge.

Let's explain these terms :


Expand rank existing slot

The planner chooses the old slot that is most compatible with the current task. Let cl,s denote the centroid or summary representation of slot s in layer l. The target slot is defined as :

sl=argmaxsSlAff(zt,cl,s),


where Aff(,) is cosine similarity.

The rank of the selected slot is then updated as

rl,s(t)=min(rl,s(t1)+Δrlt,rmax),

with

Δrlt=max(1,r^ltrl,s(t1)).


Open New Slot

To keep parameter growth under control, each layer l has a maximum number of slots:

|Sl|Smax.

If the planner chooses open_new_slot but the slot budget is already full, that decision is redirected to
expand_rank_existing_slot. In that case, the planner reuses the previously defined target slot sl instead of creating a new one.

If a new slot is allowed, it is initialized with :

rl,new(t)=r^lt.


Reused Shared

Is to reused share core LoRA and update the shared LoRA based on rank budget, which
mostly low. So the final weight will be something like :

Wleff(x,t)=Wl0+βltΔWlshared

With shared LoRA components at layer l are defined as follows :

ΔWlshared=BlsharedmlsharedAlsharedAlsharedRrmax×dinBlsharedRdout×rmaxmlshared{0,1}rmax

Here:


Strong Retention

Is only to reused share core LoRA, like reused share, but without any updated rank on ΔWlshared , or the ΔWlshared is frozen :

Wleff(x,t)=Wl0+βltΔWlshared



Instance Router

During the forward pass, the router does not activate all slots at the same time in the task t. Instead, it selects only a small number of the most relevant slots for each input instance. For a sample x at layer l, the router first builds an instance query :

ql(x)=ϕl(Pool(Hl(x))),

where Hl(x) is the token-level representation at layer l, Pool() is a pooling operation over tokens, and ϕl() is a learned lightweight projection.

Based on this query, the active slot set is defined as :

Slact(x)=TopKsSlcos(ql(x),kl,s),

with

|Slact(x)|=Kroute,Kroute|Sl|.

This means that only the top-Kroute slots with the highest similarity to the query are used for that input instance.

The routing coefficient is then computed only over the selected slots using a TopK-Softmax :

αl,s(x)=TopKSoftmax(cos(ql(x),kl,s)τ),sSlact(x),

where τ is the routing temperature. These coefficients determine how much each active slot contributes to the final update.

At the end, the effective weight used by the block for task t, at sample x, is :

Wleff(x,t)=Wl0+βltΔWlshared+sSlact(x)αl,s(x)ΔWl,sslot.


Incremental Cosine Head

In continual learning, classes arrive step by step, so the classifier head cannot stay fixed. After task t, the model must predict all classes seen so far :

C1:t=C1C2Ct.


This means the classifier head must grow incrementally as new classes appear. A usual linear classifier computes logits as

c=wcf+bc,


where f is the feature embedding, wc is the class weight, and bc is the bias term.


This formulation can be unstable because feature norms may shift across tasks, new classes may dominate old ones, and the bias term can strengthen the imbalance between recent and old classes. As a result, the classifier often becomes biased toward new tasks and suffers from forgetting at the head level.

To address this problem, NH-LoRA uses a cosine classifier, adapted from unified classifier [4]. Instead of relying on feature magnitude, it compares the direction of the feature and class weight vectors:

c=sfwcfwc=scos(f,wc),


where s is a learnable or fixed scale factor. For task t, the classifier head contains all classes seen so far:

Wcls(t)=[w1,w2,,w|C1:t|].


When a new task arrives, the head is expanded by appending new class weights:

Wcls(t)=[Wcls(t1);Wnew(t)].


For an input x, the probability of class c is then computed with softmax over all seen classes:

p(y=cx)=exp(c)jC1:texp(j).


In this way, the head grows incrementally while the scoring rule stays consistent across tasks.

This makes the decision depends on angular similarity instead of vector norm. This is also useful when the backbone representation is modified by adapters such as LoRA, because those updates may change feature scale across tasks.


Consolidation and Homeostasis Unit


After a task is completed, NH-LoRA performs consolidation. For each slot, the model computes its utility ul,s, stability ψl,s, and redundancy rl,sred with respect to the other slots in the same block, as well as the shared core.


The decision to merge, prune, or keep/freeze the slot is based on usage, stability, redundancy, and the consolidation flag.


The usage of slot s at layer l is defined as the average routing coefficient over the task dataset:

ul,s(t)=1NtxDtαl,s(x),


The stability of a slot is defined from the magnitude of its parameter change during task t:

ψl,s(t)=exp(ΔWl,sendΔWl,sstartFΔWl,sstartF+ϵ),


The redundancy of a slot with respect to other slots in the same block is computed as

rl,sred=maxuscos(vec(ΔWl,s),vec(ΔWl,u)),


Based on these three quantities, the post-task decision is defined as:

merge(l,s)ul,s(t)τu+ψl,s(t)τψclt=1,prune(l,s)ul,s(t)τurl,sredτr.

and if neither condition is satisfied, then the slot is kept or frozen:

keep/freeze(l,s)¬merge(l,s)¬prune(l,s).


Loss Function

The loss function consists of seven terms. The first four are adapted from the CL-LoRA training objective [2], and the last three are additional regularizers designed from related LoRA adaptation [9,10].

The training objective in NH-LoRA is defined as a weighted combination of classification loss, distillation loss, feature retention loss, subspace regularization, rank regularization, structural growth penalty, and routing regularization:

L=Lcls+λkdLkd+λfeatLfeat+λorthLorth+λrankLrank+λgrowLgrow+λrouteLroute

For the first task, only a subset of these losses is active:

L(1)=Lcls+λrankLrank+λrouteLroute+λorthLorth(1)

While Lkd(1)=0, Lfeat(1)=0, and Lgrow(1)=0.

As usual, we'll explain these term :


Classification Loss

The classification loss is defined as the cross-entropy between the target label y and the model prediction y^:

Lcls=CE(y,y^).

This loss is applied for the model to correctly predict the labels of the current task.


Logit Distillation Loss

To preserve knowledge about previously learned classes, NH-LoRA uses a teacher model given by the snapshot after task t1. The logit distillation loss is defined as:

Lkd=KL(pθt1(yoldx;T)pθ(yoldx;T)).

This loss is applied to current-task samples using a Learning without Forgetting (LwF)-style retention scheme [12].


Feature Retention Loss

In addition to preserving logits, the model is encouraged to keep intermediate representations stable at selected layers:

Lfeat=lLrethlt(x)hlt1(x)22.

Here, Lret denotes the set of layers monitored for feature retention.


Orthogonality Loss Across Slots

To prevent new slots from learning subspaces that are too similar to existing ones, NH-LoRA applies an orthogonality regularizer:

Lorth=lsuAl,sAl,uF2

This loss encourages diversity among slots within each layer.


Rank Regularization

To promote compact and efficient capacity usage, the active rank of each slot is regularized by:

Lrank=l,sml,s1.

This loss is applied to encourages the number of active low-rank dimensions to remain small.


Growth Penalty

To avoid uncontrolled structural expansion, opening a new slot is penalized by:

Lgrow=l1[new slot at l]

As a result, new slots are created only when they are truly needed by the current task.


Router Balance

To prevent the router from repeatedly selecting the same slot, NH-LoRA uses a routing balance regularizer:

Lroute=lKL(α¯lUniform).

Here, α¯l denotes the average routing distribution at layer l.


So, what is the purpose of each of these loss ?


Training

Overview

In this class incremental learning setup, the dataset is first preprocessed using PILOT to prepare the samples before they are split into a sequence of tasks. The model then learns each task step by step using a pretrained model with NH-LoRA, which helps it adapt to new classes while reducing catastrophic forgetting on previously learned ones.

After training on each incremental task, the model is evaluated using Average Incremental Accuracy, Final Average Accuracy, and Forgetting. These metrics show not only how well the model learns new classes, but also how much performance it retains on earlier classes throughout the incremental process.


training.png

Parameter config

The configuration (batch, learning-rate, etc) can be found here. It contain spesific configuration for each dataset used in this work.


Evaluation

The evaluation for our model consist of two method :


Final Average Accuracy

This metric represents the average accuracy at the very end of training, after the model has learned all tasks. In other words, it measures how well the final model performs across every task or class that has appeared so far. This is the main score used to summarize the overall final performance of the continual learning process.

It is calculated as follows:

Final Average Accuracy=1Ti=1TAT,i

where:


Forgetting

Forgetting measures how much performance on earlier tasks drops after the model learns new tasks. In continual learning, this is an important metric because a model may achieve high accuracy on the latest task while gradually losing knowledge from previous tasks. A lower forgetting value means the model preserves past knowledge better.

Forgetting=1T1i=1T1(AibestAT,i)

where:

A higher value means more forgetting, while a lower value means better retention of previously learned tasks.

We have run 8 experiment (detailed explanation here ), and here are the best result for each dataset :

Pasted image 20260420162621.png


Conclusion

Well, as we can see up there, NH-LoRA does help the model learn new classes without either fully retraining from scratch or modifying all backbone parameters. The proposed adapter design is applied across all layers to preserve plasticity, which differs from methods such as CL-LoRA that rely on more specialized adapters in selected layers [2]. In this sense, the architecture is designed to adapt its structure according to task novelty and conflict, instead of using the same fixed adaptation strategy for every task. Therefore, this should improve flexibility and allow capacity to be allocated more effectively, especially when the incoming task has a different level of similarity to previous ones.

However, we still need to adjust some our expectation for the result. The model still struggles to maintain old knowledge while learning new tasks, so the forgetting issue is still present. It may benefit from the horizontal planner mechanism, yet the extra conditions and regularization also make the optimization harder if they are not well balanced.

We also have to remember that the result is also likely affected by training configuration differences, such as task split, number of epochs, learning rate, and other hyperparameters. This need to examine further, as we don't have more budget for another experiment :( , and every experiment does take time (CIFAR itself need around 5-6 hours per fine-tuning).

Future Works

For future work, NH-LoRA would benefit from more extensive experimentation, especially on parameter tuning and task-setting variations. Since continual learning performance is highly sensitive to hyperparameters, a more systematic search over learning rate, rank budget, regularization strength, and task split could give a clearer picture of the method’s true potential.

References

[1] D.-W. Zhou et al., “Class-Incremental Learning: A Survey.” 2023. [Online]. Available: https://arxiv.org/abs/2302.03648

[2] J. He, Z. Duan, and F. Zhu, “CL-LoRA: Continual Low-Rank Adaptation for Rehearsal-Free Class-Incremental Learning.” 2025. [Online]. Available: https://arxiv.org/abs/2505.24816

[3] V. Lialin et al., “Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning.” 2023. [Online]. Available: https://arxiv.org/abs/2303.15647

[4] B. Jung et al., “Neuromorphic Computing - An Overview.” 2025. [Online]. Available: https://arxiv.org/abs/2510.06721v2

[5] S. Hou et al., “Learning a Unified Classifier Incrementally via Rebalancing.” 2019. [Online]. Available: https://openaccess.thecvf.com/content_CVPR_2019/html/Hou_Learning_a_Unified_Classifier_Incrementally_via_Rebalancing_CVPR_2019_paper.html

[6] J. Yoon et al., “Lifelong Learning with Dynamically Expandable Networks.” 2018. [Online]. Available: https://openreview.net/forum?id=Bk-aoer_-

[7] A. Dosovitskiy et al., “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale.” 2020. [Online]. Available: https://arxiv.org/abs/2010.11929

[8] A. Vaswani et al., “Attention Is All You Need.” 2017. [Online]. Available: https://arxiv.org/abs/1706.03762

[9] Z. Hu et al., “Low Rank Regularization: A review.” 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0893608021001468

[10] A. Rokah et al., “Mixture-of-Experts Models in Vision: Routing, Optimization, and Generalization.” 2026. [Online]. Available: https://arxiv.org/abs/2601.15021v1

[11] E. J. Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models.” 2021. [Online]. Available: https://arxiv.org/abs/2106.09685

[12] Z. Li and D. Hoiem, “Learning without Forgetting.” 2016. [Online]. Available: https://arxiv.org/abs/1606.09282