Yao Lu1 Hao Cheng Yujie Fang1 Zeyu Wang1 Jiaheng Wei2
Dongwei Xu1 Qi Xuan1 Xiaoniu Yang1 Zhaowei Zhu
1Zhejiang University of Technology 2HKUST-GZyaolu.zjut@gmail.com. Equal contribution with Hao Cheng.Corresponding author: xuanqi@zjut.edu.cn.
Abstract
Although large language models (LLMs) have achieved remarkable success across various domains, their considerable scale necessitates substantial computational resources, posing significant challenges for deployment in resource-constrained environments. Layer pruning, as a simple yet effective compression method, removes layers of a model directly, reducing computational overhead. However, what are the best practices for layer pruning in LLMs? Are sophisticated layer selection metrics truly effective? Does the LoRA (Low-Rank Approximation) family, widely regarded as a leading method for pruned model fine-tuning, truly meet expectations when applied to post-pruning fine-tuning? To answer these questions, we dedicate thousands of GPU hours to benchmarking layer pruning in LLMs and gaining insights across multiple dimensions. Our results demonstrate that a simple approach, i.e., pruning the final 25% of layers followed by fine-tuning the lm_head and the remaining last three layer, yields remarkably strong performance. Following this guide, we prune Llama-3.1-8B-It and obtain a model that outperforms many popular LLMs of similar size, such as ChatGLM2-6B, Vicuna-7B-v1.5, Qwen1.5-7B and Baichuan2-7B.We release the optimal model weights on Huggingface111https://huggingface.co/YaoLuzjut/Llama-3.1-6.3B-It-Alpaca and https://huggingface.co/YaoLuzjut/Llama-3.1-6.3B-It-Dolly, and the code is available on GitHub222https://github.com/yaolu-zjut/Navigation-LLM-layer-pruning.
1 Introduction
In recent years, large language models (LLMs) have achieved unprecedented success in many fields, such as text generation(Achiam etal., 2023; Touvron etal., 2023), semantic analysis(Deng etal., 2023; Zhang etal., 2023b) and machine translation(Zhang etal., 2023a; Wang etal., 2023). However, these achievements come with massive resource consumption, posing significant challenges for deployment on resource-constrained devices. To address these challenges, numerous techniques have been developed to create more efficient LLMs, including pruning(Ma etal., 2023a; Sun etal., 2023), knowledge distillation(Xu etal., 2024; Gu etal., 2024), quantization(Lin etal., 2024; Liu etal., 2023), low-rank factorization(Saha etal., 2023; Zhao etal., 2024a), and system-level inference acceleration(Shah etal., 2024; Lee etal., 2024).
Among these methods, pruning has emerged as a promising solution to mitigate the resource demands of LLMs. By selectively removing redundant patterns—such as parameters(Sun etal., 2023), attention heads(Ma etal., 2023a) and layers(Men etal., 2024)—pruning aims to slim down the model while maintaining its original performance as much as possible. Among different types of pruning, layer pruning(Kim etal., 2024; Siddiqui etal., 2024) has garnered particular interest due to its direct impact on pruning the model’s depth, thereby decreasing both computational complexity and memory usage. Additionally, thanks to the nice structure of the existing LLMs such as Llama(Dubey etal., 2024), whose transformer blocks have the exactly same dimension of input and output, layer pruning becomes a straightforward and simple solution. Therefore, in this paper, we focus on layer pruning. Unlike existing studies(Men etal., 2024; Yang etal., 2024b; Chen etal., 2024; Zhong etal., 2024; Liu etal., 2024b) that aim to propose various sophisticated pruning methods, we take a step back and focus on the following questions:
- Q1.
Layer Selection: Are fancy metrics essential for identifying redundant layers to prune?
- Q2.
Fine-Tuning: Is the LoRA family the best choice for post-pruning fine-tuning?
- Q3.
Pruning Strategy: Will iterative pruning outperform one-shot pruning?
To answer the aforementioned questions, we spent thousands of GPU hours to benchmark layer pruning, conducting extensive experiments across 7 layer selection metrics, 4 state-of-the-art open-source LLMs, 6 fine-tuning methods, 5 pruning strategies on 10 common datasets. From these efforts, we have developed a practical list of key insights for LLM layer pruning in Figure1:
- 1).
Reverse-order pruning is simple yet effective, i.e., simply pruning the last several layers performs better than many complex pruning metrics(Kim etal., 2024; Men etal., 2024) .
- 2).
LoRA performs worse than expected, i.e., LoRA, the most commonly used fine-tuning methods in existing pruning approaches(Sun etal., 2023; Ma etal., 2023b; Kim etal., 2024; Men etal., 2024), is not the best choice for post-pruning performance recovery. In contrast, freezing the other layers and fine-tuning only the last few remaining layers and lm_head, also known as partial-layer fine-tuning, can achieve higher accuracy while reducing the training time. The result is unique to layer pruning since LoRA and partial-layer fine-tuning perform similarly as Table3 in full-model fine-tuning.
- 3).
Iterative pruning offers no benefit, i.e., considering both training costs and performance gains, iterative pruning, where layers are removed step-by-step, fails to beat the one-shot pruning, where a single cut is made.
In addition to the above practices, we also conduct sensitivity analyses on the number of calibration samples, the choice of Supervised Fine-Tuning (SFT) datasets and various pruning rates for LLM layer pruning. We find that the number of calibration samples affects the performance of data-driven pruning methods, highlighting the importance of considering performance stability as a key criterion when evaluating the quality of pruning metrics. Similarly, we discover that fine-tuning with different SFT datasets significantly impacts the performance of pruned models. This suggests the need for further exploration of the most suitable datasets for fine-tuning. Finally, we apply our insights and practices to prune Llama-3.1-8B-Instruct(Dubey etal., 2024), obtaining Llama-3.1-6.3B-It-Alpaca and Llama-3.1-6.3B-It-Dolly, as shown in Figure1. These pruned models require significantly fewer training tokens but outperform several popular community LLMs of similar size, such as ChatGLM2-6B(GLM etal., 2024), Vicuna-7B-v1.5(Zheng etal., 2024), Qwen1.5-7B(Yang etal., 2024a) and Baichuan2-7B(Baichuan, 2023). We hope our work will help guide future efforts in LLM layer pruning and inform best practices for deploying LLMs in real-world applications. In a nutshell, we make the following contributions:
- •
Comprehensive Benchmarking: We conduct an extensive evaluation of layer selection metrics, fine-tuning methods, and pruning strategies, providing practical insights into effective pruning techniques based on thousands of GPU hours across multiple datasets.
- •
Novel Best Practices: We identify reverse-order as a simple and effective layer selection metric, find that partial-layer fine-tuning outperforms LoRA-based techniques, and demonstrate that one-shot pruning is as effective as iterative pruning while reducing training costs.
- •
Optimized Pruned LLMs: We release Llama-3.1-6.3B-It-Alpaca and Llama-3.1-6.3B-It-Dolly, which are obtained through direct pruning of the Llama-3.1-8B-Instruct. Our pruned models require up to fewer training tokens compared to training from scratch, while still comparing favorably to various popular community LLMs of similar size, such as ChatGLM2-6B(GLM etal., 2024), Vicuna-7B-v1.5(Zheng etal., 2024), Qwen1.5-7B(Yang etal., 2024a) and Baichuan2-7B(Baichuan, 2023).
2 Related Work
LLM Layer Pruning. LLM layer pruning is a technique used to reduce the number of layers in LLMs, aiming to lower computational costs without significantly degrading performance. Specifically, it evaluates the contribution of each layer to the model’s overall performance, using criteria such as gradients, activation values, parameter weights, or the layer’s influence on the loss function. Layers that contribute the least are then pruned to reduce complexity. For example, LaCo(Yang etal., 2024b) achieves rapid model size reduction by folding subsequent layers into the previous layer, effectively preserving the model structure. Similarly, MKA(Liu etal., 2024b) uses manifold learning and the Normalized Pairwise Information Bottleneck measure(Tishby etal., 2000) to identify the most similar layers for merging. ShortGPT(Men etal., 2024) uses Block Influence (BI) to measure the importance of each layer in LLMs and remove layers with low BI scores. Kim etal. (2024) utilize Magnitude, Taylor and Perplexity (PPL) to evaluate the significance of each layer.
Differences from Traditional Layer Pruning.Unlike traditional Deep Neural Networks(Szegedy etal., 2014; Simonyan & Zisserman, 2015; He etal., 2015; Dosovitskiy etal., 2021; Liu etal., 2021) (DNNs), typically trained for a single, specific task, LLMs are designed to handle a wide range of tasks and are structured with billions of parameters. These differences in model scale and task complexity fundamentally alter the challenges associated with layer pruning.For example, in traditional DNN layer pruning(Chen & Zhao, 2018; Wang etal., 2019; Lu etal., 2022; Tang etal., 2023; Guenter & Sideris, 2024), assessing the importance of each layer is relatively straightforward, as it is tied to a single task. In contrast, the parameters of LLMs are optimized across diverse tasks, complicating the evaluation of layer importance. Furthermore, traditional DNN pruning commonly involves full parameter fine-tuning after pruning, while LLMs often employ Parameter-Efficient Fine-Tuning (PEFT) techniques(Hu etal., 2021; Meng etal., 2024; Zhao etal., 2024b; Dettmers etal., 2024) such as Low-Rank Approximation (LoRA)(Hu etal., 2021) to accommodate their massive parameter space. Consequently, traditional DNN pruning methods may not adequately address the unique challenges posed by LLMs, highlighting the need for specialized pruning strategies.
Exploration of LLM Pruning. Although recent research focuses on developing sophisticated pruning methods(Kim etal., 2024; Ma etal., 2023a; Men etal., 2024; Liu etal., 2024c; b; Yang etal., 2024b; Zhong etal., 2024), few studies(Jaiswal etal., 2023; Williams & Aletras, 2024; Muralidharan etal., 2024) take a step back and revisit existing LLM pruning techniques. For example, Jaiswal etal. (2023) re-evaluate the effectiveness of existing state-of-the-art pruning methods with PPL. Williams & Aletras (2024) systematically investigate how the calibration dataset impacts the effectiveness of model compression methods. Muralidharan etal. (2024) develop a set of practical practices for LLMs that combine layer, width, attention and MLP pruning with knowledge distillation-based retraining. However, these methods either do not consider layer pruning or lack a comprehensive comparison. In contrast, we systematically validate different layer selection metrics, fine-tuning techniques, and pruning strategies to provide a thorough evaluation.
3 Background and Notation
3.1 Problem Formulation for Layer Pruning
An LLM consists of multiple Transformer layers , each containing a pair of multi-head attention and feed-forward network modules:
(1) |
Layer pruning aims to find a subset of layers such that the pruned model maintains acceptable performance while reducing the model’s complexity, which can be formalized as:
Minimize | (2) | |||
where denotes the complexity of the pruned model, which can be quantified in terms of the number of parameters, FLOPs, or inference time, etc. is a hyperparameter (e.g., ) that defines the acceptable performance degradation. represents the performance on given tasks. Numerous methods have proposed various metrics to identify and prune unimportant layers. Herein, we include 7 popular metrics:
Random Selection. For the random selection baseline, we randomly select several layers to prune.
Reverse-order. This metric (Men etal., 2024) posits that importance is inversely proportional to the sequence order. It assigns lower importance scores to the deeper layers and prune them.
Magnitude. It was first introduced by Li etal. (2016) and subsequently adopted by Kim etal. (2024), which assumes that weights exhibiting smaller magnitudes are deemed less informative. Following Kim etal. (2024), we compute , where denotes the weight matrix of operation within the -th transformer layer. In this paper, we uniformly set . As a result, we term these methods as Magnitude-l1 and Magnitude-l2.
Taylor. For a given calibration dataset , the significance of removing weight parameters is indicated by the change in training loss . Following Ma etal. (2023a); Kim etal. (2024), we omit the second-order derivatives in this assessment. Then we define the Taylor score of the -th transformer layer as .
PPL. Following Kim etal. (2024), we remove a single layer and assess its impact on the perplexity of the pruned model using the calibration dataset . We then prune those layers that lead to a smaller degradation of the PPL.
BI. Men etal. (2024) introduce a metric called Block Influence as an effective indicator of layer importance. Specifically, the BI score of the -th layer can be calculated as follows:
(3) |
where denotes the input of the -th layer and is the -th row of .
3.2 Evaluation and Datasets
To assess the performance of the model, we follow the evaluation of Ma etal. (2023a) to perform zero-shot task classification on common sense reasoning datasets using the lm-evaluation-harness(Gao etal., 2023) package: MMLU(Hendrycks etal., 2021), CMMLU(Li etal., 2023), PIQA(Bisk etal., 2020), HellaSwag(Zellers etal., 2019), WinoGrande(Sakaguchi etal., 2021), ARC-easy(Clark etal., 2018), ARC-challenge(Clark etal., 2018) and OpenbookQA(Mihaylov etal., 2018). Additionally, we evaluate the model using perplexity on the WikiText2(Merity etal., 2016) and Penn Treebank (PTB)(Marcus etal., 1993) datasets. For the PPL metric, we follow (Ma etal., 2023a; Muralidharan etal., 2024) and use WikiText2 for calculation. Following (Ma etal., 2023a), we randomly select 10 samples from BookCorpus(Zhu etal., 2015) to compute Taylor and BI, truncating each sample to a sequence length of 128. Unless otherwise specified, we utilize the Alpaca-cleaned(Taori etal., 2023) with LoRA to recover the performance. Uniformly, we set the training epoch to 2 and batch size to 64. All experiments are conducted on 2 NVIDIA A100 GPUs with 40 GB of memory and 4 NVIDIA RTX A5000 GPUs with 24 GB of memory.
4 An Empirical Exploration of LLM Layer Pruning
This paper aims to contribute to the community the best practice of layer pruning such that practitioners can prune an LLM to an affordable size and desired performance with minimal exploration effort. Specifically, we will expand from three aspects: First, we explore which metric is most effective for identifying unimportant layers, helping researchers make informed choices.Then, we investigate which fine-tuning method most effectively restores model performance after pruning. Finally, we delve deeper into various pruning strategies and want to answer whether iterative pruning will outperform one-shot pruning.
Model Metric Benchmarks Avg Acc PIQA HellaSwag OpenbookQA ARC-e ARC-c MMLU CMMLU WinoGrande Vicuna-7B-v1.5 Dense 0.77200.0098 0.56420.0049 0.33000.0210 0.75550.0088 0.43260.0145 0.48580.0040 0.35180.0044 0.69530.0129 0.5484 Reverse-order 0.71710.0105 0.50050.0050 0.26080.0198 0.62210.0099 0.38480.0142 0.47370.0041 0.34170.0044 0.62670.0136 0.4909 Random 0.52230.0117 0.26070.0044 0.13800.0154 0.26140.0090 0.21760.0121 0.22950.0035 0.25000.0040 0.46720.0140 0.2933 PPL 0.73610.0103 0.47340.0050 0.27600.0200 0.67050.0096 0.34560.0139 0.29430.0038 0.25690.0041 0.58960.0138 0.4553 Magnitude-l1 0.52990.0116 0.25860.0044 0.14400.0157 0.26090.0090 0.22530.0122 0.22970.0035 0.25140.0040 0.48930.0140 0.2986 Magnitude-l2 0.52560.0117 0.25780.0044 0.13400.0152 0.26220.0090 0.21080.0119 0.22950.0035 0.25150.0040 0.48380.0140 0.2944 BI 0.69100.0108 0.39870.0049 0.21000.0182 0.58290.0101 0.26540.0129 0.23890.0036 0.25130.0040 0.50360.0141 0.3927 Taylor 0.52500.0117 0.25810.0044 0.13600.0153 0.25840.0090 0.20480.0118 0.23180.0036 0.25260.0040 0.49720.0141 0.2955 Qwen1.5-7B Dense 0.78450.0096 0.57850.0049 0.31600.0208 0.71250.0093 0.40530.0143 0.59670.0039 0.72770.0039 0.65750.0133 0.5973 Reverse-order 0.69420.0107 0.44440.0050 0.22800.0188 0.51430.0103 0.33020.0137 0.51010.0041 0.71710.0040 0.59120.0138 0.5037 Random 0.54080.0116 0.26820.0044 0.12400.0148 0.26300.0090 0.20390.0118 0.23660.0076 0.24570.0040 0.48070.0140 0.2954 PPL 0.70890.0106 0.41950.0049 0.22400.0187 0.59600.0101 0.29440.0133 0.24570.0036 0.25520.0041 0.51850.0140 0.4078 Magnitude-l1 0.65780.0111 0.39890.0049 0.20400.0180 0.52440.0102 0.29010.0133 0.25740.0037 0.25410.0041 0.52490.0140 0.3890 Magnitude-l2 0.59030.0115 0.36570.0048 0.16400.0166 0.46300.0102 0.23810.0124 0.25020.0037 0.25130.0040 0.53120.0140 0.3567 BI 0.72200.0105 0.41900.0049 0.24400.0192 0.59720.0101 0.26710.0129 0.24560.0036 0.25360.0040 0.53830.0140 0.4190 Taylor 0.69700.0107 0.42840.0049 0.20600.0181 0.51600.0103 0.31400.0136 0.52310.0041 0.60790.0043 0.60460.0137 0.4871 Gemma2-2B-It Dense 0.78670.0096 0.53670.0050 0.35600.0214 0.80850.0081 0.51110.0146 0.56870.0039 0.44990.0045 0.69610.0129 0.5892 Reverse-order 0.70290.0107 0.45290.0050 0.26600.0198 0.63430.0099 0.37630.0142 0.52610.0040 0.41170.0045 0.65510.0134 0.5032 Random 0.73070.0104 0.44620.0050 0.28600.0202 0.68520.0095 0.34220.0139 0.34520.0040 0.28930.0042 0.58330.0139 0.4635 PPL 0.74540.0102 0.46110.0050 0.29400.0204 0.70080.0094 0.36090.0140 0.35030.0040 0.28380.0042 0.58250.0139 0.4724 Magnitude-l1 0.74810.0101 0.45300.0050 0.30400.0206 0.72390.0092 0.37290.0141 0.27030.0037 0.25140.0040 0.55960.0140 0.4604 Magnitude-l2 0.72250.0104 0.42450.0049 0.23800.0191 0.65610.0097 0.30380.0134 0.24130.0036 0.22580.0041 0.54930.0140 0.4202 BI 0.69210.0108 0.42720.0049 0.27000.0199 0.65110.0098 0.37030.0141 0.49680.0040 0.38510.0045 0.66610.0133 0.4948 Taylor 0.70020.0107 0.45410.0050 0.30200.0206 0.63590.0099 0.36950.0141 0.54310.0040 0.40480.0045 0.64880.0134 0.5073 Llama-3.1-8B-It Dense 0.80030.0093 0.59100.0049 0.33800.0212 0.81820.0079 0.51790.0146 0.67900.0038 0.55520.0045 0.73950.0123 0.6299 Reverse-order 0.70020.0107 0.40100.0049 0.29400.0204 0.61700.0100 0.39850.0143 0.63420.0039 0.54490.0045 0.62430.0136 0.5268 Random 0.56530.0116 0.28860.0045 0.14000.0155 0.31690.0095 0.18600.0114 0.22750.0035 0.25590.0041 0.50750.0141 0.3110 PPL 0.76280.0099 0.49310.0050 0.26400.0197 0.72900.0091 0.38050.0142 0.33670.0040 0.27240.0041 0.57930.0139 0.4772 Magnitude-l1 0.54080.0116 0.26340.0044 0.13600.0153 0.28450.0093 0.20140.0117 0.25040.0037 0.25030.0040 0.48780.0140 0.3018 Magnitude-l2 0.54130.0116 0.26380.0044 0.13400.0152 0.28410.0093 0.20140.0117 0.24980.0036 0.25040.0040 0.48700.0140 0.3015 BI 0.71760.0105 0.41960.0049 0.20200.0180 0.61070.0100 0.28410.0132 0.24170.0036 0.24940.0040 0.53910.0140 0.4080 Taylor 0.71380.0105 0.49640.0050 0.27400.0200 0.68480.0095 0.41810.0144 0.28610.0038 0.25040.0040 0.71350.0127 0.4796
4.1 Are fancy metrics essential for identifying redundant layers to prune?
The first question is to find the most “redundant” layers to prune. As discussed in Section3.1, there are various metrics for layer selection, which can be as straightforward as reverse-order, or as complicated as BI. However, does a complicated metric always contribute to a better performance? Probably not. We find that a simple metric, i.e., reverse-order, is competitive among these metrics.
Specifically, we conduct comprehensive experiments on Vicuna-7B-v1.5(Zheng etal., 2024), Qwen1.5-7B(Yang etal., 2024a), Gemma2-2B-Instruct(Team, 2024) and Llama-3.1-8B-Instruct(Dubey etal., 2024). We uniformly prune 8 layers (25% pruning ratio) for Vicuna-7B-v1.5, Qwen1.5-7B and Llama-3.1-8B-Instruct, and 6 layers for Gemma2-2B-Instruct. Experiments with a 50% pruning ratio (12 layers for Gemma2-2B-Instruct and 16 layers for others) are provided in TableA. In the fine-tuning stage, we use LoRA with a rank of and a batch size of , and the AdamW optimizer. The learning rate is set to with warming steps.
Results. As shown in Table1, we find that the reverse-order metric delivers stable and superior results across various models under the 25% pruning rate, making it a reliable choice for pruning. On average, it outperforms the second-best PPL metric by across four models. The result also holds for the 50% pruning rate, as shown in TableA.We hope our insights can help researchers make informed choices when selecting the most suitable pruning metrics for their specific models.
4.2 Is the LoRA family the best choice for post-pruning fine-tuning?
In previous studies(Kim etal., 2024; Men etal., 2024), LoRA is often used to restore the performance of pruned models. This raises a question: Is the LoRA family the best choice for post-pruning fine-tuning? To answer this question, we further use QLoRA(Dettmers etal., 2024) and partial-layer fine-tuning techniques to conduct experiments. We briefly introduce these methods as follows:
LoRA Fine-tuning. LoRA is one of the best-performed parameter-efficient fine-tuning paradigm that updates dense model layers using pluggable low-rank matrices(Mao etal., 2024). Specifically, for a pre-trained weight matrix , LoRA constrains its update by representing the latter with a low-rank decomposition . At the beginning of training, is initialize with a random Gaussian initialization, while is initialized to zero. During training, is frozen and does not receive gradient updates, while and contain trainable parameters. Then the forward pass can be formalized as:
(4) |
QLoRA Fine-tuning. QLoRA builds on LoRA by incorporating quantization techniques to further reduce memory usage while maintaining, or even enhancing the performance.
Partial-layer Fine-tuning. Compared to LoRA and QLoRA, which inject trainable low-rank factorization matrices into each layer, partial-layer fine-tuning simply freezes the weights of some layers while updating only the specified layers to save computing resources and time(Shen etal., 2021; Ngesthi etal., 2021; Peng & Wang, 2020). Following by the common practice of previous studies(Khan & Fang, 2023), we choose to fine-tune only the later layers that are closer to the output, while keeping the earlier layers, which capture more general features, frozen. Specifically, we use two different fine-tuning strategies: one is to finetune only the model head (lm_head only), and the other is to finetune the lm_head plus the last layer (lm_head + last layer), the last two layers (lm_head + last two layers), and the last three layers (lm_head + last three layers).
Model Method Layer Benchmarks Avg Acc PIQA HellaSwag OpenbookQA ARC-e ARC-c MMLU CMMLU WinoGrande Vicuna-7B-v1.5 LoRA - 0.71710.0105 0.50050.0050 0.26080.0198 0.62210.0099 0.38480.0142 0.47370.0041 0.34170.0044 0.62670.0136 0.4909 QLoRA - 0.66490.0110 0.40570.0049 0.27000.0199 0.53450.0102 0.34390.0139 0.48090.0041 0.34730.0044 0.60140.0138 0.4561 Partial-layer lm_head only 0.70570.0106 0.48650.0050 0.28800.0203 0.63010.0099 0.40100.0143 0.48190.0041 0.35200.0044 0.61560.0137 0.4951 lm_head+last layer 0.71550.0105 0.50540.0050 0.29000.0203 0.65110.0098 0.41130.0144 0.48310.0041 0.35380.0044 0.62830.0136 0.5048 lm_head+last two layers 0.72140.0105 0.50600.0050 0.30200.0206 0.65320.0098 0.40020.0143 0.48580.0041 0.35300.0044 0.62670.0136 0.5060 lm_head+last three layers 0.72470.0104 0.51030.0050 0.29600.0204 0.65280.0098 0.39850.0143 0.48700.0040 0.35440.0044 0.62190.0136 0.5057 Qwen1.5-7B LoRA - 0.69420.0107 0.44440.0050 0.22800.0188 0.51430.0103 0.33020.0137 0.51010.0041 0.71710.0040 0.59120.0138 0.5037 QLoRA - 0.66970.0110 0.40280.0049 0.24000.0191 0.47600.0102 0.29690.0134 0.47970.0041 0.69140.0041 0.58250.0139 0.4799 Partial-layer lm_head only 0.71490.0105 0.47350.0050 0.24600.0193 0.54970.0102 0.35240.0140 0.54670.0040 0.72760.0039 0.59670.0138 0.5259 lm_head+last layer 0.72200.0105 0.48500.0050 0.24400.0192 0.56900.0102 0.35490.0140 0.57190.0040 0.72830.0039 0.62750.0136 0.5378 lm_head+last two layers 0.72140.0105 0.49150.0050 0.25400.0195 0.57830.0101 0.35840.0140 0.57340.0040 0.72750.0039 0.62980.0136 0.5418 lm_head+last three layers 0.72960.0104 0.49740.0050 0.25200.0194 0.58080.0101 0.36180.0140 0.57950.0040 0.72720.0040 0.62750.0136 0.5445 Llama-3.1-8B-It LoRA - 0.70020.0107 0.40100.0049 0.29400.0204 0.61700.0100 0.39850.0143 0.63420.0039 0.54490.0045 0.62430.0136 0.5268 QLoRA - 0.69800.0107 0.39750.0049 0.30000.0205 0.61830.0100 0.38400.0142 0.60320.0039 0.50900.0045 0.62670.0136 0.5171 Partial-layer lm_head only 0.73340.0103 0.48960.0050 0.28600.0202 0.70120.0094 0.44110.0145 0.61220.0040 0.54420.0045 0.67170.0132 0.5599 lm_head+last layer 0.73500.0103 0.51070.0050 0.29400.0204 0.71930.0092 0.45310.0145 0.66300.0038 0.55260.0045 0.65820.0133 0.5732 lm_head+last two layers 0.73610.0103 0.52040.0050 0.30800.0207 0.71510.0093 0.46330.0146 0.65880.0038 0.55430.0045 0.65670.0133 0.5766 lm_head+last three layers 0.73830.0103 0.53230.0050 0.30800.0207 0.72600.0092 0.46840.0146 0.65670.0038 0.55150.0045 0.66460.0133 0.5807
Method Benchmarks Avg Acc PIQA HellaSwag OpenbookQA ARC-e ARC-c MMLU CMMLU WinoGrande Dense 0.80030.0093 0.59100.0049 0.33800.0212 0.81820.0079 0.51790.0146 0.67900.0038 0.55520.0045 0.73950.0123 0.6299 lm_head+last three layers 0.79980.0093 0.60570.0049 0.35200.0214 0.81860.0079 0.53160.0146 0.67840.0038 0.55220.0045 0.73160.0125 0.6337 LoRA 0.80470.0092 0.60070.0049 0.35000.0214 0.82870.0077 0.53160.0146 0.67640.0038 0.55300.0045 0.73800.0124 0.6354
LoRA QLoRA lm_head only lm_head+last layer lm_head+last two layers lm_head+last three layers Trainable parameters 15.73M 15.73M 525.34M 743.45M 961.56M 1179.68M GPU memory 45.83G 14.26G 39.82G 42.12G 44.41G 48.02G Training time (2 epoch) 10440.30s 17249.01s 6952.92s 7296.76s 7616.83s 7931.36s
In view of the superiority of the reverse-order metric in Section4.1, we use it to prune here. For the Vicuna-7B-v1.5, Qwen1.5-7B, and Llama-3.1-8B-Instruct models, we prune 8 layers. For the Gemma2-2B-Instruct model, we prune 6 layers. Subsequently, we utilize LoRA, QLoRA and partial-layer fine-tuning methods to restore performance. We provide more results of fine-tuning with the taylor metric in TableB. In particular, because Gemma2-2B-Instruct employs weight tying(Press & Wolf, 2016) to share the weights between the embedding layer and the softmax layer (lm_head), we exclude partial-layer fine-tuning in Gemma2-2B-Instruct. For fine-tuning with LoRA and partial-layer methods, we utilize the AdamW optimizer, while for QLoRA, we opt for the paged_adamw_8bit optimizer. All other hyperparameter settings are the same as in Section4.1.
Results. As shown in the Table2 and TableB, we find that fine-tuning with QLoRA slightly hurts the performance of pruned models compared to LoRA. Excitingly, the effect of partial-layer fine-tuning is significantly better than LoRA, providing a viable new direction for fine-tuning models after pruning. In the ablation study, we compare the performance of LoRA with partial-layer fine-tuning for the full model in Table3, which shows that partial-layer fine-tuning and LoRA perform similarly. This suggests that the conventional insights for the full model fine-tuning do not hold after pruning, i.e., the structural changes and parameter reduction of the model enable partial layer fine-tuning to adapt more effectively to the new parameter distribution and fully leverage the potential benefits of pruning. When considering fine-tuning methods for LLMs, in addition to performance, the training cost is also a significant factor to take into account. Therefore, we compare the training cost of these fine-tuning methods, including training time, gpu memory and trainable parameters. Specifically, we conduct experiments on 2 empty NVIDIA RTX A100 GPUs using the pruned Llama-3.1-8B-Instruct model (with 8 layers removed in reverse order). Table4 shows the comparison among these fine-tuning methods. We find that compared to LoRA, partial-layer fine-tuning involves more trainable parameters but maintains comparable GPU usage and achieves faster training time. Additionally, partial-layer fine-tuning outperforms LoRA in effectiveness. In contrast, although QLoRA consumes less GPU memory, it has much longer training time and yields poorer performance. In summary, we conclude that partial-layer fine-tuning is an effective approach to restoring the performance of pruned models when sufficient memory is available.
4.3 Will iterative pruning outperform one-shot pruning?
In this subsection, we provide insights into the optimal pruning strategy for LLMs. Although Muralidharan etal. (2024) have explored pruning strategies and concluded that iterative pruning offers no benefit, their study focuses on utilizing knowledge distillation(Hinton, 2015) for performance recovery. In contrast, this paper concentrates on layer pruning with LoRA and partial-layer fine-tuning, thereby broadening the scope of pruning strategies evaluated. We briefly introduce the one-shot pruning and iterative pruning:
One-shot Pruning.One-shot pruning scores once and then prune the model to a target prune ratio.
Iterative Pruning.Iterative pruning alternately processes the score-prune-update cycle until achieving the target prune ratio.
Specifically, we select Llama-3.1-8B-Instruct and Gemma2-2B-Instruct as the base models. For one-shot pruning, we prune 8 layers from the Llama-3.1-8B-Instruct and 6 layers from the Gemma2-2B-Instruct in a single step, guided by the reverse-order and taylor metrics. For iterative pruning with LoRA, we begin by scoring all layers using these metrics. Subsequently, we set the pruning step to and for Llama-3.1-8B-Instruct, and and for Gemma2-2B-Instruct. After each pruning step, we fine-tune the model with LoRA and merge LoRA weights back into the fine-tuned model. This score-prune-fine-tune-merge cycle is repeated until a total of 8 layers are pruned for Llama-3.1-8B-Instruct and 6 layers for Gemma2-2B-Instruct. For iterative pruning with partial-layer fine-tuning, we fine-tune the model using partial-layer fine-tuning (lm_head + last three layers) after each pruning step, and then repeat the score-prune-fine-tune cycle. To avoid the fine-tuned layers being pruned completely, we set the pruning step size to 1. All hyperparameter settings are the same as in Section4.1. Experiments with iterative pruning of more layers are provided in TableC.
Results. By comparing the results of iterative and one-shot pruning in Table5 and TableC, we find that unlike traditional CNN pruning, which often yields significant performance improvements through iterative pruning(Tan & Motani, 2020; He & Xiao, 2023), the iterative approach for LLMs may not provide the same benefits and can even lead to performance degradation. We believe that is because too much training causes the model to suffer from catastrophic forgetting(Zhai etal., 2024; Liu etal., 2024a). Figure B visualizes the representational similarity of different pruning strategies. From this, we observe that different pruning strategies yield significantly different representations, highlighting the impact of each strategy on the model’s learned features. Besides, iterative pruning requires more computational overhead than one-shot pruning, which is not cost-effective with limited performance gains.
Fine-tuning Method Model Metric Iteration steps Benchmarks Avg Acc PIQA HellaSwag OpenbookQA ARC-e ARC-c MMLU CMMLU WinoGrande LoRA Llama-3.1-8B-It Reverse-order one-shot 0.7002+0.0107 0.4010+0.0049 0.2940+0.0204 0.6170+0.0100 0.3985+0.0143 0.6342+0.0039 0.54490.0045 0.62430.0136 0.5268 1:4:8 0.71760.0105 0.45380.0050 0.29200.0204 0.67050.0096 0.41210.0144 0.63740.0039 0.54390.0045 0.63690.0135 0.5455 1:1:8 0.71600.0105 0.44700.0050 0.28600.0202 0.66370.0097 0.40610.0144 0.64400.0039 0.54250.0045 0.64480.0135 0.5438 Taylor one-shot 0.71380.0105 0.49640.0050 0.27400.0200 0.68480.0095 0.41810.0144 0.28610.0038 0.25040.0040 0.71350.0127 0.4796 1:4:8 0.71490.0105 0.49910.0050 0.24800.0193 0.70710.0093 0.39510.0143 0.46760.0041 0.34800.0044 0.67090.0132 0.5063 1:1:8 0.69210.0108 0.47280.0050 0.21400.0184 0.66750.0097 0.38910.0142 0.45760.0041 0.35110.0044 0.65190.0134 0.4870 Gemma2-2B-It Reverse-order one-shot 0.70290.0107 0.45290.0050 0.26600.0198 0.63430.0099 0.37630.0142 0.52610.0040 0.41170.0045 0.65510.0134 0.5032 1:3:6 0.69530.0107 0.45230.0050 0.29000.0203 0.63970.0099 0.37290.0141 0.54180.0040 0.40130.0045 0.64960.0134 0.5054 1:1:6 0.70670.0106 0.44760.0050 0.26600.0198 0.63050.0099 0.37460.0141 0.51430.0040 0.40660.0045 0.65590.0134 0.5003 Taylor one-shot 0.70020.0107 0.45410.0050 0.30200.0206 0.63590.0099 0.36950.0141 0.54310.0040 0.40480.0045 0.64880.0134 0.5073 1:3:6 0.70570.0106 0.44730.0050 0.23800.0191 0.65530.0098 0.34900.0139 0.36970.0040 0.28840.0042 0.59270.0138 0.4558 1:1:6 0.72360.0104 0.45440.0050 0.28600.0202 0.65740.0097 0.34900.0139 0.47630.0041 0.38010.0045 0.63060.0136 0.4947 Partial-layer Llama-3.1-8B-It Reverse-order one-shot 0.73830.0103 0.53230.0050 0.30800.0207 0.72600.0092 0.46840.0146 0.65670.0038 0.55150.0045 0.66460.0133 0.5807 1:1:8 0.74320.0102 0.53570.0050 0.29800.0205 0.74960.0089 0.45900.0146 0.65390.0038 0.55580.0045 0.69220.0130 0.5859 Taylor one-shot 0.73450.0103 0.52900.0050 0.30200.0206 0.73990.0090 0.43600.0145 0.62770.0039 0.47630.0046 0.71510.0127 0.5701 1:1:8 0.63000.0113 0.35530.0048 0.17600.0170 0.51770.0103 0.27560.0131 0.26110.0037 0.25570.0041 0.53120.0140 0.3753
Verification PPL on WikiText2 PPL on PTB Avg Acc Metric BI Taylor BI Taylor BI Taylor Calibration Samples 1 51.06 65.43 90.97 94.35 0.40 0.36 5 43.54 65.43 79.34 94.35 0.43 0.36 10 53.53 65.43 101.64 94.35 0.41 0.36 30 50.03 55.42 88.02 77.63 0.42 0.55 50 59.73 55.42 103.19 77.63 0.41 0.55
5 Sensitivity Analysis
In this section, we conduct sensitivity analyses on the number of calibration samples, the choice of SFT dataset and various pruning rates for LLM layer pruning.
The effect of number of calibration samples on LLM layer pruning.It is worth noting that some data-driven layer pruning methods, such as BI and Taylor, rely upon calibration samples to generate layer activations. Therefore, we explore the effect of the number of calibration samples on pruning. Specifically, we calculate BI and Taylor metrics using 1, 5, 10, 30, and 50 calibration samples, prune 8 layers based on these metrics, finetune the pruned Llama-3.1-8B-Instruct models using LoRA, and evaluate their performance through lm-evaluation-harness package. For ease of comparison, we report the average accuracy on 8 datasets in the main text. For more details, see TableD. Besides, we report the model perplexity on the WikiText and Penn Treebank test set. As shown in Table6, we observe that the pruned models, obtained using varying numbers of calibration samples, do affect the model complexity and zero-shot performance, which suggests that for data-driven pruning methods, performance stability should also be considered a key criterion when evaluating the quality of pruning technique.
Dataset Benchmarks Avg Acc PIQA HellaSwag OpenbookQA ARC-e ARC-c MMLU CMMLU WinoGrande Dolly-15k 0.77090.0098 0.55410.0050 0.30000.0205 0.74240.0090 0.48380.0146 0.67530.0038 0.55220.0045 0.70320.0128 0.5977 Alpaca-cleaned 0.73830.0103 0.53230.0050 0.30800.0207 0.72600.0092 0.46840.0146 0.65670.0038 0.55150.0045 0.66460.0133 0.5807 MMLU 0.60120.0114 0.27140.0044 0.17000.0168 0.34300.0097 0.24570.0126 0.58880.0040 0.52660.0045 0.58560.0138 0.4165
The effect of SFT datasets on LLM layer pruning. In the previous sections, we uniformly utilize Alpaca-cleaned(Taori etal., 2023) to fine-tune the pruned models. Herein, we aim to assess how fine-tuning a pruned model using different SFT datasets affects its performance. Specifically, we conduct experiments using the Reverse-order metric to remove 8 layers from the Llama-3.1-8B-Instruct and fine-tune the pruned model using lm_head + last three layers on MMLU (training set)(Hendrycks etal., 2021) and Dolly-15k(Conover etal., 2023). We set the maximum sequence length to 512 for MMLU and 1024 for Dolly-15k. From Table7, we observe that among these datasets, Dolly-15k achieves the best results, followed by Alpaca-cleaned. This demonstrates that fine-tuning with different SFT datasets has a significant impact on the performance of pruned models and suggests further exploration of the most suitable datasets for fine-tuning pruned models.
The effect of different pruning rates on LLM layer pruning.We investigate the impact of pruning the LLM at various pruning rates in Figure2. Specifically, we conduct one-shot pruning on Llama-3.1-8B-Instruct using reverse-order and taylor metrics and evaluate their effects on the model’s performance with LoRA. All hyperparameter settings remain consistent with those in Section4.1. As shown in Figure2, we observe that as the number of pruned layers increases, the performance of the model on all datasets tends to decrease and eventually converges.However, certain datasets, especially MMLU, CMMLU, and ARC-c, are highly sensitive to layer changes and degrade faster than others. Besides, after cutting off about 16 layers, the model was damaged, so we set the maximum pruning rate in the paper to 16 layers.
Baseline # Parameters (TTokens) Benchmarks Avg Acc PIQA HellaSwag OpenbookQA ARC-e ARC-c MMLU CMMLU WinoGrande Vicuna-7B-v1.5 6.74B (370M) 0.77200.0098 0.56420.0049 0.33000.0210 0.75550.0088 0.43260.0145 0.48580.0040 0.35180.0044 0.69530.0129 0.5484 ChatGLM2-6B 6.24B (1.4T) 0.54030.0116 0.25890.0044 0.14200.0156 0.25970.0090 0.20050.0117 0.24310.0036 0.25370.0040 0.52880.0140 0.3034 Baichuan2-7B 7.51B (2.6T) 0.76660.0099 0.53630.0050 0.30200.0206 0.74750.0089 0.42060.0144 0.50240.0040 0.52200.0045 0.68190.0131 0.5599 Qwen1.5-7B 7.72B (18T) 0.78450.0096 0.57850.0049 0.31600.0208 0.71250.0093 0.40530.0143 0.59670.0039 0.72770.0039 0.65750.0133 0.5973 LLaMA3-8B 8.03B (15T+) 0.79650.0094 0.60140.0049 0.34800.0213 0.80050.0082 0.49830.0146 0.62120.0038 0.47520.0045 0.73320.0124 0.6093 Gemma2-7B 8.54B (6T) 0.80250.0093 0.60390.0049 0.33000.0210 0.81100.0080 0.50090.0146 0.61430.0039 0.44300.0045 0.74350.0123 0.6061 Llama-3.1-8B-It 8.03B (15T+) 0.80030.0093 0.59100.0049 0.33800.0212 0.81820.0079 0.51790.0146 0.67900.0038 0.55520.0045 0.73950.0123 0.6299 ShortGPT (BI) 6.29B (12.74M) 0.71760.0105 0.41960.0049 0.20200.0180 0.61070.0100 0.28410.0132 0.24170.0036 0.24940.0040 0.53910.0140 0.4080 Shortened LLaMA (PPL) 6.29B (12.74M) 0.76280.0099 0.49310.0050 0.26400.0197 0.72900.0091 0.38050.0142 0.33670.0040 0.27240.0041 0.57930.0139 0.4772 Shortened LLaMA (Taylor) 6.29B (12.74M) 0.71380.0105 0.49640.0050 0.27400.0200 0.68480.0095 0.41810.0144 0.28610.0038 0.25040.0040 0.71350.0127 0.4796 Llama-3.1-6.3B-It-Alpaca 6.29B (12.74M) 0.73830.0103 0.53230.0050 0.30800.0207 0.72600.0092 0.46840.0146 0.65670.0038 0.55150.0045 0.66460.0133 0.5807 Llama-3.1-6.3B-It-Dolly 6.29B (14.96M) 0.77090.0098 0.55410.0050 0.30000.0205 0.74240.0090 0.48380.0146 0.67530.0038 0.55220.0045 0.70320.0128 0.5977
Model # Params # MACs Memory Latency Llama-3.1-6.3B-It-Alpaca, Llama-3.1-6.3B-Dolly 6.29B 368.65G 23984MiB 210.35s
6 Obtaining the Best Pruned Models
In Section4 and Section5, we have gained some valuable non-trivial practices and insights on LLM layer pruning through systematic experiments. Herein, we use these practices and insights to obtain the Llama-3.1-6.3B-It model and compare its performance against multiple baselines: (1) the original Llama-3.1-8B-It model, (2) a set of similarly sized community models and (3) a set of pruned models obtained by state-of-the-art LLM layer pruning methods (all prune 8 layers, fine-tune on Alpaca-cleaned).
Specifically, Llama-3.1-6.3B-It is obtained by pruning 8 layers of Llama-3.1-8B-It using the reverse-order metric. Note that, in contrast to these community models trained from scratch on trillions of tokens (except for Vicuna-7B-v1.5), Llama-3.1-6.3B-It is fine-tuned solely on Alpaca-cleaned (12.74M tokens) and Dolly-15k (14.96M tokens). For ease of distinction, we refer to them as “Llama-3.1-6.3B-It-Alpaca” and “Llama-3.1-6.3B-It-Dolly”, respectively. From Table8, we find that both Llama-3.1-6.3B-It-Alpaca and Llama-3.1-6.3B-It-Dolly outperform ChatGLM2-6B(GLM etal., 2024), Vicuna-7B-v1.5(Zheng etal., 2024) and Baichuan2-7B(Baichuan, 2023), and partially exceed LLaMA3-8B(AI@Meta, 2024), Gemma2-7B(Team etal., 2024) (e.g., MMLU), while using significantly fewer training tokens. Notably, Llama-3.1-6.3B-It-Dolly also outperforms Qwen1.5-7B(Yang etal., 2024a). Besides, we also compare our models to other pruned models obtained by various LLM layer pruning methods. Experimental results show that our models are nearly 19% better than ShortGPT(Men etal., 2024) and 10%+ better than Shortened LLaMA(Kim etal., 2024). Table9 presents the statistic of Llama-3.1-6.3B-It, including parameters, MACs, memory requirements and latency. Following Ma etal. (2023a), the statistical evaluation is conducted in inference mode, where the model is fed a sentence consisting of 64 tokens. The latency is tested under the test set of WikiText2 on a single NVIDIA RTX A100 GPU. We also present the generation results of the Llama-3.1-6.3B-It-Alpaca, Llama-3.1-6.3B-It-Dolly and Llama-3.1-8B-It in TableE.
7 Conclusion
In this paper, we revisit LLM layer pruning, focusing on pruning metrics, fine-tuning methods and pruning strategies. From these efforts, we have developed a practical list of best practices for LLM layer pruning. We use these practices and insights to guide the pruning of Llama-3.1-8B-Instruct and obtain Llama-3.1-6.3B-It-Alpaca and Llama-3.1-6.3B-It-Dolly. Our pruned models require fewer training tokens compared to training from scratch, yet still performing favorably against various popular community LLMs of similar size. We hope our work will help inform best practices for deploying LLMs in real-world applications.
Limitations and Future Work. In Section5, we find that SFT datasets do effect the performance of pruned models. Therefore, we will explore which SFT datasets are more suitable for fine-tuning pruned models in future work. Additionally, in this paper, we focus primarily on layer pruning due to the straightforward nature of pruning layers in LLMs, where the input and output dimensions are identical. However, we plan to further investigate weight pruning(Sun etal., 2023; Frantar & Alistarh, 2023) and width pruning(Xia etal., 2023; Ma etal., 2023b) in future experiments.
8 Reproducibility Statement
The authors have made great efforts to ensure the reproducibility of the empirical results reported in this paper. Firstly, the experiment settings, evaluation metrics, and datasets were described in detail in Section3.2. Secondly, the code to reproduce the results is available at https://github.com/yaolu-zjut/Navigation-LLM-layer-pruning, and the optimal model weights can be found at at https://huggingface.co/YaoLuzjut/Llama-3.1-6.3B-It-Alpaca and https://huggingface.co/YaoLuzjut/Llama-3.1-6.3B-It-Dolly.
9 Ethics statement
In this paper, we carefully consider ethical concerns related to our research and ensure that all methodologies and experimental designs adhere to ethical standards. Our study focuses on layer pruning to enhance the efficiency of LLMs and reduce computational resource requirements, thereby promoting sustainable AI development. Furthermore, all models and datasets used in our research are sourced from publicly available and accessible origins, ensuring no infringement on intellectual property or personal privacy.
References
- Achiam etal. (2023)Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, FlorenciaLeoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, etal.Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023.
- AI@Meta (2024)AI@Meta.Llama 3 model card.2024.URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
- Baichuan (2023)Baichuan.Baichuan 2: Open large-scale language models.arXiv preprint arXiv:2309.10305, 2023.URL https://arxiv.org/abs/2309.10305.
- Bisk etal. (2020)Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, etal.Piqa: Reasoning about physical commonsense in natural language.In Proceedings of the AAAI conference on artificial intelligence, volume34, pp. 7432–7439, 2020.
- Chen & Zhao (2018)Shi Chen and QiZhao.Shallowing deep networks: Layer-wise pruning based on feature representations.IEEE transactions on pattern analysis and machine intelligence, 41(12):3048–3056, 2018.
- Chen etal. (2024)Xiaodong Chen, Yuxuan Hu, and Jing Zhang.Compressing large language models by streamlining the unimportant layer.arXiv preprint arXiv:2403.19135, 2024.
- Clark etal. (2018)Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord.Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018.
- Conover etal. (2023)Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin.Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023.URL https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.
- Deng etal. (2023)Xiang Deng, Vasilisa Bashlovkina, Feng Han, Simon Baumgartner, and Michael Bendersky.Llms to the moon? reddit market sentiment analysis with large language models.In Companion Proceedings of the ACM Web Conference 2023, pp. 1014–1019, 2023.
- Dettmers etal. (2024)Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer.Qlora: Efficient finetuning of quantized llms.Advances in Neural Information Processing Systems, 36, 2024.
- Dosovitskiy etal. (2021)Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby.An image is worth 16x16 words: Transformers for image recognition at scale, 2021.URL https://arxiv.org/abs/2010.11929.
- Dubey etal. (2024)Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, etal.The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024.
- Frantar & Alistarh (2023)Elias Frantar and Dan Alistarh.Sparsegpt: Massive language models can be accurately pruned in one-shot.In International Conference on Machine Learning, pp. 10323–10337. PMLR, 2023.
- Gao etal. (2023)Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain LeNoac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou.A framework for few-shot language model evaluation, 12 2023.URL https://zenodo.org/records/10256836.
- GLM etal. (2024)Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, DaYin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, Shuxun Yang, WengLam Tam, Wenyi Zhao, Xiao Liu, Xiao Xia, Xiaohan Zhang, Xiaotao Gu, Xin Lv, Xinghan Liu, Xinyi Liu, Xinyue Yang, Xixuan Song, Xunkai Zhang, Yifan An, Yifan Xu, Yilin Niu, Yuantao Yang, Yueyan Li, Yushi Bai, Yuxiao Dong, Zehan Qi, Zhaoyu Wang, Zhen Yang, Zhengxiao Du, Zhenyu Hou, and Zihan Wang.Chatglm: A family of large language models from glm-130b to glm-4 all tools, 2024.
- Gu etal. (2024)Yuxian Gu, LiDong, Furu Wei, and Minlie Huang.Minillm: Knowledge distillation of large language models.In The Twelfth International Conference on Learning Representations, 2024.
- Guenter & Sideris (2024)Valentin FrankIngmar Guenter and Athanasios Sideris.Concurrent training and layer pruning of deep neural networks.arXiv preprint arXiv:2406.04549, 2024.
- He etal. (2015)Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition, 2015.URL https://arxiv.org/abs/1512.03385.
- He & Xiao (2023)Yang He and Lingao Xiao.Structured pruning for deep convolutional neural networks: A survey.IEEE transactions on pattern analysis and machine intelligence, 2023.
- Hendrycks etal. (2021)Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt.Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- Hinton (2015)Geoffrey Hinton.Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015.
- Hu etal. (2021)EdwardJ Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, LuWang, and Weizhu Chen.Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021.
- Jaiswal etal. (2023)Ajay Jaiswal, Zhe Gan, Xianzhi Du, Bowen Zhang, Zhangyang Wang, and Yinfei Yang.Compressing llms: The truth is rarely pure and never simple.arXiv preprint arXiv:2310.01382, 2023.
- Khan & Fang (2023)MuhammadOsama Khan and YiFang.Revisiting fine-tuning strategies for self-supervised medical imaging analysis.arXiv preprint arXiv:2307.10915, 2023.
- Kim etal. (2024)Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, and Hyoung-Kyu Song.Shortened llama: A simple depth pruning for large language models.arXiv preprint arXiv:2402.02834, 2024.
- Lee etal. (2024)Jungi Lee, Wonbeom Lee, and Jaewoong Sim.Tender: Accelerating large language models via tensor decomposition and runtime requantization.arXiv preprint arXiv:2406.12930, 2024.
- Li etal. (2016)Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and HansPeter Graf.Pruning filters for efficient convnets.arXiv preprint arXiv:1608.08710, 2016.
- Li etal. (2023)Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin.Cmmlu: Measuring massive multitask language understanding in chinese.arXiv preprint arXiv:2306.09212, 2023.
- Lin etal. (2024)JiLin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han.Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of Machine Learning and Systems, 6:87–100, 2024.
- Liu etal. (2024a)Chengyuan Liu, Shihang Wang, Yangyang Kang, Lizhi Qing, Fubang Zhao, Changlong Sun, Kun Kuang, and Fei Wu.More than catastrophic forgetting: Integrating general capabilities for domain-specific llms.arXiv preprint arXiv:2405.17830, 2024a.
- Liu etal. (2024b)Deyuan Liu, Zhanyue Qin, Hairu Wang, Zhao Yang, Zecheng Wang, Fangying Rong, Qingbin Liu, Yanchao Hao, XiChen, Cunhang Fan, etal.Pruning via merging: Compressing llms via manifold alignment based layer merging.arXiv preprint arXiv:2406.16330, 2024b.
- Liu etal. (2024c)Songwei Liu, Chao Zeng, Lianqiang Li, Chenqian Yan, Lean Fu, Xing Mei, and Fangmin Chen.Foldgpt: Simple and effective large language model compression scheme.arXiv preprint arXiv:2407.00928, 2024c.
- Liu etal. (2021)ZeLiu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo.Swin transformer: Hierarchical vision transformer using shifted windows, 2021.URL https://arxiv.org/abs/2103.14030.
- Liu etal. (2023)Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra.Llm-qat: Data-free quantization aware training for large language models.arXiv preprint arXiv:2305.17888, 2023.
- Lu etal. (2022)Yao Lu, Wen Yang, Yunzhe Zhang, Zuohui Chen, Jinyin Chen, QiXuan, Zhen Wang, and Xiaoniu Yang.Understanding the dynamics of dnns using graph modularity.In European Conference on Computer Vision, pp. 225–242. Springer, 2022.
- Ma etal. (2023a)Xinyin Ma, Gongfan Fang, and Xinchao Wang.Llm-pruner: On the structural pruning of large language models.Advances in neural information processing systems, 36:21702–21720, 2023a.
- Ma etal. (2023b)Xinyin Ma, Gongfan Fang, and Xinchao Wang.Llm-pruner: On the structural pruning of large language models.In Advances in Neural Information Processing Systems, 2023b.
- Mao etal. (2024)Yuren Mao, Yuhang Ge, Yijiang Fan, Wenyi Xu, YuMi, Zhonghao Hu, and Yunjun Gao.A survey on lora of large language models.arXiv preprint arXiv:2407.11046, 2024.
- Marcus etal. (1993)Mitch Marcus, Beatrice Santorini, and MaryAnn Marcinkiewicz.Building a large annotated corpus of english: The penn treebank.Computational linguistics, 19(2):313–330, 1993.
- Men etal. (2024)Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen.Shortgpt: Layers in large language models are more redundant than you expect.arXiv preprint arXiv:2403.03853, 2024.
- Meng etal. (2024)Fanxu Meng, Zhaohui Wang, and Muhan Zhang.Pissa: Principal singular values and singular vectors adaptation of large language models, 2024.URL https://arxiv.org/abs/2404.02948.
- Merity etal. (2016)Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher.Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016.
- Mihaylov etal. (2018)Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal.Can a suit of armor conduct electricity? a new dataset for open book question answering.arXiv preprint arXiv:1809.02789, 2018.
- Muralidharan etal. (2024)Saurav Muralidharan, SharathTuruvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov.Compact language models via pruning and knowledge distillation.arXiv preprint arXiv:2407.14679, 2024.
- Ngesthi etal. (2021)StephanyOctaviani Ngesthi, Iwan Setyawan, and IvannaK Timotius.The effect of partial fine tuning on alexnet for skin lesions classification.In 2021 13th International Conference on Information Technology and Electrical Engineering (ICITEE), pp. 147–152. IEEE, 2021.
- Peng & Wang (2020)Peng Peng and Jiugen Wang.How to fine-tune deep neural networks in few-shot learning?arXiv preprint arXiv:2012.00204, 2020.
- Press & Wolf (2016)Ofir Press and Lior Wolf.Using the output embedding to improve language models.arXiv preprint arXiv:1608.05859, 2016.
- Saha etal. (2023)Rajarshi Saha, Varun Srivastava, and Mert Pilanci.Matrix compression via randomized low rank and low precision factorization.Advances in Neural Information Processing Systems, 36, 2023.
- Sakaguchi etal. (2021)Keisuke Sakaguchi, RonanLe Bras, Chandra Bhagavatula, and Yejin Choi.Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021.
- Shah etal. (2024)Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao.Flashattention-3: Fast and accurate attention with asynchrony and low-precision.arXiv preprint arXiv:2407.08608, 2024.
- Shen etal. (2021)Zhiqiang Shen, Zechun Liu, Jie Qin, Marios Savvides, and Kwang-Ting Cheng.Partial is better than all: Revisiting fine-tuning strategy for few-shot learning.In Proceedings of the AAAI conference on artificial intelligence, volume35, pp. 9594–9602, 2021.
- Siddiqui etal. (2024)ShoaibAhmed Siddiqui, Xin Dong, Greg Heinrich, Thomas Breuel, Jan Kautz, David Krueger, and Pavlo Molchanov.A deeper look at depth pruning of llms.arXiv preprint arXiv:2407.16286, 2024.
- Simonyan & Zisserman (2015)Karen Simonyan and Andrew Zisserman.Very deep convolutional networks for large-scale image recognition, 2015.URL https://arxiv.org/abs/1409.1556.
- Sun etal. (2023)Mingjie Sun, Zhuang Liu, Anna Bair, and JZico Kolter.A simple and effective pruning approach for large language models.arXiv preprint arXiv:2306.11695, 2023.
- Szegedy etal. (2014)Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich.Going deeper with convolutions, 2014.URL https://arxiv.org/abs/1409.4842.
- Tan & Motani (2020)Chong MinJohn Tan and Mehul Motani.Dropnet: Reducing neural network complexity via iterative pruning.In International Conference on Machine Learning, pp. 9356–9366. PMLR, 2020.
- Tang etal. (2023)Hui Tang, Yao Lu, and QiXuan.Sr-init: An interpretable layer pruning method.In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE, 2023.
- Taori etal. (2023)Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and TatsunoriB Hashimoto.Stanford alpaca: An instruction-following llama model, 2023.
- Team (2024)Gemma Team.Gemma.2024.doi: 10.34740/KAGGLE/M/3301.URL https://www.kaggle.com/m/3301.
- Team etal. (2024)Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, MihirSanjay Kale, Juliette Love, etal.Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024.
- Tishby etal. (2000)Naftali Tishby, FernandoC Pereira, and William Bialek.The information bottleneck method.arXiv preprint physics/0004057, 2000.
- Touvron etal. (2023)Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, etal.Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023.
- Wang etal. (2023)Longyue Wang, Chenyang Lyu, Tianbo Ji, Zhirui Zhang, Dian Yu, Shuming Shi, and Zhaopeng Tu.Document-level machine translation with large language models.arXiv preprint arXiv:2304.02210, 2023.
- Wang etal. (2019)Wenxiao Wang, Shuai Zhao, Minghao Chen, Jinming Hu, Deng Cai, and Haifeng Liu.Dbp: Discrimination based block-level pruning for deep model acceleration.arXiv preprint arXiv:1912.10178, 2019.
- Williams & Aletras (2024)Miles Williams and Nikolaos Aletras.On the impact of calibration data in post-training quantization and pruning.In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 10100–10118, 2024.
- Xia etal. (2023)Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen.Sheared llama: Accelerating language model pre-training via structured pruning.arXiv preprint arXiv:2310.06694, 2023.
- Xu etal. (2024)Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, and Tianyi Zhou.A survey on knowledge distillation of large language models.arXiv preprint arXiv:2402.13116, 2024.
- Yang etal. (2024a)AnYang, Baosong Yang, Binyuan Hui, BoZheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, NaNi, Pei Zhang, Peng Wang, RuPeng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, YuWan, Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zhihao Fan.Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024a.
- Yang etal. (2024b)Yifei Yang, Zouying Cao, and Hai Zhao.Laco: Large language model pruning via layer collapse.arXiv preprint arXiv:2402.11187, 2024b.
- Zellers etal. (2019)Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi.Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830, 2019.
- Zhai etal. (2024)Yuexiang Zhai, Shengbang Tong, Xiao Li, MuCai, Qing Qu, YongJae Lee, and YiMa.Investigating the catastrophic forgetting in multimodal large language model fine-tuning.In Conference on Parsimony and Learning, pp. 202–227. PMLR, 2024.
- Zhang etal. (2023a)Biao Zhang, Barry Haddow, and Alexandra Birch.Prompting large language model for machine translation: A case study.In International Conference on Machine Learning, pp. 41092–41110. PMLR, 2023a.
- Zhang etal. (2023b)Boyu Zhang, Hongyang Yang, Tianyu Zhou, Muhammad AliBabar, and Xiao-Yang Liu.Enhancing financial sentiment analysis via retrieval augmented large language models.In Proceedings of the fourth ACM international conference on AI in finance, pp. 349–356, 2023b.
- Zhao etal. (2024a)Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian.Galore: Memory-efficient llm training by gradient low-rank projection.arXiv preprint arXiv:2403.03507, 2024a.
- Zhao etal. (2024b)Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian.Galore: Memory-efficient llm training by gradient low-rank projection, 2024b.URL https://arxiv.org/abs/2403.03507.
- Zheng etal. (2024)Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, ZiLin, Zhuohan Li, Dacheng Li, Eric Xing, etal.Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36, 2024.
- Zhong etal. (2024)Longguang Zhong, Fanqi Wan, Ruijun Chen, Xiaojun Quan, and Liangzhi Li.Blockpruner: Fine-grained pruning for large language models.arXiv preprint arXiv:2406.10594, 2024.
- Zhu etal. (2015)Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler.Aligning books and movies: Towards story-like visual explanations by watching movies and reading books.In Proceedings of the IEEE international conference on computer vision, pp. 19–27, 2015.
Appendix A Supplementary Material of Reassessing Layer Pruning in LLMs: New Insights and Methods
Model Metric Benchmarks Avg Acc PIQA HellaSwag OpenbookQA ARC-e ARC-c MMLU CMMLU WinoGrande Vicuna-7B-v1.5 Dense 0.77200.0098 0.56420.0049 0.33000.0210 0.75550.0088 0.43260.0145 0.48580.0040 0.35180.0044 0.69530.0129 0.5484 Reverse-order 0.56420.0116 0.29190.0045 0.17000.0168 0.32580.0096 0.26450.0129 0.43720.0041 0.30690.0043 0.58720.0138 0.3685 Random 0.57730.0115 0.30830.0046 0.15600.0162 0.37750.0099 0.21760.0121 0.26500.0037 0.25420.0041 0.50670.0141 0.3328 PPL 0.65720.0111 0.35240.0048 0.19400.0177 0.49710.0103 0.24060.0125 0.23610.0036 0.25100.0040 0.53280.0140 0.3702 Magnitude-l1 0.52390.0117 0.25850.0044 0.14000.0155 0.26350.0090 0.21840.0121 0.22950.0035 0.25270.0040 0.48930.0140 0.2970 Magnitude-l2 0.52450.0117 0.25900.0044 0.13000.0151 0.26560.0091 0.22100.0121 0.22930.0035 0.25120.0040 0.47910.0140 0.2950 BI 0.52500.0117 0.25980.0044 0.14400.0157 0.27400.0092 0.19280.0115 0.22960.0035 0.24760.0040 0.49880.0141 0.2965 Taylor 0.52830.0116 0.25850.0044 0.13000.0151 0.25720.0090 0.21670.0120 0.26140.0037 0.25130.0040 0.49010.0140 0.2992 Qwen1.5-7B Dense 0.78450.0096 0.57850.0049 0.31600.0208 0.71250.0093 f0.40530.0143 0.59670.0039 0.72770.0039 0.65750.0133 0.5973 Reverse-order 0.57830.0115 0.31000.0046 0.16400.0166 0.30470.0094 0.23630.0124 0.25070.0037 0.25640.0041 0.53910.0140 0.3299 Random 0.64090.0112 0.32680.0047 0.19400.0177 0.46170.0102 0.22610.0122 0.23210.0036 0.25290.0040 0.50830.0141 0.3553 PPL 0.65290.0111 0.32330.0047 0.17000.0168 0.43600.0102 0.20990.0119 0.22970.0035 0.25410.0041 0.52250.0140 0.3498 Magnitude-l1 0.54520.0116 0.26900.0044 0.12800.0150 0.28370.0092 0.19620.0116 0.25480.0037 0.24790.0040 0.48620.0140 0.3013 Magnitude-l2 0.53480.0116 0.26510.0044 0.15200.0161 0.28580.0093 0.18430.0113 0.26590.0037 0.25190.0040 0.50590.0141 0.3057 BI 0.60010.0114 0.29050.0045 0.18800.0175 0.40990.0101 0.20900.0119 0.24200.0036 0.24720.0040 0.49010.0140 0.3346 Taylor 0.52230.0117 0.25400.0043 0.14600.0158 0.24030.0088 0.21760.0121 0.23930.0036 0.24780.0040 0.48540.0140 0.2941 Gemma2-2B-It Dense 0.78670.0096 0.53670.0050 0.35600.0214 0.80850.0081 0.51110.0146 0.56870.0039 0.44990.0045 0.69610.0129 0.5892 Reverse-order 0.60500.0114 0.30490.0046 0.19000.0176 0.38170.0100 0.24910.0126 0.23270.0036 0.25270.0040 0.55800.0140 0.3468 Random 0.67410.0109 0.34410.0047 0.21800.0185 0.54460.0102 0.26960.0130 0.23070.0036 0.25400.0041 0.53350.0140 0.3836 PPL 0.66210.0110 0.35050.0048 0.23800.0191 0.55850.0102 0.25260.0127 0.23280.0036 0.25260.0040 0.52800.0140 0.3844 Magnitude-l1 0.66490.0110 0.33580.0047 0.19600.0178 0.55640.0102 0.23550.0124 0.23070.0035 0.25160.0040 0.52640.0140 0.3747 Magnitude-l2 0.61590.0113 0.29560.0046 0.17200.0169 0.43010.0102 0.20730.0118 0.23190.0036 0.25010.0040 0.51780.0140 0.3401 BI 0.63760.0112 0.33100.0047 0.21400.0184 0.48910.0103 0.24060.0125 0.23970.0036 0.25320.0040 0.56670.0139 0.3715 Taylor 0.60880.0114 0.31420.0046 0.18800.0175 0.40490.0101 0.27390.0130 0.22970.0035 0.25080.0040 0.58170.0139 0.3565 Llama-3.1-8B-It Dense 0.80030.0093 0.59100.0049 0.33800.0212 0.81820.0079 0.51790.0146 0.67900.0038 0.55520.0045 0.73950.0123 0.6299 Reverse-order 0.63760.0112 0.31630.0046 0.19600.0178 0.40190.0101 0.31060.0135 0.25020.0036 0.24820.0040 0.61010.0137 0.3714 Random 0.55880.0116 0.27300.0044 0.12800.0150 0.28260.0093 0.19030.0115 0.24060.0036 0.25550.0041 0.50200.0141 0.3039 PPL 0.66430.0110 0.35480.0048 0.19600.0178 0.47180.0102 0.24830.0126 0.23940.0036 0.24460.0040 0.54540.0140 0.3706 Magnitude-l1 0.53160.0116 0.25760.0044 0.13600.0153 0.25720.0090 0.19800.0116 0.23440.0036 0.25260.0040 0.49330.0141 0.2951 Magnitude-l2 0.53160.0116 0.25760.0044 0.13600.0153 0.25720.0090 0.19800.0116 0.23440.0036 0.25260.0040 0.49330.0141 0.2951 BI 0.57730.0115 0.28780.0045 0.15200.0161 0.36740.0099 0.17060.0110 0.23420.0036 0.24660.0040 0.50360.0141 0.3174 Taylor 0.60880.0114 0.32880.0047 0.16600.0167 0.43180.0102 0.27900.0131 0.23100.0036 0.25340.0041 0.60930.0137 0.3635
Model Method Layer Benchmarks Avg Acc PIQA HellaSwag OpenbookQA ARC-e ARC-c MMLU CMMLU WinoGrande Llama-3.1-8B-It LoRA - 0.71380.0105 0.49640.0050 0.27400.0200 0.68480.0095 0.41810.0144 0.28610.0038 0.25040.0040 0.71350.0127 0.4796 QLoRA - 0.64960.0111 0.32600.0047 0.18200.0173 0.45200.0102 0.29690.0134 0.34250.0040 0.26270.0041 0.57930.0139 0.3864 Partial-layer lm_head only 0.67520.0109 0.36850.0048 0.21000.0182 0.53490.0102 0.32760.0137 0.43150.0041 0.33730.0044 0.67950.0109 0.4456 lm_head+last layer 0.70290.0107 0.46760.0050 0.21400.0184 0.63930.0099 0.37630.0142 0.56820.0041 0.44830.0046 0.67480.0132 0.5114 lm_head+last two layers 0.72520.0104 0.51730.0050 0.28000.0201 0.71040.0093 0.42320.0144 0.60580.0040 0.46590.0046 0.70400.0128 0.5540 lm_head+last three layers 0.73450.0103 0.52900.0050 0.30200.0206 0.73990.0090 0.43600.0145 0.62770.0039 0.47630.0046 0.71510.0127 0.5701
Fine-tuning Method Model Method Iteration steps Benchmarks Avg Acc PIQA HellaSwag OpenbookQA ARC-e ARC-c MMLU CMMLU WinoGrande LoRA Llama-3.1-8B-It Reverse-order one-shot 0.63760.0112 0.31630.0046 0.19600.0178 0.40190.0101 0.31060.0135 0.25020.0036 0.24820.0040 0.61010.0137 0.3714 1:8:16 0.63760.0112 0.31600.0046 0.19800.0178 0.39900.0100 0.31060.0135 0.25260.0037 0.25040.0040 0.60460.0137 0.3711 1:1:16 0.63330.0112 0.32590.0047 0.20200.0180 0.41460.0101 0.29610.0133 0.24260.0036 0.26900.0041 0.59120.0138 0.3718 Taylor one-shot 0.60880.0114 0.32880.0047 0.16600.0167 0.43180.0102 0.27900.0131 0.23100.0036 0.25340.0041 0.60930.0137 0.3635 1:8:16 0.62300.0113 0.35160.0048 0.14800.0159 0.46040.0102 0.23550.0124 0.25410.0037 0.25460.0041 0.53120.0140 0.3573 1:1:16 0.54300.0116 0.26920.0044 0.15800.0163 0.29210.0093 0.19370.0115 0.23340.0036 0.24810.0040 0.50910.0141 0.3058 Gemma2-2B-It Reverse-order one-shot 0.60500.0114 0.30490.0046 0.19000.0176 0.38170.0100 0.24910.0126 0.23270.0036 0.25270.0040 0.55800.0140 0.3468 1:6:12 0.60070.0114 0.30760.0046 0.19000.0176 0.39940.0101 0.24830.0126 0.24290.0036 0.24950.0040 0.54780.0140 0.3483 1:1:12 0.60230.0114 0.31730.0046 0.17200.0169 0.38970.0100 0.24490.0126 0.25310.0037 0.24810.0040 0.53870.0140 0.3458 Taylor one-shot 0.60880.0114 0.31420.0046 0.18800.0175 0.40490.0101 0.27390.0130 0.22970.0035 0.25080.0040 0.58170.0139 0.3565 1:6:12 0.59090.0115 0.28060.0045 0.13800.0154 0.38340.0100 0.21500.0120 0.22950.0035 0.25230.0040 0.50590.0141 0.3245 1:1:12 0.65020.0111 0.34560.0047 0.18600.0174 0.47900.0103 0.24830.0126 0.23140.0036 0.25780.0041 0.55250.0140 0.3689 Partial-layer Llama-3.1-8B-It Reverse-order one-shot 0.65780.0111 0.41370.0049 0.22000.0185 0.57070.0102 0.32940.0137 0.38540.0040 0.31900.0043 0.65040.0134 0.4433 1:1:16 0.67740.0109 0.41640.0049 0.22000.0185 0.58630.0101 0.33620.0138 0.41700.0041 0.34600.0044 0.63850.0135 0.4547 Taylor one-shot 0.66490.0110 0.39850.0049 0.21000.0182 0.55810.0102 0.32510.0137 0.30540.0039 0.28760.0042 0.62120.0136 0.4214 1:1:16 0.58760.0115 0.28130.0045 0.13000.0151 0.39860.0100 0.19800.0116 0.25080.0037 0.25020.0040 0.49570.0141 0.3240
Model Metric Calibration Samples Removed Layers Benchmarks Avg Acc PIQA HellaSwag OpenbookQA ARC-e ARC-c MMLU CMMLU WinoGrande Llama-3.1-8B-Instruct BI 1 2,3,5,6,7,8,11,12 0.70290.0107 0.41670.0049 0.20600.0181 0.61360.0100 0.27390.0130 0.23620.0036 0.25120.0040 0.52250.0140 0.40 5 3,4,5,8,9,10,13,19 0.72360.0104 0.44000.0050 0.24200.0192 0.67300.0096 0.33110.0138 0.25240.0037 0.25530.0041 0.54850.0140 0.43 10 2,3,4,5,6,7,8,9 0.71760.0105 0.41960.0049 0.20200.0180 0.61070.0100 0.28410.0132 0.24170.0036 0.24940.0040 0.53910.0140 0.41 30 2,3,4,10,11,12,13,14 0.72090.0105 0.43280.0049 0.20400.0180 0.64140.0098 0.32590.0137 0.25000.0036 0.25760.0041 0.55170.0140 0.42 50 2,3,4,5,6,7,10,13 0.71000.0106 0.40910.0049 0.21800.0185 0.62210.0099 0.28750.0132 0.24920.0036 0.25290.0040 0.54620.0140 0.41 Taylor 1 27, 26, 25, 24, 28, 23, 29, 22 0.60880.0114 0.32880.0047 0.16600.0167 0.43180.0102 0.27900.0131 0.23100.0036 0.25340.0041 0.60930.0137 0.36 5 24, 26, 25, 28, 27, 23, 29, 22 0.60880.0114 0.32880.0047 0.16600.0167 0.43180.0102 0.27900.0131 0.23100.0036 0.25340.0041 0.60930.0137 0.36 10 24, 26, 25, 28, 27, 23, 29, 22 0.60880.0114 0.32880.0047 0.16600.0167 0.43180.0102 0.27900.0131 0.23100.0036 0.25340.0041 0.60930.0137 0.36 30 24, 23, 25, 26, 22, 27, 28, 20 0.72800.0104 0.49850.0050 0.24600.0193 0.69610.0094 0.41300.0144 0.66110.0038 0.49150.0046 0.70320.0128 0.55 50 24, 23, 25, 26, 22, 27, 28, 20 0.72800.0104 0.49850.0050 0.24600.0193 0.69610.0094 0.41300.0144 0.66110.0038 0.49150.0046 0.70320.0128 0.55
Model Generated Sentences Llama-3.1-6.3B-It-Alpaca To be or not to be, that is the question that has been asked thousands of times throughout history. but have you ever wondered what it would be like to not exist? what would it feel like to not have consciousness or awareness of any kind? would it be like being asleep or a combed out hair? no, it would be something else entirely. it would be like not being here at all, like being a tree or a rock, a piece of furniture or a piece of trash. nothingness is not something we can really comprehend because we can’t experience it. we can’t even imagine it because we can’t imagine. It fills me with such pride and joy knowing that I have been able to help shaped the lives of countless students over the years. I am proud of the numerous accomplishments and achievements of my students, and I am proud of the role I have played in shaping them into the people they are today. As a teacher, I believe it is my duty to provide an optimal learning environment that fosters growth and development of my students. I believe in making learning fun and engaging, and I am dedicated to pushing the bounds of innovation in the classroom. I am dedicated to ensuring that my students leave my classroom with the knowledge and skills they need … What’s great about the holiday season? Christmas lights, Christmas shows, Christmas presents, Christmas holiday traditions. But what’s not great about the holiday season? crowds, stress, Santa Claus, Christmas holiday stress, Christmas holiday stressors. It’s important to remember to do things that help you relax during the holiday season, such as taking time for yourself, engaging in relaxation techniques, practicing mindfulness, engaging in physical activity, practicing gratitude, practicing self-care, engaging in activities that bring you joy, and spending time with loved ones. These are all important components of stressors prevention during the holiday season. Here are some tips to help you. … Llama-3.1-6.3B-It-Dolly To be or not to be, that is the question - Hamlet’s famous soliloquy - could easily apply to my current state of mind. I’ve been struggling to decision whether to go back to school or not. On one hand, I’m really interesting in learning more about the subject matter I’ve been passionate about for years, and I think pursuing a degree would be a great way to formalize my knowledge. On the Hamlet scale, this option would be ”to be”: me myself, fully engaged in learning and growing as a person. On the other hand, I’ve been married for almost 20 years … It fills me with such pride and joy to see my little human growing up. yesterday was a special day in our household because we celebrated one of the major milestone in life. my little one cut his first teeth. Yes you heard it right he is now a big boy and his first teeth are now visible. It has been a long journey since we started noticed his teething symptoms and we have been through lot of sleepless nights and teething tantrums but it all pay off now. I can see the relief in his little face when he finally gets to experience the world without the constant pain in his little bottom teeth. As a … What’s great about the holiday season is spending time with loved ones and making new traditions. But let’s not forget about the stress and pressure that comes with it. We’re here to help you beat stress during the holiday season with some tips from our #Ask a therapist series with licensed therapist, Dr. Dezympfer. ### tip #1: Prioritize Self-Cark. With all the commotion during the holiday season, it can be easy to forget to take care of yourself. Make sure to schedule time for self-Cark. whether that is taking a yoga class, taking a few minutes each day to meditate, … Llama-3.1-8B-It To be or not to be, that is the question. How does one decide whether a being is alive or not? The issue of what constitutes life has puzzled philosophers and scientists alike for centuries. A being is alive if it meets the criteria of life, which are: Metabolism: The ability to carry out basic functions such as respiration, photosynthesis, and nutrient uptake.Homeostasis: The ability to maintain a stable internal environment, despite changes in the external environment.Growth and Development: The ability to increase in size and complexity over time.Reproduction: The ability to produce offspring, either sexually or asex It fills me with such pride and joy to see my friends’ passions ignite into flame and to watch them soar to great heights! It was an absolutely honour to have been able to share this experience with you and to bear witness to your hard work and dedication! It was an absolute pleasure to get to know you and share our experiences, but I know that this is just the beginning of an amazing journey! I feel so grateful to have you in my life, and I am constantly in awe of your love, kindness, and generosity! It’s been such an incredible journey watching you discover and grow, and I feel so lucky to have… What’s great about the holiday season? In addition to being able to spend time with loved ones and get some much-needed R&R, many people enjoy the idea of giving back to others. Whether it’s volunteering, donating to charity, or participating in a Secret Santa gift exchange, the holiday season can be a time of kindness and generosity. But have you ever thought about how you might be able to combine your love of cooking and giving back this holiday season? If so, you might be interested in hosting a charity-themed potluck dinner or bake sale. Here are a few ideas to get you started: Host a potluck dinner to…