Energy-Efficient AI Inference at the Edge: Optimizing Semiconductor Hardware for Small Language Models
DOI:
https://doi.org/10.63282/3050-9416.IJAIBDCMS-V7I1P132Keywords:
Edge artificial intelligence, Small language models, Energy-efficient inference, Semiconductor AI accelerators, Edge computing hardware, Model quantization, Neural network compression, Hardware–software co-design, Embedded AI systems, Low-power machine learningAbstract
The rapid expansion of artificial intelligence applications across mobile devices, Internet of Things (IoT) platforms, and embedded systems has intensified the demand for efficient on-device inference. While large language models have demonstrated remarkable performance in natural language processing tasks, their computational and energy requirements make them impractical for deployment in resource-constrained edge environments. Small Language Models (SLMs) have therefore emerged as a promising alternative for enabling localized intelligence while maintaining manageable computational footprints. However, achieving efficient inference for these models remains dependent on the capabilities of underlying semiconductor hardware and the effectiveness of hardware-aware optimization strategies. This study examines the design considerations necessary for enabling energy-efficient inference of small language models on edge computing platforms. The paper analyzes how semiconductor-level architectural features such as neural processing units, specialized tensor accelerators, and optimized memory hierarchies influence inference latency and energy consumption. In addition, the work investigates model optimization techniques including low-precision quantization, parameter pruning, and hardware-aware scheduling that allow language models to operate efficiently on embedded processors and dedicated AI accelerators. A system-level framework is proposed that integrates semiconductor hardware capabilities with model compression techniques to improve inference efficiency without significantly degrading predictive performance.
The study further evaluates performance characteristics across representative edge hardware platforms using metrics such as energy consumption per inference, inference latency, and throughput. The findings indicate that coordinated optimization across model architecture and semiconductor hardware design can significantly reduce energy requirements while sustaining real-time processing capabilities. These results highlight the importance of hardware–software co-design in enabling scalable and sustainable deployment of language models in edge environments. The proposed framework provides practical guidance for the development of next-generation edge AI systems capable of supporting language-based applications with improved energy efficiency and operational autonomy.
References
1. Han, S., Mao, H., & Dally, W. (2020). Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. International Conference on Learning Representations.
2. Sze, V., Chen, Y. H., Yang, T. J., & Emer, J. S. (2020). Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE.
3. Li, S., Zhao, Z., Varma, R., et al. (2020). Pushing the limits of mobile AI inference. ACM Transactions on Embedded Computing Systems.
4. Lane, N., Bhattacharya, S., Georgiev, P., et al. (2021). Deep learning for mobile and edge computing: Opportunities and challenges. Proceedings of the IEEE.
5. Deng, J., Li, G., Han, S., Shi, L., & Xie, Y. (2021). Model compression and hardware acceleration for neural networks: A comprehensive survey. Proceedings of the IEEE.
6. Zhou, A., Yao, A., Guo, Y., Xu, L., & Chen, Y. (2021). Incremental network quantization: Towards lossless CNNs with low-precision weights. International Conference on Learning Representations.
7. Gholami, A., Kim, S., Dong, Z., et al. (2021). A survey of quantization methods for efficient neural network inference. ACM Computing Surveys.
8. Reddi, V., Kanter, D., & Mattson, P. (2021). MLPerf inference benchmark. International Symposium on Computer Architecture.
9. Chen, Y., Yang, T., Emer, J., & Sze, V. (2022). Eyeriss v2: A flexible accelerator for emerging deep neural networks. IEEE Journal on Emerging and Selected Topics in Circuits and Systems.
10. Wang, Y., Xu, C., Han, S., et al. (2022). Hardware-aware neural architecture search: Survey and taxonomy. ACM Computing Surveys.
11. Xu, Z., Zhang, Y., Wang, H., et al. (2022). Edge AI: On-demand accelerating deep neural network inference via edge computing. IEEE Transactions on Computers.
12. Lin, Y., Liu, Z., Sun, M., et al. (2022). Big transfer (BiT): General visual representation learning. European Conference on Computer Vision.
13. Gupta, U., Wu, C., Wang, X., et al. (2022). The architectural implications of deep neural networks. IEEE Micro.
14. Zhang, R., Li, Y., Wang, Y., et al. (2023). Efficient deep learning for edge computing: Techniques and applications. IEEE Network.
15. Chen, T., Goodfellow, I., & Shlens, J. (2023). Net2Net: Accelerating learning via knowledge transfer. International Conference on Learning Representations.
16. Alizadeh, M., Shoeybi, M., Patwary, M., et al. (2023). ZeRO-Infinity: Breaking the GPU memory wall for extreme scale deep learning. SC Conference.
17. Hu, E., Shen, Y., Wallis, P., et al. (2023). LoRA: Low-rank adaptation of large language models. International Conference on Learning Representations.
18. Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2023). GPTQ: Accurate post-training quantization for generative models. NeurIPS.
19. Vallemoni, R. K. (2021). Settlement, Fees, and Interchange: Data Models for Accurate Reconciliation and Exception Handling. AL-KINDI CENTER FOR RESEARCH AND DEVELOPMENT.
20. Vallemoni, R. K. (2022). Canonical payment data models for merchant acquiring: Merchants, terminals, transactions, fees, and chargebacks. International Journal of Computer Science and Engineering (ISCSITR-IJCSE), 3(1), 42-66.
21. Vallemoni, R. K. (2022). Authorization-to-settlement at scale: A reference data architecture for ISO 8583/ISO 20022 coexistence. Journal of Computer Science and Technology Studies, 4(1), 88-98.
22. Vallemoni, R. K. (2023). Merchant Onboarding and Risk Scoring: Data Governance, Master Data, and Golden-Record Strategies. Below the Content is Description.
23. Vallemoni, R. K. From Legacy EDW to Hybrid Cloud: Modernizing ETL/ELT for Risk, Finance, and Regulatory Reporting. Vallemoni RK. From Legacy EDW to Hybrid Cloud: Modernizing ETL/ELT for Risk, Finance, and Regulatory Reporting.
24. Vallemoni, R. K. (2023). Data lineage and metadata in payment ecosystems: Auditability and regulatory readiness across the life cycle. Frontiers in Computer Science and Artificial Intelligence, 2(1), 46-58.
25. Li, J., Chen, Y., & Li, X. (2024). Efficient transformer inference for edge AI systems. IEEE Transactions on Neural Networks and Learning Systems.
26. Dao, T., Fu, D., Ermon, S., et al. (2024). FlashAttention: Fast and memory-efficient exact attention with IO awareness. NeurIPS.
27. Dettmers, T., Lewis, M., Shleifer, S., & Zettlemoyer, L. (2024). QLoRA: Efficient finetuning of quantized LLMs. NeurIPS.
28. Islam, M. R., Deng, B., Nguyen, T., et al. (2025). Characterizing and understanding the energy footprint of small language models on edge devices. arXiv.
29. Husom, E. J., & others (2025). Evaluating quantized large language models for energy efficiency and performance. ACM Digital Library.
30. Tian, C., Qin, X., Tam, K., et al. (2025). CLONE: Customizing LLMs for efficient latency-aware inference at the edge. arXiv.
31. Zhang, R., Li, Y., & Wang, Y. (2025). Optimization methods, challenges and opportunities for lightweight AI models in edge computing. Electronics (MDPI).
32. Wang, R., Chen, Y., & Liu, Z. (2025). A survey of edge-efficient large language models and deployment techniques. Journal of Systems Architecture.
33. Lai, N., Dewi, D., Maidin, S., Xiao, W., Zhao, S., & Hu, Q. (2026). A comprehensive review of lightweight deep learning models for edge computing with future directions. Discover Computing.
34. Pandey, N., Park, J., Gungor, O., Ponzina, F., & Rosing, T. (2026). QMC: Efficient SLM edge inference via outlier-aware quantization and emergent memory co-design. arXiv.
35. Kumar, S., & Jha, S. (2026). Quantifying energy-efficient edge intelligence: Inference-time scaling laws for heterogeneous computing. arXiv.
36. Cai, G., Zhang, Y., & Liu, H. (2026). Efficient inference for edge large language models. Tsinghua Science and Technology.