Vision Transformers (ViT) for Small-Scale Image Classification with Token Reduction

Authors

  • Sajud Hamza Elinjulliparambil Pace University. Author

DOI:

https://doi.org/10.63282/3050-9416.IJAIBDCMS-V3I4P112

Keywords:

Vision Transformer (Vit), Token Reduction, Small-Scale Image Classification, Self-Attention, CNN Vit Hybrid Models, Attention Pooling, Computational Efficiency, Data Efficiency, Transformer Architectures

Abstract

Convolutional Neural Networks (CNNs) are no longer considered a superior choice in image classification over Vision Transformers (ViTs), which have shown to be highly effective on a large scale because they can exploit self-attention aspects in capturing long-range dependencies. Nevertheless, in small datasets of images like CIFAR-10, CIFAR-100 and SVHN, standard ViTs are usually inefficient in data usage, expensive to run, and redundant in token representation. This weakness is due to the fact that ViT requires large training corpora and its self-attention has a quadratic complexity with regard to the number of tokens. In order to overcome these issues, the initial studies suggested token-cutting techniques such as pruning, pooling, and a combination of CNN and ViT networks to reduce tokens on the input side and maintain the needed visual information. These ways seek to reduce overfitting and increase computation efficiency and generalization at low-data regimes.  This review gives a thorough analysis of Vision Transformers in small-scale image classification with regard to developments. The paper provides an overview of ViTs development, their drawbacks when working with small datasets, an overview of initial token reduction methods, and their performance on a variety of benchmark tasks. Its discussion reflects both advantages and disadvantages of token reduction and provides the future research prospects. The article is a reference material to scholars conducting an investigation on effective transformer-based models specialized to small-scale image categorization tasks

References

1. Elngar, A. A., et al. (2021). Image classification based on CNN: A survey. Journal of Cybersecurity and Information Management, 6(1), 18–50.

2. Li, Z., et al. (2021). A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Transactions on Neural Networks and Learning Systems, 33(12), 6999–7019.

3. Ekman, M. (2021). Learning deep learning: Theory and practice of neural networks, computer vision, natural language processing, and transformers using TensorFlow. Addison-Wesley Professional.

4. Yuan, L., et al. (2021). Tokens-to-token ViT: Training vision transformers from scratch on ImageNet. In Proceedings of the IEEE/CVF International Conference on Computer Vision.

5. Bazi, Y., et al. (2021). Vision transformers for remote sensing image classification. Remote Sensing, 13(3), 516.

6. Goel, K., et al. (2020). Model patching: Closing the subgroup performance gap with data augmentation. arXiv preprint arXiv:2008.06775.

7. Hossain, M. A., & Sajib, M. S. A. (2019). Classification of image using convolutional neural network (CNN). Global Journal of Computer Science and Technology, 19(2), 13–14.

8. Swapna, M., Sharma, Y. K., & Prasadh, B. M. G. (2020). CNN architectures: AlexNet, LeNet, VGG, GoogLeNet, ResNet. International Journal of Recent Technology and Engineering, 8(6), 953–960.

9. Pazouki, E. (2021). A transformer self-attention model for time series forecasting (pp. 1–10).

10. Dong, H., Zhang, L., & Zou, B. (2021). Exploring vision transformers for polarimetric SAR image classification. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–15.

11. Xu, Y., et al. (2021). VITAE: Vision transformer advanced by exploring intrinsic inductive bias. In Advances in Neural Information Processing Systems, 34, 28522–28535.

12. Touvron, H., et al. (2021). Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning. PMLR.

13. Hou, Y., & Zheng, L. (2021). Multiview detection with shadow transformer (and view-coherent data augmentation). In Proceedings of the 29th ACM International Conference on Multimedia.

14. Chen, X., Xie, S., & He, K. (2021). An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision.

15. Marin, D., et al. (2021). Token pooling in vision transformers. arXiv preprint arXiv:2110.03860.

16. Jiang, Z.-H., et al. (2021). All tokens matter: Token labeling for training better vision transformers. In Advances in Neural Information Processing Systems, 34, 18590–18602.

17. Wu, Z. (2021). Video Understanding: Data Privacy, Pipeline Simplicity, and Implementation Efficiency (Doctoral dissertation).

18. Liu, Z., et al. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision.

19. Jiang, Z., et al. (2021). Method for diagnosis of acute lymphoblastic leukemia based on ViT‐CNN ensemble model. Computational Intelligence and Neuroscience, 2021, 7529893.

20. Singla, S., Singla, S., & Feizi, S. (2021). Improved deterministic L2 robustness on CIFAR-10 and CIFAR-100. arXiv preprint arXiv:2108.04062.

21. Hassani, A., et al. (2021). Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704.

22. Basart, S. (2021). Towards Robustness of Neural Networks (Doctoral dissertation). The University of Chicago.

Downloads

Published

2022-12-30

Issue

Section

Articles

How to Cite

1.
Elinjulliparambil SH. Vision Transformers (ViT) for Small-Scale Image Classification with Token Reduction. IJAIBDCMS [Internet]. 2022 Dec. 30 [cited 2026 Apr. 29];3(4):115-22. Available from: https://ijaibdcms.org/index.php/ijaibdcms/article/view/344