Human-AI Collaboration in Software Teams: Evaluating Productivity, Quality, and Knowledge Transfer with Agentic and LLM-Based Tools

Authors

  • Ravikanth Konda Software Application Engineer. Author

DOI:

https://doi.org/10.63282/3050-9416.ICAIDSCT26-128

Keywords:

Human-AI Collaboration, Software Engineering Productivity, Code Quality, Knowledge Transfer, Large Language Models (LLMs), Agentic AI, Software Teams, Pair Programming, Software Review, SWE-Bench, Developer Experience, Socio-Technical Systems, Empirical Software Engineering, Governance Frameworks

Abstract

The ascendency of large language models (LLMs) and the nascent age of agentic AI systems are changing how software teams work, create, and maintain code. These are tools that have evolved from static code-completion helpers to mission-critical software development cycle tools that affect productivity, quality of code, and how knowledge is created and transferred among teams. However, despite being widely deployed, fine-grained effects on socio-technical systems are understudied, especially at the team level, where human-human interactions meet with AI augmentation. This paper offers a systematic review of human-AI collaboration in software teams, focusing on productivity, quality, and knowledge transfer by introducing a three-dimensional framework for viewing this phenomenon. Unlike previous works, which mainly concentrate on individual developer productivity or task-level efficiency, this work shines the spotlight on team-oriented processes and organizational rules or governance forms that decide whether AI adoption brings sustained value or ominous risk. The framework is informed by evidence based on randomized controlled trials, field deployments, and benchmark evaluations. In the control studies (e.g., GitHub Copilot RCTs), we observe that time can be saved up to 55–56% for bounded coding tasks, compared to enterprise case studies, where reported improvement in developer happiness and reduction in mental energy consumption coincide. Nevertheless, repository-level benchmarks such as SWE-bench indicate that LLMs/agents still have room to catch up on complex multi-file changes and subtle bug-fixes, supporting the requirement of a human-in-the-loop governance and robust validation pipelines. By the same token, research on AI-augmented code review and human-AI pair programming reveals efficiency gains as well as trade-offs that are far from trivial: reviews are faster and junior engineers may be onboarded more easily, but critical thinking skills atrophy and dialogic inspection often weakens, inviting knowledge dilution and undiagnosed design mistakes. Methodologically, this paper is based on a quasi-experimental mixed-methods study design reproducible in enterprise software development environments. The solution leverages sprint-wise cohort contrast (baseline, LLM-supported, and agentic), backlog-aligned task batteries, as well as multi-modal instrumentation of developer efforts. You’re not judged by cycle time alone, but by the ratio of rework, velocity through your backlog, and cost of context switching. Quality is assessed by pre-merge signals (test pass rates, static analysis hits, review loops) and post-release results (defect density, incident frequency, and rollback rates). Knowledge transfer is modelled by analysing reasons for PRs, design decision logs, pair-programming dialogues, and file-ownership distributions as indicators of the organisational memory. Data triangulation contributes to validity, with quantitative telemetry complemented by qualitative interviews about trust, usability, and team dynamics. The utilization of the framework over typical examples validates several intuitive insights. First, AI copilots are consistently more productive when it comes to scaffolding, test generation, and documentation tasks, except for where value boundaries are poorly defined or on architectural work. Second, quality is enhanced when AI output is gated by disciplined engineering practices — e.g., tests, linters, and structured reviews but degraded when excessive trust causes unfettered acceptance of plausible-but-wrong code. Third, the nature of knowledge transfer dynamics changes rather than disappears: humans and machines can be trained together quickly at a surface level but not so much at dialogic deep learning unless such interactions are assisted by protocols that enforce rationale capture, architecture reviews, and peer-to-peer explainability. Crucially, comfort gains and cognitive effort reductions are tempered by survey data showing declining trust in accuracy, illustrating a paradox of enthusiastic skepticism among developers. The discussion section re-mediates the results into an adopter playbook for the industry, and provides maturity models for blending LLM and agentic tools. Main takeaways are progressive autonomy between (suggestion-only to constrained edits to supervised multi-file changes), policy guardrails (security scanning, coverage thresholds, auditability), and socio-technical rituals (review-by-explanation, decision records, AI-generated rationale validated by humans). Through a disciplined, workflow-oriented framing of human-AI collaboration as not replacement but augmentation, this paper posits the LLM and agentic toolset as levers to drive productivity efficiency and quality enhancement at knowledge transfer rates that minimize loss risk to organizations.

References

1. S. Peng, et al., “The Impact of AI on Developer Productivity: Evidence from a Controlled Experiment with GitHub Copilot,” arXiv preprint arXiv:2302.06527, Microsoft Research, Feb. 2023.

2. GitHub, “Quantifying GitHub Copilot’s Impact in the Enterprise: Developer Productivity and Satisfaction,” GitHub White Paper, May 2024.

3. GitHub, “Research: Quantifying GitHub Copilot’s Impact on Developer Productivity and Happiness,” GitHub, Sept. 2022.

4. SWE-bench Team, “SWE-bench: A Benchmark for Repository-Level Software Engineering Tasks,” GitHub/Leaderboard Resources, 2023–2025.

5. W. Liang, Y. Zhang, and D. Lo, “An Exploratory Evaluation of Large Language Models Using Empirical Software Engineering Tasks,” in Proc. Internetware Conf., ACM, July 2024.

6. C. E. Jimenez, et al., “Can Language Models Resolve Real-World GitHub Issues? Evaluating on SWE-bench,” arXiv preprint arXiv:2310.06770, Oct. 2023.

7. Y. Almeida, T. Pimentel, and A. Serebrenik, “AICodeReview: Advancing Code Quality with AI-Enhanced Code Review,” Journal of Systems and Software, vol. 209, 112922, Feb. 2024.

8. Stack Overflow, “2025 Developer Survey: AI Adoption and Trust in Software Engineering,” Stack Overflow Insights, July 2025.

9. A. Welter, M. Gerlach, and T. Fritz, “From Developer Pairs to AI Copilots: A Comparative Study on Knowledge Transfer in Pair Programming,” arXiv preprint arXiv:2506.01234, June 2025.

10. G. Fan, L. He, and X. Li, “Impact of AI-Assisted Pair Programming on Student Motivation, Anxiety, and Performance,” International Journal of STEM Education, vol. 12, no. 1, pp. 1–18, Jan. 2025.

11. S. Rasnayaka, H. Wang, and P. Liang, “An Empirical Study on Usage and Perceptions of Large Language Models in Software Engineering Projects,” Empirical Software Engineering, vol. 29, no. 3, pp. 1–28, Mar. 2024.

12. McKinsey & Company, “Unleashing Developer Productivity with Generative AI,” McKinsey Report, June 2023.

13. Reuters, “Over 40% of Agentic AI Projects Will Be Scrapped by 2027, Gartner Says,” Reuters Technology Report, June 25, 2025.

14. Microsoft Research, The New Future of Work Report 2023: The Role of Generative AI in Knowledge Work, Microsoft Research, Dec. 2023.

15. Microsoft Research, Generative AI in Real-World Workplaces: Empirical Insights from Enterprise Deployments, Microsoft Research, July 2024.

Downloads

Published

2026-02-17

How to Cite

1.
Konda R. Human-AI Collaboration in Software Teams: Evaluating Productivity, Quality, and Knowledge Transfer with Agentic and LLM-Based Tools. IJAIBDCMS [Internet]. 2026 Feb. 17 [cited 2026 Feb. 17];:250-7. Available from: https://ijaibdcms.org/index.php/ijaibdcms/article/view/418