Monday, August 4, 1:00 PM- 5:00 PM. Toronto, ON, Canada. Held in conjunction with KDD'25, Toronto Convention Centre
The rapid advancement of generative and agentic AI models has ushered in transformative applications across diverse domains, ranging from creative content generation to autonomous decision-making systems. However, as these models become more capable, their widespread deployment raises pressing concerns about evaluation, reliability, and ethical implications. Current evaluation methods are insufficient for capturing the complexities and risks of generative outputs and autonomous AI behaviors.
To bridge this gap, robust evaluation frameworks are needed to assess these models holistically, ensuring they are not only performant but also aligned with societal values and safety expectations. Current benchmarks primarily focus on standard performance metrics, overlooking critical aspects such as trustworthiness, interpretability, and real-world usability. Without rigorous evaluation methodologies, generative and agentic AI systems may inadvertently perpetuate biases, propagate misinformation, or act unpredictably in high-stakes scenarios.
This workshop aims to foster interdisciplinary collaboration by bringing together researchers, industry practitioners, and policymakers to develop advanced evaluation techniques for generative and agentic AI. Our discussions will focus on new methodologies for assessing reasoning capabilities, ethical robustness, and cross-modal generation, along with scalable and user-centric evaluation frameworks. By addressing these challenges, we seek to pave the way for more reliable and responsible AI systems that can be safely integrated into society.
Contact: kdd2025-ws-agentic-genai-eval@amazon.com
This workshop will focus on the unique challenges and opportunities presented by the intersection of evaluation and trustworthiness in the context of generative and agentic AI. Generative AI models, such as large language models (LLMs) and diffusion models, have shown remarkable abilities in generating human-quality text, images, and other forms of content. Agentic AI refers to AI systems that can act autonomously and purposefully, raising critical concerns about safety, control, and alignment with human values. Evaluating these advanced AI models requires going beyond traditional metrics and benchmarks. We need new methods and frameworks to assess their performance, identify potential biases, and ensure they are used responsibly. This workshop will delve into these issues, with a particular focus on the following topics:
Haixun Wang, Hua Wei
Title: Trustworthy Document AI: From "Good Enough" to Mission-Critical
Abstract: Large Language Models and multimodal generative systems can now read dense contracts, physician notes, and photographed invoices with impressive fluency. But in courtrooms, clinics, and underwriting desks, “pretty good” still courts unacceptable risk. This talk frames Document AI as an end-to-end product challenge: we will explore how reliable performance emerges not from bigger models alone, but from principled evaluation and structured design. We’ll show how QA-driven metrics, especially reinforcement learning with automated reward signals, enable scalable self-improvement and replace vague benchmarks with real-world accountability. Agentic systems, which decompose schema-driven tasks into orchestrated micro-agents, offer a more transparent and cost-effective path than monolithic LLMs. Across modalities and use cases, from legal extraction to voice agents, we argue for Document AI as infrastructure: modular, measurable, and mission-ready.
Bio:
Title: Rethinking Evaluation for Agentic AI: Lessons from Transportation and Education
Abstract: As large language models increasingly serve as autonomous agents capable of reasoning, planning, and acting within complex environments, evaluating their real-world behavior has become a pressing challenge. In this talk, I will share empirical insights from three systems we developed in the domains of transportation and education. These deployments reveal practical limitations of existing evaluation methodologies and highlight the need for more context-aware, user-centered, and uncertainty-informed evaluation strategies. I will share the key lessons we learnt from these case studies and proposing directions for building trustworthy and socially integrated agentic AI systems.
Bio: Hua Wei is an Assistant Professor in the School of Computing and Augmented Intelligence (SCAI) at Arizona State University (ASU). His research interests lie in data mining and machine learning, with a particular emphasis on spatio-temporal data mining and reinforcement learning. His work has received multiple Best Paper Awards, including honors from ECML-PKDD and ICCPS. His research has been supported by various funding agencies, including the National Science Foundation (NSF), the Department of Energy (DoE), and the Department of Transportation (DoT). Dr. Wei is a recipient of the Amazon Research Award and the NSF CAREER Award in 2025.
August 04, 2025 (1:00-5:00PM), Toronto Convention Centre, [602] Posters at 700 foyerLOCATION
Introduction by organizers.
Haixun Wang
1:50 PM - 2:00 PM, The Optimization Paradox in Clinical AI Multi-Agent Systems, Suhana Bedi (10 minutes)
2:00 PM - 2:10 PM, Cybernaut: Towards Reliable Web Automation, Ankur Tomar (10 minutes)
2:10 PM - 2:20 PM, Finding the Sweet Spot: Trading Quality, Cost, and Speed During Inference-Time LLM Reflection, Nikita Kozodoi (10 minutes)
2:20 PM - 2:30 PM, Reasoning Beyond the Obvious: Evaluating Divergent and Convergent Thinking in LLMs for Financial Scenarios, Zhuang Qiang Bok (10 minutes)
Poster presentations and networking opportunity
Hua Wei
4:15 - 4:25 PM, Evaluating the LLM-simulated Impacts of Big Five Personality Traits and AI Capabilities on Social Negotiations, Myke C. Cohen (10 minutes)
4:25 - 4:35 PM, Risk In Context: Benchmarking Privacy Leakage of Foundation Models in Synthetic Tabular Data Generation, Jessup Byun (10 minutes)
4:35 - 4:45 PM, FECT: Factuality Evaluation of Interpretive AI-Generated Claims in Contact Center Conversation Transcripts, Hagyeong Shin (10 minutes)
4:45 - 4:55 PM, Evaluating Large Language Models for Semi-Structured Data Manipulation Tasks: A Survey Platform Case Study, Vinicius Monteiro de Lira (10 minutes)
Closing remarks by organizers.
Below are the accepted papers for the workshop
Authors: Hassen Dhrif
Authors: Sulbha Jain
Authors: Chung-Yu Wang, Alireza DaghighFarsoodeh, Hung Viet Pham
Authors: Jack Butler, Nikita Kozodoi, Zainab Afolabi, Brian Tyacke, Gaiar Baimuratov
Authors: Daniel Schwartz, Dmitriy Bespalov, Zhe Wang, Ninad Kulkarni, Yanjun Qi
Authors: Yikai Zhao
Authors: Tung Sum Thomas Kwok, Zeyong Zhang, Chi-Hua Wang, Guang Cheng
Authors: Ramesh Chadalavada, Sumanta Kashyapi, Ramkumar Subramanian, Matt Witten
Authors: Zhuang Qiang Bok, Watson Wei Khong Chua
Authors: Suhana Bedi, Iddah Mlauzi, Daniel Shin, Sanmi Koyejo, Nigam H. Shah
Authors: Julia Kharchenko, Tanya Roosta, Aman Chadha, Chirag Shah
Authors: Melody Xuan, Zhenyu Zhang, Çağatay Demiralp, Xinyang Shen, Ozalp Ozer, Niti Kalra
Authors: Ankur Tomar, Hengyue Liang, Indranil Bhattacharya, Natalia Larios, Francesco Carbone
Authors: Hagyeong Shin, Binoy Robin Dalal, Iwona Bialynicka-Birula, Navjot Matharu, Ryan Muir, Xingwei Yang, Samuel W. K. Wong
Authors: Gabriel Isaac Afriat, Ryan Lucas, Xiang Meng, Yufang Hou, Yada Zhu, Rahul Mazumder
Authors: Vinicius Monteiro de Lira, Antonio Maiorino, Peng Jiang, Rouzbeh Torabian Esfahani
Authors: Ruizhe Zhu, Hao Zhu, Yaxuan Li, Syang Zhou, Shijing Cai, Małgorzata Łazuka, Elliott Ash
Authors: Sarthak Chaturvedi, Anurag Acharya, Rounak Meyur, Koby Hayashi, Sai Munikoti, Sameera Horawalavithana
Authors: Raj Krishnan Vijayaraj
Authors: Rabimba Karanjai, Yang Lu, Ranjith Chodavarapu, Lei Xu, Weidong Shi
Authors: Ananya Singha, Harshita Sahijwani, Walt Williams, Emmanuel Aboah Boatang, Nick Hausman, Miguel Di Luca, Keegan Choudhury, Chaya Binet, Vu Le, Tianwei Chen, Oryan Rokeah Chen, Sulaiman Vesal, Sadid Hasan
Authors: Daeja Oxendine, Swetasudha Panda, Naveen Jafer Nizar, Qinlan Shen,Sumana Srivatsa and Krishnaram Kenthapadi
Authors: Jessup Byun, Xiaofeng Lin, Joshua Ward, Guang Cheng