Agentic & GenAI Evaluation KDD 2025: KDD workshop on Evaluation and Trustworthiness of Agentic and Generative AI Models

Welcome to Agentic & GenAI Evaluation KDD 2025!

The rapid advancement of generative and agentic AI models has ushered in transformative applications across diverse domains, ranging from creative content generation to autonomous decision-making systems. However, as these models become more capable, their widespread deployment raises pressing concerns about evaluation, reliability, and ethical implications. Current evaluation methods are insufficient for capturing the complexities and risks of generative outputs and autonomous AI behaviors.

To bridge this gap, robust evaluation frameworks are needed to assess these models holistically, ensuring they are not only performant but also aligned with societal values and safety expectations. Current benchmarks primarily focus on standard performance metrics, overlooking critical aspects such as trustworthiness, interpretability, and real-world usability. Without rigorous evaluation methodologies, generative and agentic AI systems may inadvertently perpetuate biases, propagate misinformation, or act unpredictably in high-stakes scenarios.

This workshop aims to foster interdisciplinary collaboration by bringing together researchers, industry practitioners, and policymakers to develop advanced evaluation techniques for generative and agentic AI. Our discussions will focus on new methodologies for assessing reasoning capabilities, ethical robustness, and cross-modal generation, along with scalable and user-centric evaluation frameworks. By addressing these challenges, we seek to pave the way for more reliable and responsible AI systems that can be safely integrated into society.

Contact: kdd2025-ws-agentic-genai-eval@amazon.com

Call for Contributions

Link to the submission website: https://openreview.net/group?id=KDD.org/2025/Workshop/Agentic-GenAI-Eval

This workshop will focus on the unique challenges and opportunities presented by the intersection of evaluation and trustworthiness in the context of generative and agentic AI. Generative AI models, such as large language models (LLMs) and diffusion models, have shown remarkable abilities in generating human-quality text, images, and other forms of content. Agentic AI refers to AI systems that can act autonomously and purposefully, raising critical concerns about safety, control, and alignment with human values. Evaluating these advanced AI models requires going beyond traditional metrics and benchmarks. We need new methods and frameworks to assess their performance, identify potential biases, and ensure they are used responsibly. This workshop will delve into these issues, with a particular focus on the following topics:

Agentic AI Evaluation: Assessing autonomous AI behavior, decision-making, goal alignment, adaptability, security, privacy, tool use, memory, and self-verification in dynamic and open-ended environments.
Trustworthiness in Generative AI Models: Truthfulness and Reliability, Ensuring Safety and Security, Addressing Bias and Fairness, Ethical Considerations, Privacy Considerations, Enhancing Misuse Resistance, Explainability, and Robustness.
User-Centric Assessment: Evaluating AI from a user experience perspective, including trust calibration, mental models, and usability.
Multi-Perspective Evaluation: Emphasizing logical reasoning, knowledge depth, problem-solving abilities, contextual understanding, and user alignment.
Evaluating Reasoning Models: Measuring AI's ability to conduct step-by-step reasoning, causal inference, and complex problem-solving across various domains.
Efficient Evaluation Methods: Scalable, automated, and cost-effective approaches for assessing generative AI performance with minimal manual oversight.
Synthetic Data Generation Evaluation: Assessing the quality, representativeness, and bias implications of synthetic data used for AI training and evaluation.
Evaluating Misinformation and Manipulative Content: Techniques for detecting, measuring, and mitigating misinformation propagation by generative models.
Cross-Modal Evaluation: Assessing AI's capabilities across text, image, audio, and multimodal generation.
Holistic Evaluation Frameworks: Developing standardized datasets, metrics, and methodologies for comprehensive AI assessment.

Keynote Speakers

Haixun Wang, Hua Wei

Haixun Wang

Title: Trustworthy Document AI: From "Good Enough" to Mission-Critical

Abstract: Large Language Models and multimodal generative systems can now read dense contracts, physician notes, and photographed invoices with impressive fluency. But in courtrooms, clinics, and underwriting desks, “pretty good” still courts unacceptable risk. This talk frames Document AI as an end-to-end product challenge: we will explore how reliable performance emerges not from bigger models alone, but from principled evaluation and structured design. We’ll show how QA-driven metrics, especially reinforcement learning with automated reward signals, enable scalable self-improvement and replace vague benchmarks with real-world accountability. Agentic systems, which decompose schema-driven tasks into orchestrated micro-agents, offer a more transparent and cost-effective path than monolithic LLMs. Across modalities and use cases, from legal extraction to voice agents, we argue for Document AI as infrastructure: modular, measurable, and mission-ready.

Bio:

Hua Wei

Title: Rethinking Evaluation for Agentic AI: Lessons from Transportation and Education

Abstract: As large language models increasingly serve as autonomous agents capable of reasoning, planning, and acting within complex environments, evaluating their real-world behavior has become a pressing challenge. In this talk, I will share empirical insights from three systems we developed in the domains of transportation and education. These deployments reveal practical limitations of existing evaluation methodologies and highlight the need for more context-aware, user-centered, and uncertainty-informed evaluation strategies. I will share the key lessons we learnt from these case studies and proposing directions for building trustworthy and socially integrated agentic AI systems.

Bio: Hua Wei is an Assistant Professor in the School of Computing and Augmented Intelligence (SCAI) at Arizona State University (ASU). His research interests lie in data mining and machine learning, with a particular emphasis on spatio-temporal data mining and reinforcement learning. His work has received multiple Best Paper Awards, including honors from ECML-PKDD and ICCPS. His research has been supported by various funding agencies, including the National Science Foundation (NSF), the Department of Energy (DoE), and the Department of Transportation (DoT). Dr. Wei is a recipient of the Amazon Research Award and the NSF CAREER Award in 2025.

Accepted Papers

Below are the accepted papers for the workshop

Creative Adversarial Testing (CAT): A Novel Framework for Evaluating Goal-Oriented Agentic AI Systems

Authors: Hassen Dhrif

JokeEval: Are the Jokes Funny? Review of Computational Evaluation Techniques to improve Joke Generation

Authors: Sulbha Jain

TITAN: Task-oriented Prompt Improvement with Script Generation

Authors: Chung-Yu Wang, Alireza DaghighFarsoodeh, Hung Viet Pham

Finding the Sweet Spot: Trading Quality, Cost, and Speed During Inference-Time LLM Reflection

Authors: Jack Butler, Nikita Kozodoi, Zainab Afolabi, Brian Tyacke, Gaiar Baimuratov

Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt Generation for Enhanced LLM Content Moderation

Authors: Daniel Schwartz, Dmitriy Bespalov, Zhe Wang, Ninad Kulkarni, Yanjun Qi

A Tri-Agent Framework for Evaluating and Aligning Question Clarification Capabilities of Large Language Models

Authors: Yikai Zhao

Towards High Supervised Learning Utility Training Data Generation: Data Pruning and Column Reordering

Authors: Tung Sum Thomas Kwok, Zeyong Zhang, Chi-Hua Wang, Guang Cheng

AutoGrader: Hybrid Evaluation of GenAI Product Images via Contrastive Embeddings and LLM-Generated Grading Notes

Authors: Ramesh Chadalavada, Sumanta Kashyapi, Ramkumar Subramanian, Matt Witten

Reasoning Beyond the Obvious: Evaluating Divergent and Convergent Thinking in LLMs for Financial Scenarios

Authors: Zhuang Qiang Bok, Watson Wei Khong Chua

The Optimization Paradox in Clinical AI Multi-Agent Systems

Authors: Suhana Bedi, Iddah Mlauzi, Daniel Shin, Sanmi Koyejo, Nigam H. Shah

How Well Do LLMs Represent Values Across Cultures? Empirical Analysis of LLM Responses Based on Hofstede Cultural Dimensions

Authors: Julia Kharchenko, Tanya Roosta, Aman Chadha, Chirag Shah

Text-to-SQL Evaluation at Scale: A Semi-Automatic Approach for Benchmark Data Generation

Authors: Melody Xuan, Zhenyu Zhang, Çağatay Demiralp, Xinyang Shen, Ozalp Ozer, Niti Kalra

Cybernaut: Towards Reliable Web Automation

Authors: Ankur Tomar, Hengyue Liang, Indranil Bhattacharya, Natalia Larios, Francesco Carbone

FECT: Factuality Evaluation of Interpretive AI-Generated Claims in Contact Center Conversation Transcripts

Authors: Hagyeong Shin, Binoy Robin Dalal, Iwona Bialynicka-Birula, Navjot Matharu, Ryan Muir, Xingwei Yang, Samuel W. K. Wong

Seeing What's Wrong: A Trajectory-Guided Approach to Caption Error Detection

Authors: Gabriel Isaac Afriat, Ryan Lucas, Xiang Meng, Yufang Hou, Yada Zhu, Rahul Mazumder

Evaluating Large Language Models for Semi-Structured Data Manipulation Tasks: A Survey Platform Case Study

Authors: Vinicius Monteiro de Lira, Antonio Maiorino, Peng Jiang, Rouzbeh Torabian Esfahani

DialogueForge: LLM Simulation of Human-Chatbot Dialogue

Authors: Ruizhe Zhu, Hao Zhu, Yaxuan Li, Syang Zhou, Shijing Cai, Małgorzata Łazuka, Elliott Ash

Evaluating the Robustness of Dense Retrievers in Interdisciplinary Domains

Authors: Sarthak Chaturvedi, Anurag Acharya, Rounak Meyur, Koby Hayashi, Sai Munikoti, Sameera Horawalavithana

Embeddings to Diagnosis: Latent Fragility under Agentic Perturbations in Clinical LLMs

Authors: Raj Krishnan Vijayaraj

Evaluating the Quality of Randomness and Entropy in Tasks Supported by Large Language Models

Authors: Rabimba Karanjai, Yang Lu, Ranjith Chodavarapu, Lei Xu, Weidong Shi

Benchmark Dataset Generation and Evaluation for Excel Formula Repair with LLMs

Authors: Ananya Singha, Harshita Sahijwani, Walt Williams, Emmanuel Aboah Boatang, Nick Hausman, Miguel Di Luca, Keegan Choudhury, Chaya Binet, Vu Le, Tianwei Chen, Oryan Rokeah Chen, Sulaiman Vesal, Sadid Hasan

What If The Patient Were Different? A Framework To Audit Biases and Toxicity in LLM Clinical Note Generation

Authors: Daeja Oxendine, Swetasudha Panda, Naveen Jafer Nizar, Qinlan Shen,Sumana Srivatsa and Krishnaram Kenthapadi

Risk In Context: Benchmarking Privacy Leakage of Foundation Models in Synthetic Tabular Data Generation

Authors: Jessup Byun, Xiaofeng Lin, Joshua Ward, Guang Cheng

Evaluating the LLM-simulated Impacts of Big Five Personality Traits and AI Capabilities on Social Negotiations

Authors: Myke C. Cohen, Hsien-Te Kao, Daniel Nguyen, Zhe Su, Maarten Sap

Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity

Authors: Xinwei Wu, Hongyu Lin, Haojie Li, Xinyu Ji, Ruohan Li, Yule Chen, Yigeng Zhang

Welcome to Agentic & GenAI Evaluation KDD 2025!

Call for Contributions

Important Dates

Paper Submission Deadline

May 25th, 2025 → June 13th, 2025

Paper Acceptance Notification

July 11th, 2025

Camera-Ready Submission

July 21th, 2025

Workshop Date

August 4th, 2025 (1:00 pm - 5:00 pm) (TBD)

Keynote Speakers

SCHEDULE

Opening

1:00 - 1:05 PM

Keynote Talk 1: Trustworthy Document AI: From "Good Enough" to Mission-Critical

1:05 - 1:50 PM

Paper Session 1:

1:50 - 2:35 PM

Posters & Networking - Location: Poster Setup at 700 foyer (boards with the name of "Evaluation and Trustworthiness of Agentic and Generative AI")

2:35 - 3:00 PM

Coffee Break

3:00 - 3:30 PM

Keynote Talk 2: Rethinking Evaluation for Agentic AI: Lessons from Transportation and Education

3:30 - 4:15 PM

Paper Session 2:

4:15 - 5:00 PM

Closing

5:00 - 5:05 PM