Agentic & GenAI Evaluation KDD2025:

KDD workshop on Evaluation and Trustworthiness of Agentic and Generative AI Models

Monday, August 4, 1:00 PM- 5:00 PM. Toronto, ON, Canada. Held in conjunction with KDD'25, Toronto Convention Centre


Welcome to Agentic & GenAI Evaluation KDD 2025!

The rapid advancement of generative and agentic AI models has ushered in transformative applications across diverse domains, ranging from creative content generation to autonomous decision-making systems. However, as these models become more capable, their widespread deployment raises pressing concerns about evaluation, reliability, and ethical implications. Current evaluation methods are insufficient for capturing the complexities and risks of generative outputs and autonomous AI behaviors.

To bridge this gap, robust evaluation frameworks are needed to assess these models holistically, ensuring they are not only performant but also aligned with societal values and safety expectations. Current benchmarks primarily focus on standard performance metrics, overlooking critical aspects such as trustworthiness, interpretability, and real-world usability. Without rigorous evaluation methodologies, generative and agentic AI systems may inadvertently perpetuate biases, propagate misinformation, or act unpredictably in high-stakes scenarios.

This workshop aims to foster interdisciplinary collaboration by bringing together researchers, industry practitioners, and policymakers to develop advanced evaluation techniques for generative and agentic AI. Our discussions will focus on new methodologies for assessing reasoning capabilities, ethical robustness, and cross-modal generation, along with scalable and user-centric evaluation frameworks. By addressing these challenges, we seek to pave the way for more reliable and responsible AI systems that can be safely integrated into society.

Contact: kdd2025-ws-agentic-genai-eval@amazon.com

Call for Contributions

  • Link to the submission website: https://openreview.net/group?id=KDD.org/2025/Workshop/Agentic-GenAI-Eval
  • This workshop will focus on the unique challenges and opportunities presented by the intersection of evaluation and trustworthiness in the context of generative and agentic AI. Generative AI models, such as large language models (LLMs) and diffusion models, have shown remarkable abilities in generating human-quality text, images, and other forms of content. Agentic AI refers to AI systems that can act autonomously and purposefully, raising critical concerns about safety, control, and alignment with human values. Evaluating these advanced AI models requires going beyond traditional metrics and benchmarks. We need new methods and frameworks to assess their performance, identify potential biases, and ensure they are used responsibly. This workshop will delve into these issues, with a particular focus on the following topics:

    • Agentic AI Evaluation: Assessing autonomous AI behavior, decision-making, goal alignment, adaptability, security, privacy, tool use, memory, and self-verification in dynamic and open-ended environments.
    • Trustworthiness in Generative AI Models: Truthfulness and Reliability, Ensuring Safety and Security, Addressing Bias and Fairness, Ethical Considerations, Privacy Considerations, Enhancing Misuse Resistance, Explainability, and Robustness.
    • User-Centric Assessment: Evaluating AI from a user experience perspective, including trust calibration, mental models, and usability.
    • Multi-Perspective Evaluation: Emphasizing logical reasoning, knowledge depth, problem-solving abilities, contextual understanding, and user alignment.
    • Evaluating Reasoning Models: Measuring AI's ability to conduct step-by-step reasoning, causal inference, and complex problem-solving across various domains.
    • Efficient Evaluation Methods: Scalable, automated, and cost-effective approaches for assessing generative AI performance with minimal manual oversight.
    • Synthetic Data Generation Evaluation: Assessing the quality, representativeness, and bias implications of synthetic data used for AI training and evaluation.
    • Evaluating Misinformation and Manipulative Content: Techniques for detecting, measuring, and mitigating misinformation propagation by generative models.
    • Cross-Modal Evaluation: Assessing AI's capabilities across text, image, audio, and multimodal generation.
    • Holistic Evaluation Frameworks: Developing standardized datasets, metrics, and methodologies for comprehensive AI assessment.

    Keynote Speakers

    Haixun Wang, Hua Wei

    Haixun Wang

    Haixun Wang

    Title: Trustworthy Document AI: From "Good Enough" to Mission-Critical

    Abstract: Large Language Models and multimodal generative systems can now read dense contracts, physician notes, and photographed invoices with impressive fluency. But in courtrooms, clinics, and underwriting desks, “pretty good” still courts unacceptable risk. This talk frames Document AI as an end-to-end product challenge: we will explore how reliable performance emerges not from bigger models alone, but from principled evaluation and structured design. We’ll show how QA-driven metrics, especially reinforcement learning with automated reward signals, enable scalable self-improvement and replace vague benchmarks with real-world accountability. Agentic systems, which decompose schema-driven tasks into orchestrated micro-agents, offer a more transparent and cost-effective path than monolithic LLMs. Across modalities and use cases, from legal extraction to voice agents, we argue for Document AI as infrastructure: modular, measurable, and mission-ready.

    Bio:

    Hua Wei

    Hua Wei

    Title: Rethinking Evaluation for Agentic AI: Lessons from Transportation and Education

    Abstract: As large language models increasingly serve as autonomous agents capable of reasoning, planning, and acting within complex environments, evaluating their real-world behavior has become a pressing challenge. In this talk, I will share empirical insights from three systems we developed in the domains of transportation and education. These deployments reveal practical limitations of existing evaluation methodologies and highlight the need for more context-aware, user-centered, and uncertainty-informed evaluation strategies. I will share the key lessons we learnt from these case studies and proposing directions for building trustworthy and socially integrated agentic AI systems.

    Bio: Hua Wei is an Assistant Professor in the School of Computing and Augmented Intelligence (SCAI) at Arizona State University (ASU). His research interests lie in data mining and machine learning, with a particular emphasis on spatio-temporal data mining and reinforcement learning. His work has received multiple Best Paper Awards, including honors from ECML-PKDD and ICCPS. His research has been supported by various funding agencies, including the National Science Foundation (NSF), the Department of Energy (DoE), and the Department of Transportation (DoT). Dr. Wei is a recipient of the Amazon Research Award and the NSF CAREER Award in 2025.


    SCHEDULE

    August 04, 2025 (1:00-5:00PM), Toronto Convention Centre, [602] Posters at 700 foyerLOCATION

      Opening
      1:00 - 1:05 PM

    Introduction by organizers.

      Keynote Talk 1: Trustworthy Document AI: From "Good Enough" to Mission-Critical
      1:05 - 1:50 PM

    Haixun Wang

      Paper Session 1:
      1:50 - 2:35 PM

    1:50 PM - 2:00 PM, The Optimization Paradox in Clinical AI Multi-Agent Systems, Suhana Bedi (10 minutes)

    2:00 PM - 2:10 PM, Cybernaut: Towards Reliable Web Automation, Ankur Tomar (10 minutes)

    2:10 PM - 2:20 PM, Finding the Sweet Spot: Trading Quality, Cost, and Speed During Inference-Time LLM Reflection, Nikita Kozodoi (10 minutes)

    2:20 PM - 2:30 PM, Reasoning Beyond the Obvious: Evaluating Divergent and Convergent Thinking in LLMs for Financial Scenarios, Zhuang Qiang Bok (10 minutes)

      Posters & Networking - Location: Poster Setup at 700 foyer (boards with the name of "Evaluation and Trustworthiness of Agentic and Generative AI")
      2:35 - 3:00 PM

    Poster presentations and networking opportunity

      Coffee Break
      3:00 - 3:30 PM
      Keynote Talk 2: Rethinking Evaluation for Agentic AI: Lessons from Transportation and Education
      3:30 - 4:15 PM

    Hua Wei

      Paper Session 2:
      4:15 - 5:00 PM

    4:15 - 4:25 PM, Evaluating the LLM-simulated Impacts of Big Five Personality Traits and AI Capabilities on Social Negotiations, Myke C. Cohen (10 minutes)

    4:25 - 4:35 PM, Risk In Context: Benchmarking Privacy Leakage of Foundation Models in Synthetic Tabular Data Generation, Jessup Byun (10 minutes)

    4:35 - 4:45 PM, FECT: Factuality Evaluation of Interpretive AI-Generated Claims in Contact Center Conversation Transcripts, Hagyeong Shin (10 minutes)

    4:45 - 4:55 PM, Evaluating Large Language Models for Semi-Structured Data Manipulation Tasks: A Survey Platform Case Study, Vinicius Monteiro de Lira (10 minutes)

      Closing
      5:00 - 5:05 PM

    Closing remarks by organizers.


    Accepted Papers

    Below are the accepted papers for the workshop

      Creative Adversarial Testing (CAT): A Novel Framework for Evaluating Goal-Oriented Agentic AI Systems

    Authors: Hassen Dhrif

      JokeEval: Are the Jokes Funny? Review of Computational Evaluation Techniques to improve Joke Generation

    Authors: Sulbha Jain

      TITAN: Task-oriented Prompt Improvement with Script Generation

    Authors: Chung-Yu Wang, Alireza DaghighFarsoodeh, Hung Viet Pham

      Finding the Sweet Spot: Trading Quality, Cost, and Speed During Inference-Time LLM Reflection

    Authors: Jack Butler, Nikita Kozodoi, Zainab Afolabi, Brian Tyacke, Gaiar Baimuratov

      Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt Generation for Enhanced LLM Content Moderation

    Authors: Daniel Schwartz, Dmitriy Bespalov, Zhe Wang, Ninad Kulkarni, Yanjun Qi

      A Tri-Agent Framework for Evaluating and Aligning Question Clarification Capabilities of Large Language Models

    Authors: Yikai Zhao

      Towards High Supervised Learning Utility Training Data Generation: Data Pruning and Column Reordering

    Authors: Tung Sum Thomas Kwok, Zeyong Zhang, Chi-Hua Wang, Guang Cheng

      AutoGrader: Hybrid Evaluation of GenAI Product Images via Contrastive Embeddings and LLM-Generated Grading Notes

    Authors: Ramesh Chadalavada, Sumanta Kashyapi, Ramkumar Subramanian, Matt Witten

      Reasoning Beyond the Obvious: Evaluating Divergent and Convergent Thinking in LLMs for Financial Scenarios

    Authors: Zhuang Qiang Bok, Watson Wei Khong Chua

      The Optimization Paradox in Clinical AI Multi-Agent Systems

    Authors: Suhana Bedi, Iddah Mlauzi, Daniel Shin, Sanmi Koyejo, Nigam H. Shah

      How Well Do LLMs Represent Values Across Cultures? Empirical Analysis of LLM Responses Based on Hofstede Cultural Dimensions

    Authors: Julia Kharchenko, Tanya Roosta, Aman Chadha, Chirag Shah

      Text-to-SQL Evaluation at Scale: A Semi-Automatic Approach for Benchmark Data Generation

    Authors: Melody Xuan, Zhenyu Zhang, Çağatay Demiralp, Xinyang Shen, Ozalp Ozer, Niti Kalra

      Cybernaut: Towards Reliable Web Automation

    Authors: Ankur Tomar, Hengyue Liang, Indranil Bhattacharya, Natalia Larios, Francesco Carbone

      FECT: Factuality Evaluation of Interpretive AI-Generated Claims in Contact Center Conversation Transcripts

    Authors: Hagyeong Shin, Binoy Robin Dalal, Iwona Bialynicka-Birula, Navjot Matharu, Ryan Muir, Xingwei Yang, Samuel W. K. Wong

      Seeing What's Wrong: A Trajectory-Guided Approach to Caption Error Detection

    Authors: Gabriel Isaac Afriat, Ryan Lucas, Xiang Meng, Yufang Hou, Yada Zhu, Rahul Mazumder

      Evaluating Large Language Models for Semi-Structured Data Manipulation Tasks: A Survey Platform Case Study

    Authors: Vinicius Monteiro de Lira, Antonio Maiorino, Peng Jiang, Rouzbeh Torabian Esfahani

      DialogueForge: LLM Simulation of Human-Chatbot Dialogue

    Authors: Ruizhe Zhu, Hao Zhu, Yaxuan Li, Syang Zhou, Shijing Cai, Małgorzata Łazuka, Elliott Ash

      Evaluating the Robustness of Dense Retrievers in Interdisciplinary Domains

    Authors: Sarthak Chaturvedi, Anurag Acharya, Rounak Meyur, Koby Hayashi, Sai Munikoti, Sameera Horawalavithana

      Embeddings to Diagnosis: Latent Fragility under Agentic Perturbations in Clinical LLMs

    Authors: Raj Krishnan Vijayaraj

      Evaluating the Quality of Randomness and Entropy in Tasks Supported by Large Language Models

    Authors: Rabimba Karanjai, Yang Lu, Ranjith Chodavarapu, Lei Xu, Weidong Shi

      Benchmark Dataset Generation and Evaluation for Excel Formula Repair with LLMs

    Authors: Ananya Singha, Harshita Sahijwani, Walt Williams, Emmanuel Aboah Boatang, Nick Hausman, Miguel Di Luca, Keegan Choudhury, Chaya Binet, Vu Le, Tianwei Chen, Oryan Rokeah Chen, Sulaiman Vesal, Sadid Hasan

      What If The Patient Were Different? A Framework To Audit Biases and Toxicity in LLM Clinical Note Generation

    Authors: Daeja Oxendine, Swetasudha Panda, Naveen Jafer Nizar, Qinlan Shen,Sumana Srivatsa and Krishnaram Kenthapadi

      Risk In Context: Benchmarking Privacy Leakage of Foundation Models in Synthetic Tabular Data Generation

    Authors: Jessup Byun, Xiaofeng Lin, Joshua Ward, Guang Cheng

      Evaluating the LLM-simulated Impacts of Big Five Personality Traits and AI Capabilities on Social Negotiations

    Authors: Myke C. Cohen, Hsien-Te Kao, Daniel Nguyen, Zhe Su, Maarten Sap

      Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity

    Authors: Xinwei Wu, Hongyu Lin, Haojie Li, Xinyu Ji, Ruohan Li, Yule Chen, Yigeng Zhang

    Submission Guidelines

    • Please ensure your paper submission is anonymous.
    • The accepted papers will be posted on the workshop website but will not be included in the KDD proceedings.
    • Paper submissions are limited to 9 pages, excluding references, must be in PDF and use ACM Conference Proceeding templates.
    • Additional supplemental material focused on reproducibility can be provided. Proofs, pseudo-code, and code may also be included in the supplement, which has no explicit page limit. The supplement format could be either single column or double column. The paper should be self-contained, since reviewers are not required to read the supplement.
    • The Word template guideline can be found here: [link]
    • The Latex/overleaf template guideline can be found here: [link]
    • A paper should be submitted in PDF format through OpenReview at the following link: https://openreview.net/group?id=KDD.org/2025/Workshop/Agentic-GenAI-Eval