sigmoid.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
A social space for people researching, working with, or just interested in AI!

Server stats:

595
active users

#evaluation

1 post1 participant0 posts today

Wir freuen uns sehr über den erfolgreichen Auftakt von „Wissen, was wirkt“! 🎉 Das Multiplikator*innen-Programm der Impact Unit stößt langfristige Beratungsangebote in wissenschaftlichen Einrichtungen an. Von April bis Juni fand der erste Durchgang statt.

Julia Panzer berichtet im Interview davon, wie die Teilnehmer*innen das Gelernte umsetzen und wie es weitergeht:
wissenschaft-im-dialog.de/blog

@bmftr_bund
#Wisskomm #Evaluation #Wirkung

Really wish there was more talk about nonprofit and glam evaluation around the fediverse. Pretty much all the related hashtags are either me, specifically about evaluating generative AI, or a few in French or German.

I was hoping to learn/follow more about the VSA virtual conference that's happening, too, but alas. Fragmentation and tech debt strike again, I guess, along with the tenacity of the status quo. Still agitating for change tho!

Successful #evaluation for ESS: From May 26 to 18, 2025, a group of international scientists visited Karlsruhe to evaluate, among other things, the Topic Engineering Secure Systems (ESS). The guests came from ETH Zurich, the University of Wisconsin-Madison, and the University of Leuven, among others. ESS is one of three (sub)topics in the Program Engineering #Digital Futures (EDF) in the @helmholtz Research Field “Information.” We at SECUSO are involved in ESS as part of the Human and Societal Factors (HSF) research group. HSF presented the work of the research group in four demonstrators from the areas of #security #awareness, user #authentication, legal design patterns, and securing democracies. Further information can be found in the special issue on Topic Engineering Secure Systems: kastel-labs.de/wp-content/uplo

🚀 Technical practitioners & grads — join to build an LLM evaluation hub!
Infra Goals:
🔧 Share evaluation outputs & params
📊 Query results across experiments

Perfect for 🧰 hands-on folks ready to build tools the whole community can use

Join the EvalEval Coalition here 👇
forms.gle/6fEmrqJkxidyKv9BA

Google Docs[EvalEval Infra] Better Infrastructure for LM EvalsWelcome to EvalEval Working Group Infrastructure! Please help us get set up by filling out this form - we are excited to get to know you! This is an interest form to contribute/collaborate on a research project, building standardized infrastructure for AI evaluation. Status Quo: The AI evaluation ecosystem currently lacks standardized methods for storing, sharing, and comparing evaluation results across different models and benchmarks. This fragmentation leads to unnecessary duplication of compute-intensive evaluations, challenges in reproducing results, and barriers to comprehensive cross-model analysis. What's the project? We plan to address these challenges by developing a comprehensive standardized format for capturing the complete evaluation lifecycle. This format will provide a clear and extensible structure for documenting evaluation inputs (hyperparameters, prompts, datasets), outputs, metrics, and metadata. This standardization enables efficient storage, retrieval, sharing, and comparison of evaluation results across the AI research community. Building on this foundation, we will create a centralized repository with both raw data access and API interfaces that allow researchers to contribute evaluation runs and access cached results. The project will integrate with popular evaluation frameworks (LM-eval, HELM, Unitxt) and provide SDKs to simplify adoption. Additionally, we will populate the repository with evaluation results from leading AI models across diverse benchmarks, creating a valuable resource that reduces computational redundancy and facilitates deeper comparative analysis. Tasks? As a collaborator, you would be expected to: Work towards merging/integrating popular evaluation frameworks (LM-eval, HELM, Unitxt) Group 1 - Extend to Any Task: Design universal metadata schemas that work for ANY NLP task, extending beyond current frameworks like lm-eval/DOVE to support specialized domains (e.g., machine translation) Group 2 - Save the Relevant: Develop efficient query/download systems for accessing only relevant data subsets from massive repositories (DOVE: 2TB, HELM: extensive metadata) The result will be open infrastructure for the AI research community, plus an academic publication. When? We're looking for researchers who can join ASAP and work with us for at least 5 to 7 months. We are hoping to find researchers who would take this on as an active project (8 hours+/week) in this period.

Was sind sinnvolle strategische Maßnahmen zur Evaluation von #wisskomm Aktivitäten und welcher Erkenntnisgewinn lässt sich daraus ziehen?

Diese uns andere Fragen zum Thema Evaluation besprechen wir in unserem Online-Talk am 17.06. ab 12:00 Uhr. Mit dabei Julia Wandt, @nawik und @wissenschaftimdialog - wir freuen uns drauf! 🤗

Mitglieder können sich gerne hier anmelden (nach Login): idw-online.de/de/idwnews?detai

Bestnoten für Lehrende – Humanwissenschaftliche Fakultät der Universität Potsdam vergibt Lehrpreis an Dr. phil. Lars Rothkegel und fünf weitere Forschende. Weitere Informationen zu den TOP 10 am besten evaluierten Lehrveranstaltungen an der Humanwissenschaftlichen Fakultät und zum Fakultätsfest: uni-potsdam.de/de/nachrichten/

I am so tired of seeing evaluators distracted and compromised by generative AI and LLMs.

It's taking up space even in the VSA justice and anti-racism space because people are asking how to use it ethically or in service of equity.

Thats Not How Any Of This Works dot gif

I guess I could show up to the meeting, but given that a) the main source of discussion points is from the garbage AEA journal on the topic and b) I've been ignored, it's prob not worth my time.