Things & Thinks-Issue LXVI

Santosh Shevade

Published Jun 30, 2025

📚Research Digest

Insights from Stanford’s Med-HELM Evaluation

What it is about

Med-HELM is a benchmarking study from Stanford CRFM that evaluates how well large language models (LLMs) perform on real-world medical tasks. It assesses 27 models-ranging from general-purpose LLMs like GPT-4 and Claude to specialized biomedical models like Meditron and PMC-LLaMA-across five key medical use cases: medical knowledge QA, patient-facing QA, clinical decision support, research summarization, and reasoning. The evaluation uses a holistic framework, considering not only accuracy but also fairness, calibration, and robustness. Overall, large models did well on complex reasoning tasks (e.g., performing medical calculations and in detecting race bias in clinical text), while medium models performed competitively on medical prediction tasks with lower computational demands (e.g., predicting readmission risk).

What it means

Med-HELM seems to offer a moment of clarity in the rush to apply LLMs in healthcare-it reminds us that even the most advanced models remain fallible in high-stakes environments. The path forward may involve combining the broad reasoning abilities of foundation models with the precision of domain-specific ones. Reflection on these findings also underscores the need for more than just benchmark wins; real-world use will require careful alignment, trust calibration, and regulatory rigor. As benchmarks like Med-HELM become standard, they could shape not just model development, but how we think about AI accountability in medicine.

How Anthropic Experiments with Agent-Oriented Research

What it is about

This post describes how Anthropic developed a multi-agent research system using multiple instances of Claude, their family of AI models, to collaborate on complex tasks in a semi-autonomous, coordinated way. The system includes a manager-agent architecture, where a manager Claude assigns subtasks to specialized worker Claudes, enabling efficient decomposition and execution of open-ended research problems like literature reviews or code analysis. The system emphasizes modularity, reusable tools (like web search and code execution), and agent-to-agent communication, all within a framework designed to stay interpretable and steerable.

What it means

Reading this research note was an interesting look into what collaborative AI could become-less of a single omniscient model and more of a distributed team of specialists, each contributing pieces of insight under a coordinated plan. It reflects a shift in thinking from monolithic AI to structured, multi-agent orchestration, where interpretability and division of labor matter. Looking ahead, this could influence how we build AI for scientific discovery, engineering design, and knowledge synthesis. Yet it also raises questions: How do we ensure accountability when multiple agents act semi-independently? Can complex agent systems be reliably monitored at scale? Anthropic's transparent release marks a cautious but important step toward making multi-agent AI both powerful and trustworthy.

🖇Digital Healthcare News

#GenAI and #BigTech in #Healthcare

The American Society of Clinical Oncology (ASCO) and Google Cloud collaborated to launch an AI-based ASCO Guidelines Assistant, developed using Google Cloud’s Vertex AI platform and advanced Gemini models.

The Coalition for Health AI (CHAI) has certified its first partnership for AI model validation with BeeKeeperAI, Mount Sinai and Morehouse is for chronic heart failure (CHF). AI model developers who have built algorithms for CHF will be able to test the performance of their models on data sets curated by Mount Sinai and Morehouse.

EHR provider Epic rolled out a new initiative, called Launchpad, to help organizations, working alongside Epic experts, quickly operationalize gen-AI-assisted workflows.

Universal Health Services and Hippocratic AI reeleased an AI agent to help clinicians make follow-up phone calls to patients post-discharge

Cigna launched a number of new digital tools meant to improve customer experience with its health benefits portal, including a virtual GenAI assistant.

Regulatory Brief

USFDA launched Elsa, a generative Artificial Intelligence (AI) tool designed to help its employees-from scientific reviewers to investigators-work more efficiently.

USFDA approved first AI tool for breast cancer risk prediction from startup Clairity

Pharma/Device Brief

Pangaea Data, a company focused on detecting hard-to-diagnose diseases in patients, is partnering with Alexion, a subsidiary of AstraZeneca focused on treating rare diseases, to co-develop, clinically validate and ultimately seek regulatory approval for an AI-enabled offering to detect hypophosphatasia in adults.

Regeneron backed away from buying the DNA-testing company after a nonprofit controlled by co-founder Wojcicki made a higher bid. Twenty-seven states and the District of Columbia have filed a lawsuit seeking to block the sale of personal genetic data by 23andMe without customer consent.

Recommended by LinkedIn

10 Things You Can Definitely Expect From The Future Of…

Bertalan Meskó, MD, PhD 7 months ago

The Promise of a Prompt: Will Generative AI Change…

Bayer | Pharmaceuticals 1 year ago

10 Things You Can Definitely Expect From The Future Of…

Bertalan Meskó, MD, PhD 1 year ago

Funding, Deals, Mergers & acquisitions

Abridge, a startup that uses AI to automate doctors’ note-taking with, raised $300M

Healthcare software company Commure raised $200M.

Tennr raised $101M to build out AI that automates patient referral process

Prepared, a startup offering AI-powered solutions for emergency response, raised $80M.

Nabla, which develops aAI copilot for doctors and other medical staff, raised $70M

Ellipsis Health, developing artificial-intelligence-powered voice agents to support patients with complex physical, behavioral and social needs, raised $45M

Mandolin, a platform that uses AI automation to enhance access to specialty drugs, raised $40M

Certify, the provider data intelligence company, raised a $40M

Sword Health raised $40M and launched AI-based mental health solution

Arine, a start-up focused on AI-driven medication intelligence, raised $30M

Outcomes4Me, a developer of a direct-to-patient, AI-driven platform, raised $21M.

Hims & Hers Health will acquire European telehealth platform Zava in its push to expand globally.

Consumer Digital Health & Other News

Novo Nordisk ended its collaboration with Hims & Hers due to concerns about the telehealth company's sales and promotion of cheaper knock-offs of the weight loss drug Wegovy and will collaborate with WeightWatchers to sell Wegovy.

Artificial intelligence startup OpenEvidence inked a multi-year content agreement with the JAMA Network to use content from 13 medical journals to inform answers on its platform.

Amazon India launched Amazon Diagnostics, an at-home healthcare service that allows customers to book lab tests, schedule appointments and receive digital reports directly via the Amazon app.

📙Longread of the Month

This article, by Paul Hlivko in Harvard Business Review is a good read about what enterprise adoption of AI/GenAI will look like-

🦜Tweet of the Month

This made me laugh 😊

📊Chart of the Month

This plot, from the Claude Research cited above, shows the most common ways people are using Anthropic's Research feature

Liked what you read? Subscribe & Share! I would love to hear your feedback and thoughts. You can also connect with me via Twitter and LinkedIn!

Things & Thinks

1,514 follower

+ Subscribe

Gourab Ray

5mo

Definitely worth reading!! Thank you for sharing Santosh.

1 Reaction

Paul Hlivko

5mo

Thanks for the mention Santosh. Your highlight of Med-HELM reinforces another point I think is relevant on this topic. Some forecasters of AI adoption are often using benchmarks ('intelligence tests') as a way to estimate disruption, when they should be more focused on what % of economically viable tasks can be profitably changed with AI. That % requires more than passing tests like a student.

Things & Thinks-Issue LXVI

Santosh Shevade

📚Research Digest

Insights from Stanford’s Med-HELM Evaluation

What it is about

What it means

How Anthropic Experiments with Agent-Oriented Research

What it is about

What it means

🖇Digital Healthcare News

#GenAI and #BigTech in #Healthcare

Regulatory Brief

Pharma/Device Brief

Recommended by LinkedIn

Funding, Deals, Mergers & acquisitions

Consumer Digital Health & Other News

📙Longread of the Month

🦜Tweet of the Month

📊Chart of the Month

Things & Thinks

1,514 follower

More articles by Santosh Shevade

Others also viewed

7 Things To Expect From AI In Healthcare This Year

Reviewing the Most Influential Papers on LLMs in Healthcare: Insights and Implications

#010: Generalist vs. Specialist AI Models in Healthcare and Life Sciences

The Great Healthcare AI Paradox: The Stakes Are Too High for Complacency

The Synergistic Power of Advanced AI Technologies in Healthcare: A Comprehensive Review

What's Next in Healthcare AI: A Forward-Looking Perspective Beyond 2025

Regulation of Artificial Intelligence in Medicine

Top LLM Papers of the Week (August Week 2, 2025)

Are You Ready for AI Doctors? The Future of AI in the USA Healthcare

Generative AI Takes On the Enduring Hurdles of Clinical Trials

Explore content categories

📚Research Digest

Insights from Stanford’s Med-HELM Evaluation

What it is about

What it means

How Anthropic Experiments with Agent-Oriented Research

What it is about

What it means

🖇Digital Healthcare News

#GenAI and #BigTech in #Healthcare

Regulatory Brief

Pharma/Device Brief

Recommended by LinkedIn

Funding, Deals, Mergers & acquisitions

Consumer Digital Health & Other News

📙Longread of the Month

🦜Tweet of the Month

📊Chart of the Month

Things & Thinks

1,514 follower

More articles by Santosh Shevade

Things & Thinks-Issue LXXI

AI Works in Demos. Reality Is Messier.

Things & Thinks-Issue LXX

Things & Thinks-Issue LXIX

Everyone Gets a Copilot. Now What?