Things & Thinks-Issue LXVI

Things & Thinks-Issue LXVI

📚Research Digest

Insights from Stanford’s Med-HELM Evaluation

What it is about

Med-HELM is a benchmarking study from Stanford CRFM that evaluates how well large language models (LLMs) perform on real-world medical tasks. It assesses 27 models-ranging from general-purpose LLMs like GPT-4 and Claude to specialized biomedical models like Meditron and PMC-LLaMA-across five key medical use cases: medical knowledge QA, patient-facing QA, clinical decision support, research summarization, and reasoning. The evaluation uses a holistic framework, considering not only accuracy but also fairness, calibration, and robustness. Overall, large models did well on complex reasoning tasks (e.g., performing medical calculations and in detecting race bias in clinical text), while medium models performed competitively on medical prediction tasks with lower computational demands (e.g., predicting readmission risk).

Article content

What it means

Med-HELM seems to offer a moment of clarity in the rush to apply LLMs in healthcare-it reminds us that even the most advanced models remain fallible in high-stakes environments. The path forward may involve combining the broad reasoning abilities of foundation models with the precision of domain-specific ones. Reflection on these findings also underscores the need for more than just benchmark wins; real-world use will require careful alignment, trust calibration, and regulatory rigor. As benchmarks like Med-HELM become standard, they could shape not just model development, but how we think about AI accountability in medicine.


How Anthropic Experiments with Agent-Oriented Research

What it is about

This post describes how Anthropic developed a multi-agent research system using multiple instances of Claude, their family of AI models, to collaborate on complex tasks in a semi-autonomous, coordinated way. The system includes a manager-agent architecture, where a manager Claude assigns subtasks to specialized worker Claudes, enabling efficient decomposition and execution of open-ended research problems like literature reviews or code analysis. The system emphasizes modularity, reusable tools (like web search and code execution), and agent-to-agent communication, all within a framework designed to stay interpretable and steerable.

Article content

What it means

Reading this research note was an interesting look into what collaborative AI could become-less of a single omniscient model and more of a distributed team of specialists, each contributing pieces of insight under a coordinated plan. It reflects a shift in thinking from monolithic AI to structured, multi-agent orchestration, where interpretability and division of labor matter. Looking ahead, this could influence how we build AI for scientific discovery, engineering design, and knowledge synthesis. Yet it also raises questions: How do we ensure accountability when multiple agents act semi-independently? Can complex agent systems be reliably monitored at scale? Anthropic's transparent release marks a cautious but important step toward making multi-agent AI both powerful and trustworthy.


🖇Digital Healthcare News

#GenAI and #BigTech in #Healthcare

The American Society of Clinical Oncology (ASCO) and Google Cloud collaborated to launch an AI-based ASCO Guidelines Assistant, developed using Google Cloud’s Vertex AI platform and advanced Gemini models.

The Coalition for Health AI (CHAI) has certified its first partnership for AI model validation with BeeKeeperAI, Mount Sinai and Morehouse is for chronic heart failure (CHF). AI model developers who have built algorithms for CHF will be able to test the performance of their models on data sets curated by Mount Sinai and Morehouse.

EHR provider Epic rolled out a new initiative, called Launchpad, to help organizations, working alongside Epic experts, quickly operationalize gen-AI-assisted workflows.

Universal Health Services and Hippocratic AI reeleased an AI agent to help clinicians make follow-up phone calls to patients post-discharge

Cigna launched a number of new digital tools meant to improve customer experience with its health benefits portal, including a virtual GenAI assistant.

Regulatory Brief

USFDA launched Elsa, a generative Artificial Intelligence (AI) tool designed to help its employees-from scientific reviewers to investigators-work more efficiently.

USFDA approved first AI tool for breast cancer risk prediction from startup Clairity

Pharma/Device Brief

Pangaea Data, a company focused on detecting hard-to-diagnose diseases in patients, is partnering with Alexion, a subsidiary of AstraZeneca focused on treating rare diseases, to co-develop, clinically validate and ultimately seek regulatory approval for an AI-enabled offering to detect hypophosphatasia in adults.

Regeneron backed away from buying the DNA-testing company after a nonprofit controlled by co-founder Wojcicki made a higher bid. Twenty-seven states and the District of Columbia have filed a lawsuit seeking to block the sale of personal genetic data by 23andMe without customer consent.

Funding, Deals, Mergers & acquisitions

Abridge, a startup that uses AI to automate doctors’ note-taking with, raised $300M

Healthcare software company Commure raised $200M.

Tennr raised $101M to build out AI that automates patient referral process

Prepared, a startup offering AI-powered solutions for emergency response, raised $80M.

Nabla, which develops aAI copilot for doctors and other medical staff, raised $70M

Ellipsis Health, developing artificial-intelligence-powered voice agents to support patients with complex physical, behavioral and social needs, raised $45M

Mandolin, a platform that uses AI automation to enhance access to specialty drugs, raised $40M

Certify, the provider data intelligence company, raised a $40M

Sword Health raised $40M and launched AI-based mental health solution

Arine, a start-up focused on AI-driven medication intelligence, raised $30M

Outcomes4Me, a developer of a direct-to-patient, AI-driven platform, raised $21M.

Hims & Hers Health will acquire European telehealth platform Zava in its push to expand globally.

Consumer Digital Health & Other News

Novo Nordisk ended its collaboration with Hims & Hers due to concerns about the telehealth company's sales and promotion of cheaper knock-offs of the weight loss drug Wegovy and will collaborate with WeightWatchers to sell Wegovy.

Artificial intelligence startup OpenEvidence inked a multi-year content agreement with the JAMA Network to use content from 13 medical journals to inform answers on its platform.

Amazon India launched Amazon Diagnostics, an at-home healthcare service that allows customers to book lab tests, schedule appointments and receive digital reports directly via the Amazon app.


📙Longread of the Month

This article, by Paul Hlivko in Harvard Business Review is a good read about what enterprise adoption of AI/GenAI will look like-

Article content

🦜Tweet of the Month

This made me laugh 😊

Article content

📊Chart of the Month

This plot, from the Claude Research cited above, shows the most common ways people are using Anthropic's Research feature

Article content

Liked what you read? Subscribe & Share! I would love to hear your feedback and thoughts. You can also connect with me via Twitter and LinkedIn!

Definitely worth reading!! Thank you for sharing Santosh.

Thanks for the mention Santosh. Your highlight of Med-HELM reinforces another point I think is relevant on this topic. Some forecasters of AI adoption are often using benchmarks ('intelligence tests') as a way to estimate disruption, when they should be more focused on what % of economically viable tasks can be profitably changed with AI. That % requires more than passing tests like a student.

To view or add a comment, sign in

More articles by Santosh Shevade

  • Things & Thinks-Issue LXXI

    📚Research Digest Hippocratic AI, Wellspan Health & multilingual reach What it is about This paper, co-written by teams…

  • AI Works in Demos. Reality Is Messier.

    Introduction: The Jagged Edge of Intelligence You may have noticed this several times..

  • Things & Thinks-Issue LXX

    📚Research Digest Thirty Days with an AI Scribe: Less Burnout, More Time for Care What it is about This multicenter…

    1 Comment
  • Things & Thinks-Issue LXIX

    📚Research Digest Epic's Comet: Scaling Generative Models for Predictive Healthcare What it is about This preprint, by…

  • Everyone Gets a Copilot. Now What?

    A year ago, enterprise leaders were scrambling to “get into AI.” Now, they’re busy distributing it.

    6 Comments
  • Things & Thinks-Issue LXVIII

    📚Research Digest Google Gemini's Insulin Resistance Literacy and Understanding Agent What it is about This research…

  • Plot Twist: Your Boring Industry Knowledge May Just Be the Hottest Skill in Tech

    Is the pendulum swinging away? Looks like we are going from "anyone can code" to "actually, maybe we need people who…

    4 Comments
  • Things & Thinks-Issue LXVII

    📚Research Digest Microsoft's MAI-DxO: Orchestrating Smarter, Cheaper Clinical Reasoning What it is about This study…

  • Pilots Don’t Scale Themselves. Pharma Needs a Smarter Lab.

    Over the past year, nearly every major pharmaceutical company has launched some form of generative AI initiative…

    2 Comments
  • Things & Thinks-Issue LXV

    📚Research Digest ➡️HealthBench What it is about This paper by OpenAI researchers introduces HealthBench, an…

    3 Comments

Others also viewed

Explore content categories