Things & Thinks-Issue LXVI
📚Research Digest
Insights from Stanford’s Med-HELM Evaluation
What it is about
Med-HELM is a benchmarking study from Stanford CRFM that evaluates how well large language models (LLMs) perform on real-world medical tasks. It assesses 27 models-ranging from general-purpose LLMs like GPT-4 and Claude to specialized biomedical models like Meditron and PMC-LLaMA-across five key medical use cases: medical knowledge QA, patient-facing QA, clinical decision support, research summarization, and reasoning. The evaluation uses a holistic framework, considering not only accuracy but also fairness, calibration, and robustness. Overall, large models did well on complex reasoning tasks (e.g., performing medical calculations and in detecting race bias in clinical text), while medium models performed competitively on medical prediction tasks with lower computational demands (e.g., predicting readmission risk).
What it means
Med-HELM seems to offer a moment of clarity in the rush to apply LLMs in healthcare-it reminds us that even the most advanced models remain fallible in high-stakes environments. The path forward may involve combining the broad reasoning abilities of foundation models with the precision of domain-specific ones. Reflection on these findings also underscores the need for more than just benchmark wins; real-world use will require careful alignment, trust calibration, and regulatory rigor. As benchmarks like Med-HELM become standard, they could shape not just model development, but how we think about AI accountability in medicine.
How Anthropic Experiments with Agent-Oriented Research
What it is about
This post describes how Anthropic developed a multi-agent research system using multiple instances of Claude, their family of AI models, to collaborate on complex tasks in a semi-autonomous, coordinated way. The system includes a manager-agent architecture, where a manager Claude assigns subtasks to specialized worker Claudes, enabling efficient decomposition and execution of open-ended research problems like literature reviews or code analysis. The system emphasizes modularity, reusable tools (like web search and code execution), and agent-to-agent communication, all within a framework designed to stay interpretable and steerable.
What it means
Reading this research note was an interesting look into what collaborative AI could become-less of a single omniscient model and more of a distributed team of specialists, each contributing pieces of insight under a coordinated plan. It reflects a shift in thinking from monolithic AI to structured, multi-agent orchestration, where interpretability and division of labor matter. Looking ahead, this could influence how we build AI for scientific discovery, engineering design, and knowledge synthesis. Yet it also raises questions: How do we ensure accountability when multiple agents act semi-independently? Can complex agent systems be reliably monitored at scale? Anthropic's transparent release marks a cautious but important step toward making multi-agent AI both powerful and trustworthy.
🖇Digital Healthcare News
#GenAI and #BigTech in #Healthcare
The American Society of Clinical Oncology (ASCO) and Google Cloud collaborated to launch an AI-based ASCO Guidelines Assistant, developed using Google Cloud’s Vertex AI platform and advanced Gemini models.
The Coalition for Health AI (CHAI) has certified its first partnership for AI model validation with BeeKeeperAI, Mount Sinai and Morehouse is for chronic heart failure (CHF). AI model developers who have built algorithms for CHF will be able to test the performance of their models on data sets curated by Mount Sinai and Morehouse.
EHR provider Epic rolled out a new initiative, called Launchpad, to help organizations, working alongside Epic experts, quickly operationalize gen-AI-assisted workflows.
Universal Health Services and Hippocratic AI reeleased an AI agent to help clinicians make follow-up phone calls to patients post-discharge
Cigna launched a number of new digital tools meant to improve customer experience with its health benefits portal, including a virtual GenAI assistant.
Regulatory Brief
USFDA launched Elsa, a generative Artificial Intelligence (AI) tool designed to help its employees-from scientific reviewers to investigators-work more efficiently.
USFDA approved first AI tool for breast cancer risk prediction from startup Clairity
Pharma/Device Brief
Pangaea Data, a company focused on detecting hard-to-diagnose diseases in patients, is partnering with Alexion, a subsidiary of AstraZeneca focused on treating rare diseases, to co-develop, clinically validate and ultimately seek regulatory approval for an AI-enabled offering to detect hypophosphatasia in adults.
Regeneron backed away from buying the DNA-testing company after a nonprofit controlled by co-founder Wojcicki made a higher bid. Twenty-seven states and the District of Columbia have filed a lawsuit seeking to block the sale of personal genetic data by 23andMe without customer consent.
Recommended by LinkedIn
Funding, Deals, Mergers & acquisitions
Abridge, a startup that uses AI to automate doctors’ note-taking with, raised $300M
Healthcare software company Commure raised $200M.
Tennr raised $101M to build out AI that automates patient referral process
Prepared, a startup offering AI-powered solutions for emergency response, raised $80M.
Nabla, which develops aAI copilot for doctors and other medical staff, raised $70M
Ellipsis Health, developing artificial-intelligence-powered voice agents to support patients with complex physical, behavioral and social needs, raised $45M
Mandolin, a platform that uses AI automation to enhance access to specialty drugs, raised $40M
Certify, the provider data intelligence company, raised a $40M
Sword Health raised $40M and launched AI-based mental health solution
Arine, a start-up focused on AI-driven medication intelligence, raised $30M
Outcomes4Me, a developer of a direct-to-patient, AI-driven platform, raised $21M.
Hims & Hers Health will acquire European telehealth platform Zava in its push to expand globally.
Consumer Digital Health & Other News
Novo Nordisk ended its collaboration with Hims & Hers due to concerns about the telehealth company's sales and promotion of cheaper knock-offs of the weight loss drug Wegovy and will collaborate with WeightWatchers to sell Wegovy.
Artificial intelligence startup OpenEvidence inked a multi-year content agreement with the JAMA Network to use content from 13 medical journals to inform answers on its platform.
Amazon India launched Amazon Diagnostics, an at-home healthcare service that allows customers to book lab tests, schedule appointments and receive digital reports directly via the Amazon app.
📙Longread of the Month
This article, by Paul Hlivko in Harvard Business Review is a good read about what enterprise adoption of AI/GenAI will look like-
🦜Tweet of the Month
This made me laugh 😊
📊Chart of the Month
This plot, from the Claude Research cited above, shows the most common ways people are using Anthropic's Research feature
Definitely worth reading!! Thank you for sharing Santosh.
Thanks for the mention Santosh. Your highlight of Med-HELM reinforces another point I think is relevant on this topic. Some forecasters of AI adoption are often using benchmarks ('intelligence tests') as a way to estimate disruption, when they should be more focused on what % of economically viable tasks can be profitably changed with AI. That % requires more than passing tests like a student.