Closing a Critical Screening Gap

Hepatocellular carcinoma (HCC), the most common form of liver cancer, is frequently diagnosed at advanced stages when treatment options are limited and survival rates are poor. Current clinical guidelines focus screening efforts on patients with known cirrhosis or chronic liver disease — but a new study published in Cancer Discovery reveals a critical flaw in this approach: 69 percent of HCC cases in a large population study occurred in patients who had never received a prior diagnosis of liver disease.

That single finding — that the majority of liver cancer patients had no flagged risk status before their diagnosis — suggests that current screening protocols miss most of the population at risk. A machine learning model developed by researchers at RWTH Aachen University, led by Dr. Carolin Schneider, offers a potential path to changing that. Using only data that already exists in routine clinical records, the model achieved an area under the receiver operating characteristic curve (AUROC) of 0.88 — substantially outperforming every existing clinical scoring tool used for HCC risk assessment.

How the Model Works

The researchers trained a random forest model — an ensemble approach that builds hundreds of decision trees and aggregates their predictions — on electronic health record data and routine blood test results from over 500,000 participants in the UK Biobank. The training dataset included 538 confirmed HCC cases, allowing the model to learn which combinations of clinical features predict cancer development over time.

The inputs are deliberately practical. The model uses patient demographics, standard blood chemistry panels (liver enzymes, complete blood count, metabolic markers), and structured EHR data — the kind of information that primary care physicians already collect at routine check-ups. No specialized imaging, no genetic sequencing, no biomarker panels requiring dedicated laboratory infrastructure.

A simplified version of the model, using just 15 clinical features, still outperformed every existing risk scoring tool. This is significant for real-world deployment: a 15-feature model is fast, transparent, and easy to integrate into existing clinical decision support systems without requiring workflow changes.

Surprising Finding: Most Patients Had No Prior Diagnosis

The 69 percent figure — HCC cases with no prior liver disease diagnosis — is the most provocative result in the study. It directly challenges the rationale for limiting HCC surveillance to high-risk groups identified by existing disease categories. If the majority of liver cancers develop in patients who would not currently qualify for enhanced screening, then even a perfect screening protocol applied only to guideline-defined high-risk patients would miss more than two-thirds of cases.

The machine learning model's ability to identify elevated HCC risk in this broader population — using only routine clinical data — suggests it could serve as a first-pass triage tool in primary care settings. Patients flagged as high risk could then be referred for imaging or blood-based cancer screening tests, enabling earlier detection at stages when curative treatment is more feasible.

Validation Across Diverse Populations

A model trained primarily on UK Biobank data — which skews toward white, older British participants — might not generalize to other populations. The researchers addressed this concern through validation on the All of Us registry, a US National Institutes of Health dataset with over 400,000 participants drawn from diverse ethnic and socioeconomic backgrounds.

The model's performance held up across demographic groups in the All of Us validation cohort, suggesting that the clinical features driving HCC risk prediction are consistent enough across populations to support broad deployment. This is an important result for a tool intended for use across the diverse patient populations of health systems in the US, Europe, and beyond.

The researchers also tested whether adding genomic data or metabolomic biomarker panels improved prediction. Notably, these expensive additional data types provided minimal performance lift over the baseline clinical model. The implication is that the most useful HCC risk signal is already embedded in the routine data that health systems collect, and extracting it requires better analytics rather than more data collection.

Path to Clinical Deployment

The study is retrospective, meaning it analyzed historical records rather than prospectively following patients forward. Prospective validation — following a population forward and measuring whether model-flagged patients actually develop HCC at higher rates — is the next required step before clinical adoption.

The researchers note several additional limitations: the UK Biobank population underrepresents patients with hepatitis B and C virus infections, which are major HCC risk factors globally. Future model iterations should incorporate viral hepatitis data and validate performance in high-prevalence hepatitis regions.

Despite these caveats, the study's core contribution is substantial. A tool that a primary care physician can run on existing patient data, with no additional tests required, and that identifies patients at elevated liver cancer risk with 0.88 AUROC performance, represents a meaningful advance over the clinical status quo. If validated prospectively and integrated into EHR workflows, it could become one of the most impactful AI screening tools to reach clinical practice.

This article is based on reporting by Medical Xpress. Read the original article.