FIH Paper Reading Checklist for Clinicians

臨床醫師 FIH 論文判讀清單

English

A phase I oncology paper is a compressed record of hundreds of decisions made under uncertainty. The waterfall plot on the last slide of a tumor conference presentation shows you the destination — but nothing about the journey, the wrong turns, or the roads not taken. A structured checklist changes the default question from “what did they find?” to “should I trust how they found it?”

Study design. The first thing to establish is what kind of phase I trial this is — dose escalation, dose expansion, basket, or biomarker-enriched cohort — because each answers a different question and carries a different risk of overinterpretation. The escalation design (3+3, accelerated titration, BOIN, CRM, or Bayesian model-based) shapes how quickly patients move through dose levels and how much weight early observations carry. Look for sentinel dosing (did they observe the first patients before opening the full cohort?) and step-up dosing (for T-cell engagers and bispecifics, was there a priming dose?). Critically, check whether the DLT window is long enough to capture the expected toxicity profile: an ADC with interstitial lung disease risk needs a longer window than a small molecule with immediate hematological toxicity.

Patient selection. Advanced solid tumor all-comer designs and molecularly selected designs are asking different questions. Know which you are reading. Check ECOG performance status limits, organ function requirements, prior therapy restrictions, brain metastasis allowances, and autoimmune disease exclusions. Ask whether the excluded populations are precisely those most likely to experience severe toxicity — and whether this means the reported safety data will underestimate real-world risk. Biomarker-selected enrollment (genetic alteration, protein expression) systematically produces higher response rates in phase I; if this is not separated from the overall ORR, the paper’s activity signal is inflated.

Dose rationale. This is the intellectual foundation. The starting dose should cite its derivation: NOAEL (no observed adverse effect level), HNSTD (highest non-severely toxic dose), MABEL (minimal anticipated biological effect level, the standard for immune-activating agents), PAD (pharmacologically active dose), receptor occupancy targets, or integrated PK/PD modeling. The safety factor applied should be stated. The RP2D (recommended phase 2 dose) should be supported by more than just the absence of DLTs — look for exposure-response or biomarker evidence. A paper that says “the RP2D was selected based on the totality of clinical evidence” without specifying what that evidence was has not met the standard.

Safety. DLT, MTD, and RP2D should be clearly defined and clearly distinct. Beyond grade 3+ adverse events, look separately at dose reductions, dose interruptions, and treatment discontinuations — these are the tolerability signals that grade 3+ counts systematically miss. A drug with no grade 3+ events but a 40% dose reduction rate is telling you something important. Chronic and cumulative toxicities — neuropathy that develops over months, fatigue that accumulates across cycles, skin toxicity that never fully resolves — are particularly likely to be underreported in trials with a short DLT window and a limited follow-up duration.

Activity signal. ORR, disease control rate, duration of response, and PFS from a phase I trial are hypothesis-generating, not confirmatory. Response by dose level is more informative than pooled overall response — if all the responses came from the highest dose levels, that is a different signal than distributed responses across the dose range. Biomarker subgroups can be informative but are almost always underpowered; treat them as hypothesis-generating. Expansion cohorts systematically inflate the ORR compared to escalation cohorts (real-world Japanese single-center data shows expansion cohort ORR 23.7% vs escalation cohort 11.0%), because expansion selects patients near the optimal dose with better molecular matching.

Translational endpoints. PK (Cmax, AUC, t½, accumulation ratio) tells you whether drug exposures are in the range where biology was expected to happen. PD (target engagement, pathway inhibition, ctDNA kinetics) tells you whether biology is actually happening. Resistance profiling (on-target vs off-target) begins to tell you why some patients stop responding. A paper that shows impressive ORR without PK/PD data leaves the reader unable to assess whether the responders had adequate exposure or whether non-responders were simply underexposed.

The 30-minute journal club format. Open with the biology question (what risk drove this FIH — payload toxicity, immune activation, target expression, or off-target effect?). Use minutes 5–10 to check starting dose rationale. Use minutes 10–18 to trace the PK/PD and biomarker evidence chain. Use minutes 18–24 to build a benefit-risk table comparing candidate RP2D options. Close with the patient question: what does this regimen actually cost the patient in terms of hospitalizations, clinic visits, toxicities, dose reductions, and quality of life?

Ten questions every clinician should be able to ask a sponsor or PI. (1) Why was this starting dose chosen? (2) Why this escalation rate? (3) Is the DLT window long enough to capture the expected toxicity? (4) How was RP2D selected when MTD was not reached? (5) Is there exposure-safety or exposure-efficacy data? (6) Does the PD biomarker actually prove target engagement? (7) Is the expansion cohort molecularly selected? (8) Were late or cumulative toxicities reported? (9) Were patient-reported outcomes collected? (10) What can and cannot be concluded from this phase I data?

中文

一篇一期腫瘤臨床試驗的論文,是幾百個在不確定性下做出的決策的壓縮紀錄。腫瘤科會議最後一張投影片上的瀑布圖,告訴你的是終點——卻什麼都沒有說明旅程、錯誤轉彎,或未走的路。一份結構化清單,能把預設問題從「他們發現了什麼?」改成「我應該相信他們找到這個結果的方式嗎?」

研究設計。 首先要確定這是哪種一期試驗——劑量遞增、劑量擴增、籃子試驗或生物標記富集隊列——因為每一種回答的問題不同,過度解讀的風險也不同。升量設計(3+3、加速滴定、BOIN、CRM 或貝氏模型)決定病人在劑量層之間移動的速度,以及早期觀察結果所承受的比重。注意是否有 sentinel dosing(在開放完整隊列前,先觀察第一批病人?),以及是否有 step-up dosing(對 T-cell engager 和雙特異性抗體,是否有前導劑量?)。關鍵是:DLT 觀察窗口是否足夠長,能捕捉預期的毒性輪廓?有肺毒性風險的 ADC 需要比有即時血液毒性的小分子更長的觀察窗口。

病人選擇。 晚期實體腫瘤不限制族群的設計,和分子篩選設計,問的是不同的問題——需要知道你在讀哪一種。檢查 ECOG 體能狀態限制、器官功能要求、前治療限制、腦轉移是否允許,以及自體免疫疾病排除條件。追問:被排除的族群,是否恰好是最可能發生嚴重毒性的人?若是,論文報告的安全性資料是否低估了真實世界的風險?依生物標記篩選入組(基因變異、蛋白表現)會系統性地提高一期試驗的反應率;若這部分沒有從整體 ORR 中分離出來,論文的療效訊號就被高估了。

劑量理由。 這是智識基礎。起始劑量應引述其推導依據:NOAEL、HNSTD、MABEL(免疫活化藥物的標準)、PAD、受體佔有率目標,或整合性 PK/PD 建模。所使用的安全係數應被明確陳述。RP2D 應有比「沒有 DLT 發生」更多的支撐——尋找暴露-反應或生物標記的證據。一篇說「RP2D 是依據整體臨床證據選出」卻未說明那些證據是什麼的論文,還未達到標準。

安全性。 DLT、MTD 和 RP2D 應有清楚的定義,且彼此明確區分。除了第 3 級以上不良事件,要分別看劑量減少、劑量中斷和停藥——這些才是第 3 級以上計數系統性忽略的耐受性訊號。一個沒有第 3 級以上事件但劑量減少率 40% 的藥,正在告訴你重要的事。慢性與累積毒性——幾個月後才出現的神經病變、跨療程累積的疲倦、從未完全消退的皮膚毒性——在 DLT 觀察窗口短、追蹤時間有限的試驗中,特別容易被漏報。

療效訊號。 一期試驗的 ORR、疾病控制率、反應持續時間和 PFS,是假說生成資料,不是確證性資料。依劑量層分層的反應比混合整體反應更有資訊量——若所有反應都來自最高劑量層,這是一種不同的訊號,有別於反應分佈於整個劑量範圍。生物標記次族群可以具有參考性,但幾乎永遠統計力不足;把它們視為假說生成。擴增隊列會系統性地膨脹 ORR,相較於遞增隊列(真實世界日本單中心資料顯示:擴增隊列 ORR 23.7% vs 遞增隊列 11.0%),因為擴增隊列在接近最佳劑量的條件下,篩選分子匹配更好的病人。

轉譯 endpoint。 PK(Cmax、AUC、t½、蓄積比)告訴你藥物暴露是否在預期生物效應發生的範圍內。PD(靶點接合、通路抑制、ctDNA 動態)告訴你生物效應是否真的發生。抗藥性分析(靶點上 vs 靶點外)開始告訴你某些病人為何停止反應。一篇呈現令人印象深刻 ORR 卻沒有 PK/PD 資料的論文,讓讀者無法評估反應者是否有足夠暴露,或非反應者是否只是暴露不足。

30 分鐘 journal club 格式。 以生物學問題開場(驅動這個 FIH 的主要風險是什麼——payload 毒性、免疫活化、靶點表現,還是 off-target 效應?)。第 5–10 分鐘查核起始劑量理由。第 10–18 分鐘追蹤 PK/PD 與生物標記的證據鏈。第 18–24 分鐘建立比較候選 RP2D 的效益-風險表。最後以病人問題收尾:這個 regimen 在住院、回診、毒性、劑量減少和生活品質方面,對病人真正的代價是什麼?

每位臨床醫師都應能問 sponsor 或 PI 的十個問題: (1) 為什麼選這個起始劑量?(2) 為什麼這樣升量?(3) DLT 觀察窗口是否足夠捕捉預期毒性?(4) MTD 未達到時,RP2D 如何決定?(5) 是否有暴露-安全性或暴露-療效資料?(6) PD 生物標記是否真的證明靶點接合?(7) 擴增隊列是否有分子篩選?(8) 是否報告了晚發或累積毒性?(9) 是否收集了病人自述結果?(10) 這個一期資料能支持什麼,不能支持什麼?

Key Concepts | 核心概念

閱讀向度核心問題紅旗訊號
研究設計是哪種 cohort 在問什麼問題?DLT window 與藥物機轉不符
病人選擇排除族群是否讓安全性被低估?Biomarker-selected ORR 未分層報告
劑量理由NOAEL/HNSTD/MABEL/PAD 或模型,用了哪個?RP2D = MTD,缺乏 E-R 支撐
安全性劑量減少/中斷比例是否被報告?只報 grade 3+,忽略慢性低階毒性
療效訊號反應是否依劑量層分層?Expansion cohort ORR 未與 escalation 分開
轉譯 endpointPK/PD 是否支持「藥有打到問題」?有 ORR 無 target engagement 資料