Dose Escalation Design Comparison: 3+3, BOIN, CRM, and Beyond

劑量升階設計比較:3+3、BOIN、CRM 與現代設計

English

Every first-in-human (FIH) oncology trial must answer a deceptively simple question: given what we know right now about how many patients at this dose have experienced dose-limiting toxicity (DLT), what dose should the next cohort receive? The way a trial answers this question is its dose escalation design. Over the past two decades, oncology has moved from simple counting rules to increasingly sophisticated statistical frameworks — but the underlying clinical problem has never changed: protect patients while efficiently learning where the dose-response curve lives.

The traditional 3+3 design treats dose escalation like a simple flowchart. Three patients receive a dose. If zero have DLT, escalate; if one has DLT, treat three more; if two or more have DLT, the maximum tolerated dose (MTD) has been exceeded. The appeal is obvious: any clinician can follow the rules, no statistician needs to be in the room, and the logic is transparent enough to explain to an ethics committee in five minutes. The limitation is equally obvious: each decision uses only the data from the current cohort and ignores everything learned from previous dose levels. A 2024 review of FDA-approved FIH trials in solid tumors and hematologic malignancies confirmed that 3+3 still accounts for the majority of escalation designs used in practice — which means the first thing to teach is not a better method, but a clearer understanding of what the old one gets wrong.

The Bayesian Optimal Interval design (BOIN) was developed to preserve the operational simplicity of 3+3 while grounding decisions in a more principled statistical framework. Before a trial begins, the team specifies a target DLT rate — commonly around 25-30% — and derives two boundaries: a lower bound below which the current dose is likely too conservative, and an upper bound above which it is likely too toxic. During the trial, BOIN simply looks up the observed DLT count in a pre-computed decision table and returns: escalate, stay, or de-escalate. To a clinician watching a cohort review, this looks almost identical to 3+3. The difference is invisible at the bedside but important in operating characteristics: simulation studies consistently show BOIN selects the correct MTD more often and exposes fewer patients to dangerously high or therapeutically insufficient doses than 3+3.

The Continual Reassessment Method (CRM) goes further. Rather than applying fixed rules, CRM fits a parametric dose-toxicity curve to all data accumulated thus far, then recommends the dose whose predicted DLT probability is closest to the target. This means the recommendation for the next cohort incorporates information from every cohort that has gone before, not just the most recent one. The statistical efficiency gain is real. The clinical cost is that CRM requires a biostatistician to run model updates between each cohort, and clinical teams must trust a black-box recommendation. A 2026 comparison of model-assisted phase I/II designs showed that no single method dominates across all scenarios — the best design depends on what kind of error matters most to the trial team: selecting an overdose, missing a lower effective dose, or simply taking too long to reach a decision.

More recent methods — including the Calibration-Free Odds (CFO) framework and its extensions — offer middle ground. CFO does not require specifying an explicit dose-toxicity curve but still uses information from adjacent and cumulative dose levels. Its most valuable contribution to clinical education may be practical: the CFO suite of tools, published with a companion R package and Shiny web application, can generate pre-trial simulations showing how each design would perform under different true toxicity scenarios. This gives clinical teams something they rarely have: a concrete, visual answer to the question “what happens if our assumptions are wrong?” The two-dimensional CFO (2dCFO) extends the same logic to drug combination trials, where dose decisions must navigate a grid rather than a ladder.

For modern oncology drugs — particularly immune activators, T-cell engagers, antibody-drug conjugates (ADCs), and radiopharmaceuticals — an additional complication disrupts any simple escalation framework: toxicity may arrive late. If DLT can develop in cycle 2 or 3 rather than cycle 1, the entire premise of “wait for the current cohort to finish their DLT window before escalating” becomes fragile when accrual is fast. The Time-to-Event extensions of both BOIN (TITE-BOIN) and CFO (TITE-CFO, fractional CFO) address this by weighting pending patients’ follow-up time rather than excluding them. This is not a minor technical footnote — it is the difference between a trial that rationally incorporates incomplete data and one that either slows to a crawl or quietly escalates before safety information has matured. The separate page on tite-boin-late-toxicity covers this in detail.

The teaching framework that emerges from all of this is: stop asking “which design is best?” and start asking “what does this design do with information, and what errors is it most likely to commit?” A checklist that any clinician can apply to any phase I paper has six items: (1) What is the target DLT rate and why was it chosen? (2) What are the rules for escalation, de-escalation, dose skipping, and stopping? (3) How does the design handle pending or late-onset DLT data? (4) Does the design incorporate any early efficacy, pharmacokinetic, or pharmacodynamic information alongside toxicity? (5) Are operating characteristics reported — specifically, what is the probability of selecting the wrong dose, and how many patients are exposed to unsafe levels? (6) What is the rationale for the final recommended phase 2 dose (RP2D), and does it rest on more than just the highest tolerated dose? If a phase I paper cannot answer all six questions, the dose recommendation it produces deserves skepticism regardless of how sophisticated the statistical method sounds. Dose escalation is not climbing a ladder — it is building a documented clinical decision path that must stand up to regulatory scrutiny and, more importantly, to the question every patient deserves an answer to: why this dose?

中文

每一個 first-in-human(FIH,首次人體試驗)腫瘤試驗都必須回答一個看似簡單、實則關鍵的問題:根據目前這個劑量層級的 dose-limiting toxicity(DLT,劑量限制毒性)觀察結果,下一個 cohort 應該接受哪個劑量?試驗回答這個問題的方式,就是它的劑量升階設計。過去二十年,腫瘤臨床試驗從簡單的計數規則走向越來越複雜的統計框架,但底層的臨床問題從未改變:如何在保護受試者的同時,有效率地找到劑量-反應曲線所在之處。

傳統 3+3 設計把劑量升階當成一張簡單的流程圖。三位病人接受某個劑量:若零人出現 DLT,升階;若一人出現 DLT,再收三人;若兩人或以上出現 DLT,代表已超過 maximum tolerated dose(MTD,最大耐受劑量)。它的吸引力很明顯:任何臨床醫師都能遵循規則,不需要統計師在場,邏輯透明到可以五分鐘內向倫理委員會說清楚。但限制同樣清楚:每個決策只用當前 cohort 的資料,完全忽略之前劑量層級所學到的一切。2024 年回顧 FDA 核准的 solid tumor 與 hematologic malignancy FIH 試驗,確認 3+3 仍是實務中最常見的升階設計。這意味著教學的第一優先不是介紹更好的方法,而是讓臨床醫師更清楚舊方法哪裡出了問題。

Bayesian Optimal Interval design(BOIN,貝氏最佳區間設計)的設計初衷,是在保留 3+3 操作簡單性的同時,把決策建立在更有原則的統計框架上。試驗開始前,研究團隊先指定一個目標 DLT 率——通常約 25-30%——然後推導出兩個邊界:一個下限(若觀察到的 DLT 率低於此,代表目前劑量可能過保守)和一個上限(若高於此,代表過毒)。試驗進行中,BOIN 只需查預先計算好的決策表,輸入觀察到的 DLT 數,得到:升階、維持、或降階。對一位在現場觀察 cohort 審查的臨床醫師來說,BOIN 看起來幾乎和 3+3 一樣。差異在床旁看不見,但在 operating characteristics(操作特性)上至關重要:模擬研究一致顯示,BOIN 比 3+3 更常選到正確的 MTD,也讓較少病人暴露在危險高劑量或治療不足的低劑量。

Continual Reassessment Method(CRM,連續再評估法)更進一步。CRM 不是套用固定規則,而是對截至目前所有累積資料擬合一條參數化的劑量-毒性曲線,然後建議下一個 cohort 接受預測 DLT 機率最接近目標的劑量。這意味著每個建議都納入了所有先前 cohort 的資訊,而不只是最近一個。統計效率的提升是真實的。臨床代價是:CRM 需要統計師在每個 cohort 之間執行模型更新,臨床團隊必須信任一個「黑盒子」建議。2026 年一篇比較多種模型輔助第一/二期設計的 review 顯示,沒有任何單一方法在所有情境下都最優——最佳設計取決於哪種錯誤對試驗團隊最重要:選到過毒劑量、遺漏較低有效劑量、或是花太長時間才做出決策。

更近期的 Calibration-Free Odds(CFO,免校準勝算)框架及其延伸方法提供了折衷方案。CFO 不需要指定明確的劑量-毒性曲線,但仍能利用相鄰和累積劑量層的資訊。它對臨床教育最有價值的貢獻或許是實用性:CFO 工具套件連同 R package 和 Shiny 網頁應用一起發表,可以在試驗開始前生成模擬,顯示每種設計在不同真實毒性情境下會如何表現。這讓臨床團隊擁有一樣他們很少有的東西:「如果我們的假設錯了,會發生什麼事」的具體視覺化答案。二維 CFO(2dCFO)把同樣的邏輯延伸到藥物組合試驗,此時劑量決策需要在一個格狀矩陣而非線性梯子上導航。

對於現代腫瘤藥物——特別是免疫活化藥物、T-cell engager(T 細胞接合藥物)、抗體藥物複合體(ADC)與放射性藥物——一個額外的複雜因素打亂了所有簡單的升階框架:毒性可能晚發。若 DLT 可能在第二或第三週期而非第一週期出現,那麼「等待目前 cohort 完成 DLT 觀察窗再升階」的整個前提,在收案速度快時就會變得脆弱。BOIN(TITE-BOIN)和 CFO(TITE-CFO、fractional CFO)的時間事件延伸版本,透過對仍在追蹤中的病人按其追蹤時間加權來解決這個問題,而不是把他們排除在外。這不是次要的技術細節——它是「理性地納入不完整資料的試驗」和「靜靜地在安全資訊尚未成熟前就升階的試驗」之間的分界線。相關細節見tite-boin-late-toxicity

從以上所有內容浮現的教學框架是:停止問「哪個設計最好?」,開始問「這個設計如何運用資訊,以及它最可能犯哪種錯誤?」任何臨床醫師都能用在任何第一期論文上的六格核查單:(1) 目標 DLT 率是多少,為何如此設定?(2) 升階、降階、跳劑量和停止的規則是什麼?(3) 設計如何處理 pending 或晚發的 DLT 資料?(4) 設計是否在毒性之外也納入了早期療效、藥物動力學或藥效學資訊?(5) 是否報告了 operating characteristics——特別是選錯劑量的機率,以及多少病人暴露在不安全的劑量?(6) 最終 recommended phase 2 dose(RP2D,建議第二期劑量)的依據是什麼,它是否不只靠著「最高耐受劑量」支撐?若一篇第一期論文無法回答這六個問題,它所產生的劑量建議就值得存疑,無論統計方法的名字聽起來多複雜。劑量升階不是爬梯子——它是建立一條有文件記錄的臨床決策路徑,必須經得起法規審查,更重要的是,必須能回答每位病人都值得得到答案的問題:為什麼是這個劑量?

Key Concepts | 核心概念

術語定義
3+3 設計規則導向升階:每次以 3 人為一組,依 DLT 數決定升降
BOIN模型輔助設計:事前定義目標 DLT 率與邊界,查表決策
CRM模型導向設計:累積資料反覆更新劑量-毒性曲線
CFO免校準勝算框架:不需指定曲線形狀,利用相鄰劑量資訊
MTDMaximum Tolerated Dose,最大耐受劑量
RP2DRecommended Phase 2 Dose,建議第二期劑量
DLTDose-Limiting Toxicity,劑量限制毒性
Operating characteristics設計的統計表現特性(正確選擇率、過毒暴露率等)