配置参考
========

本节详细说明 HABIT 的所有配置文件参数和选项。

概述
----

HABIT 使用 YAML 格式的配置文件来控制所有功能。每个功能模块都有对应的配置文件，用户可以通过修改配置文件来调整功能。

**配置文件类型：**

- **预处理配置**: 控制图像预处理流程
- **生境分析配置**: 控制生境分割和特征提取
- **特征提取配置**: 控制生境特征提取
- **机器学习配置**: 控制机器学习建模
- **数据配置**: 指定数据路径和结构

**配置文件特点：**

- **易于理解**: 使用 YAML 格式，易于阅读和编辑
- **灵活配置**: 支持多种参数组合
- **版本控制**: 可以纳入版本控制，便于追踪变更
- **可重复性**: 相同的配置文件产生相同的结果

通用配置参数
------------

**data_dir**: 数据目录路径

- **类型**: 字符串
- **必需**: 是
- **说明**: 可以是文件夹或 YAML 配置文件
- **示例**: `./files_preprocessing.yaml`

**out_dir**: 输出目录路径

- **类型**: 字符串
- **必需**: 是
- **说明**: 输出文件将保存在此目录
- **示例**: `./preprocessed`

**processes**: 并行进程数

- **类型**: 整数
- **必需**: 否
- **默认值**: 2
- **说明**: 用于并行处理的进程数
- **示例**: `4`

**random_state**: 随机种子

- **类型**: 整数
- **必需**: 否
- **默认值**: None
- **说明**: 用于可重复性的随机种子
- **示例**: `42`

**debug**: 调试模式

- **类型**: 布尔值
- **必需**: 否
- **默认值**: false
- **说明**: 启用详细日志的调试模式
- **示例**: `true`

预处理配置参数
------------

**配置文件示例：**

.. code-block:: yaml

   data_dir: ./files_preprocessing.yaml
   out_dir: ./preprocessed

   Preprocessing:
     dcm2nii:
       images: [delay2, delay3, delay5]
       dcm2niix_path: ./dcm2niix.exe
       compress: true
       anonymize: true

     n4_correction:
       images: [delay2, delay3, delay5]
       num_fitting_levels: 4

     resample:
       images: [delay2, delay3, delay5]
       target_spacing: [1.0, 1.0, 1.0]

     registration:
       images: [delay2, delay3, delay5]
       fixed_image: delay2
       moving_images: [delay3, delay5]
       type_of_transform: SyNRA
       use_mask: false

     zscore_normalization:
       images: [delay2, delay3, delay5]
       only_inmask: false
       mask_key: mask

     adaptive_histogram_equalization:
       images: [delay2, delay3, delay5]
       alpha: 0.3
       beta: 0.3
       radius: 5

   save_options:
     save_intermediate: true
     intermediate_steps: [dcm2nii, n4_correction, resample]

   processes: 2
   random_state: 42

**Preprocessing**: 预处理设置

**dcm2niix**: DICOM 转换设置

- ``images``: 要转换的图像列表

  - **类型**: 列表
  - **必需**: 是
  - **示例**: ``[delay2, delay3, delay5]``

- ``dcm2niix_path``: dcm2niix 可执行文件路径

  - **类型**: 字符串
  - **必需**: 是
  - **示例**: ``./dcm2niix.exe``

- ``compress``: 是否压缩输出文件

  - **类型**: 布尔值
  - **必需**: 否
  - **默认值**: ``true``
  - **示例**: ``true``

- ``anonymize``: 是否匿名化

  - **类型**: 布尔值
  - **必需**: 否
  - **默认值**: ``true``
  - **示例**: ``true``

**n4_correction**: N4 偏置场校正设置

- ``images``: 要校正的图像列表

  - **类型**: 列表
  - **必需**: 是
  - **示例**: ``[delay2, delay3, delay5]``

- ``num_fitting_levels``: 拟合级别数

  - **类型**: 整数
  - **必需**: 否
  - **默认值**: 4
  - **范围**: 2-4
  - **示例**: ``4``

**resample**: 重采样设置

- ``images``: 要重采样的图像列表

  - **类型**: 列表
  - **必需**: 是
  - **示例**: ``[delay2, delay3, delay5]``

- ``target_spacing``: 目标间距

  - **类型**: 列表
  - **必需**: 是
  - **格式**: [x, y, z]（单位：mm）
  - **示例**: ``[1.0, 1.0, 1.0]``

**registration**: 配准设置

- ``images``: 所有涉及的图像列表

  - **类型**: 列表
  - **必需**: 是
  - **示例**: ``[delay2, delay3, delay5]``

- ``fixed_image``: 固定图像

  - **类型**: 字符串
  - **必需**: 是
  - **说明**: 参考图像
  - **示例**: ``delay2``

- ``moving_images``: 要配准的图像列表

  - **类型**: 列表
  - **必需**: 是
  - **示例**: ``[delay3, delay5]``

- ``type_of_transform``: 变换类型

  - **类型**: 字符串
  - **必需**: 否
  - **默认值**: ``SyNRA``
  - **可选值**: ``SyNRA``, ``SyN``, ``Affine``
  - **示例**: ``SyNRA``

- ``use_mask``: 是否使用掩码引导配准

  - **类型**: 布尔值
  - **必需**: 否
  - **默认值**: ``false``
  - **示例**: ``false``

- ``mask_key``: 掩码键名

  - **类型**: 字符串
  - **必需**: 否（当 ``use_mask`` 为 ``true`` 时必需）
  - **示例**: ``mask``

**zscore_normalization**: Z-Score 标准化设置

- ``images``: 要标准化的图像列表

  - **类型**: 列表
  - **必需**: 是
  - **示例**: ``[delay2, delay3, delay5]``

- ``only_inmask``: 是否仅在掩码内计算统计量

  - **类型**: 布尔值
  - **必需**: 否
  - **默认值**: ``false``
  - **示例**: ``false``

- ``mask_key``: 掩码键名

  - **类型**: 字符串
  - **必需**: 否（当 ``only_inmask`` 为 ``true`` 时必需）
  - **示例**: ``mask``

**adaptive_histogram_equalization**: 自适应直方图均衡化设置

- ``images``: 要均衡化的图像列表

  - **类型**: 列表
  - **必需**: 是
  - **示例**: ``[delay2, delay3, delay5]``

- ``alpha``: 全局对比度增强因子

  - **类型**: 浮点数
  - **必需**: 否
  - **默认值**: 0.3
  - **范围**: [0, 1]
  - **示例**: ``0.3``

- ``beta``: 局部对比度增强因子

  - **类型**: 浮点数
  - **必需**: 否
  - **默认值**: 0.3
  - **范围**: [0, 1]
  - **示例**: ``0.3``

- ``radius``: 局部窗口半径

  - **类型**: 整数
  - **必需**: 否
  - **默认值**: 5
  - **单位**: 像素
  - **示例**: ``5``

**save_options**: 保存选项

- ``save_intermediate``: 是否保存中间结果

  - **类型**: 布尔值
  - **必需**: 否
  - **默认值**: ``false``
  - **示例**: ``true``

- ``intermediate_steps``: 要保存的中间步骤列表

  - **类型**: 列表
  - **必需**: 否
  - **默认值**: 空列表（表示保存所有步骤）
  - **示例**: ``[dcm2nii, n4_correction, resample]``

生境分析配置参数
------------

**配置文件示例：**

.. code-block:: yaml

   run_mode: train
   pipeline_path: ./results/habitat_pipeline.pkl
   data_dir: ./file_habitat.yaml
   out_dir: ./results/habitat/train

   FeatureConstruction:
     voxel_level:
       method: concat(raw(delay2), raw(delay3), raw(delay5))
       params: {}

     supervoxel_level:
       supervoxel_file_keyword: '*_supervoxel.nrrd'
       method: mean_voxel_features()
       params: {}

     preprocessing_for_subject_level:
       methods:
         - method: winsorize
           winsor_limits: [0.05, 0.05]
           global_normalize: true
         - method: minmax
           global_normalize: true

     preprocessing_for_group_level:
       methods:
         - method: binning
           n_bins: 10
           bin_strategy: uniform
           global_normalize: false

   HabitatsSegmention:
     clustering_mode: two_step

     supervoxel:
       algorithm: kmeans
       n_clusters: 50
       random_state: 42
       max_iter: 300
       n_init: 10

     habitat:
       algorithm: kmeans
       max_clusters: 10
       habitat_cluster_selection_method:
         - inertia
         - silhouette
       fixed_n_clusters: null
       random_state: 42
       max_iter: 300
       n_init: 10

   processes: 2
   plot_curves: true
   save_results_csv: true
   random_state: 42
   debug: false

**run_mode**: 运行模式

- **类型**: 字符串
- **必需**: 否
- **默认值**: ``train``
- **可选值**: ``train``, ``predict``
- **说明**: ``train`` 表示训练新模型，``predict`` 表示使用预训练模型进行预测。
- **示例**: ``train``

**pipeline_path**: Pipeline 文件路径

- **类型**: 字符串
- **必需**: 否 (``predict`` 模式必需)
- **说明**: 指定训练好的 Pipeline 文件路径。
- **示例**: ``./results/habitat_pipeline.pkl``

**FeatureConstruction**: 特征提取设置

**voxel_level**: 体素级特征提取

- ``method``: 特征提取方法表达式

  - **类型**: 字符串
  - **必需**: 是
  - **说明**: 支持函数式语法组合多个特征提取器。
  - **可用方法及参数**:

    **raw(image_name)**:

      - **说明**: 提取原始图像体素值（最基础的特征）
      - **参数**: 无
      - **示例**: ``raw(delay2)``

    **concat(...)**:

      - **说明**: 拼接多个特征向量
      - **参数**: 接受多个特征提取表达式
      - **示例**: ``concat(raw(delay2), raw(delay3), raw(delay5))``

    **kinetic(...)**:

      - **说明**: 提取动力学特征（wash-in/wash-out 斜率等）
      - **参数**:

        - ``timestamps`` (str, 必需): 时间戳文件路径
        - 接受多个 ``raw(image_name)`` 表达式

      - **示例**: ``kinetic(raw(LAP), raw(PVP), raw(delay_3min), timestamps=...)``
      - **提取的特征**:

        - ``wash_in_slope``: 洗入斜率
        - ``wash_out_slope_lap_pvp``: LAP 到 PVP 的洗出斜率
        - ``wash_out_slope_pvp_dp``: PVP 到延迟期的洗出斜率

    **local_entropy(...)**:

      - **说明**: 计算局部熵（衡量局部纹理复杂度）
      - **参数**:

        - ``kernel_size`` (int, 默认: ``3``): 局部邻域大小
        - ``bins`` (int, 默认: ``32``): 直方图分箱数

      - **示例**: ``local_entropy(raw(delay2), kernel_size=5, bins=32)``

    **voxel_radiomics(...)**:

      - **说明**: 提取体素级影像组学特征
      - **参数**:

        - ``params_file`` (str, 必需): PyRadiomics 参数文件路径
        - ``kernelRadius`` (int, 默认: ``1``): 局部邻域半径（1=3×3×3, 2=5×5×5）

      - **示例**: ``voxel_radiomics(raw(delay2), params_file='./parameter.yaml', kernelRadius=1)``

  - **完整示例**:

    .. code-block:: yaml

       # 简单拼接原始图像
       voxel_level:
         method: concat(raw(delay2), raw(delay3), raw(delay5))
         params: {}
       
       # 提取动力学特征
       voxel_level:
         method: kinetic(raw(LAP), raw(PVP), raw(delay_3min))
         params:
           timestamps: ./timestamps.txt
       
       # 组合局部熵和原始值
       voxel_level:
         method: concat(raw(delay2), local_entropy(raw(delay2)))
         params:
           kernel_size: 5
           bins: 32

- ``params``: 全局参数

  - **类型**: 字典
  - **必需**: 否
  - **默认值**: ``{}``
  - **说明**: 传递给所有特征提取器的公共参数。
  - **常用参数**:

    - ``timestamps`` (str): 时间戳文件路径（用于 kinetic 方法）
    - ``kernel_size`` (int): 局部邻域大小（用于 local_entropy）
    - ``bins`` (int): 直方图分箱数（用于 local_entropy）
    - ``params_file`` (str): PyRadiomics 参数文件（用于 voxel_radiomics）
    - ``kernelRadius`` (int): 体素级组学邻域半径（用于 voxel_radiomics）

**supervoxel_level**: 超像素级特征提取 (可选)

- ``supervoxel_file_keyword``: 超像素文件匹配模式

  - **类型**: 字符串
  - **必需**: 是
  - **默认值**: ``"*_supervoxel.nrrd"``
  - **说明**: 用于匹配已有的超像素分割文件（由 two_step 模式生成）。
  - **示例**: ``"*_supervoxel.nrrd"``

- ``method``: 特征聚合/提取方法

  - **类型**: 字符串
  - **必需**: 是
  - **默认值**: ``"mean_voxel_features()"``
  - **说明**: 定义如何从体素特征聚合到超像素，或直接从超像素提取特征。
  - **可用方法及参数**:

    **mean_voxel_features()**:

      - **说明**: 计算每个超像素内体素特征的平均值（最常用）
      - **参数**: 无
      - **用途**: 将体素级特征（如 ``voxel_level`` 提取的特征）聚合到超像素级
      - **示例**: ``mean_voxel_features()``

    **supervoxel_radiomics(params_file=...)**:

      - **说明**: 直接从原始图像的超像素块提取影像组学特征
      - **参数**:

        - ``params_file`` (str, 必需): PyRadiomics 参数文件路径

      - **用途**: 不依赖 ``voxel_level`` 特征，直接从超像素区域提取纹理、形状等组学特征
      - **示例**: ``supervoxel_radiomics(params_file='./parameter.yaml')``

  - **方法对比**:

    - ``mean_voxel_features()``: 依赖 ``voxel_level`` 特征，速度快，适合大多数场景
    - ``supervoxel_radiomics()``: 独立提取，特征更丰富但计算量大

  - **完整示例**:

    .. code-block:: yaml

       # 场景1：聚合体素特征（推荐）
       supervoxel_level:
         supervoxel_file_keyword: '*_supervoxel.nrrd'
         method: mean_voxel_features()
         params: {}
       
       # 场景2：直接提取影像组学特征
       supervoxel_level:
         supervoxel_file_keyword: '*_supervoxel.nrrd'
         method: supervoxel_radiomics()
         params:
           params_file: ./parameter_supervoxel.yaml

- ``params``: 参数

  - **类型**: 字典
  - **必需**: 否
  - **默认值**: ``{}``
  - **说明**: 传递给特征提取器的参数（如 ``params_file``）。

**preprocessing_for_subject_level**: 个体级别预处理 (可选)

- ``methods``: 预处理方法列表

  - **类型**: 列表
  - **必需**: 否
  - **默认值**: ``[]``
  - **说明**: 在个体水平对特征进行预处理，消除个体内异常值和尺度差异。
  - **注意**: ``two_step`` 模式下，个体级别不允许使用会删列的方法（``variance_filter``、``correlation_filter``），否则会导致跨受试者拼接后出现大量缺失列。
  - **支持方法及参数**:

    **winsorize (缩尾处理)**:

      - ``winsor_limits`` (list, 默认: ``[0.05, 0.05]``): 下限和上限的截断比例
      - ``global_normalize`` (bool, 默认: ``false``): 是否全局归一化（跨所有特征）

    **minmax (最小-最大归一化)**:

      - ``global_normalize`` (bool, 默认: ``false``): 是否全局归一化

    **zscore (Z-Score 标准化)**:

      - ``global_normalize`` (bool, 默认: ``false``): 是否全局标准化

    **robust (鲁棒标准化)**:

      - ``global_normalize`` (bool, 默认: ``false``): 是否全局归一化
      - 使用分位距（IQR）进行缩放，对异常值鲁棒

    **log (对数变换)**:

      - ``global_normalize`` (bool, 默认: ``false``): 是否全局变换
      - 自动处理负值（平移后再取对数）

    **variance_filter (低方差筛选)**:

      - ``variance_threshold`` (float, 默认: ``0.0``): 保留方差大于该阈值的特征
      - 说明: 该方法会删除特征列

    **correlation_filter (高相关筛选)**:

      - ``corr_threshold`` (float, 默认: ``0.95``): 相关系数绝对值大于该阈值时删除冗余特征
      - ``corr_method`` (str, 默认: ``spearman``): 相关系数方法，可选 ``pearson``/``spearman``/``kendall``
      - 说明: 该方法会删除特征列

  - **示例**:

    .. code-block:: yaml

       # 去除异常值后归一化
       - method: winsorize
         winsor_limits: [0.05, 0.05]
         global_normalize: true
       - method: minmax
         global_normalize: true
       
       # Z-Score 标准化
       - method: zscore
         global_normalize: false

**preprocessing_for_group_level**: 群体级别预处理 (可选)

- ``methods``: 预处理方法列表

  - **类型**: 列表
  - **必需**: 否
  - **默认值**: ``[]``
  - **说明**: 在群体水平对特征进行预处理，通常用于离散化以提高聚类的稳定性。
  - **支持方法及参数**:

    **binning (特征离散化/分箱)**:

      - ``n_bins`` (int, 默认: ``10``): 分箱数量
      - ``bin_strategy`` (str, 默认: ``uniform``): 分箱策略，可选:

        - ``uniform``: 均匀分箱（等宽）
        - ``quantile``: 分位数分箱（等频）
        - ``kmeans``: K-means 聚类分箱

      - ``global_normalize`` (bool, 默认: ``false``): 是否全局分箱（跨所有特征）

    **winsorize (缩尾处理)**:

      - ``winsor_limits`` (list, 默认: ``[0.05, 0.05]``): 下限和上限的截断比例
      - ``global_normalize`` (bool, 默认: ``false``): 是否全局归一化

    **minmax / zscore / robust / log**:

      - 同 ``preprocessing_for_subject_level``，但作用于群体汇总后的数据

    **variance_filter / correlation_filter (推荐放在群体级执行)**:

      - 用于无监督场景下的特征删列，降低噪声与冗余
      - ``variance_filter`` 参数: ``variance_threshold``
      - ``correlation_filter`` 参数: ``corr_threshold``、``corr_method``
      - 建议: 在训练阶段确定保留列，预测阶段复用同一列集合

  - **示例**:

    .. code-block:: yaml

       # 均匀分箱（推荐用于生境分析）
       - method: binning
         n_bins: 10
         bin_strategy: uniform
         global_normalize: false
       
       # 分位数分箱（等频分箱）
       - method: binning
         n_bins: 20
         bin_strategy: quantile
         global_normalize: false

**HabitatsSegmention**: 生境分割设置

- ``clustering_mode``: 聚类策略

  - **类型**: 字符串
  - **必需**: 否
  - **默认值**: ``two_step``
  - **可选值**:

    - ``one_step``: 直接对体素进行聚类。
    - ``two_step``: 先生成超像素，再对超像素进行聚类生成生境。
    - ``direct_pooling``: 直接汇总所有受试者的体素进行聚类（计算量大）。

  - **示例**: ``two_step``

**supervoxel**: 超像素聚类设置 (仅用于 ``two_step`` 模式)

- ``algorithm``: 聚类算法

  - **类型**: 字符串
  - **默认值**: ``kmeans``
  - **可选值**:

    - ``kmeans``: K-means 聚类（速度快，适合大多数场景）
    - ``gmm``: 高斯混合模型（考虑数据分布，更灵活但速度较慢）

  - **示例**: ``kmeans``

- ``n_clusters``: 超像素数量

  - **类型**: 整数
  - **必需**: 是
  - **说明**: 每个受试者生成的超像素个数。推荐范围: 30-100。
  - **示例**: ``50``

- ``random_state``: 随机种子

  - **类型**: 整数
  - **默认值**: ``42``
  - **说明**: 用于结果可重复性

- ``max_iter``: 最大迭代次数

  - **类型**: 整数
  - **默认值**: ``300``
  - **说明**: 聚类算法的最大迭代次数

- ``n_init``: 初始化次数

  - **类型**: 整数
  - **默认值**: ``10``
  - **说明**: 使用不同初始化运行算法的次数，选择最佳结果

- ``covariance_type``: 协方差类型（仅用于 ``gmm``）

  - **类型**: 字符串
  - **默认值**: ``full``
  - **可选值**: ``full``, ``tied``, ``diag``, ``spherical``
  - **说明**: 

    - ``full``: 每个组件有独立的完整协方差矩阵
    - ``tied``: 所有组件共享相同的协方差矩阵
    - ``diag``: 对角协方差矩阵（假设特征独立）
    - ``spherical``: 球形协方差（各向同性）

- **完整示例**:

  .. code-block:: yaml

     # K-means 聚类（推荐）
     supervoxel:
       algorithm: kmeans
       n_clusters: 50
       random_state: 42
       max_iter: 300
       n_init: 10
     
     # GMM 聚类
     supervoxel:
       algorithm: gmm
       n_clusters: 50
       covariance_type: full
       random_state: 42
       max_iter: 100
       n_init: 5

**one_step_settings**: One-Step 模式设置 (仅用于 ``one_step`` 模式)

- ``min_clusters``: 最小聚类数

  - **类型**: 整数
  - **默认值**: ``2``
  - **说明**: 自动选择时的下限

- ``max_clusters``: 最大聚类数

  - **类型**: 整数
  - **默认值**: ``10``
  - **说明**: 自动选择时的上限

- ``fixed_n_clusters``: 固定聚类数

  - **类型**: 整数或 null
  - **默认值**: ``null``
  - **说明**: 若设置，则跳过自动选择，直接使用该值。

- ``selection_method``: 自动选择指标

  - **类型**: 字符串
  - **默认值**: ``silhouette``
  - **可选值及说明**:

    - ``silhouette``: 轮廓系数（-1 到 1，越接近 1 表示聚类越紧密）
    - ``calinski_harabasz``: Calinski-Harabasz 指数（越大表示聚类越好）
    - ``davies_bouldin``: Davies-Bouldin 指数（越小表示聚类越好）
    - ``inertia``: 簇内平方和（越小表示聚类越紧密，内部用 Kneedle 选拐点）
    - ``kneedle``: Kneedle 方法（对 inertia 曲线归一化后选最大偏离点）

  - **推荐**: ``silhouette``（综合性能最佳）

- ``plot_validation_curves``: 是否绘制验证曲线

  - **类型**: 布尔值
  - **默认值**: ``true``
  - **说明**: 生成不同聚类数下的指标曲线图，帮助理解自动选择结果

**habitat**: 生境聚类设置

- ``algorithm``: 聚类算法

  - **类型**: 字符串
  - **默认值**: ``kmeans``
  - **可选值**:

    - ``kmeans``: K-means 聚类
    - ``gmm``: 高斯混合模型

- ``max_clusters``: 最大生境数

  - **类型**: 整数
  - **必需**: 是
  - **说明**: 自动选择生境数时的上限。推荐范围: 5-10。
  - **示例**: ``10``

- ``min_clusters``: 最小生境数

  - **类型**: 整数
  - **默认值**: ``2``
  - **说明**: 自动选择生境数时的下限。

- ``habitat_cluster_selection_method``: 自动选择指标

  - **类型**: 列表或字符串
  - **默认值**: ``[inertia]``
  - **可选值及说明**:

    - ``inertia``: 簇内平方和（越小越好，适用于 kmeans，内部用 Kneedle 选拐点）
    - ``kneedle``: Kneedle 方法（对 inertia 曲线归一化后选最大偏离点）
    - ``silhouette``: 轮廓系数（-1 到 1，越接近 1 越好）
    - ``calinski_harabasz``: Calinski-Harabasz 指数（越大越好）
    - ``davies_bouldin``: Davies-Bouldin 指数（越小越好）
    - ``aic``: 赤池信息准则（越小越好，仅用于 gmm）
    - ``bic``: 贝叶斯信息准则（越小越好，仅用于 gmm）

  - **说明**: 可指定多个指标，系统会综合评估选择最佳生境数。
  - **示例**: ``[inertia, silhouette]``

- ``fixed_n_clusters``: 固定生境数

  - **类型**: 整数或 null
  - **默认值**: ``null``
  - **说明**: 若设置为具体数值，则跳过自动选择，直接使用该生境数。

- ``random_state``: 随机种子

  - **类型**: 整数
  - **默认值**: ``42``

- ``max_iter``: 最大迭代次数

  - **类型**: 整数
  - **默认值**: ``300`` (kmeans) 或 ``100`` (gmm)

- ``n_init``: 初始化次数

  - **类型**: 整数
  - **默认值**: ``10`` (kmeans) 或 ``1`` (gmm)

- **完整示例**:

  .. code-block:: yaml

     # 自动选择生境数（推荐）
     habitat:
       algorithm: kmeans
       max_clusters: 10
       min_clusters: 2
       habitat_cluster_selection_method:
         - inertia
         - silhouette
       fixed_n_clusters: null
       random_state: 42
     
     # 固定生境数
     habitat:
       algorithm: kmeans
       fixed_n_clusters: 5
       random_state: 42

**postprocess_supervoxel / postprocess_habitat**: 连通域后处理设置

- **类型**: 字典
- **必需**: 否
- **默认值**: ``enabled: false``
- **说明**:

  - ``postprocess_supervoxel`` 作用于超体素标签图（主要 two_step 阶段）。
  - ``postprocess_habitat`` 作用于最终生境标签图（one_step/two_step/direct_pooling）。
  - 当前实现采用 SimpleITK 快路径：先按标签移除小连通域，再按最近种子标签回填。
  - 该流程旨在减少碎片并保持 ROI 内体素不丢失。

- **子参数**:

  - ``enabled`` (bool, 默认: ``false``): 是否启用后处理
  - ``min_component_size`` (int, 默认: ``30``): 最小连通域体素数阈值
  - ``connectivity`` (int, 默认: ``1``): 邻域连通性；当前快路径中 ``1`` 为面邻接优先，``2``/``3`` 均表现为全连接行为
  - ``debug_postprocess`` (bool, 默认: ``false``): 是否输出后处理详细日志
  - ``reassign_method`` (str, 默认: ``neighbor_vote``): 兼容字段，当前快路径已忽略
  - ``max_iterations`` (int, 默认: ``3``): 兼容字段，当前快路径已忽略

- **示例**:

  .. code-block:: yaml

     HabitatsSegmention:
       postprocess_supervoxel:
         enabled: false
         min_component_size: 30
         connectivity: 1
         debug_postprocess: false
         reassign_method: neighbor_vote  # deprecated/ignored
         max_iterations: 3               # deprecated/ignored

       postprocess_habitat:
         enabled: true
         min_component_size: 30
         connectivity: 1
         debug_postprocess: false
         reassign_method: neighbor_vote  # deprecated/ignored
         max_iterations: 3               # deprecated/ignored

**plot_curves**: 是否生成和保存图表

- **类型**: 布尔值
- **默认值**: ``true``

**save_results_csv**: 是否将结果保存为 CSV 文件

- **类型**: 布尔值
- **默认值**: ``true``

特征提取配置参数
------------

**配置文件示例：**

.. code-block:: yaml

   params_file_of_non_habitat: ./parameter.yaml
   params_file_of_habitat: ./parameter_habitat.yaml

   raw_img_folder: ./preprocessed/processed_images
   habitats_map_folder: ./results/habitat
   out_dir: ./results/features

   n_processes:3
   habitat_pattern: '*_habitats.nrrd'

   feature_types:
     - traditional
     - non_radiomics
     - whole_habitat
     - each_habitat
     - msi
     - ith_score

   n_habitats:

   debug: false

**params_file_of_non_habitat**: 从原始图像提取特征的参数文件

- **类型**: 字符串
- **必需**: 是
- **说明**: 使用 pyradiomics 提取传统影像组学特征的参数文件
- **示例**: ``./parameter.yaml``

**params_file_of_habitat**: 从生境图提取特征的参数文件

- **类型**: 字符串
- **必需**: 是
- **说明**: 使用 pyradiomics 从生境图中提取特征的参数文件
- **示例**: ``./parameter_habitat.yaml``

**raw_img_folder**: 原始图像根目录

- **类型**: 字符串
- **必需**: 是
- **说明**: 包含预处理后的图像
- **示例**: ``./preprocessed/processed_images``

**habitats_map_folder**: 生境图根目录

- **类型**: 字符串
- **必需**: 是
- **说明**: 包含生成的生境图
- **示例**: ``./results/habitat``

**out_dir**: 输出目录

- **类型**: 字符串
- **必需**: 是
- **说明**: 特征文件将保存在此目录
- **示例**: ``./results/features``

**n_processes**: 并行进程数

- **类型**: 整数
- **必需**: 否
- **默认值**: 2
- **说明**: 用于并行处理的进程数
- **示例**: ``3``

**habitat_pattern**: 生境文件匹配模式

- **类型**: 字符串
- **必需**: 否
- **默认值**: ``'*_habitats.nrrd'``
- **说明**: 用于匹配生境图文件，支持通配符（`*`）
- **示例**: ``*_habitats.nrrd``

**feature_types**: 特征类型列表

- **类型**: 列表
- **必需**: 否
- **默认值**: ``[traditional]``
- **可选值**: ``traditional``, ``non_radiomics``, ``whole_habitat``, ``each_habitat``, ``msi``, ``ith_score``
- **示例**: ``[traditional, non_radiomics, whole_habitat]``

**n_habitats**: 生境数量

- **类型**: 整数或 null
- **必需**: 否
- **默认值**: ``null``（表示自动检测）
- **说明**: 可以手动指定生境数量
- **示例**: ``null``

机器学习配置参数
------------

**配置文件示例：**

.. code-block:: yaml

   input:
     - path: ./results/features/combined_features.csv
       name: training_data
       subject_id_col: Subject
       label_col: label
   output: ./results/ml/train
   random_state: 42
   
   split_method: stratified
   test_size: 0.3
   
   normalization:
     method: z_score
     params: {}
   
   feature_selection_methods:
     - method: variance
       params:
         threshold: 0.0
     - method: correlation
       params:
         threshold: 0.9
   
   models:
     RandomForest:
       params:
         n_estimators: 100
         random_state: 42
     LogisticRegression:
       params:
         max_iter: 1000
   
   is_visualize: true
   is_save_model: true
   
   visualization:
     enabled: true
     plot_types: [roc, dca, calibration, pr, confusion, shap]
     dpi: 600
     format: pdf

**mode（CLI 参数）**: 运行模式

- **位置**: 命令行参数 `habit model --mode <train|predict>`
- **说明**: 
  - `train` 使用 `MLConfig`（训练配置结构）
  - `predict` 使用 `PredictionConfig`（预测配置结构）

**input**: 输入数据配置

- **类型**: 列表
- **必需**: 是
- **说明**: 包含一个或多个输入文件的配置字典。
- **子参数**:

  - ``path``: 特征文件路径 (CSV/Excel)。
  - ``name``: 数据集名称。
  - ``subject_id_col``: 受试者 ID 列名。
  - ``label_col``: 标签列名。

**output**: 输出目录

- **类型**: 字符串
- **必需**: 是
- **说明**: 结果、模型和图表保存的路径。

**split_method**: 数据划分方法

- **类型**: 字符串
- **默认值**: ``stratified``
- **可选值**: ``random``, ``stratified``, ``custom``

**test_size**: 测试集比例

- **类型**: 浮点数
- **默认值**: ``0.3``
- **范围**: (0, 1)

**normalization**: 特征归一化设置

- ``method``: 归一化方法

  - **类型**: 字符串
  - **默认值**: ``z_score``
  - **可选值**:

    - ``z_score``: Z-Score 标准化 (StandardScaler)
    - ``min_max``: 最小-最大归一化 (MinMaxScaler)
    - ``robust``: 鲁棒缩放 (RobustScaler)
    - ``max_abs``: 最大绝对值缩放 (MaxAbsScaler)
    - ``normalizer``: L1/L2 归一化 (Normalizer)
    - ``quantile``: 分位数转换 (QuantileTransformer)
    - ``power``: 幂变换 (PowerTransformer)

- ``params``: 方法特定参数

  - **类型**: 字典
  - **默认值**: ``{}``
  - **说明**: 根据选择的归一化方法传递不同的参数
  - **各方法支持的参数**:

    **z_score (StandardScaler)**:

      - ``with_mean`` (bool, 默认: ``true``): 是否在缩放前中心化数据
      - ``with_std`` (bool, 默认: ``true``): 是否缩放到单位方差

    **min_max (MinMaxScaler)**:

      - ``feature_range`` (list, 默认: ``[0, 1]``): 目标范围，如 ``[0, 1]`` 或 ``[-1, 1]``

    **robust (RobustScaler)**:

      - ``with_centering`` (bool, 默认: ``true``): 是否在缩放前中心化数据
      - ``with_scaling`` (bool, 默认: ``true``): 是否缩放到分位距
      - ``quantile_range`` (list, 默认: ``[25.0, 75.0]``): 用于计算缩放的分位数范围（IQR）

    **max_abs (MaxAbsScaler)**:

      - 无特殊参数（使用默认值即可）

    **quantile (QuantileTransformer)**:

      - ``n_quantiles`` (int, 默认: ``1000``): 分位数数量
      - ``output_distribution`` (str, 默认: ``uniform``): 输出分布，可选 ``uniform`` 或 ``normal``
      - ``subsample`` (int, 默认: ``10000``): 用于估计分位数的最大样本数

    **power (PowerTransformer)**:

      - ``method`` (str, 默认: ``yeo-johnson``): 变换方法，可选 ``yeo-johnson`` 或 ``box-cox``

  - **示例**:

    .. code-block:: yaml

       # Z-Score 标准化
       normalization:
         method: z_score
         params: {}
       
       # 最小-最大归一化到 [-1, 1]
       normalization:
         method: min_max
         params:
           feature_range: [-1, 1]
       
       # 鲁棒缩放（对异常值鲁棒）
       normalization:
         method: robust
         params:
           quantile_range: [25.0, 75.0]

**sampling**: 训练集重采样设置

- **类型**: 字典
- **必需**: 否
- **默认值**: ``enabled: false``
- **说明**: 仅对训练数据进行重采样；验证/测试数据不会被重采样。

- ``enabled``: 是否启用重采样

  - **类型**: 布尔值
  - **默认值**: ``false``

- ``method``: 重采样方法

  - **类型**: 字符串
  - **默认值**: ``random_over``
  - **可选值**:

    - ``random_over``: 随机过采样少数类
    - ``random_under``: 随机欠采样多数类
    - ``smote``: SMOTE 过采样（需要安装 ``imbalanced-learn``）

- ``ratio``: 重采样比例

  - **类型**: 浮点数
  - **默认值**: ``1.0``
  - **范围**: ``> 0``
  - **补充说明**:

    - 当 ``method: random_over`` 时，少数类目标数量约为 ``majority_count * ratio``。
    - 当 ``method: random_under`` 时，建议 ``ratio <= 1.0``（若大于 1.0 会报错）。
    - 当 ``method: smote`` 时，作为 ``SMOTE(sampling_strategy=ratio)`` 传入。

- ``random_state``: 随机种子

  - **类型**: 整数
  - **默认值**: ``42``
  - **说明**: 控制采样与打乱顺序的可复现性。

- **执行时机与调用链**:

  - ``run_pipeline()`` 中训练模型时，会调用 ``_train_with_optional_sampling(...)``。
  - 在该函数内部调用 ``_resample_training_data(...)``，完成重采样后再 ``fit(...)``。
  - Holdout 与 K-Fold 两个工作流都走这条链路。

- **如何确认重采样已执行**:

  - 在输出目录日志（如 ``processing.log``）中搜索以下关键字：

    - ``Sampling enabled: method=...``
    - ``Sampling completed: after_counts=...``
    - ``random_over skipped`` 或 ``random_under skipped``

- **示例**:

  .. code-block:: yaml

     sampling:
       enabled: true
       method: random_over
       ratio: 1.0
       random_state: 42

**feature_selection_methods**: 特征选择方法列表

- **类型**: 列表
- **说明**: 按顺序执行的特征选择步骤。每个方法都有特定的参数。
- **可选方法及其参数**:

  **variance (方差阈值)**:

    - ``threshold`` (float, 默认: ``0.0``): 方差阈值，低于此值的特征被移除
    - ``top_k`` (int, 可选): 选择方差最大的前 k 个特征（若指定则覆盖 threshold）
    - ``top_percent`` (float, 可选): 选择方差最大的前 x% 特征（0-100）
    - ``plot_variances`` (bool, 默认: ``true``): 是否绘制方差分布图

  **correlation (相关性过滤)**:

    - ``threshold`` (float, 默认: ``0.8``): 相关系数阈值，高于此值的特征对会移除其一
    - ``method`` (str, 默认: ``spearman``): 相关系数计算方法，可选 ``pearson``, ``spearman``, ``kendall``
    - ``visualize`` (bool, 默认: ``false``): 是否生成相关性热图

  **anova (方差分析)**:

    - ``p_threshold`` (float, 默认: ``0.05``): p 值阈值
    - ``n_features_to_select`` (int, 可选): 选择前 n 个特征（若指定则覆盖 p_threshold）
    - ``plot_importance`` (bool, 默认: ``true``): 是否绘制特征重要性图

  **chi2 (卡方检验)**:

    - ``p_threshold`` (float, 默认: ``0.05``): p 值阈值
    - ``n_features_to_select`` (int, 可选): 选择前 n 个特征
    - ``plot_importance`` (bool, 默认: ``true``): 是否绘制特征重要性图
    - **注意**: 仅适用于非负特征

  **lasso (Lasso 正则化)**:

    - ``cv`` (int, 默认: ``10``): 交叉验证折数
    - ``n_alphas`` (int, 默认: ``100``): alpha 参数的数量
    - ``alphas`` (list, 可选): 自定义 alpha 参数列表
    - ``random_state`` (int, 默认: ``42``): 随机种子
    - ``visualize`` (bool, 默认: ``false``): 是否生成系数路径图

  **rfecv (递归特征消除 + 交叉验证)**:

    - ``estimator`` (str, 默认: ``RandomForestClassifier``): 使用的估计器，可选:

      - 分类器: ``LogisticRegression``, ``RandomForestClassifier``, ``SVC``, ``GradientBoostingClassifier``, ``XGBClassifier``
      - 回归器: ``LinearRegression``, ``RandomForestRegressor``, ``SVR``, ``GradientBoostingRegressor``, ``XGBRegressor``

    - ``step`` (int, 默认: ``1``): 每次迭代移除的特征数
    - ``cv`` (int, 默认: ``5``): 交叉验证折数
    - ``scoring`` (str, 默认: ``roc_auc``): 评分指标
    - ``min_features_to_select`` (int, 默认: ``1``): 最少保留的特征数
    - ``n_jobs`` (int, 默认: ``-1``): 并行作业数（``-1`` 表示使用所有 CPU）
    - ``random_state`` (int, 可选): 随机种子

- **示例**:

  .. code-block:: yaml

     # 方差阈值筛选
     feature_selection_methods:
       - method: variance
         params:
           threshold: 0.0
           plot_variances: true
     
     # 相关性过滤 + ANOVA
     feature_selection_methods:
       - method: correlation
         params:
           threshold: 0.9
           method: spearman
       - method: anova
         params:
           p_threshold: 0.05

**models**: 模型训练设置

定义要训练的一个或多个模型。

- **支持的模型类型及常用参数**:

  **LogisticRegression (逻辑回归)**:

    - ``max_iter`` (int, 默认: ``100``): 最大迭代次数
    - ``C`` (float, 默认: ``1.0``): 正则化强度的倒数
    - ``penalty`` (str, 默认: ``l2``): 正则化类型，可选 ``l1``, ``l2``, ``elasticnet``
    - ``solver`` (str, 默认: ``lbfgs``): 优化算法
    - ``random_state`` (int): 随机种子

  **RandomForest (随机森林)**:

    - ``n_estimators`` (int, 默认: ``100``): 决策树数量
    - ``max_depth`` (int, 可选): 树的最大深度
    - ``min_samples_split`` (int, 默认: ``2``): 分裂节点所需的最小样本数
    - ``min_samples_leaf`` (int, 默认: ``1``): 叶子节点的最小样本数
    - ``max_features`` (str/int, 默认: ``sqrt``): 分裂时考虑的最大特征数
    - ``random_state`` (int): 随机种子

  **XGBoost (极端梯度提升)**:

    - ``n_estimators`` (int, 默认: ``100``): 提升轮数
    - ``max_depth`` (int, 默认: ``3``): 树的最大深度
    - ``learning_rate`` (float, 默认: ``0.1``): 学习率
    - ``subsample`` (float, 默认: ``1.0``): 样本采样比例
    - ``colsample_bytree`` (float, 默认: ``1.0``): 特征采样比例
    - ``random_state`` (int): 随机种子

  **SVM (支持向量机)**:

    - ``C`` (float, 默认: ``1.0``): 正则化参数
    - ``kernel`` (str, 默认: ``rbf``): 核函数，可选 ``linear``, ``poly``, ``rbf``, ``sigmoid``
    - ``gamma`` (str/float, 默认: ``scale``): 核系数
    - ``probability`` (bool, 默认: ``false``): 是否启用概率估计
    - ``random_state`` (int): 随机种子

  **KNN (K 近邻)**:

    - ``n_neighbors`` (int, 默认: ``5``): 邻居数量
    - ``weights`` (str, 默认: ``uniform``): 权重函数，可选 ``uniform``, ``distance``
    - ``metric`` (str, 默认: ``minkowski``): 距离度量

  **AutoGluon (自动机器学习)**:

    - ``time_limit`` (int): 训练时间限制（秒）
    - ``presets`` (str, 默认: ``medium_quality``): 预设质量，可选 ``best_quality``, ``high_quality``, ``medium_quality``

- **示例**:

  .. code-block:: yaml

     # 训练多个模型
     models:
       LogisticRegression:
         params:
           max_iter: 1000
           C: 1.0
           random_state: 42
       
       RandomForest:
         params:
           n_estimators: 200
           max_depth: 10
           random_state: 42
       
       XGBoost:
         params:
           n_estimators: 100
           max_depth: 5
           learning_rate: 0.1
           random_state: 42

**is_visualize**: 是否启用可视化

- **类型**: 布尔值
- **默认值**: ``true``

**visualization**: 可视化详细设置

- ``plot_types``: 要生成的图表类型。

  - **可选值**: ``roc``, ``dca``, ``calibration``, ``pr``, ``confusion``, ``shap``

- ``dpi``: 分辨率 (默认 600)。
- ``format``: 文件格式 (如 ``pdf``, ``png``)。

数据配置参数
------------

**配置文件示例：**

.. code-block:: yaml

   # 控制是否自动读取目录中的第一个文件
   auto_select_first_file: true

   images:
     subject1:
       T1: /path/to/subject1/T1/T1.nii.gz
       T2: /path/to/subject1/T2/T2.nii.gz
     subject2:
       T1: /path/to/subject2/T1/T1.nii.gz
       T2: /path/to/subject2/T2/T2.nii.gz

   masks:
     subject1:
       T1: /path/to/subject1/T1/mask_T1.nii.gz
     subject2:
       T1: /path/to/subject2/T1/mask_T1.nii.gz

**auto_select_first_file**: 是否自动读取目录中的第一个文件

- **类型**: 布尔值
- **默认值**: ``true``
- **说明**: 

  - ``true``: 自动读取目录中的第一个文件（适用于已转换的 nii 文件等场景）。
  - ``false``: 保持目录路径不变（适用于 dcm2nii 等需要整个文件夹的任务）。

**images**: 图像数据路径

- **类型**: 字典
- **必需**: 是
- **说明**: 嵌套字典，第一层是受试者 ID，第二层是图像类型（Key）。

**masks**: 掩码数据路径

- **类型**: 字典
- **必需**: 否
- **说明**: 结构同 ``images``。通常用于指定 ROI。

配置文件验证
------------

HABIT 提供了配置文件验证机制，确保参数的正确性。

**验证规则：**

1. **必需参数检查**: 检查所有必需参数是否提供
2. **类型检查**: 检查参数类型是否正确
3. **范围检查**: 检查参数值是否在有效范围内
4. **依赖检查**: 检查参数依赖是否满足

**验证示例：**

.. code-block:: python

   from habit.core.common.config_loader import load_config

   # 加载配置并验证
   config = load_config('./config.yaml')

   # 如果配置有误，会抛出异常
   # ValueError: Missing required parameter: data_dir

常见问题
--------

**Q1: 如何创建配置文件？**

A: 可以通过以下方式创建：
1. 复制示例配置文件并修改
2. 参考本文档创建新的配置文件
3. 使用配置文件生成工具（如果有）

**Q2: 如何调试配置文件？**

A: 可以使用以下方法：
1. 使用 `debug` 模式启用详细日志
2. 检查配置文件语法
3. 逐步添加参数，定位问题
4. 查看错误信息