特征重要性分析

1. 特征重要性基础

1.1 特征重要性的意义

为什么重要?

在量化投资中,特征重要性分析有三大价值:

  1. 模型解释性:理解模型如何做决策
  2. 特征筛选:识别有效因子,剔除冗余因子
  3. 风险控制:避免模型依赖无效特征

量化场景的特殊性

量化数据的特点:

  • 高维特征:数百到数千个因子
  • 低信噪比:大量噪声特征
  • 强相关性:因子间存在多重共线性
  • 非平稳性:因子重要性随时间变化

1.2 LightGBM的特征重要性类型

LightGBM提供两种特征重要性:

  1. Split重要性(分裂重要性)

    • 基于特征作为分裂节点的次数
    • 衡量特征在分裂决策中的使用频率
  2. Gain重要性(增益重要性)

    • 基于分裂带来的信息增益
    • 衡量特征对模型性能的贡献度

代码示例

import lightgbm as lgb
import matplotlib.pyplot as plt
 
# 训练模型
model = lgb.train(params, train_data, num_boost_round=1000,
                  valid_sets=[train_data, val_data],
                  callbacks=[lgb.early_stopping(stopping_rounds=50)])
 
# 获取特征重要性
split_importance = model.feature_importance(importance_type='split')
gain_importance = model.feature_importance(importance_type='gain')
 
# 打印
print("Split重要性(前10):")
for i, (idx, imp) in enumerate(sorted(enumerate(split_importance),
                                      key=lambda x: x[1], reverse=True)[:10]):
    print(f"  {idx+1}. 特征{idx}: {imp}")
 
print("\nGain重要性(前10):")
for i, (idx, imp) in enumerate(sorted(enumerate(gain_importance),
                                      key=lambda x: x[1], reverse=True)[:10]):
    print(f"  {idx+1}. 特征{idx}: {imp:.2f}")

2. 特征重要性可视化

2.1 条形图可视化

def plot_feature_importance(model, feature_names, importance_type='gain', top_n=20):
    """
    绘制特征重要性条形图
 
    参数:
        model: LightGBM模型
        feature_names: 特征名称列表
        importance_type: 'split' 或 'gain'
        top_n: 显示前n个特征
    """
    # 获取重要性
    importance = model.feature_importance(importance_type=importance_type)
 
    # 排序
    indices = np.argsort(importance)[::-1][:top_n]
    importance = importance[indices]
    feature_names = np.array(feature_names)[indices]
 
    # 绘制
    plt.figure(figsize=(12, 8))
    plt.barh(range(len(importance)), importance[::-1])
    plt.yticks(range(len(importance)), feature_names[::-1])
    plt.xlabel(f'Feature Importance ({importance_type})')
    plt.title(f'Top {top_n} Feature Importance')
    plt.tight_layout()
    plt.show()
 
# 使用示例
feature_names = [f'factor_{i}' for i in range(X.shape[1])]
plot_feature_importance(model, feature_names, importance_type='gain', top_n=20)

2.2 对数重要性图

def plot_feature_importance_log(model, feature_names, importance_type='gain'):
    """
    绘制对数特征重要性图
 
    适用于特征重要性差异巨大的情况
    """
    importance = model.feature_importance(importance_type=importance_type)
    importance = np.maximum(importance, 1e-6)  # 避免log(0)
 
    indices = np.argsort(importance)[::-1]
    importance_log = np.log(importance[indices])
 
    plt.figure(figsize=(12, 8))
    plt.plot(range(len(importance_log)), importance_log, marker='o')
    plt.xlabel('Feature Rank')
    plt.ylabel('Log Feature Importance')
    plt.title(f'Feature Importance Distribution (Log Scale) - {importance_type}')
    plt.grid(True)
    plt.show()
 
plot_feature_importance_log(model, feature_names, importance_type='gain')

3. 基于Permutation的特征重要性

3.1 Permutation Importance原理

核心思想

Permutation Importance通过打乱特征的值来评估其重要性:

  1. 计算基准模型性能
  2. 打乱某个特征的值
  3. 重新计算模型性能
  4. 性能下降越大,特征越重要

优势

  • 不依赖于模型类型
  • 考虑特征间的交互作用
  • 更真实反映特征重要性

代码实现

from sklearn.metrics import mean_squared_error
from scipy.stats import pearsonr
 
def permutation_importance(model, X, y, metric='ic', n_repeats=5, random_state=42):
    """
    计算Permutation Importance
 
    参数:
        model: 训练好的模型
        X: 特征矩阵
        y: 目标变量
        metric: 评估指标 ('rmse', 'r2', 'ic', 'rank_ic')
        n_repeats: 重复次数
        random_state: 随机种子
 
    返回:
        importances: 特征重要性数组,shape=[n_features, n_repeats]
    """
    np.random.seed(random_state)
    n_features = X.shape[1]
    importances = np.zeros((n_features, n_repeats))
 
    # 计算基准性能
    y_pred = model.predict(X)
    if metric == 'rmse':
        baseline_score = np.sqrt(mean_squared_error(y, y_pred))
    elif metric == 'r2':
        from sklearn.metrics import r2_score
        baseline_score = r2_score(y, y_pred)
    elif metric == 'ic':
        baseline_score = pearsonr(y_pred, y)[0]
    elif metric == 'rank_ic':
        from scipy.stats import spearmanr
        baseline_score = spearmanr(y_pred, y)[0]
    else:
        raise ValueError(f"Unknown metric: {metric}")
 
    print(f"Baseline {metric}: {baseline_score:.4f}")
 
    # 对每个特征进行Permutation
    for feature_idx in range(n_features):
        print(f"Processing feature {feature_idx + 1}/{n_features}")
 
        for repeat in range(n_repeats):
            # 打乱特征
            X_permuted = X.copy()
            np.random.shuffle(X_permuted[:, feature_idx])
 
            # 重新预测
            y_pred_permuted = model.predict(X_permuted)
 
            # 计算性能
            if metric == 'rmse':
                score = np.sqrt(mean_squared_error(y, y_pred_permuted))
                importance = score - baseline_score  # RMSE增加,重要
            elif metric == 'r2':
                score = r2_score(y, y_pred_permuted)
                importance = baseline_score - score  # R2减少,重要
            elif metric == 'ic':
                score = pearsonr(y_pred_permuted, y)[0]
                importance = baseline_score - score  # IC减少,重要
            elif metric == 'rank_ic':
                score = spearmanr(y_pred_permuted, y)[0]
                importance = baseline_score - score  # Rank IC减少,重要
 
            importances[feature_idx, repeat] = importance
 
    return importances
 
# 使用示例
importances = permutation_importance(model, X_val, y_val, metric='ic', n_repeats=5)
 
# 计算平均重要性
mean_importance = importances.mean(axis=1)
std_importance = importances.std(axis=1)
 
# 排序
sorted_indices = np.argsort(mean_importance)[::-1]
 
print("\nPermutation Importance (IC):")
for i, idx in enumerate(sorted_indices[:10]):
    print(f"  {i+1}. 特征{idx}: {mean_importance[idx]:.4f} ± {std_importance[idx]:.4f}")

3.2 可视化Permutation Importance

def plot_permutation_importance(importances, feature_names, top_n=20):
    """
    绘制Permutation Importance
 
    参数:
        importances: 特征重要性数组,shape=[n_features, n_repeats]
        feature_names: 特征名称
        top_n: 显示前n个特征
    """
    # 计算平均重要性
    mean_imp = importances.mean(axis=1)
    std_imp = importances.std(axis=1)
 
    # 排序
    indices = np.argsort(mean_imp)[::-1][:top_n]
    mean_imp = mean_imp[indices]
    std_imp = std_imp[indices]
    feature_names = np.array(feature_names)[indices]
 
    # 绘制
    plt.figure(figsize=(12, 8))
    plt.barh(range(len(mean_imp)), mean_imp[::-1], xerr=std_imp[::-1],
             color='steelblue', alpha=0.7, capsize=3)
    plt.yticks(range(len(mean_imp)), feature_names[::-1])
    plt.xlabel('Permutation Importance')
    plt.title(f'Top {top_n} Permutation Feature Importance')
    plt.tight_layout()
    plt.show()
 
plot_permutation_importance(importances, feature_names, top_n=20)

4. SHAP值分析

4.1 SHAP原理

SHAP(SHapley Additive exPlanations)

SHAP值基于博弈论中的Shapley值,提供一致的局部解释。

核心思想

每个特征对预测的贡献:

其中:

  • 是第i个样本的预测值
  • Base Value是所有样本的均值预测
  • 是第j个特征对第i个样本的贡献

代码实现

import shap
 
# 计算SHAP值
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
 
# SHAP Summary Plot
shap.summary_plot(shap_values, X, feature_names=feature_names, plot_type='bar')
 
# SHAP Summary Plot (详细)
shap.summary_plot(shap_values, X, feature_names=feature_names)

4.2 特征重要性排序

def shap_feature_importance(shap_values, feature_names, top_n=20):
    """
    基于SHAP值的特征重要性
 
    参数:
        shap_values: SHAP值数组
        feature_names: 特征名称
        top_n: 显示前n个特征
 
    返回:
        重要性排序结果
    """
    # 计算每个特征的平均绝对SHAP值
    mean_abs_shap = np.abs(shap_values).mean(axis=0)
 
    # 排序
    indices = np.argsort(mean_abs_shap)[::-1][:top_n]
    importance = mean_abs_shap[indices]
    names = np.array(feature_names)[indices]
 
    # 打印
    print("SHAP Feature Importance:")
    for i, (name, imp) in enumerate(zip(names, importance)):
        print(f"  {i+1}. {name}: {imp:.4f}")
 
    return indices, importance, names
 
# 使用示例
indices, importance, names = shap_feature_importance(shap_values, feature_names, top_n=20)

4.3 个体解释

def plot_shap_force(explainer, X, sample_idx, feature_names):
    """
    绘制单个样本的SHAP Force Plot
 
    参数:
        explainer: SHAP解释器
        X: 特征矩阵
        sample_idx: 样本索引
        feature_names: 特征名称
    """
    # 计算SHAP值
    shap_values = explainer.shap_values(X[[sample_idx]])
 
    # 绘制Force Plot
    shap.force_plot(explainer.expected_value[0],
                   shap_values[0],
                   X[sample_idx],
                   feature_names=feature_names)
 
# 使用示例
plot_shap_force(explainer, X_val, sample_idx=0, feature_names=feature_names)

5. 特征相关性分析

5.1 为什么特征相关性重要?

高相关的特征:

  • 提供冗余信息
  • 可能导致重要性”分散”
  • 增加模型复杂度但不增加价值

建议:

  • 同类特征保留最重要的 1-2 个
  • 相关性 > 0.9 的特征考虑合并或剔除

5.2 相关性计算

def analyze_feature_correlation(X, features, threshold=0.7):
    """
    分析特征相关性
 
    参数:
        X: 特征 DataFrame
        features: 特征列表
        threshold: 高相关阈值
 
    返回:
        high_corr_pairs: 高相关特征对
    """
    # 计算相关矩阵
    corr_matrix = X[features].corr()
 
    # 找高相关对
    high_corr_pairs = []
    for i, feat1 in enumerate(features):
        for feat2 in features[i+1:]:
            corr = corr_matrix.loc[feat1, feat2]
            if abs(corr) > threshold:
                high_corr_pairs.append((feat1, feat2, corr))
 
    # 排序
    high_corr_pairs.sort(key=lambda x: abs(x[2]), reverse=True)
 
    # 打印
    print(f"\n📊 高相关特征对 (|corr| > {threshold}):")
    print("-" * 50)
    for feat1, feat2, corr in high_corr_pairs:
        print(f"  {feat1:12s}{feat2:12s} : {corr:.3f}")
 
    return high_corr_pairs
 
# 使用示例
high_corr_pairs = analyze_feature_correlation(X_train, list(FEATURES.keys()), threshold=0.7)

示例输出:

高相关特征对 (|corr| > 0.7):
--------------------------------------------------
RET_5D       ↔ BIAS_10     :  0.931
RET_10D      ↔ BIAS_20     :  0.924
BIAS_10      ↔ BIAS_20     :  0.873
BIAS_5       ↔ BIAS_10     :  0.860
RET_10D      ↔ BIAS_10     :  0.834
RET_20D      ↔ BIAS_20     :  0.829
RET_5D       ↔ BIAS_5      :  0.818
RET_1D       ↔ BODY        :  0.814
RET_5D       ↔ BIAS_20     :  0.769
VOL_10       ↔ VOL_20      :  0.745

5.3 相关性热力图可视化

def plot_correlation_heatmap(X, features, top_n=30):
    """
    绘制特征相关性热力图
 
    参数:
        X: 特征 DataFrame
        features: 特征列表
        top_n: 显示前n个特征
    """
    # 计算相关性矩阵
    corr_matrix = X[features[:top_n]].corr()
 
    # 绘制热力图
    import seaborn as sns
    plt.figure(figsize=(12, 10))
 
    mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
    sns.heatmap(corr_matrix,
                mask=mask,
                cmap='coolwarm',
                center=0,
                square=True,
                linewidths=1,
                cbar_kws={"shrink": 0.8},
                annot=False,
                fmt='.2f',
                xticklabels=1,
                yticklabels=1)
 
    plt.title('Feature Correlation Heatmap', fontsize=14, fontweight='bold')
    plt.xlabel('Features', fontsize=12)
    plt.ylabel('Features', fontsize=12)
    plt.xticks(rotation=45, ha='right')
    plt.yticks(rotation=0)
    plt.tight_layout()
    plt.show()
 
# 使用示例
plot_correlation_heatmap(X_train, list(FEATURES.keys()), top_n=30)

5.4 基于相关性的特征筛选

def filter_features_by_correlation(X, features, threshold=0.9, importance=None):
    """
    基于相关性筛选特征
 
    策略:
    1. 找到高相关特征对
    2. 如果两个特征相关性 > threshold
    3. 保留重要性更高的特征
 
    参数:
        X: 特征 DataFrame
        features: 特征列表
        threshold: 相关性阈值
        importance: 特征重要性字典 {feature: importance}
 
    返回:
        selected_features: 筛选后的特征列表
    """
    # 计算相关矩阵
    corr_matrix = X[features].corr()
 
    # 如果没有提供重要性,使用默认值
    if importance is None:
        importance = {feat: 1.0 for feat in features}
 
    # 标记要删除的特征
    to_remove = set()
 
    for i, feat1 in enumerate(features):
        for feat2 in features[i+1:]:
            corr = abs(corr_matrix.loc[feat1, feat2])
 
            if corr > threshold:
                # 保留重要性更高的特征
                if importance[feat1] >= importance[feat2]:
                    to_remove.add(feat2)
                    print(f"移除 {feat2} (与 {feat1} 相关性={corr:.3f})")
                else:
                    to_remove.add(feat1)
                    print(f"移除 {feat1} (与 {feat2} 相关性={corr:.3f})")
 
    # 筛选特征
    selected_features = [f for f in features if f not in to_remove]
 
    print(f"\n原始特征数: {len(features)}")
    print(f"筛选后特征数: {len(selected_features)}")
    print(f"删除特征数: {len(to_remove)}")
 
    return selected_features
 
# 使用示例
importance_dict = dict(zip(feature_names, gain_importance))
selected_features = filter_features_by_correlation(
    X_train, 
    list(FEATURES.keys()),
    threshold=0.9,
    importance=importance_dict
)

6. 重训练验证

6.1 验证特征选择的效果

def validate_feature_selection(X_train, y_train, X_valid, y_valid, X_test, y_test,
                         full_features, selected_features, params):
    """
    验证特征选择的效果
 
    参数:
        X_train, y_train: 训练数据
        X_valid, y_valid: 验证数据
        X_test, y_test: 测试数据
        full_features: 所有特征
        selected_features: 选中的特征
        params: 模型参数
 
    返回:
        comparison: 对比结果
    """
    import lightgbm as lgb
    from scipy.stats import pearsonr
 
    # 训练全特征模型
    print("\n训练全特征模型...")
    train_data_full = lgb.Dataset(X_train[full_features], label=y_train)
    val_data_full = lgb.Dataset(X_valid[full_features], label=y_valid, reference=train_data_full)
 
    model_full = lgb.train(
        params,
        train_data_full,
        num_boost_round=1000,
        valid_sets=[train_data_full, val_data_full],
        callbacks=[
            lgb.early_stopping(stopping_rounds=50, verbose=False),
            lgb.log_evaluation(period=100)
        ]
    )
 
    # 训练精选特征模型
    print("\n训练精选特征模型...")
    train_data_selected = lgb.Dataset(X_train[selected_features], label=y_train)
    val_data_selected = lgb.Dataset(X_valid[selected_features], label=y_valid, reference=train_data_selected)
 
    model_selected = lgb.train(
        params,
        train_data_selected,
        num_boost_round=1000,
        valid_sets=[train_data_selected, val_data_selected],
        callbacks=[
            lgb.early_stopping(stopping_rounds=50, verbose=False),
            lgb.log_evaluation(period=100)
        ]
    )
 
    # 预测
    pred_full = model_full.predict(X_test[full_features])
    pred_selected = model_selected.predict(X_test[selected_features])
 
    # 计算IC
    ic_full = pearsonr(pred_full, y_test)[0]
    ic_selected = pearsonr(pred_selected, y_test)[0]
 
    # 打印对比
    print("\n🔬 特征选择验证:")
    print("-" * 50)
    print(f"{'模型':<15s} {'特征数':<10s} {'测试集 IC':<10s}")
    print("-" * 50)
    print(f"{'全部特征':<15s} {len(full_features):<10d} {ic_full:<10.4f}")
    print(f"{'精选特征':<15s} {len(selected_features):<10d} {ic_selected:<10.4f}")
    print("-" * 50)
 
    # 评估
    ic_ratio = ic_selected / ic_full if ic_full != 0 else 0
    print(f"\nIC保持率: {ic_ratio:.2%}")
 
    if ic_selected >= ic_full * 0.95:
        print("✅ 精选特征表现接近全特征,可以简化模型!")
    elif ic_selected >= ic_full * 0.90:
        print("⚠️ 精选特征性能略有下降,但简化模型可能值得")
    else:
        print("❌ 精选特征性能下降较多,需要调整")
 
    return {
        'full_features_ic': ic_full,
        'selected_features_ic': ic_selected,
        'n_full': len(full_features),
        'n_selected': len(selected_features),
        'ic_ratio': ic_ratio,
        'model_full': model_full,
        'model_selected': model_selected
    }
 
# 使用示例
params = {
    'objective': 'regression',
    'metric': 'rmse',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'verbose': -1,
}
 
comparison = validate_feature_selection(
    X_train, y_train, X_valid, y_valid, X_test, y_test,
    full_features=list(FEATURES.keys()),
    selected_features=selected_features,
    params=params
)

6.2 特征选择效果可视化

def plot_feature_selection_comparison(comparison):
    """
    可视化特征选择对比结果
 
    参数:
        comparison: validate_feature_selection 的返回结果
    """
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
 
    # 特征数量对比
    models = ['全特征', '精选特征']
    n_features = [comparison['n_full'], comparison['n_selected']]
    ics = [comparison['full_features_ic'], comparison['selected_features_ic']]
 
    # 特征数 vs IC
    axes[0].scatter(n_features, ics, s=200, alpha=0.6)
    axes[0].plot(n_features, ics, 'r-', linewidth=2, alpha=0.5)
 
    # 添加标注
    for i, (n, ic, model) in enumerate(zip(n_features, ics, models)):
        axes[0].annotate(f'{model}\n{n}特征\nIC={ic:.4f}',
                        (n, ic),
                        textcoords="offset points",
                        xytext=(0, 10),
                        ha='center')
 
    axes[0].set_xlabel('特征数量')
    axes[0].set_ylabel('测试集 IC')
    axes[0].set_title('特征数量 vs 性能')
    axes[0].grid(True, alpha=0.3)
 
    # IC 柱状图
    bars = axes[1].bar(models, ics, color=['steelblue', 'coral'], alpha=0.7, edgecolor='black')
    axes[1].set_ylabel('测试集 IC')
    axes[1].set_title('模型性能对比')
    axes[1].grid(True, axis='y', alpha=0.3)
 
    # 添加数值标签
    for bar, ic in zip(bars, ics):
        height = bar.get_height()
        axes[1].text(bar.get_x() + bar.get_width()/2., height,
                     f'{ic:.4f}',
                     ha='center', va='bottom', fontweight='bold')
 
    # 添加IC保持率
    axes[1].axhline(y=comparison['full_features_ic'] * 0.95,
                   color='green', linestyle='--', alpha=0.5,
                   label='95% 阈值')
    axes[1].legend()
 
    plt.tight_layout()
    plt.show()
 
# 使用示例
plot_feature_selection_comparison(comparison)

6.3 时序特征重要性

5. 时序特征重要性

5.1 滚动窗口特征重要性

核心思想

在不同时间窗口内计算特征重要性,分析重要性的稳定性。

代码实现

def rolling_feature_importance(X, y, model_class, params,
                               window_size=252, step_size=21):
    """
    滚动窗口特征重要性
 
    参数:
        X: 特征矩阵,shape=[n_samples, n_features]
        y: 目标变量
        model_class: 模型类
        params: 模型参数
        window_size: 窗口大小
        step_size: 步长
 
    返回:
        importance_history: 特征重要性历史
    """
    n_samples = len(X)
    importance_history = []
 
    start_idx = window_size
    while start_idx + step_size <= n_samples:
        print(f"Processing window: {start_idx - window_size} - {start_idx}")
 
        # 划分数据
        X_window = X[start_idx - window_size:start_idx]
        y_window = y[start_idx - window_size:start_idx]
 
        # 训练模型
        model = model_class(**params)
        model.fit(X_window, y_window)
 
        # 计算特征重要性
        importance = model.feature_importance(importance_type='gain')
        importance_history.append(importance)
 
        # 滚动窗口
        start_idx += step_size
 
    return np.array(importance_history)
 
# 使用示例
importance_history = rolling_feature_importance(
    X, y, LGBMRegressor, params,
    window_size=252,  # 1年
    step_size=21      # 1个月
)
 
print(f"重要性历史: {importance_history.shape}")

5.2 特征重要性稳定性分析

def analyze_importance_stability(importance_history, feature_names, top_n=10):
    """
    分析特征重要性的稳定性
 
    参数:
        importance_history: 特征重要性历史,shape=[n_windows, n_features]
        feature_names: 特征名称
        top_n: 分析前n个特征
    """
    # 计算统计量
    mean_importance = importance_history.mean(axis=0)
    std_importance = importance_history.std(axis=0)
    cv_importance = std_importance / (mean_importance + 1e-6)  # 变异系数
 
    # 排序
    sorted_indices = np.argsort(mean_importance)[::-1][:top_n]
 
    # 打印
    print(f"{'特征':<20} {'均值':>10} {'标准差':>10} {'变异系数':>10}")
    print("-" * 60)
    for i, idx in enumerate(sorted_indices):
        print(f"{feature_names[idx]:<20} "
              f"{mean_importance[idx]:>10.4f} "
              f"{std_importance[idx]:>10.4f} "
              f"{cv_importance[idx]:>10.4f}")
 
    # 可视化
    fig, axes = plt.subplots(1, 2, figsize=(15, 6))
 
    # 均值vs标准差
    axes[0].scatter(mean_importance, std_importance, alpha=0.6)
    for idx in sorted_indices:
        axes[0].annotate(feature_names[idx],
                        (mean_importance[idx], std_importance[idx]))
    axes[0].set_xlabel('Mean Importance')
    axes[0].set_ylabel('Std Importance')
    axes[0].set_title('Importance Stability')
    axes[0].grid(True)
 
    # 时间序列
    for idx in sorted_indices:
        axes[1].plot(importance_history[:, idx], label=feature_names[idx])
    axes[1].set_xlabel('Time Window')
    axes[1].set_ylabel('Importance')
    axes[1].set_title(f'Top {top_n} Features Importance Over Time')
    axes[1].legend()
    axes[1].grid(True)
 
    plt.tight_layout()
    plt.show()
 
# 使用示例
analyze_importance_stability(importance_history, feature_names, top_n=10)

6. 特征选择策略

6.1 基于重要性的特征选择

def select_features_by_importance(model, X, feature_names,
                                  importance_type='gain', threshold=0.01):
    """
    基于特征重要性选择特征
 
    参数:
        model: LightGBM模型
        X: 特征矩阵
        feature_names: 特征名称
        importance_type: 'split' 或 'gain'
        threshold: 重要性阈值(比例)
 
    返回:
        X_selected: 选择后的特征矩阵
        selected_features: 选择的特征名称
        selected_indices: 选择的特征索引
    """
    # 获取特征重要性
    importance = model.feature_importance(importance_type=importance_type)
 
    # 归一化
    importance_normalized = importance / importance.sum()
 
    # 选择重要性超过阈值的特征
    selected_indices = np.where(importance_normalized >= threshold)[0]
 
    # 提取数据
    X_selected = X[:, selected_indices]
    selected_features = np.array(feature_names)[selected_indices]
 
    print(f"选择 {len(selected_indices)}/{len(feature_names)} 个特征")
    print(f"累计重要性: {importance_normalized[selected_indices].sum():.2%}")
 
    return X_selected, selected_features, selected_indices
 
# 使用示例
X_selected, selected_features, selected_indices = select_features_by_importance(
    model, X, feature_names, importance_type='gain', threshold=0.01
)

6.2 递归特征消除(RFE)

from sklearn.feature_selection import RFE
 
def recursive_feature_elimination(X, y, estimator, n_features_to_select=None,
                                 step=1, cv=5):
    """
    递归特征消除
 
    参数:
        X: 特征矩阵
        y: 目标变量
        estimator: 评估器
        n_features_to_select: 目标特征数
        step: 每次消除的特征数
        cv: 交叉验证折数
 
    返回:
        X_selected: 选择后的特征矩阵
        selected_indices: 选择的特征索引
        rfe: RFE对象
    """
    # 创建RFE
    rfe = RFE(estimator=estimator,
              n_features_to_select=n_features_to_select,
              step=step,
              importance_getter='auto')
 
    # 拟合
    rfe.fit(X, y)
 
    # 提取结果
    selected_indices = np.where(rfe.support_)[0]
    X_selected = X[:, selected_indices]
 
    print(f"选择 {len(selected_indices)}/{X.shape[1]} 个特征")
 
    return X_selected, selected_indices, rfe
 
# 使用示例
estimator = LGBMRegressor(**params)
X_selected, selected_indices, rfe = recursive_feature_elimination(
    X, y, estimator, n_features_to_select=50, step=5
)

6.3 稳定性特征选择

def stable_feature_selection(importance_history, feature_names,
                             stability_threshold=0.7, top_n=None):
    """
    稳定性特征选择
 
    参数:
        importance_history: 特征重要性历史
        feature_names: 特征名称
        stability_threshold: 稳定性阈值
        top_n: 选择前n个稳定特征
 
    返回:
        selected_features: 选择的特征名称
        stability_scores: 稳定性得分
    """
    # 计算每个特征的排名
    rankings = []
    for importance in importance_history:
        ranking = np.argsort(importance)[::-1]
        rankings.append(ranking)
 
    rankings = np.array(rankings)
 
    # 计算稳定性得分(基于排名的方差)
    stability_scores = []
    for feature_idx in range(len(feature_names)):
        feature_rankings = np.where(rankings == feature_idx)[1]
        stability_score = 1 / (1 + np.var(feature_rankings))
        stability_scores.append(stability_score)
 
    stability_scores = np.array(stability_scores)
 
    # 排序
    sorted_indices = np.argsort(stability_scores)[::-1]
 
    # 选择稳定特征
    if top_n is None:
        selected_indices = sorted_indices[stability_scores[sorted_indices] >= stability_threshold]
    else:
        selected_indices = sorted_indices[:top_n]
 
    selected_features = np.array(feature_names)[selected_indices]
 
    print(f"选择 {len(selected_indices)} 个稳定特征")
 
    return selected_features, stability_scores
 
# 使用示例
selected_features, stability_scores = stable_feature_selection(
    importance_history, feature_names,
    stability_threshold=0.7, top_n=20
)

7. 特征重要性分析的最佳实践

7.1 综合分析流程

class FeatureImportanceAnalyzer:
    """
    特征重要性分析器
 
    功能:
    1. 计算多种特征重要性
    2. 可视化分析结果
    3. 时序稳定性分析
    4. 特征选择建议
    """
 
    def __init__(self, model, X, y, feature_names):
        self.model = model
        self.X = X
        self.y = y
        self.feature_names = feature_names
 
        self.split_importance = None
        self.gain_importance = None
        self.permutation_importance = None
        self.shap_values = None
 
    def calculate_lightgbm_importance(self):
        """计算LightGBM内置重要性"""
        self.split_importance = self.model.feature_importance(importance_type='split')
        self.gain_importance = self.model.feature_importance(importance_type='gain')
 
    def calculate_permutation_importance(self, metric='ic', n_repeats=5):
        """计算Permutation Importance"""
        self.permutation_importance = permutation_importance(
            self.model, self.X, self.y,
            metric=metric, n_repeats=n_repeats
        )
 
    def calculate_shap_values(self):
        """计算SHAP值"""
        explainer = shap.TreeExplainer(self.model)
        self.shap_values = explainer.shap_values(self.X)
 
    def analyze_all(self):
        """执行所有分析"""
        print("计算LightGBM重要性...")
        self.calculate_lightgbm_importance()
 
        print("计算Permutation Importance...")
        self.calculate_permutation_importance()
 
        print("计算SHAP值...")
        self.calculate_shap_values()
 
    def plot_summary(self, top_n=20):
        """绘制汇总图"""
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
 
        # Split重要性
        indices = np.argsort(self.split_importance)[::-1][:top_n]
        axes[0, 0].barh(range(len(indices)), self.split_importance[indices][::-1])
        axes[0, 0].set_yticks(range(len(indices)), np.array(self.feature_names)[indices][::-1])
        axes[0, 0].set_title('Split Importance')
        axes[0, 0].invert_yaxis()
 
        # Gain重要性
        indices = np.argsort(self.gain_importance)[::-1][:top_n]
        axes[0, 1].barh(range(len(indices)), self.gain_importance[indices][::-1])
        axes[0, 1].set_yticks(range(len(indices)), np.array(self.feature_names)[indices][::-1])
        axes[0, 1].set_title('Gain Importance')
        axes[0, 1].invert_yaxis()
 
        # Permutation Importance
        if self.permutation_importance is not None:
            mean_imp = self.permutation_importance.mean(axis=1)
            indices = np.argsort(mean_imp)[::-1][:top_n]
            axes[1, 0].barh(range(len(indices)), mean_imp[indices][::-1])
            axes[1, 0].set_yticks(range(len(indices)), np.array(self.feature_names)[indices][::-1])
            axes[1, 0].set_title('Permutation Importance')
            axes[1, 0].invert_yaxis()
 
        # SHAP重要性
        if self.shap_values is not None:
            mean_shap = np.abs(self.shap_values).mean(axis=0)
            indices = np.argsort(mean_shap)[::-1][:top_n]
            axes[1, 1].barh(range(len(indices)), mean_shap[indices][::-1])
            axes[1, 1].set_yticks(range(len(indices)), np.array(self.feature_names)[indices][::-1])
            axes[1, 1].set_title('SHAP Importance')
            axes[1, 1].invert_yaxis()
 
        plt.tight_layout()
        plt.show()
 
# 使用示例
analyzer = FeatureImportanceAnalyzer(model, X_val, y_val, feature_names)
analyzer.analyze_all()
analyzer.plot_summary(top_n=15)

7.2 检查清单

特征重要性分析检查清单

  • 计算至少两种不同的特征重要性
  • 可视化特征重要性分布
  • 检查特征重要性的稳定性
  • 分析特征间的相关性
  • 验证特征选择的合理性
  • 记录分析过程和结论

时序特征重要性分析检查清单

  • 使用滚动窗口分析重要性变化
  • 识别稳定和不稳定特征
  • 分析不同市场状态下的重要性
  • 检查重要性与市场周期的关系
  • 制定特征更新策略

8. 总结

特征重要性分析是量化模型开发中的关键环节:

  1. 基础重要性:LightGBM内置的Split和Gain重要性
  2. Permutation Importance:更真实的重要性评估方法
  3. SHAP分析:提供个体和全局解释
  4. 时序分析:分析重要性的稳定性
  5. 特征选择:基于重要性的特征筛选策略

正确的特征重要性分析能够帮助我们:

  • 理解模型决策逻辑
  • 识别有效因子
  • 提升模型性能
  • 控制模型风险