特征重要性分析

1. 特征重要性基础

1.1 特征重要性的意义

为什么重要？

在量化投资中，特征重要性分析有三大价值：

模型解释性：理解模型如何做决策
特征筛选：识别有效因子，剔除冗余因子
风险控制：避免模型依赖无效特征

量化场景的特殊性

量化数据的特点：

高维特征：数百到数千个因子
低信噪比：大量噪声特征
强相关性：因子间存在多重共线性
非平稳性：因子重要性随时间变化

1.2 LightGBM的特征重要性类型

LightGBM提供两种特征重要性：

Split重要性（分裂重要性）
- 基于特征作为分裂节点的次数
- 衡量特征在分裂决策中的使用频率
Gain重要性（增益重要性）
- 基于分裂带来的信息增益
- 衡量特征对模型性能的贡献度

代码示例

import lightgbm as lgb
import matplotlib.pyplot as plt
 
# 训练模型
model = lgb.train(params, train_data, num_boost_round=1000,
                  valid_sets=[train_data, val_data],
                  callbacks=[lgb.early_stopping(stopping_rounds=50)])
 
# 获取特征重要性
split_importance = model.feature_importance(importance_type='split')
gain_importance = model.feature_importance(importance_type='gain')
 
# 打印
print("Split重要性（前10）：")
for i, (idx, imp) in enumerate(sorted(enumerate(split_importance),
                                      key=lambda x: x[1], reverse=True)[:10]):
    print(f"  {idx+1}. 特征{idx}: {imp}")
 
print("\nGain重要性（前10）：")
for i, (idx, imp) in enumerate(sorted(enumerate(gain_importance),
                                      key=lambda x: x[1], reverse=True)[:10]):
    print(f"  {idx+1}. 特征{idx}: {imp:.2f}")

2. 特征重要性可视化

2.1 条形图可视化

def plot_feature_importance(model, feature_names, importance_type='gain', top_n=20):
    """
    绘制特征重要性条形图
 
    参数:
        model: LightGBM模型
        feature_names: 特征名称列表
        importance_type: 'split' 或 'gain'
        top_n: 显示前n个特征
    """
    # 获取重要性
    importance = model.feature_importance(importance_type=importance_type)
 
    # 排序
    indices = np.argsort(importance)[::-1][:top_n]
    importance = importance[indices]
    feature_names = np.array(feature_names)[indices]
 
    # 绘制
    plt.figure(figsize=(12, 8))
    plt.barh(range(len(importance)), importance[::-1])
    plt.yticks(range(len(importance)), feature_names[::-1])
    plt.xlabel(f'Feature Importance ({importance_type})')
    plt.title(f'Top {top_n} Feature Importance')
    plt.tight_layout()
    plt.show()
 
# 使用示例
feature_names = [f'factor_{i}' for i in range(X.shape[1])]
plot_feature_importance(model, feature_names, importance_type='gain', top_n=20)

2.2 对数重要性图

def plot_feature_importance_log(model, feature_names, importance_type='gain'):
    """
    绘制对数特征重要性图
 
    适用于特征重要性差异巨大的情况
    """
    importance = model.feature_importance(importance_type=importance_type)
    importance = np.maximum(importance, 1e-6)  # 避免log(0)
 
    indices = np.argsort(importance)[::-1]
    importance_log = np.log(importance[indices])
 
    plt.figure(figsize=(12, 8))
    plt.plot(range(len(importance_log)), importance_log, marker='o')
    plt.xlabel('Feature Rank')
    plt.ylabel('Log Feature Importance')
    plt.title(f'Feature Importance Distribution (Log Scale) - {importance_type}')
    plt.grid(True)
    plt.show()
 
plot_feature_importance_log(model, feature_names, importance_type='gain')

3. 基于Permutation的特征重要性

3.1 Permutation Importance原理

核心思想

Permutation Importance通过打乱特征的值来评估其重要性：

计算基准模型性能
打乱某个特征的值
重新计算模型性能
性能下降越大，特征越重要

优势

不依赖于模型类型
考虑特征间的交互作用
更真实反映特征重要性

代码实现

from sklearn.metrics import mean_squared_error
from scipy.stats import pearsonr
 
def permutation_importance(model, X, y, metric='ic', n_repeats=5, random_state=42):
    """
    计算Permutation Importance
 
    参数:
        model: 训练好的模型
        X: 特征矩阵
        y: 目标变量
        metric: 评估指标 ('rmse', 'r2', 'ic', 'rank_ic')
        n_repeats: 重复次数
        random_state: 随机种子
 
    返回:
        importances: 特征重要性数组，shape=[n_features, n_repeats]
    """
    np.random.seed(random_state)
    n_features = X.shape[1]
    importances = np.zeros((n_features, n_repeats))
 
    # 计算基准性能
    y_pred = model.predict(X)
    if metric == 'rmse':
        baseline_score = np.sqrt(mean_squared_error(y, y_pred))
    elif metric == 'r2':
        from sklearn.metrics import r2_score
        baseline_score = r2_score(y, y_pred)
    elif metric == 'ic':
        baseline_score = pearsonr(y_pred, y)[0]
    elif metric == 'rank_ic':
        from scipy.stats import spearmanr
        baseline_score = spearmanr(y_pred, y)[0]
    else:
        raise ValueError(f"Unknown metric: {metric}")
 
    print(f"Baseline {metric}: {baseline_score:.4f}")
 
    # 对每个特征进行Permutation
    for feature_idx in range(n_features):
        print(f"Processing feature {feature_idx + 1}/{n_features}")
 
        for repeat in range(n_repeats):
            # 打乱特征
            X_permuted = X.copy()
            np.random.shuffle(X_permuted[:, feature_idx])
 
            # 重新预测
            y_pred_permuted = model.predict(X_permuted)
 
            # 计算性能
            if metric == 'rmse':
                score = np.sqrt(mean_squared_error(y, y_pred_permuted))
                importance = score - baseline_score  # RMSE增加，重要
            elif metric == 'r2':
                score = r2_score(y, y_pred_permuted)
                importance = baseline_score - score  # R2减少，重要
            elif metric == 'ic':
                score = pearsonr(y_pred_permuted, y)[0]
                importance = baseline_score - score  # IC减少，重要
            elif metric == 'rank_ic':
                score = spearmanr(y_pred_permuted, y)[0]
                importance = baseline_score - score  # Rank IC减少，重要
 
            importances[feature_idx, repeat] = importance
 
    return importances
 
# 使用示例
importances = permutation_importance(model, X_val, y_val, metric='ic', n_repeats=5)
 
# 计算平均重要性
mean_importance = importances.mean(axis=1)
std_importance = importances.std(axis=1)
 
# 排序
sorted_indices = np.argsort(mean_importance)[::-1]
 
print("\nPermutation Importance (IC):")
for i, idx in enumerate(sorted_indices[:10]):
    print(f"  {i+1}. 特征{idx}: {mean_importance[idx]:.4f} ± {std_importance[idx]:.4f}")

3.2 可视化Permutation Importance

def plot_permutation_importance(importances, feature_names, top_n=20):
    """
    绘制Permutation Importance
 
    参数:
        importances: 特征重要性数组，shape=[n_features, n_repeats]
        feature_names: 特征名称
        top_n: 显示前n个特征
    """
    # 计算平均重要性
    mean_imp = importances.mean(axis=1)
    std_imp = importances.std(axis=1)
 
    # 排序
    indices = np.argsort(mean_imp)[::-1][:top_n]
    mean_imp = mean_imp[indices]
    std_imp = std_imp[indices]
    feature_names = np.array(feature_names)[indices]
 
    # 绘制
    plt.figure(figsize=(12, 8))
    plt.barh(range(len(mean_imp)), mean_imp[::-1], xerr=std_imp[::-1],
             color='steelblue', alpha=0.7, capsize=3)
    plt.yticks(range(len(mean_imp)), feature_names[::-1])
    plt.xlabel('Permutation Importance')
    plt.title(f'Top {top_n} Permutation Feature Importance')
    plt.tight_layout()
    plt.show()
 
plot_permutation_importance(importances, feature_names, top_n=20)

4. SHAP值分析

4.1 SHAP原理

SHAP（SHapley Additive exPlanations）

SHAP值基于博弈论中的Shapley值，提供一致的局部解释。

核心思想

每个特征对预测的贡献：

$\overset{y}{^}_{i} = Base Value + \sum_{j = 1}^{M} SHAP_{i, j}$

其中：

$\overset{y}{^}_{i}$ 是第i个样本的预测值
Base Value是所有样本的均值预测
$SHAP_{i, j}$ 是第j个特征对第i个样本的贡献

代码实现

import shap
 
# 计算SHAP值
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
 
# SHAP Summary Plot
shap.summary_plot(shap_values, X, feature_names=feature_names, plot_type='bar')
 
# SHAP Summary Plot (详细)
shap.summary_plot(shap_values, X, feature_names=feature_names)

4.2 特征重要性排序

def shap_feature_importance(shap_values, feature_names, top_n=20):
    """
    基于SHAP值的特征重要性
 
    参数:
        shap_values: SHAP值数组
        feature_names: 特征名称
        top_n: 显示前n个特征
 
    返回:
        重要性排序结果
    """
    # 计算每个特征的平均绝对SHAP值
    mean_abs_shap = np.abs(shap_values).mean(axis=0)
 
    # 排序
    indices = np.argsort(mean_abs_shap)[::-1][:top_n]
    importance = mean_abs_shap[indices]
    names = np.array(feature_names)[indices]
 
    # 打印
    print("SHAP Feature Importance:")
    for i, (name, imp) in enumerate(zip(names, importance)):
        print(f"  {i+1}. {name}: {imp:.4f}")
 
    return indices, importance, names
 
# 使用示例
indices, importance, names = shap_feature_importance(shap_values, feature_names, top_n=20)

4.3 个体解释

def plot_shap_force(explainer, X, sample_idx, feature_names):
    """
    绘制单个样本的SHAP Force Plot
 
    参数:
        explainer: SHAP解释器
        X: 特征矩阵
        sample_idx: 样本索引
        feature_names: 特征名称
    """
    # 计算SHAP值
    shap_values = explainer.shap_values(X[[sample_idx]])
 
    # 绘制Force Plot
    shap.force_plot(explainer.expected_value[0],
                   shap_values[0],
                   X[sample_idx],
                   feature_names=feature_names)
 
# 使用示例
plot_shap_force(explainer, X_val, sample_idx=0, feature_names=feature_names)

5. 特征相关性分析

5.1 为什么特征相关性重要？

高相关的特征：

提供冗余信息
可能导致重要性”分散”
增加模型复杂度但不增加价值

建议：

同类特征保留最重要的 1-2 个
相关性 > 0.9 的特征考虑合并或剔除

5.2 相关性计算

def analyze_feature_correlation(X, features, threshold=0.7):
    """
    分析特征相关性
 
    参数:
        X: 特征 DataFrame
        features: 特征列表
        threshold: 高相关阈值
 
    返回:
        high_corr_pairs: 高相关特征对
    """
    # 计算相关矩阵
    corr_matrix = X[features].corr()
 
    # 找高相关对
    high_corr_pairs = []
    for i, feat1 in enumerate(features):
        for feat2 in features[i+1:]:
            corr = corr_matrix.loc[feat1, feat2]
            if abs(corr) > threshold:
                high_corr_pairs.append((feat1, feat2, corr))
 
    # 排序
    high_corr_pairs.sort(key=lambda x: abs(x[2]), reverse=True)
 
    # 打印
    print(f"\n📊 高相关特征对 (|corr| > {threshold}):")
    print("-" * 50)
    for feat1, feat2, corr in high_corr_pairs:
        print(f"  {feat1:12s} ↔ {feat2:12s} : {corr:.3f}")
 
    return high_corr_pairs
 
# 使用示例
high_corr_pairs = analyze_feature_correlation(X_train, list(FEATURES.keys()), threshold=0.7)

示例输出：

高相关特征对 (|corr| > 0.7):
--------------------------------------------------
RET_5D       ↔ BIAS_10     :  0.931
RET_10D      ↔ BIAS_20     :  0.924
BIAS_10      ↔ BIAS_20     :  0.873
BIAS_5       ↔ BIAS_10     :  0.860
RET_10D      ↔ BIAS_10     :  0.834
RET_20D      ↔ BIAS_20     :  0.829
RET_5D       ↔ BIAS_5      :  0.818
RET_1D       ↔ BODY        :  0.814
RET_5D       ↔ BIAS_20     :  0.769
VOL_10       ↔ VOL_20      :  0.745

5.3 相关性热力图可视化

def plot_correlation_heatmap(X, features, top_n=30):
    """
    绘制特征相关性热力图
 
    参数:
        X: 特征 DataFrame
        features: 特征列表
        top_n: 显示前n个特征
    """
    # 计算相关性矩阵
    corr_matrix = X[features[:top_n]].corr()
 
    # 绘制热力图
    import seaborn as sns
    plt.figure(figsize=(12, 10))
 
    mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
    sns.heatmap(corr_matrix,
                mask=mask,
                cmap='coolwarm',
                center=0,
                square=True,
                linewidths=1,
                cbar_kws={"shrink": 0.8},
                annot=False,
                fmt='.2f',
                xticklabels=1,
                yticklabels=1)
 
    plt.title('Feature Correlation Heatmap', fontsize=14, fontweight='bold')
    plt.xlabel('Features', fontsize=12)
    plt.ylabel('Features', fontsize=12)
    plt.xticks(rotation=45, ha='right')
    plt.yticks(rotation=0)
    plt.tight_layout()
    plt.show()
 
# 使用示例
plot_correlation_heatmap(X_train, list(FEATURES.keys()), top_n=30)

5.4 基于相关性的特征筛选

def filter_features_by_correlation(X, features, threshold=0.9, importance=None):
    """
    基于相关性筛选特征
 
    策略：
    1. 找到高相关特征对
    2. 如果两个特征相关性 > threshold
    3. 保留重要性更高的特征
 
    参数:
        X: 特征 DataFrame
        features: 特征列表
        threshold: 相关性阈值
        importance: 特征重要性字典 {feature: importance}
 
    返回:
        selected_features: 筛选后的特征列表
    """
    # 计算相关矩阵
    corr_matrix = X[features].corr()
 
    # 如果没有提供重要性，使用默认值
    if importance is None:
        importance = {feat: 1.0 for feat in features}
 
    # 标记要删除的特征
    to_remove = set()
 
    for i, feat1 in enumerate(features):
        for feat2 in features[i+1:]:
            corr = abs(corr_matrix.loc[feat1, feat2])
 
            if corr > threshold:
                # 保留重要性更高的特征
                if importance[feat1] >= importance[feat2]:
                    to_remove.add(feat2)
                    print(f"移除 {feat2} (与 {feat1} 相关性={corr:.3f})")
                else:
                    to_remove.add(feat1)
                    print(f"移除 {feat1} (与 {feat2} 相关性={corr:.3f})")
 
    # 筛选特征
    selected_features = [f for f in features if f not in to_remove]
 
    print(f"\n原始特征数: {len(features)}")
    print(f"筛选后特征数: {len(selected_features)}")
    print(f"删除特征数: {len(to_remove)}")
 
    return selected_features
 
# 使用示例
importance_dict = dict(zip(feature_names, gain_importance))
selected_features = filter_features_by_correlation(
    X_train, 
    list(FEATURES.keys()),
    threshold=0.9,
    importance=importance_dict
)

6. 重训练验证

6.1 验证特征选择的效果

def validate_feature_selection(X_train, y_train, X_valid, y_valid, X_test, y_test,
                         full_features, selected_features, params):
    """
    验证特征选择的效果
 
    参数:
        X_train, y_train: 训练数据
        X_valid, y_valid: 验证数据
        X_test, y_test: 测试数据
        full_features: 所有特征
        selected_features: 选中的特征
        params: 模型参数
 
    返回:
        comparison: 对比结果
    """
    import lightgbm as lgb
    from scipy.stats import pearsonr
 
    # 训练全特征模型
    print("\n训练全特征模型...")
    train_data_full = lgb.Dataset(X_train[full_features], label=y_train)
    val_data_full = lgb.Dataset(X_valid[full_features], label=y_valid, reference=train_data_full)
 
    model_full = lgb.train(
        params,
        train_data_full,
        num_boost_round=1000,
        valid_sets=[train_data_full, val_data_full],
        callbacks=[
            lgb.early_stopping(stopping_rounds=50, verbose=False),
            lgb.log_evaluation(period=100)
        ]
    )
 
    # 训练精选特征模型
    print("\n训练精选特征模型...")
    train_data_selected = lgb.Dataset(X_train[selected_features], label=y_train)
    val_data_selected = lgb.Dataset(X_valid[selected_features], label=y_valid, reference=train_data_selected)
 
    model_selected = lgb.train(
        params,
        train_data_selected,
        num_boost_round=1000,
        valid_sets=[train_data_selected, val_data_selected],
        callbacks=[
            lgb.early_stopping(stopping_rounds=50, verbose=False),
            lgb.log_evaluation(period=100)
        ]
    )
 
    # 预测
    pred_full = model_full.predict(X_test[full_features])
    pred_selected = model_selected.predict(X_test[selected_features])
 
    # 计算IC
    ic_full = pearsonr(pred_full, y_test)[0]
    ic_selected = pearsonr(pred_selected, y_test)[0]
 
    # 打印对比
    print("\n🔬 特征选择验证:")
    print("-" * 50)
    print(f"{'模型':<15s} {'特征数':<10s} {'测试集 IC':<10s}")
    print("-" * 50)
    print(f"{'全部特征':<15s} {len(full_features):<10d} {ic_full:<10.4f}")
    print(f"{'精选特征':<15s} {len(selected_features):<10d} {ic_selected:<10.4f}")
    print("-" * 50)
 
    # 评估
    ic_ratio = ic_selected / ic_full if ic_full != 0 else 0
    print(f"\nIC保持率: {ic_ratio:.2%}")
 
    if ic_selected >= ic_full * 0.95:
        print("✅ 精选特征表现接近全特征，可以简化模型!")
    elif ic_selected >= ic_full * 0.90:
        print("⚠️ 精选特征性能略有下降，但简化模型可能值得")
    else:
        print("❌ 精选特征性能下降较多，需要调整")
 
    return {
        'full_features_ic': ic_full,
        'selected_features_ic': ic_selected,
        'n_full': len(full_features),
        'n_selected': len(selected_features),
        'ic_ratio': ic_ratio,
        'model_full': model_full,
        'model_selected': model_selected
    }
 
# 使用示例
params = {
    'objective': 'regression',
    'metric': 'rmse',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'verbose': -1,
}
 
comparison = validate_feature_selection(
    X_train, y_train, X_valid, y_valid, X_test, y_test,
    full_features=list(FEATURES.keys()),
    selected_features=selected_features,
    params=params
)

6.2 特征选择效果可视化

def plot_feature_selection_comparison(comparison):
    """
    可视化特征选择对比结果
 
    参数:
        comparison: validate_feature_selection 的返回结果
    """
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
 
    # 特征数量对比
    models = ['全特征', '精选特征']
    n_features = [comparison['n_full'], comparison['n_selected']]
    ics = [comparison['full_features_ic'], comparison['selected_features_ic']]
 
    # 特征数 vs IC
    axes[0].scatter(n_features, ics, s=200, alpha=0.6)
    axes[0].plot(n_features, ics, 'r-', linewidth=2, alpha=0.5)
 
    # 添加标注
    for i, (n, ic, model) in enumerate(zip(n_features, ics, models)):
        axes[0].annotate(f'{model}\n{n}特征\nIC={ic:.4f}',
                        (n, ic),
                        textcoords="offset points",
                        xytext=(0, 10),
                        ha='center')
 
    axes[0].set_xlabel('特征数量')
    axes[0].set_ylabel('测试集 IC')
    axes[0].set_title('特征数量 vs 性能')
    axes[0].grid(True, alpha=0.3)
 
    # IC 柱状图
    bars = axes[1].bar(models, ics, color=['steelblue', 'coral'], alpha=0.7, edgecolor='black')
    axes[1].set_ylabel('测试集 IC')
    axes[1].set_title('模型性能对比')
    axes[1].grid(True, axis='y', alpha=0.3)
 
    # 添加数值标签
    for bar, ic in zip(bars, ics):
        height = bar.get_height()
        axes[1].text(bar.get_x() + bar.get_width()/2., height,
                     f'{ic:.4f}',
                     ha='center', va='bottom', fontweight='bold')
 
    # 添加IC保持率
    axes[1].axhline(y=comparison['full_features_ic'] * 0.95,
                   color='green', linestyle='--', alpha=0.5,
                   label='95% 阈值')
    axes[1].legend()
 
    plt.tight_layout()
    plt.show()
 
# 使用示例
plot_feature_selection_comparison(comparison)

6.3 时序特征重要性

5. 时序特征重要性

5.1 滚动窗口特征重要性

核心思想

在不同时间窗口内计算特征重要性，分析重要性的稳定性。

代码实现

def rolling_feature_importance(X, y, model_class, params,
                               window_size=252, step_size=21):
    """
    滚动窗口特征重要性
 
    参数:
        X: 特征矩阵，shape=[n_samples, n_features]
        y: 目标变量
        model_class: 模型类
        params: 模型参数
        window_size: 窗口大小
        step_size: 步长
 
    返回:
        importance_history: 特征重要性历史
    """
    n_samples = len(X)
    importance_history = []
 
    start_idx = window_size
    while start_idx + step_size <= n_samples:
        print(f"Processing window: {start_idx - window_size} - {start_idx}")
 
        # 划分数据
        X_window = X[start_idx - window_size:start_idx]
        y_window = y[start_idx - window_size:start_idx]
 
        # 训练模型
        model = model_class(**params)
        model.fit(X_window, y_window)
 
        # 计算特征重要性
        importance = model.feature_importance(importance_type='gain')
        importance_history.append(importance)
 
        # 滚动窗口
        start_idx += step_size
 
    return np.array(importance_history)
 
# 使用示例
importance_history = rolling_feature_importance(
    X, y, LGBMRegressor, params,
    window_size=252,  # 1年
    step_size=21      # 1个月
)
 
print(f"重要性历史: {importance_history.shape}")

5.2 特征重要性稳定性分析

def analyze_importance_stability(importance_history, feature_names, top_n=10):
    """
    分析特征重要性的稳定性
 
    参数:
        importance_history: 特征重要性历史，shape=[n_windows, n_features]
        feature_names: 特征名称
        top_n: 分析前n个特征
    """
    # 计算统计量
    mean_importance = importance_history.mean(axis=0)
    std_importance = importance_history.std(axis=0)
    cv_importance = std_importance / (mean_importance + 1e-6)  # 变异系数
 
    # 排序
    sorted_indices = np.argsort(mean_importance)[::-1][:top_n]
 
    # 打印
    print(f"{'特征':<20} {'均值':>10} {'标准差':>10} {'变异系数':>10}")
    print("-" * 60)
    for i, idx in enumerate(sorted_indices):
        print(f"{feature_names[idx]:<20} "
              f"{mean_importance[idx]:>10.4f} "
              f"{std_importance[idx]:>10.4f} "
              f"{cv_importance[idx]:>10.4f}")
 
    # 可视化
    fig, axes = plt.subplots(1, 2, figsize=(15, 6))
 
    # 均值vs标准差
    axes[0].scatter(mean_importance, std_importance, alpha=0.6)
    for idx in sorted_indices:
        axes[0].annotate(feature_names[idx],
                        (mean_importance[idx], std_importance[idx]))
    axes[0].set_xlabel('Mean Importance')
    axes[0].set_ylabel('Std Importance')
    axes[0].set_title('Importance Stability')
    axes[0].grid(True)
 
    # 时间序列
    for idx in sorted_indices:
        axes[1].plot(importance_history[:, idx], label=feature_names[idx])
    axes[1].set_xlabel('Time Window')
    axes[1].set_ylabel('Importance')
    axes[1].set_title(f'Top {top_n} Features Importance Over Time')
    axes[1].legend()
    axes[1].grid(True)
 
    plt.tight_layout()
    plt.show()
 
# 使用示例
analyze_importance_stability(importance_history, feature_names, top_n=10)

6. 特征选择策略

6.1 基于重要性的特征选择

def select_features_by_importance(model, X, feature_names,
                                  importance_type='gain', threshold=0.01):
    """
    基于特征重要性选择特征
 
    参数:
        model: LightGBM模型
        X: 特征矩阵
        feature_names: 特征名称
        importance_type: 'split' 或 'gain'
        threshold: 重要性阈值（比例）
 
    返回:
        X_selected: 选择后的特征矩阵
        selected_features: 选择的特征名称
        selected_indices: 选择的特征索引
    """
    # 获取特征重要性
    importance = model.feature_importance(importance_type=importance_type)
 
    # 归一化
    importance_normalized = importance / importance.sum()
 
    # 选择重要性超过阈值的特征
    selected_indices = np.where(importance_normalized >= threshold)[0]
 
    # 提取数据
    X_selected = X[:, selected_indices]
    selected_features = np.array(feature_names)[selected_indices]
 
    print(f"选择 {len(selected_indices)}/{len(feature_names)} 个特征")
    print(f"累计重要性: {importance_normalized[selected_indices].sum():.2%}")
 
    return X_selected, selected_features, selected_indices
 
# 使用示例
X_selected, selected_features, selected_indices = select_features_by_importance(
    model, X, feature_names, importance_type='gain', threshold=0.01
)

6.2 递归特征消除（RFE）

from sklearn.feature_selection import RFE
 
def recursive_feature_elimination(X, y, estimator, n_features_to_select=None,
                                 step=1, cv=5):
    """
    递归特征消除
 
    参数:
        X: 特征矩阵
        y: 目标变量
        estimator: 评估器
        n_features_to_select: 目标特征数
        step: 每次消除的特征数
        cv: 交叉验证折数
 
    返回:
        X_selected: 选择后的特征矩阵
        selected_indices: 选择的特征索引
        rfe: RFE对象
    """
    # 创建RFE
    rfe = RFE(estimator=estimator,
              n_features_to_select=n_features_to_select,
              step=step,
              importance_getter='auto')
 
    # 拟合
    rfe.fit(X, y)
 
    # 提取结果
    selected_indices = np.where(rfe.support_)[0]
    X_selected = X[:, selected_indices]
 
    print(f"选择 {len(selected_indices)}/{X.shape[1]} 个特征")
 
    return X_selected, selected_indices, rfe
 
# 使用示例
estimator = LGBMRegressor(**params)
X_selected, selected_indices, rfe = recursive_feature_elimination(
    X, y, estimator, n_features_to_select=50, step=5
)

6.3 稳定性特征选择

def stable_feature_selection(importance_history, feature_names,
                             stability_threshold=0.7, top_n=None):
    """
    稳定性特征选择
 
    参数:
        importance_history: 特征重要性历史
        feature_names: 特征名称
        stability_threshold: 稳定性阈值
        top_n: 选择前n个稳定特征
 
    返回:
        selected_features: 选择的特征名称
        stability_scores: 稳定性得分
    """
    # 计算每个特征的排名
    rankings = []
    for importance in importance_history:
        ranking = np.argsort(importance)[::-1]
        rankings.append(ranking)
 
    rankings = np.array(rankings)
 
    # 计算稳定性得分（基于排名的方差）
    stability_scores = []
    for feature_idx in range(len(feature_names)):
        feature_rankings = np.where(rankings == feature_idx)[1]
        stability_score = 1 / (1 + np.var(feature_rankings))
        stability_scores.append(stability_score)
 
    stability_scores = np.array(stability_scores)
 
    # 排序
    sorted_indices = np.argsort(stability_scores)[::-1]
 
    # 选择稳定特征
    if top_n is None:
        selected_indices = sorted_indices[stability_scores[sorted_indices] >= stability_threshold]
    else:
        selected_indices = sorted_indices[:top_n]
 
    selected_features = np.array(feature_names)[selected_indices]
 
    print(f"选择 {len(selected_indices)} 个稳定特征")
 
    return selected_features, stability_scores
 
# 使用示例
selected_features, stability_scores = stable_feature_selection(
    importance_history, feature_names,
    stability_threshold=0.7, top_n=20
)

7. 特征重要性分析的最佳实践

7.1 综合分析流程

class FeatureImportanceAnalyzer:
    """
    特征重要性分析器
 
    功能：
    1. 计算多种特征重要性
    2. 可视化分析结果
    3. 时序稳定性分析
    4. 特征选择建议
    """
 
    def __init__(self, model, X, y, feature_names):
        self.model = model
        self.X = X
        self.y = y
        self.feature_names = feature_names
 
        self.split_importance = None
        self.gain_importance = None
        self.permutation_importance = None
        self.shap_values = None
 
    def calculate_lightgbm_importance(self):
        """计算LightGBM内置重要性"""
        self.split_importance = self.model.feature_importance(importance_type='split')
        self.gain_importance = self.model.feature_importance(importance_type='gain')
 
    def calculate_permutation_importance(self, metric='ic', n_repeats=5):
        """计算Permutation Importance"""
        self.permutation_importance = permutation_importance(
            self.model, self.X, self.y,
            metric=metric, n_repeats=n_repeats
        )
 
    def calculate_shap_values(self):
        """计算SHAP值"""
        explainer = shap.TreeExplainer(self.model)
        self.shap_values = explainer.shap_values(self.X)
 
    def analyze_all(self):
        """执行所有分析"""
        print("计算LightGBM重要性...")
        self.calculate_lightgbm_importance()
 
        print("计算Permutation Importance...")
        self.calculate_permutation_importance()
 
        print("计算SHAP值...")
        self.calculate_shap_values()
 
    def plot_summary(self, top_n=20):
        """绘制汇总图"""
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
 
        # Split重要性
        indices = np.argsort(self.split_importance)[::-1][:top_n]
        axes[0, 0].barh(range(len(indices)), self.split_importance[indices][::-1])
        axes[0, 0].set_yticks(range(len(indices)), np.array(self.feature_names)[indices][::-1])
        axes[0, 0].set_title('Split Importance')
        axes[0, 0].invert_yaxis()
 
        # Gain重要性
        indices = np.argsort(self.gain_importance)[::-1][:top_n]
        axes[0, 1].barh(range(len(indices)), self.gain_importance[indices][::-1])
        axes[0, 1].set_yticks(range(len(indices)), np.array(self.feature_names)[indices][::-1])
        axes[0, 1].set_title('Gain Importance')
        axes[0, 1].invert_yaxis()
 
        # Permutation Importance
        if self.permutation_importance is not None:
            mean_imp = self.permutation_importance.mean(axis=1)
            indices = np.argsort(mean_imp)[::-1][:top_n]
            axes[1, 0].barh(range(len(indices)), mean_imp[indices][::-1])
            axes[1, 0].set_yticks(range(len(indices)), np.array(self.feature_names)[indices][::-1])
            axes[1, 0].set_title('Permutation Importance')
            axes[1, 0].invert_yaxis()
 
        # SHAP重要性
        if self.shap_values is not None:
            mean_shap = np.abs(self.shap_values).mean(axis=0)
            indices = np.argsort(mean_shap)[::-1][:top_n]
            axes[1, 1].barh(range(len(indices)), mean_shap[indices][::-1])
            axes[1, 1].set_yticks(range(len(indices)), np.array(self.feature_names)[indices][::-1])
            axes[1, 1].set_title('SHAP Importance')
            axes[1, 1].invert_yaxis()
 
        plt.tight_layout()
        plt.show()
 
# 使用示例
analyzer = FeatureImportanceAnalyzer(model, X_val, y_val, feature_names)
analyzer.analyze_all()
analyzer.plot_summary(top_n=15)

7.2 检查清单

特征重要性分析检查清单

计算至少两种不同的特征重要性
可视化特征重要性分布
检查特征重要性的稳定性
分析特征间的相关性
验证特征选择的合理性
记录分析过程和结论

时序特征重要性分析检查清单

使用滚动窗口分析重要性变化
识别稳定和不稳定特征
分析不同市场状态下的重要性
检查重要性与市场周期的关系
制定特征更新策略

8. 总结

特征重要性分析是量化模型开发中的关键环节：

基础重要性：LightGBM内置的Split和Gain重要性
Permutation Importance：更真实的重要性评估方法
SHAP分析：提供个体和全局解释
时序分析：分析重要性的稳定性
特征选择：基于重要性的特征筛选策略

正确的特征重要性分析能够帮助我们：

理解模型决策逻辑
识别有效因子
提升模型性能
控制模型风险

MindCarver Blog

MindCarver

探索

05-特征重要性分析

特征重要性分析

1. 特征重要性基础

1.1 特征重要性的意义

1.2 LightGBM的特征重要性类型

2. 特征重要性可视化

2.1 条形图可视化

2.2 对数重要性图

3. 基于Permutation的特征重要性

3.1 Permutation Importance原理

3.2 可视化Permutation Importance

4. SHAP值分析

4.1 SHAP原理

4.2 特征重要性排序

4.3 个体解释

5. 特征相关性分析

5.1 为什么特征相关性重要？

5.2 相关性计算

5.3 相关性热力图可视化

5.4 基于相关性的特征筛选

6. 重训练验证

6.1 验证特征选择的效果

6.2 特征选择效果可视化

6.3 时序特征重要性

5. 时序特征重要性

5.1 滚动窗口特征重要性

5.2 特征重要性稳定性分析

6. 特征选择策略

6.1 基于重要性的特征选择

6.2 递归特征消除（RFE）

6.3 稳定性特征选择

7. 特征重要性分析的最佳实践

7.1 综合分析流程

7.2 检查清单

8. 总结

关系图谱

目录

反向链接