特征重要性分析
1. 特征重要性基础
1.1 特征重要性的意义
为什么重要?
在量化投资中,特征重要性分析有三大价值:
- 模型解释性:理解模型如何做决策
- 特征筛选:识别有效因子,剔除冗余因子
- 风险控制:避免模型依赖无效特征
量化场景的特殊性
量化数据的特点:
- 高维特征:数百到数千个因子
- 低信噪比:大量噪声特征
- 强相关性:因子间存在多重共线性
- 非平稳性:因子重要性随时间变化
1.2 LightGBM的特征重要性类型
LightGBM提供两种特征重要性:
-
Split重要性(分裂重要性)
- 基于特征作为分裂节点的次数
- 衡量特征在分裂决策中的使用频率
-
Gain重要性(增益重要性)
- 基于分裂带来的信息增益
- 衡量特征对模型性能的贡献度
代码示例
import lightgbm as lgb
import matplotlib.pyplot as plt
# 训练模型
model = lgb.train(params, train_data, num_boost_round=1000,
valid_sets=[train_data, val_data],
callbacks=[lgb.early_stopping(stopping_rounds=50)])
# 获取特征重要性
split_importance = model.feature_importance(importance_type='split')
gain_importance = model.feature_importance(importance_type='gain')
# 打印
print("Split重要性(前10):")
for i, (idx, imp) in enumerate(sorted(enumerate(split_importance),
key=lambda x: x[1], reverse=True)[:10]):
print(f" {idx+1}. 特征{idx}: {imp}")
print("\nGain重要性(前10):")
for i, (idx, imp) in enumerate(sorted(enumerate(gain_importance),
key=lambda x: x[1], reverse=True)[:10]):
print(f" {idx+1}. 特征{idx}: {imp:.2f}")2. 特征重要性可视化
2.1 条形图可视化
def plot_feature_importance(model, feature_names, importance_type='gain', top_n=20):
"""
绘制特征重要性条形图
参数:
model: LightGBM模型
feature_names: 特征名称列表
importance_type: 'split' 或 'gain'
top_n: 显示前n个特征
"""
# 获取重要性
importance = model.feature_importance(importance_type=importance_type)
# 排序
indices = np.argsort(importance)[::-1][:top_n]
importance = importance[indices]
feature_names = np.array(feature_names)[indices]
# 绘制
plt.figure(figsize=(12, 8))
plt.barh(range(len(importance)), importance[::-1])
plt.yticks(range(len(importance)), feature_names[::-1])
plt.xlabel(f'Feature Importance ({importance_type})')
plt.title(f'Top {top_n} Feature Importance')
plt.tight_layout()
plt.show()
# 使用示例
feature_names = [f'factor_{i}' for i in range(X.shape[1])]
plot_feature_importance(model, feature_names, importance_type='gain', top_n=20)2.2 对数重要性图
def plot_feature_importance_log(model, feature_names, importance_type='gain'):
"""
绘制对数特征重要性图
适用于特征重要性差异巨大的情况
"""
importance = model.feature_importance(importance_type=importance_type)
importance = np.maximum(importance, 1e-6) # 避免log(0)
indices = np.argsort(importance)[::-1]
importance_log = np.log(importance[indices])
plt.figure(figsize=(12, 8))
plt.plot(range(len(importance_log)), importance_log, marker='o')
plt.xlabel('Feature Rank')
plt.ylabel('Log Feature Importance')
plt.title(f'Feature Importance Distribution (Log Scale) - {importance_type}')
plt.grid(True)
plt.show()
plot_feature_importance_log(model, feature_names, importance_type='gain')3. 基于Permutation的特征重要性
3.1 Permutation Importance原理
核心思想
Permutation Importance通过打乱特征的值来评估其重要性:
- 计算基准模型性能
- 打乱某个特征的值
- 重新计算模型性能
- 性能下降越大,特征越重要
优势
- 不依赖于模型类型
- 考虑特征间的交互作用
- 更真实反映特征重要性
代码实现
from sklearn.metrics import mean_squared_error
from scipy.stats import pearsonr
def permutation_importance(model, X, y, metric='ic', n_repeats=5, random_state=42):
"""
计算Permutation Importance
参数:
model: 训练好的模型
X: 特征矩阵
y: 目标变量
metric: 评估指标 ('rmse', 'r2', 'ic', 'rank_ic')
n_repeats: 重复次数
random_state: 随机种子
返回:
importances: 特征重要性数组,shape=[n_features, n_repeats]
"""
np.random.seed(random_state)
n_features = X.shape[1]
importances = np.zeros((n_features, n_repeats))
# 计算基准性能
y_pred = model.predict(X)
if metric == 'rmse':
baseline_score = np.sqrt(mean_squared_error(y, y_pred))
elif metric == 'r2':
from sklearn.metrics import r2_score
baseline_score = r2_score(y, y_pred)
elif metric == 'ic':
baseline_score = pearsonr(y_pred, y)[0]
elif metric == 'rank_ic':
from scipy.stats import spearmanr
baseline_score = spearmanr(y_pred, y)[0]
else:
raise ValueError(f"Unknown metric: {metric}")
print(f"Baseline {metric}: {baseline_score:.4f}")
# 对每个特征进行Permutation
for feature_idx in range(n_features):
print(f"Processing feature {feature_idx + 1}/{n_features}")
for repeat in range(n_repeats):
# 打乱特征
X_permuted = X.copy()
np.random.shuffle(X_permuted[:, feature_idx])
# 重新预测
y_pred_permuted = model.predict(X_permuted)
# 计算性能
if metric == 'rmse':
score = np.sqrt(mean_squared_error(y, y_pred_permuted))
importance = score - baseline_score # RMSE增加,重要
elif metric == 'r2':
score = r2_score(y, y_pred_permuted)
importance = baseline_score - score # R2减少,重要
elif metric == 'ic':
score = pearsonr(y_pred_permuted, y)[0]
importance = baseline_score - score # IC减少,重要
elif metric == 'rank_ic':
score = spearmanr(y_pred_permuted, y)[0]
importance = baseline_score - score # Rank IC减少,重要
importances[feature_idx, repeat] = importance
return importances
# 使用示例
importances = permutation_importance(model, X_val, y_val, metric='ic', n_repeats=5)
# 计算平均重要性
mean_importance = importances.mean(axis=1)
std_importance = importances.std(axis=1)
# 排序
sorted_indices = np.argsort(mean_importance)[::-1]
print("\nPermutation Importance (IC):")
for i, idx in enumerate(sorted_indices[:10]):
print(f" {i+1}. 特征{idx}: {mean_importance[idx]:.4f} ± {std_importance[idx]:.4f}")3.2 可视化Permutation Importance
def plot_permutation_importance(importances, feature_names, top_n=20):
"""
绘制Permutation Importance
参数:
importances: 特征重要性数组,shape=[n_features, n_repeats]
feature_names: 特征名称
top_n: 显示前n个特征
"""
# 计算平均重要性
mean_imp = importances.mean(axis=1)
std_imp = importances.std(axis=1)
# 排序
indices = np.argsort(mean_imp)[::-1][:top_n]
mean_imp = mean_imp[indices]
std_imp = std_imp[indices]
feature_names = np.array(feature_names)[indices]
# 绘制
plt.figure(figsize=(12, 8))
plt.barh(range(len(mean_imp)), mean_imp[::-1], xerr=std_imp[::-1],
color='steelblue', alpha=0.7, capsize=3)
plt.yticks(range(len(mean_imp)), feature_names[::-1])
plt.xlabel('Permutation Importance')
plt.title(f'Top {top_n} Permutation Feature Importance')
plt.tight_layout()
plt.show()
plot_permutation_importance(importances, feature_names, top_n=20)4. SHAP值分析
4.1 SHAP原理
SHAP(SHapley Additive exPlanations)
SHAP值基于博弈论中的Shapley值,提供一致的局部解释。
核心思想
每个特征对预测的贡献:
其中:
- 是第i个样本的预测值
- Base Value是所有样本的均值预测
- 是第j个特征对第i个样本的贡献
代码实现
import shap
# 计算SHAP值
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
# SHAP Summary Plot
shap.summary_plot(shap_values, X, feature_names=feature_names, plot_type='bar')
# SHAP Summary Plot (详细)
shap.summary_plot(shap_values, X, feature_names=feature_names)4.2 特征重要性排序
def shap_feature_importance(shap_values, feature_names, top_n=20):
"""
基于SHAP值的特征重要性
参数:
shap_values: SHAP值数组
feature_names: 特征名称
top_n: 显示前n个特征
返回:
重要性排序结果
"""
# 计算每个特征的平均绝对SHAP值
mean_abs_shap = np.abs(shap_values).mean(axis=0)
# 排序
indices = np.argsort(mean_abs_shap)[::-1][:top_n]
importance = mean_abs_shap[indices]
names = np.array(feature_names)[indices]
# 打印
print("SHAP Feature Importance:")
for i, (name, imp) in enumerate(zip(names, importance)):
print(f" {i+1}. {name}: {imp:.4f}")
return indices, importance, names
# 使用示例
indices, importance, names = shap_feature_importance(shap_values, feature_names, top_n=20)4.3 个体解释
def plot_shap_force(explainer, X, sample_idx, feature_names):
"""
绘制单个样本的SHAP Force Plot
参数:
explainer: SHAP解释器
X: 特征矩阵
sample_idx: 样本索引
feature_names: 特征名称
"""
# 计算SHAP值
shap_values = explainer.shap_values(X[[sample_idx]])
# 绘制Force Plot
shap.force_plot(explainer.expected_value[0],
shap_values[0],
X[sample_idx],
feature_names=feature_names)
# 使用示例
plot_shap_force(explainer, X_val, sample_idx=0, feature_names=feature_names)5. 特征相关性分析
5.1 为什么特征相关性重要?
高相关的特征:
- 提供冗余信息
- 可能导致重要性”分散”
- 增加模型复杂度但不增加价值
建议:
- 同类特征保留最重要的 1-2 个
- 相关性 > 0.9 的特征考虑合并或剔除
5.2 相关性计算
def analyze_feature_correlation(X, features, threshold=0.7):
"""
分析特征相关性
参数:
X: 特征 DataFrame
features: 特征列表
threshold: 高相关阈值
返回:
high_corr_pairs: 高相关特征对
"""
# 计算相关矩阵
corr_matrix = X[features].corr()
# 找高相关对
high_corr_pairs = []
for i, feat1 in enumerate(features):
for feat2 in features[i+1:]:
corr = corr_matrix.loc[feat1, feat2]
if abs(corr) > threshold:
high_corr_pairs.append((feat1, feat2, corr))
# 排序
high_corr_pairs.sort(key=lambda x: abs(x[2]), reverse=True)
# 打印
print(f"\n📊 高相关特征对 (|corr| > {threshold}):")
print("-" * 50)
for feat1, feat2, corr in high_corr_pairs:
print(f" {feat1:12s} ↔ {feat2:12s} : {corr:.3f}")
return high_corr_pairs
# 使用示例
high_corr_pairs = analyze_feature_correlation(X_train, list(FEATURES.keys()), threshold=0.7)示例输出:
高相关特征对 (|corr| > 0.7):
--------------------------------------------------
RET_5D ↔ BIAS_10 : 0.931
RET_10D ↔ BIAS_20 : 0.924
BIAS_10 ↔ BIAS_20 : 0.873
BIAS_5 ↔ BIAS_10 : 0.860
RET_10D ↔ BIAS_10 : 0.834
RET_20D ↔ BIAS_20 : 0.829
RET_5D ↔ BIAS_5 : 0.818
RET_1D ↔ BODY : 0.814
RET_5D ↔ BIAS_20 : 0.769
VOL_10 ↔ VOL_20 : 0.745
5.3 相关性热力图可视化
def plot_correlation_heatmap(X, features, top_n=30):
"""
绘制特征相关性热力图
参数:
X: 特征 DataFrame
features: 特征列表
top_n: 显示前n个特征
"""
# 计算相关性矩阵
corr_matrix = X[features[:top_n]].corr()
# 绘制热力图
import seaborn as sns
plt.figure(figsize=(12, 10))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix,
mask=mask,
cmap='coolwarm',
center=0,
square=True,
linewidths=1,
cbar_kws={"shrink": 0.8},
annot=False,
fmt='.2f',
xticklabels=1,
yticklabels=1)
plt.title('Feature Correlation Heatmap', fontsize=14, fontweight='bold')
plt.xlabel('Features', fontsize=12)
plt.ylabel('Features', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()
# 使用示例
plot_correlation_heatmap(X_train, list(FEATURES.keys()), top_n=30)5.4 基于相关性的特征筛选
def filter_features_by_correlation(X, features, threshold=0.9, importance=None):
"""
基于相关性筛选特征
策略:
1. 找到高相关特征对
2. 如果两个特征相关性 > threshold
3. 保留重要性更高的特征
参数:
X: 特征 DataFrame
features: 特征列表
threshold: 相关性阈值
importance: 特征重要性字典 {feature: importance}
返回:
selected_features: 筛选后的特征列表
"""
# 计算相关矩阵
corr_matrix = X[features].corr()
# 如果没有提供重要性,使用默认值
if importance is None:
importance = {feat: 1.0 for feat in features}
# 标记要删除的特征
to_remove = set()
for i, feat1 in enumerate(features):
for feat2 in features[i+1:]:
corr = abs(corr_matrix.loc[feat1, feat2])
if corr > threshold:
# 保留重要性更高的特征
if importance[feat1] >= importance[feat2]:
to_remove.add(feat2)
print(f"移除 {feat2} (与 {feat1} 相关性={corr:.3f})")
else:
to_remove.add(feat1)
print(f"移除 {feat1} (与 {feat2} 相关性={corr:.3f})")
# 筛选特征
selected_features = [f for f in features if f not in to_remove]
print(f"\n原始特征数: {len(features)}")
print(f"筛选后特征数: {len(selected_features)}")
print(f"删除特征数: {len(to_remove)}")
return selected_features
# 使用示例
importance_dict = dict(zip(feature_names, gain_importance))
selected_features = filter_features_by_correlation(
X_train,
list(FEATURES.keys()),
threshold=0.9,
importance=importance_dict
)6. 重训练验证
6.1 验证特征选择的效果
def validate_feature_selection(X_train, y_train, X_valid, y_valid, X_test, y_test,
full_features, selected_features, params):
"""
验证特征选择的效果
参数:
X_train, y_train: 训练数据
X_valid, y_valid: 验证数据
X_test, y_test: 测试数据
full_features: 所有特征
selected_features: 选中的特征
params: 模型参数
返回:
comparison: 对比结果
"""
import lightgbm as lgb
from scipy.stats import pearsonr
# 训练全特征模型
print("\n训练全特征模型...")
train_data_full = lgb.Dataset(X_train[full_features], label=y_train)
val_data_full = lgb.Dataset(X_valid[full_features], label=y_valid, reference=train_data_full)
model_full = lgb.train(
params,
train_data_full,
num_boost_round=1000,
valid_sets=[train_data_full, val_data_full],
callbacks=[
lgb.early_stopping(stopping_rounds=50, verbose=False),
lgb.log_evaluation(period=100)
]
)
# 训练精选特征模型
print("\n训练精选特征模型...")
train_data_selected = lgb.Dataset(X_train[selected_features], label=y_train)
val_data_selected = lgb.Dataset(X_valid[selected_features], label=y_valid, reference=train_data_selected)
model_selected = lgb.train(
params,
train_data_selected,
num_boost_round=1000,
valid_sets=[train_data_selected, val_data_selected],
callbacks=[
lgb.early_stopping(stopping_rounds=50, verbose=False),
lgb.log_evaluation(period=100)
]
)
# 预测
pred_full = model_full.predict(X_test[full_features])
pred_selected = model_selected.predict(X_test[selected_features])
# 计算IC
ic_full = pearsonr(pred_full, y_test)[0]
ic_selected = pearsonr(pred_selected, y_test)[0]
# 打印对比
print("\n🔬 特征选择验证:")
print("-" * 50)
print(f"{'模型':<15s} {'特征数':<10s} {'测试集 IC':<10s}")
print("-" * 50)
print(f"{'全部特征':<15s} {len(full_features):<10d} {ic_full:<10.4f}")
print(f"{'精选特征':<15s} {len(selected_features):<10d} {ic_selected:<10.4f}")
print("-" * 50)
# 评估
ic_ratio = ic_selected / ic_full if ic_full != 0 else 0
print(f"\nIC保持率: {ic_ratio:.2%}")
if ic_selected >= ic_full * 0.95:
print("✅ 精选特征表现接近全特征,可以简化模型!")
elif ic_selected >= ic_full * 0.90:
print("⚠️ 精选特征性能略有下降,但简化模型可能值得")
else:
print("❌ 精选特征性能下降较多,需要调整")
return {
'full_features_ic': ic_full,
'selected_features_ic': ic_selected,
'n_full': len(full_features),
'n_selected': len(selected_features),
'ic_ratio': ic_ratio,
'model_full': model_full,
'model_selected': model_selected
}
# 使用示例
params = {
'objective': 'regression',
'metric': 'rmse',
'num_leaves': 31,
'learning_rate': 0.05,
'verbose': -1,
}
comparison = validate_feature_selection(
X_train, y_train, X_valid, y_valid, X_test, y_test,
full_features=list(FEATURES.keys()),
selected_features=selected_features,
params=params
)6.2 特征选择效果可视化
def plot_feature_selection_comparison(comparison):
"""
可视化特征选择对比结果
参数:
comparison: validate_feature_selection 的返回结果
"""
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# 特征数量对比
models = ['全特征', '精选特征']
n_features = [comparison['n_full'], comparison['n_selected']]
ics = [comparison['full_features_ic'], comparison['selected_features_ic']]
# 特征数 vs IC
axes[0].scatter(n_features, ics, s=200, alpha=0.6)
axes[0].plot(n_features, ics, 'r-', linewidth=2, alpha=0.5)
# 添加标注
for i, (n, ic, model) in enumerate(zip(n_features, ics, models)):
axes[0].annotate(f'{model}\n{n}特征\nIC={ic:.4f}',
(n, ic),
textcoords="offset points",
xytext=(0, 10),
ha='center')
axes[0].set_xlabel('特征数量')
axes[0].set_ylabel('测试集 IC')
axes[0].set_title('特征数量 vs 性能')
axes[0].grid(True, alpha=0.3)
# IC 柱状图
bars = axes[1].bar(models, ics, color=['steelblue', 'coral'], alpha=0.7, edgecolor='black')
axes[1].set_ylabel('测试集 IC')
axes[1].set_title('模型性能对比')
axes[1].grid(True, axis='y', alpha=0.3)
# 添加数值标签
for bar, ic in zip(bars, ics):
height = bar.get_height()
axes[1].text(bar.get_x() + bar.get_width()/2., height,
f'{ic:.4f}',
ha='center', va='bottom', fontweight='bold')
# 添加IC保持率
axes[1].axhline(y=comparison['full_features_ic'] * 0.95,
color='green', linestyle='--', alpha=0.5,
label='95% 阈值')
axes[1].legend()
plt.tight_layout()
plt.show()
# 使用示例
plot_feature_selection_comparison(comparison)6.3 时序特征重要性
5. 时序特征重要性
5.1 滚动窗口特征重要性
核心思想
在不同时间窗口内计算特征重要性,分析重要性的稳定性。
代码实现
def rolling_feature_importance(X, y, model_class, params,
window_size=252, step_size=21):
"""
滚动窗口特征重要性
参数:
X: 特征矩阵,shape=[n_samples, n_features]
y: 目标变量
model_class: 模型类
params: 模型参数
window_size: 窗口大小
step_size: 步长
返回:
importance_history: 特征重要性历史
"""
n_samples = len(X)
importance_history = []
start_idx = window_size
while start_idx + step_size <= n_samples:
print(f"Processing window: {start_idx - window_size} - {start_idx}")
# 划分数据
X_window = X[start_idx - window_size:start_idx]
y_window = y[start_idx - window_size:start_idx]
# 训练模型
model = model_class(**params)
model.fit(X_window, y_window)
# 计算特征重要性
importance = model.feature_importance(importance_type='gain')
importance_history.append(importance)
# 滚动窗口
start_idx += step_size
return np.array(importance_history)
# 使用示例
importance_history = rolling_feature_importance(
X, y, LGBMRegressor, params,
window_size=252, # 1年
step_size=21 # 1个月
)
print(f"重要性历史: {importance_history.shape}")5.2 特征重要性稳定性分析
def analyze_importance_stability(importance_history, feature_names, top_n=10):
"""
分析特征重要性的稳定性
参数:
importance_history: 特征重要性历史,shape=[n_windows, n_features]
feature_names: 特征名称
top_n: 分析前n个特征
"""
# 计算统计量
mean_importance = importance_history.mean(axis=0)
std_importance = importance_history.std(axis=0)
cv_importance = std_importance / (mean_importance + 1e-6) # 变异系数
# 排序
sorted_indices = np.argsort(mean_importance)[::-1][:top_n]
# 打印
print(f"{'特征':<20} {'均值':>10} {'标准差':>10} {'变异系数':>10}")
print("-" * 60)
for i, idx in enumerate(sorted_indices):
print(f"{feature_names[idx]:<20} "
f"{mean_importance[idx]:>10.4f} "
f"{std_importance[idx]:>10.4f} "
f"{cv_importance[idx]:>10.4f}")
# 可视化
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
# 均值vs标准差
axes[0].scatter(mean_importance, std_importance, alpha=0.6)
for idx in sorted_indices:
axes[0].annotate(feature_names[idx],
(mean_importance[idx], std_importance[idx]))
axes[0].set_xlabel('Mean Importance')
axes[0].set_ylabel('Std Importance')
axes[0].set_title('Importance Stability')
axes[0].grid(True)
# 时间序列
for idx in sorted_indices:
axes[1].plot(importance_history[:, idx], label=feature_names[idx])
axes[1].set_xlabel('Time Window')
axes[1].set_ylabel('Importance')
axes[1].set_title(f'Top {top_n} Features Importance Over Time')
axes[1].legend()
axes[1].grid(True)
plt.tight_layout()
plt.show()
# 使用示例
analyze_importance_stability(importance_history, feature_names, top_n=10)6. 特征选择策略
6.1 基于重要性的特征选择
def select_features_by_importance(model, X, feature_names,
importance_type='gain', threshold=0.01):
"""
基于特征重要性选择特征
参数:
model: LightGBM模型
X: 特征矩阵
feature_names: 特征名称
importance_type: 'split' 或 'gain'
threshold: 重要性阈值(比例)
返回:
X_selected: 选择后的特征矩阵
selected_features: 选择的特征名称
selected_indices: 选择的特征索引
"""
# 获取特征重要性
importance = model.feature_importance(importance_type=importance_type)
# 归一化
importance_normalized = importance / importance.sum()
# 选择重要性超过阈值的特征
selected_indices = np.where(importance_normalized >= threshold)[0]
# 提取数据
X_selected = X[:, selected_indices]
selected_features = np.array(feature_names)[selected_indices]
print(f"选择 {len(selected_indices)}/{len(feature_names)} 个特征")
print(f"累计重要性: {importance_normalized[selected_indices].sum():.2%}")
return X_selected, selected_features, selected_indices
# 使用示例
X_selected, selected_features, selected_indices = select_features_by_importance(
model, X, feature_names, importance_type='gain', threshold=0.01
)6.2 递归特征消除(RFE)
from sklearn.feature_selection import RFE
def recursive_feature_elimination(X, y, estimator, n_features_to_select=None,
step=1, cv=5):
"""
递归特征消除
参数:
X: 特征矩阵
y: 目标变量
estimator: 评估器
n_features_to_select: 目标特征数
step: 每次消除的特征数
cv: 交叉验证折数
返回:
X_selected: 选择后的特征矩阵
selected_indices: 选择的特征索引
rfe: RFE对象
"""
# 创建RFE
rfe = RFE(estimator=estimator,
n_features_to_select=n_features_to_select,
step=step,
importance_getter='auto')
# 拟合
rfe.fit(X, y)
# 提取结果
selected_indices = np.where(rfe.support_)[0]
X_selected = X[:, selected_indices]
print(f"选择 {len(selected_indices)}/{X.shape[1]} 个特征")
return X_selected, selected_indices, rfe
# 使用示例
estimator = LGBMRegressor(**params)
X_selected, selected_indices, rfe = recursive_feature_elimination(
X, y, estimator, n_features_to_select=50, step=5
)6.3 稳定性特征选择
def stable_feature_selection(importance_history, feature_names,
stability_threshold=0.7, top_n=None):
"""
稳定性特征选择
参数:
importance_history: 特征重要性历史
feature_names: 特征名称
stability_threshold: 稳定性阈值
top_n: 选择前n个稳定特征
返回:
selected_features: 选择的特征名称
stability_scores: 稳定性得分
"""
# 计算每个特征的排名
rankings = []
for importance in importance_history:
ranking = np.argsort(importance)[::-1]
rankings.append(ranking)
rankings = np.array(rankings)
# 计算稳定性得分(基于排名的方差)
stability_scores = []
for feature_idx in range(len(feature_names)):
feature_rankings = np.where(rankings == feature_idx)[1]
stability_score = 1 / (1 + np.var(feature_rankings))
stability_scores.append(stability_score)
stability_scores = np.array(stability_scores)
# 排序
sorted_indices = np.argsort(stability_scores)[::-1]
# 选择稳定特征
if top_n is None:
selected_indices = sorted_indices[stability_scores[sorted_indices] >= stability_threshold]
else:
selected_indices = sorted_indices[:top_n]
selected_features = np.array(feature_names)[selected_indices]
print(f"选择 {len(selected_indices)} 个稳定特征")
return selected_features, stability_scores
# 使用示例
selected_features, stability_scores = stable_feature_selection(
importance_history, feature_names,
stability_threshold=0.7, top_n=20
)7. 特征重要性分析的最佳实践
7.1 综合分析流程
class FeatureImportanceAnalyzer:
"""
特征重要性分析器
功能:
1. 计算多种特征重要性
2. 可视化分析结果
3. 时序稳定性分析
4. 特征选择建议
"""
def __init__(self, model, X, y, feature_names):
self.model = model
self.X = X
self.y = y
self.feature_names = feature_names
self.split_importance = None
self.gain_importance = None
self.permutation_importance = None
self.shap_values = None
def calculate_lightgbm_importance(self):
"""计算LightGBM内置重要性"""
self.split_importance = self.model.feature_importance(importance_type='split')
self.gain_importance = self.model.feature_importance(importance_type='gain')
def calculate_permutation_importance(self, metric='ic', n_repeats=5):
"""计算Permutation Importance"""
self.permutation_importance = permutation_importance(
self.model, self.X, self.y,
metric=metric, n_repeats=n_repeats
)
def calculate_shap_values(self):
"""计算SHAP值"""
explainer = shap.TreeExplainer(self.model)
self.shap_values = explainer.shap_values(self.X)
def analyze_all(self):
"""执行所有分析"""
print("计算LightGBM重要性...")
self.calculate_lightgbm_importance()
print("计算Permutation Importance...")
self.calculate_permutation_importance()
print("计算SHAP值...")
self.calculate_shap_values()
def plot_summary(self, top_n=20):
"""绘制汇总图"""
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
# Split重要性
indices = np.argsort(self.split_importance)[::-1][:top_n]
axes[0, 0].barh(range(len(indices)), self.split_importance[indices][::-1])
axes[0, 0].set_yticks(range(len(indices)), np.array(self.feature_names)[indices][::-1])
axes[0, 0].set_title('Split Importance')
axes[0, 0].invert_yaxis()
# Gain重要性
indices = np.argsort(self.gain_importance)[::-1][:top_n]
axes[0, 1].barh(range(len(indices)), self.gain_importance[indices][::-1])
axes[0, 1].set_yticks(range(len(indices)), np.array(self.feature_names)[indices][::-1])
axes[0, 1].set_title('Gain Importance')
axes[0, 1].invert_yaxis()
# Permutation Importance
if self.permutation_importance is not None:
mean_imp = self.permutation_importance.mean(axis=1)
indices = np.argsort(mean_imp)[::-1][:top_n]
axes[1, 0].barh(range(len(indices)), mean_imp[indices][::-1])
axes[1, 0].set_yticks(range(len(indices)), np.array(self.feature_names)[indices][::-1])
axes[1, 0].set_title('Permutation Importance')
axes[1, 0].invert_yaxis()
# SHAP重要性
if self.shap_values is not None:
mean_shap = np.abs(self.shap_values).mean(axis=0)
indices = np.argsort(mean_shap)[::-1][:top_n]
axes[1, 1].barh(range(len(indices)), mean_shap[indices][::-1])
axes[1, 1].set_yticks(range(len(indices)), np.array(self.feature_names)[indices][::-1])
axes[1, 1].set_title('SHAP Importance')
axes[1, 1].invert_yaxis()
plt.tight_layout()
plt.show()
# 使用示例
analyzer = FeatureImportanceAnalyzer(model, X_val, y_val, feature_names)
analyzer.analyze_all()
analyzer.plot_summary(top_n=15)7.2 检查清单
特征重要性分析检查清单
- 计算至少两种不同的特征重要性
- 可视化特征重要性分布
- 检查特征重要性的稳定性
- 分析特征间的相关性
- 验证特征选择的合理性
- 记录分析过程和结论
时序特征重要性分析检查清单
- 使用滚动窗口分析重要性变化
- 识别稳定和不稳定特征
- 分析不同市场状态下的重要性
- 检查重要性与市场周期的关系
- 制定特征更新策略
8. 总结
特征重要性分析是量化模型开发中的关键环节:
- 基础重要性:LightGBM内置的Split和Gain重要性
- Permutation Importance:更真实的重要性评估方法
- SHAP分析:提供个体和全局解释
- 时序分析:分析重要性的稳定性
- 特征选择:基于重要性的特征筛选策略
正确的特征重要性分析能够帮助我们:
- 理解模型决策逻辑
- 识别有效因子
- 提升模型性能
- 控制模型风险