10个NumPy单行代码简化特征工程

当构建机器学习模型时，大多数开发者关注模型架构和超参数调优。然而，真正的竞争优势来自于构建能帮助模型理解数据底层模式的特征表示。

1. 使用中位数绝对偏差进行鲁棒缩放

1
2
3


import numpy as np
data = np.array([1, 200, 3, 10, 4, 50, 6, 9, 3, 100])
scaled = (data - np.median(data)) / np.median(np.abs(data - np.median(data)))

2. 使用分位数进行连续变量分箱

1
2


ages = np.array([18, 25, 35, 22, 45, 67, 23, 29, 34, 56, 41, 38, 52, 28, 33])
binned = np.digitize(ages, np.percentile(ages, [25, 50, 75])) - 1

3. 无循环生成多项式特征

1
2


X = np.array([[20, 65], [25, 70], [30, 45], [22, 80]])
poly_features = np.column_stack([X[:, [i, j]].prod(axis=1) for i in range(X.shape[1]) for j in range(i, X.shape[1])])

4. 时间序列滞后特征

1
2


sales = np.array([100, 98, 120,130, 74, 145, 110, 140, 65, 105, 135])
lags = np.column_stack([np.roll(sales, shift) for shift in [1, 2, 3]])[3:]

5. 无需pandas的独热编码

1
2


categories = np.array([0, 1, 2, 1, 0, 2, 3, 1])
one_hot = (categories[:, None] == np.arange(categories.max() + 1)).astype(int)

6. 从坐标生成距离特征

1
2
3


locations = np.array([[40.7128, -74.0060], [34.0522, -118.2437], [41.8781, -87.6298], [29.7604, -95.3698]])
reference = np.array([39.7392, -104.9903])
distances = np.sqrt(((locations - reference) ** 2).sum(axis=1))

7. 变量间的交互特征

1
2


features = np.array([[10, 8, 7], [15, 9, 6], [12, 7, 8], [20, 10, 9]])
interactions = np.array([features[:, i] * features[:, j] for i in range(features.shape[1]) for j in range(i+1, features.shape[1])]).T

8. 滚动窗口统计

1
2
3


signal = np.array([10, 27, 12, 18, 11, 19, 20, 26, 12, 19, 25, 31, 28])
window_size = 4
rolling_mean = np.convolve(signal, np.ones(window_size)/window_size, mode='valid')

9. 异常值指示特征

1
2


amounts = np.array([25, 30, 28, 32, 500, 29, 31, 27, 33, 26])
outlier_flags = ((amounts < np.percentile(amounts, 5)) | (amounts > np.percentile(amounts, 95))).astype(int)

10. 分类变量的频率编码

1
2
3


categories = np.array(['Electronics', 'Books', 'Electronics', 'Clothing', 'Books', 'Electronics', 'Home', 'Books'])
unique_cats, counts = np.unique(categories, return_counts=True)
freq_encoded = np.array([counts[np.where(unique_cats == cat)[0][0]] for cat in categories])

特征工程最佳实践

内存效率：处理大型数据集时考虑特征工程的内存影响
特征选择：并非特征越多越好，使用相关性分析选择最有价值的特征
验证：在保留集上验证工程特征是否提升模型性能
领域知识：最佳特征通常来自对业务领域的深入理解

这些NumPy单行代码是解决常见特征工程挑战的实用方案，能帮助开发者构建更高效、可维护的特征工程流程。