筛选是关键！Python来救援！

当你需要快速分析大量数据时，有一个关键步骤需要执行：筛选（Triage）。在取证调查中，这一步至关重要，因为它使调查人员能够从海量数据中快速识别、优先排序和隔离最相关或高价值的证据，确保有限的时间和资源集中在最有可能揭示事件关键事实的工件上。有时，一个快速的脚本就足以加速这项任务。

今天，我正在处理一个案例，其中有一个包含超过20,000个混合文件的目录。其中有许多ZIP压缩包（主要是Office文档），这些压缩包内也包含大量文件。思路是扫描所有这些文件（包括ZIP压缩包）以查找某些关键词。我写了一个快速的Python脚本，它将根据嵌入的YARA规则扫描所有文件，如果找到匹配项，则将原始文件复制到目标目录。

以下是脚本：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94


#
# 快速Python筛选脚本
# 将匹配YARA规则的文件复制到另一个目录
#
import yara
import os
import shutil
import zipfile
import io

# YARA规则
yara_rule = """
rule case_xxxxxx_search_1
{
    strings:
        $s1 = "string1" nocase wide ascii
        $s2 = "string2" nocase wide ascii
        $s3 = "string3" nocase wide ascii
        $s4 = "string4" nocase wide ascii
        $s5 = "string5" nocase wide ascii
    condition:
        any of ($s*)
}
"""

source_dir = "Triage"
dest_dir = "MatchedFiles"
os.makedirs(dest_dir, exist_ok=True)
rules = yara.compile(source=yara_rule)

def is_zip_file(filepath):
    """
    检查ZIP压缩包魔数。
    """
    try:
        with open(filepath, "rb") as f:
            sig = f.read(4)
            return sig in (b"PK\x03\x04", b"PK\x05\x06", b"PK\x07\x08")
    except Exception:
        return False

def safe_extract_path(member_name):
    """
    返回目标文件夹内的安全相对路径（防止路径中出现..）。
    """
    return os.path.normpath(member_name).replace("..", "_")

def scan_file(filepath, file_bytes=None, inside_zip=False, zip_name=None, member_name=None):
    """
    使用YARA扫描文件。
    """
    try:
        if file_bytes is not None:
            matches = rules.match(data=file_bytes)
        else:
            matches = rules.match(filepath)

        if matches:
            if inside_zip:
                print("[MATCH] {member_name} (inside {zip_name})")
                rel_path = os.path.relpath(zip_name, source_dir)
                filepath = os.path.join(source_dir, rel_path)
                dest_path = os.path.join(dest_dir, rel_path)
            else:
                print("[MATCH] {filepath}")
                rel_path = os.path.relpath(filepath, source_dir)
                dest_path = os.path.join(dest_dir, rel_path)
            
            # 保存副本
            os.makedirs(os.path.dirname(dest_path), exist_ok=True)
            shutil.copy2(filepath, dest_path)
    except Exception as e:
        print(e)
        pass

# 主程序
for root, dirs, files in os.walk(source_dir):
    for name in files:
        filepath = os.path.join(root, name)
        if is_zip_file(filepath):
            try:
                with zipfile.ZipFile(filepath, 'r') as z:
                    for member in z.namelist():
                        if member.endswith("/"):  # 跳过目录
                            continue
                        try:
                            file_data = z.read(member)
                            scan_file(member, file_bytes=file_data, inside_zip=True, zip_name=filepath, member_name=member)
                        except Exception:
                            pass
            except zipfile.BadZipFile:
                pass
        else:
            scan_file(filepath)

现在，你可以在脚本工作时享受一些咖啡：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


[MATCH] docProps/app.xml (inside Triage\xxxxxxx.xlsx)
[MATCH] xl/sharedStrings.xml (inside Triage\xxxxx.xlsx)
[MATCH] xl/sharedStrings.xml (inside Triage\xxxxxxxxxxxxxxxxxxxx.xlsx)
[MATCH] ppt/slides/slide3.xml (inside Triage\xxxxxxxxxxxxxxxxxxxxxx.pptx)
[MATCH] ppt/slides/slide12.xml (inside Triage\xxxxxxxxxxxxxxxxxxxxxx.pptx)
[MATCH] ppt/slides/slide14.xml (inside Triage\xxxxxxxxxxxxxxxxxxxxxx.pptx)
[MATCH] ppt/slides/slide15.xml (inside Triage\xxxxxxxxxxxxxxxxxxxxxx.pptx)
[MATCH] xl/sharedStrings.xml (inside Triage\xxxxxxxx.xlsx)
[MATCH] Triage\xxxxxxxxxxxxxxxxxxxxxxx.pdf
[MATCH] Triage\xxxxxxxxxxxxxxxxxxx.xls
[MATCH] xl/sharedStrings.xml (inside Triage\xxxxxxxxxxxxxxxx.xlsx)
[MATCH] Triage\xxxxxxxxxxxxxxxxxxxxxxxxxx.xls

你可以看到，通过几行Python代码，你可以加速调查中的筛选阶段。请注意，该脚本是为处理我当前的文件集而编写的，尚未准备好用于更广泛的用途（例如处理受密码保护的压缩包或其他类型的压缩包）。

Xavier Mertens (@xme) Xameco 高级ISC处理员 - 自由职业网络安全顾问 PGP密钥

关键词：DFIR 取证调查 Python脚本筛选

Python助力数字取证：YARA规则快速筛选关键证据

本文介绍如何使用Python脚本结合YARA规则对大量文件进行快速筛选，特别针对ZIP压缩包内的Office文档进行关键词匹配，提升数字取证调查中的证据筛选效率。

筛选是关键！Python来救援！