如何将JSON数据转换为匹配任何模式

在API间传输数据或准备JSON数据导入时，模式不匹配可能会中断工作流程。学习如何清理和规范化JSON数据可确保顺畅、无差错的数据传输。

本教程演示如何基于预定义模式清理混乱的JSON并将结果导出到新文件。我们将清理包含200条合成客户记录数据集的JSON文件。

涵盖内容

先决条件
添加和检查JSON文件
定义目标模式
使用纯Python清理JSON数据
使用Pandas清理JSON数据
验证清理后的JSON
Pandas与纯Python数据清洗对比

先决条件

要跟随本教程，您应基本了解：

Python字典、列表和循环
JSON数据结构（键、值和嵌套）
使用Python的json模块读写JSON文件

添加和检查JSON文件

在开始编写代码之前，请确保要清理的.json文件位于项目目录中。这样可以仅使用文件名轻松加载到脚本中。

您可以通过本地查看文件或使用Python内置的json模块在脚本中加载来检查数据结构（假设文件名为"old_customers.json"）：

1
2
3
4
5
6
7


import json

with open('old_customers.json') as file:
    crm_data = json.load(file)

print(type(crm_data))
print(crm_data)

这将显示JSON文件是字典还是列表结构，并在终端中打印整个文件内容。

定义目标模式

JSON模式本质上是描述以下内容的蓝图：

必填字段
字段名称
每个字段的数据类型
标准化格式（例如小写电子邮件、修剪空格等）

旧模式与目标模式对比：目标是从每个条目中删除"customer_id"和"address"字段，并将其余字段重命名：

“name” → “full_name”
“email” → “email_address”
“phone” → “mobile”
“membership_level” → “tier”

输出应包含4个响应字段而不是6个，全部重命名以适应项目要求。

使用纯Python清理JSON数据

步骤1：导入json和time模块

1
2


import json
import time

步骤2：使用json.load()加载文件

1
2
3


start_time = time.time()
with open('old_customers.json') as file:
    crm_data = json.load(file)

步骤3：编写循环清理函数

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


def clean_data(records):
    transformed_records = []
    for customer in records["customers"]:
        transformed_records.append({
            "full_name": customer["name"],
            "email_address": customer["email"],
            "mobile": customer["phone"],
            "tier": customer["membership_level"],
        })
    return {"customers": transformed_records}

new_data = clean_data(crm_data)

步骤4：保存到.json文件

1
2
3


output_file = "transformed_data.json"
with open(output_file, "w") as f:
    json.dump(new_data, f, indent=4)

步骤5：计时数据处理

1
2
3
4


end_time = time.time()
elapsed_time = end_time - start_time
print(f"Transformed data saved to {output_file}")
print(f"Processing data took {elapsed_time:.2f} seconds")

使用Pandas清理JSON数据

安装Pandas

1

pip install pandas

步骤1：导入相关库

1
2
3


import json
import time
import pandas as pd

步骤2：加载文件并提取客户条目

1
2
3
4
5
6


start_time = time.time()
with open('old_customers.json', 'r') as f:
    crm_data = json.load(f)

# 提取客户条目列表
clients = crm_data.get("customers", [])

步骤3：加载到DataFrame

1

df = pd.DataFrame(clients)

步骤4：编写字段转换函数

1
2
3
4
5
6
7


def transform_fields(row):
    return {
        "full_name": row.get("name", "Unknown"),
        "email_address": row.get("email", "N/A"),
        "mobile": row.get("phone", "N/A"),
        "tier": row.get("membership_level", "N/A")
    }

步骤5：应用模式转换

1
2
3
4
5
6


# 方法1：使用apply()
transformed_df = df.apply(transform_fields, axis=1)
transformed_data = transformed_df.tolist()

# 方法2：使用列表推导式
transformed_data = [transform_fields(row) for row in df.to_dict(orient="records")]

步骤6：保存输出

1
2
3
4


output_data = {"customers": transformed_data}
output_file = "applypandas_customer.json"
with open(output_file, "w") as f:
    json.dump(output_data, f, indent=4)

步骤7：跟踪运行时间

1
2
3
4


end_time = time.time()
elapsed_time = end_time - start_time
print(f"Transformed data saved to {output_file}")
print(f"Processing data took {elapsed_time:.2f} seconds")

验证清理后的JSON

步骤1：安装jsonschema

1

pip install jsonschema

步骤2：定义模式

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


schema = {
    "type": "object",
    "properties": {
        "customers": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "full_name": {"type": "string"},
                    "email_address": {"type": "string"},
                    "mobile": {"type": "string"},
                    "tier": {"type": "string"}
                },
                "required": ["full_name", "email_address", "mobile", "tier"]
            }
        }
    },
    "required": ["customers"]
}

步骤3：加载清理后的JSON文件

1
2


with open("transformed_data.json") as f:
    data = json.load(f)

步骤4：验证数据

1
2
3
4
5
6
7


from jsonschema import validate, ValidationError

try:
    validate(instance=data, schema=schema)
    print("JSON is valid.")
except ValidationError as e:
    print("JSON is invalid:", e.message)

Pandas与纯Python数据清洗对比

使用纯Python清理和重构JSON是更直接的方法，速度快，适合处理小型数据集或简单转换。但随着数据增长和变得更加复杂，您可能需要Python单独无法提供的高级数据清理方法。在这种情况下，Pandas成为更好的选择，它能有效处理大型复杂数据集，提供处理缺失数据和删除重复项的内置函数。

JSON数据模式转换实战指南：Python与Pandas双方案

本文详细讲解如何使用纯Python和Pandas库将JSON数据转换为目标模式，包含完整代码实现、性能对比和JSON Schema验证方法，帮助开发者处理API数据交互和数据结构标准化问题。