【SQL周周練】給你無酸紙、變色油墨，你能偽造多少美金？

大家好，我是“蔣點(diǎn)數(shù)分”，多年以來一直從事數(shù)據(jù)分析工作。從今天開始，與大家持續(xù)分享關(guān)于數(shù)據(jù)分析的學(xué)習(xí)內(nèi)容。

本文是第 2 篇，也是【SQL 周周練】系列的第 2 篇。該系列是挑選或自創(chuàng)具有一些難度的 SQL 題目，一周至少更新一篇。后續(xù)創(chuàng)作的內(nèi)容，初步規(guī)劃的方向包括：

后續(xù)內(nèi)容規(guī)劃

1.利用 Streamlit 實(shí)現(xiàn) Hive 元數(shù)據(jù)展示、SQL 編輯器、結(jié)合Docker 沙箱實(shí)現(xiàn)數(shù)據(jù)分析 Agent

2.時(shí)間序列異常識(shí)別、異動(dòng)歸因算法3.留存率擬合、預(yù)測(cè)、建模

4.學(xué)習(xí) AB 實(shí)驗(yàn)、復(fù)雜實(shí)驗(yàn)設(shè)計(jì)等

5.自動(dòng)化機(jī)器學(xué)習(xí)、自動(dòng)化特征工程

6.因果推斷學(xué)習(xí)

7. ……

歡迎關(guān)注，一起學(xué)習(xí)。（歡迎準(zhǔn)備從事數(shù)據(jù)分析崗位的學(xué)生關(guān)注一起來當(dāng)“工具人”）

第 2 期題目

題目來源：自創(chuàng)題目，場(chǎng)景來源于香港電影《無雙》

一、題目介紹

《無雙》是一部很不錯(cuò)的電影，其主題是偽造美鈔。雖然已經(jīng)上映多年，但其中“無酸紙”、“變色油墨”的梗，至今在網(wǎng)上依舊可以看到。其中的一個(gè)經(jīng)典片段 ——“畫家”(周潤(rùn)發(fā))嗔怪“李問”(郭富城)訂購了500噸無酸紙，說讓“李問”活著給他印完（當(dāng)然結(jié)尾展示了郭富城其實(shí)才是“畫家”）。那么由此而來，我想出了一道 SQL 題：

假設(shè)偽鈔集團(tuán)每日給你供應(yīng)隨機(jī)數(shù)量的變色油墨、無酸紙、安全線/防偽線（未用完的材料可以留給后面用），凹版印刷機(jī)等其他材料和工具也已經(jīng)準(zhǔn)備好。

請(qǐng)你計(jì)算每天能制作偽鈔多少張，并且根據(jù)當(dāng)天的情況輸出第二天最缺哪種材料：

date	string	日期
acid_free_paper_supply	int	無酸紙供應(yīng)量（單位g）
optically_variable_ink_supply	int	變色油墨供應(yīng)量（單位mg）
security_thread_supply	int	安全線供應(yīng)量

假設(shè) 一張偽鈔需要 1g 無酸紙，0.005g 的變色油墨，1 根安全線；印制過程中不考慮損耗

二、題目思路

想要答題的同學(xué)，可以先思考答案??。

.……

我來談?wù)勎业乃悸罚哼@道題目的設(shè)計(jì)，材料是以固定比例的投入產(chǎn)生一張偽鈔，哪種材料相對(duì)較少，哪種材料就限制住了偽鈔的制造數(shù)量；所以可以單獨(dú)計(jì)算三種材料能制造多少偽鈔，然后用 least 求最小值，類似“木桶短板理論”。題目里提到了當(dāng)日未用完的材料，可以后面再用；所以每天不需要單獨(dú)計(jì)算，直接計(jì)算從開始到當(dāng)天 => 這又用上了數(shù)據(jù)分析師的老朋友“窗口函數(shù)”。

下面，我用 NumPy 和 Scipy 生成模擬的數(shù)據(jù)集：

三、生成模擬數(shù)據(jù)

只關(guān)心 SQL 代碼的同學(xué)，可以跳轉(zhuǎn)到第四節(jié)（我在工作中使用 Hive 較多，因此采用 Hive 的語法）

模擬代碼如下：

1.定義模擬邏輯需要的常量，計(jì)算目標(biāo)數(shù)量的偽鈔需要多少材料：

import numpy as np
import pandas as pd
import scipy

# 隨機(jī)數(shù)種子
RANDOM_SEED = 2025
# 偽造開始日期
START_DATE = "2025-05-01"
# 偽造天數(shù)
NUM_DAY = 10
# 需要偽造的偽鈔數(shù)量（張數(shù)，非金額）
NUM_TOTAL_COUNTERFEIT_CURRENCY = 1_000_000

# 一張偽鈔需要多少無酸紙，簡(jiǎn)化問題只考慮重量(單位 g)
ACID_FREE_PAPER_EACH_COUNTERFEIT_CURRENCY = 1
# 所有偽鈔需要的無酸紙(1.05 是一個(gè)冗余度，所有材料類似)
ACID_FREE_PAPER_ALL_NEED = (
    ACID_FREE_PAPER_EACH_COUNTERFEIT_CURRENCY 
    * NUM_TOTAL_COUNTERFEIT_CURRENCY 
    * 1.05
)

# 一張偽鈔需要多少變色油墨，重量（單位 mg）
OPTICALLY_VARIABLE_INK_EACH_COUNTERFEIT_CURRENCY = 5
# 所有偽鈔需要的變色油墨
OPTICALLY_VARIABLE_INK_ALL_NEED = (
    OPTICALLY_VARIABLE_INK_EACH_COUNTERFEIT_CURRENCY
    * NUM_TOTAL_COUNTERFEIT_CURRENCY
    * 1.05
)

# 一張偽鈔需要多少安全線（單位 條）
SECURITY_THREAD_EACH_COUNTERFEIT_CURRENCY = 1
# 所有偽鈔需要的防偽線
SECURITY_THREAD_ALL_NEED = (
    SECURITY_THREAD_EACH_COUNTERFEIT_CURRENCY 
    * NUM_TOTAL_COUNTERFEIT_CURRENCY 
    * 1.05
)

2.偽鈔需要的材料每天按照隨機(jī)的權(quán)重提供，權(quán)重需要?dú)w一化：

# 權(quán)重范圍，用來隨機(jī)生成數(shù)據(jù)（需要?dú)w一化）
WEIGHT_RANGE = (0.2, 2)

# 無酸紙每天供應(yīng)的隨機(jī)權(quán)重
acid_free_paper_supply_weight = scipy.stats.uniform.rvs(
    loc=WEIGHT_RANGE[0],
    scale=WEIGHT_RANGE[1] - WEIGHT_RANGE[0],
    size=NUM_DAY,
    random_state=RANDOM_SEED - 1,
)

# 變色油墨每天供應(yīng)的權(quán)重
optically_variable_ink_supply_weight = scipy.stats.uniform.rvs(
    loc=WEIGHT_RANGE[0],
    scale=WEIGHT_RANGE[1] - WEIGHT_RANGE[0],
    size=NUM_DAY,
    random_state=RANDOM_SEED,
)

# 安全線每天供應(yīng)的權(quán)重
security_thread_supply_weight = scipy.stats.uniform.rvs(
    loc=WEIGHT_RANGE[0],
    scale=WEIGHT_RANGE[1] - WEIGHT_RANGE[0],
    size=NUM_DAY,
    random_state=RANDOM_SEED + 1,
)

# 將權(quán)重歸一化，使得所有天數(shù)的供應(yīng)比例和為 1
acid_free_paper_supply_weight /= acid_free_paper_supply_weight.sum()
optically_variable_ink_supply_weight /= optically_variable_ink_supply_weight.sum()
security_thread_supply_weight /= security_thread_supply_weight.sum()

3. 將前面生成的數(shù)據(jù)轉(zhuǎn)為 pd.DataFrame，并輸出為 csv 文件：

df = pd.DataFrame(
    {
        "acid_free_paper_supply": ACID_FREE_PAPER_ALL_NEED
        * acid_free_paper_supply_weight,
        "optically_variable_ink_supply": OPTICALLY_VARIABLE_INK_ALL_NEED
        * optically_variable_ink_supply_weight,
        "security_thread_supply": SECURITY_THREAD_ALL_NEED
        * security_thread_supply_weight
    }
)

# 四舍五入并轉(zhuǎn)為 int
df = df.round().astype(int)
df["date"] = pd.date_range(start=START_DATE, periods=NUM_DAY, freq="D")

# 在 Jupyter 中展示數(shù)據(jù)
display(df)

out_csv_path = "./dwd_conterfeit_material_daily_supply_records.csv"
columns = [
    "date",
    "acid_free_paper_supply",
    "optically_variable_ink_supply",
    "security_thread_supply"
]
# 導(dǎo)出 csv 用來讓 hive load 數(shù)據(jù)，utf-8-sig 編碼處理中文，雖然表里數(shù)據(jù)沒有中文
df[columns].to_csv(out_csv_path, header=False, index=False, encoding="utf-8-sig")

4.創(chuàng)建新的 Hive 表，并將數(shù)據(jù) load 到表中：

from pyhive import hive

# 配置連接參數(shù)
host_ip = "127.0.0.1"
port = 10000
username = "蔣點(diǎn)數(shù)分"


with hive.Connection(host=host_ip, port=port) as conn:
    cursor = conn.cursor()

    drop_table_sql = """
    drop table if exists data_exercise.dwd_conterfeit_material_daily_supply_records
    """
    print(drop_table_sql)
    cursor.execute(drop_table_sql)

    create_table_sql = """
    create table data_exercise.dwd_conterfeit_material_daily_supply_records (
        `date` string comment "日期",
        acid_free_paper_supply int comment "無酸紙供應(yīng)量（單位g）",
        optically_variable_ink_supply int comment "變色油墨供應(yīng)量（單位mg）",
        security_thread_supply int comment "安全線供應(yīng)量"
    )
    comment "偽鈔集團(tuán)每天供應(yīng)的偽鈔原材料數(shù)量 | 文章編號(hào)：2c3d2561"
    row format delimited fields terminated by ","
    stored as textfile
    """
    
    print(create_table_sql)
    cursor.execute(create_table_sql)

    import os
    
    load_data_sql = f"""
    load data local inpath "{os.path.abspath(out_csv_path)}" 
    overwrite into table data_exercise.dwd_conterfeit_material_daily_supply_records
    """
    
    print(load_data_sql)
    cursor.execute(load_data_sql)

    cursor.close()

我通過使用 PyHive 包實(shí)現(xiàn) Python 操作 Hive。我個(gè)人電腦部署了 Hadoop 及 Hive，但是沒有開啟認(rèn)證，企業(yè)里一般常用 Kerberos 來進(jìn)行大數(shù)據(jù)集群的認(rèn)證。

四、SQL 解答

思路在第二節(jié)已經(jīng)說明，下面是代碼，細(xì)節(jié)參見注釋。其中 cumulative_conterfeit_all_restriction 等于哪種材料的 cumulative_conterfeit_only... 就可以認(rèn)為第二天最缺哪種材料（偽鈔制造量被這種材料制約）。提示：order by 時(shí)，統(tǒng)計(jì)的窗口范圍默認(rèn)是 rows between preceding unbounded and current row，寫清楚更好。三種材料單獨(dú)判斷，然后用 concat_ws 合并結(jié)果（注意其他 SQL 方言不一定有 Hive 的這個(gè)函數(shù)）。

每天的偽鈔制造量 action_daily_production 使用 cumulative_conterfeit_all_restriction 結(jié)合窗口函數(shù) lag 減去上一行即可。

with calc_single_material_restrict_production as (
    -- 計(jì)算一種材料限制能造多少美元偽鈔
    select
      `date`
    , acid_free_paper_supply
    , optically_variable_ink_supply
    , security_thread_supply
    -- 只考慮無酸紙，不考慮其他材料和每日最大制造量限制，累計(jì)偽鈔制作數(shù)，下面以此類推
    -- 有些材料比例為 1，因此不額外寫除以 1
    , sum(acid_free_paper_supply) over(order by `date` asc) as cumulative_conterfeit_only_acid_free_paper
    -- 注意向下取整
    , floor(sum(optically_variable_ink_supply) over(order by `date` asc) / 5) as cumulative_conterfeit_only_optically_variable_ink
    , sum(security_thread_supply) over(order by `date` asc) as cumulative_conterfeit_only_security_thread
    from data_exercise.dwd_conterfeit_material_daily_supply_records
)

, calc_all_restriction_prodection as (
    select
      `date`
    , acid_free_paper_supply
    , optically_variable_ink_supply
    , security_thread_supply
    , cumulative_conterfeit_only_acid_free_paper
    , cumulative_conterfeit_only_optically_variable_ink
    , cumulative_conterfeit_only_security_thread
    -- 使用 least 計(jì)算最小值
    , least(
        cumulative_conterfeit_only_acid_free_paper, 
        cumulative_conterfeit_only_optically_variable_ink, 
        cumulative_conterfeit_only_security_thread
      ) as cumulative_conterfeit_all_restriction
    from calc_single_material_restrict_production

)

select
  `date`
, cumulative_conterfeit_only_acid_free_paper
, cumulative_conterfeit_only_optically_variable_ink
, cumulative_conterfeit_only_security_thread
, cumulative_conterfeit_all_restriction
-- 減去上一行的數(shù)據(jù)，獲取每日偽鈔制造量
, cumulative_conterfeit_all_restriction - lag(cumulative_conterfeit_all_restriction, 1, 0) over(order by `date` asc) as action_daily_production
, if( cumulative_conterfeit_all_restriction >= 1000000, null, -- 已經(jīng)完成目標(biāo)量，就不寫缺哪種材料了
    concat_ws(',', 
        if(cumulative_conterfeit_only_acid_free_paper=cumulative_conterfeit_all_restriction, '無酸紙', null),
        if(cumulative_conterfeit_only_optically_variable_ink=cumulative_conterfeit_all_restriction, '變色油墨', null),
        if(cumulative_conterfeit_only_security_thread=cumulative_conterfeit_all_restriction, '安全線', null)
    )
) as `最缺的材料`
from calc_all_restriction_prodection

查詢結(jié)果如下：

2025-05-01	139293	36604	51579	36604	36604	變色油墨
2025-05-02	300720	184888	133386	133386	96782	安全線
2025-05-03	360345	339815	303165	303165	169779	安全線
2025-05-04	391211	422448	334383	334383	31218	安全線
2025-05-05	454196	496570	426535	426535	92152	安全線
2025-05-06	497465	551300	598018	497465	70930	無酸紙
2025-05-07	664497	665371	646287	646287	148822	安全線
2025-05-08	821998	754988	805933	754988	108701	變色油墨
2025-05-09	938544	914610	910409	910409	155421	安全線
2025-05-10	1050000	1050000	1050001	1050000	139591	null

上面的圖片，是我在 Python 中使用 pyvchart 庫實(shí)現(xiàn)的，它是字節(jié)跳動(dòng)開源的 vchart 的 Python 包，當(dāng)然你也可以使用 pyecharts。pd.melt 函數(shù)用于將“寬數(shù)據(jù)框”轉(zhuǎn)“長(zhǎng)數(shù)據(jù)框”。代碼部分如下：

with hive.Connection(host=host_ip, port=port) as conn:
    select_data_sql = ''' 我給出 SQL 答案 '''
    df_outcome = pd.read_sql_query(select_data_sql, conn)

from pyvchart import render_chart
spec = {
  "type": 'area',
  "data": [
    {
      "id": 'lineData',
      "values": pd.melt(df_outcome[[
        'date','cumulative_conterfeit_only_acid_free_paper',
        'cumulative_conterfeit_only_optically_variable_ink',
        'cumulative_conterfeit_only_security_thread'
      ]], id_vars=['date']).to_dict(orient='records')
    },
    {
      "id": 'areaData',
      "values": pd.melt(df_outcome[['date','cumulative_conterfeit_all_restriction',
        '最缺的材料']], id_vars=['date']).to_dict(orient='records')
    },
  ],
 "series": [
    {
      "type": 'line',
      "dataId": 'lineData',
      "xField": 'date',
      "yField": 'value',
      "seriesField": 'variable',
    },
    {
      "type": 'area',
      "dataId": 'areaData',
      "xField": 'date',
      "yField": 'value',
      "seriesField": 'variable',
    },
 ],

};

display(render_chart(spec))

————

??????我現(xiàn)在正在求職數(shù)據(jù)類工作（主要是數(shù)據(jù)分析或數(shù)據(jù)科學(xué)）；如果您有合適的機(jī)會(huì)，懇請(qǐng)您與我聯(lián)系，即時(shí)到崗，不限城市。您可以發(fā)送私信或*******************。

SQL周周練文章被收錄于專欄

【SQL周周練】是精選或者自編自創(chuàng)一系列具有挑戰(zhàn)性或趣味性的 SQL 題目，對(duì)于數(shù)據(jù)分析校招生或者初中級(jí)數(shù)據(jù)分析師非常值得一看。