EDA

This notebook is introduction to data in the competition Wattbot 2025

I participitated in this competition few months back and was ranked 5th in the Private Leaderboard and 10th in the Public Leaderboard. There were 182 entrants

Looking at data

There are 3 files shared in the competition - metadata.csv - test_Q.csv - train_QA.csv

I have created a folder data and have downloaded all the 3 files into it

metadata().head()

	id	type	title	year	citation	url
0	amazon2023	report	2023 Amazon Sustainability Report	2023	Amazon Staff. (2023). Amazon Sustainability Re...	https://sustainability.aboutamazon.com/2023-am...
1	chen2024	paper	Efficient Heterogeneous Large Language Model D...	2024	Shaoyuan Chen, Wencong Xiao, Yutong Lin, Mingx...	https://arxiv.org/pdf/2405.01814
2	chung2025	paper	The ML.ENERGY Benchmark: Toward Automated Infe...	2025	Jae-Won Chung, Jiachen Liu, Jeff J. Ma, Ruofan...	https://arxiv.org/pdf/2505.06371
3	cottier2024	paper	The Rising Costs of Training Frontier AI Models	2024	Ben Cottier, Robi Rahman, Loredana Fattorini, ...	https://arxiv.org/pdf/2405.21015
4	dodge2022	paper	Measuring the Carbon Intensity of AI in Cloud ...	2022	Jesse Dodge, Taylor Prewitt, Remi Tachet Des C...	https://arxiv.org/pdf/2206.05229

len(metadata())

There are 32 files

train().head()

	id	question	answer	answer_value	answer_unit	ref_id	ref_url	supporting_materials	explanation
0	q003	What is the name of the benchmark suite presen...	The ML.ENERGY Benchmark	ML.ENERGY Benchmark	is_blank	['chung2025']	['https://arxiv.org/pdf/2505.06371']	We present the ML.ENERGY Benchmark, a benchmar...	Quote
1	q009	What were the net CO2e emissions from training...	4.3 tCO2e	4.3	tCO2e	['patterson2021']	['https://arxiv.org/pdf/2104.10350']	"Training GShard-600B used 24 MWh and produced...	Quote
2	q054	What is the model size in gigabytes (GB) for t...	64.7 GB	64.7	GB	['chen2024']	['https://arxiv.org/pdf/2405.01814']	Table 3: Large language models used for evalua...	Table 3
3	q062	What was the total electricity consumption of ...	Unable to answer with confidence based on the ...	is_blank	MWh	is_blank	is_blank	is_blank	is_blank
4	q075	True or False: Hyperscale data centers in 2020...	TRUE	1	is_blank	['wu2021b','patterson2021']	['https://arxiv.org/abs/2108.06738','https://a...	Wu 2021, body text near Fig. 1: "…between trad...	The >40% statement is explicit in Wu. Patterso...

get_train_data()

namespace(id='q166',
          question='Which of the following five large NLP DNNs has the highest energy consumption: Meena, T5, GPT-3, GShard-600B, or Switch Transformer?',
          answer='GPT-3',
          answer_value='GPT-3',
          answer_unit='is_blank',
          ref_id="['patterson2021']",
          ref_url="['https://arxiv.org/pdf/2104.10350']",
          supporting_materials='Figure 3',
          explanation='Figure')

get_train_data(0)

namespace(id='q003',
          question='What is the name of the benchmark suite presented in a recent paper for measuring inference energy consumption?',
          answer='The ML.ENERGY Benchmark',
          answer_value='ML.ENERGY Benchmark',
          answer_unit='is_blank',
          ref_id="['chung2025']",
          ref_url="['https://arxiv.org/pdf/2505.06371']",
          supporting_materials='We present the ML.ENERGY Benchmark, a benchmark suite and tool for measuring inference energy consumption under realistic service environments...',
          explanation='Quote')

get_value(train().iloc[0])

namespace(id='q003',
          question='What is the name of the benchmark suite presented in a recent paper for measuring inference energy consumption?',
          answer='The ML.ENERGY Benchmark',
          answer_value='ML.ENERGY Benchmark',
          answer_unit='is_blank',
          ref_id="['chung2025']",
          ref_url="['https://arxiv.org/pdf/2505.06371']",
          supporting_materials='We present the ML.ENERGY Benchmark, a benchmark suite and tool for measuring inference energy consumption under realistic service environments...',
          explanation='Quote')

test().head()

	id	question	answer	answer_value	answer_unit	ref_id	ref_url	supporting_materials	explanation
0	q001	What was the average increase in U.S. data cen...	NaN	NaN	percent	NaN	NaN	NaN	NaN
1	q002	In 2023, what was the estimated amount of cars...	NaN	NaN	cars	NaN	NaN	NaN	NaN
2	q004	How many data centers did AWS begin using recy...	NaN	NaN	data centers	NaN	NaN	NaN	NaN
3	q005	Since NVIDIA doesn't release the embodied carb...	NaN	NaN	kg/GPU	NaN	NaN	NaN	NaN
4	q006	By what factor was the estimated amortized tra...	NaN	NaN	ratio	NaN	NaN	NaN	NaN

How many papers and reports?

metadata()['type'].hist()

metadata()[metadata()['type'] == 'report']

	id	type	title	year	citation	url
0	amazon2023	report	2023 Amazon Sustainability Report	2023	Amazon Staff. (2023). Amazon Sustainability Re...	https://sustainability.aboutamazon.com/2023-am...

How is the distribution of arxiv?

is_arxiv(metadata().iloc[0]['url'])

False

Counter(metadata()['url'].map(lambda x: is_arxiv(x)))

Counter({True: 31, False: 1})

Most of them are arxiv

How is the distribution of title length?

metadata()['title'].map(lambda x: len(x)).hist()

Download the pdf

md = metadata()

url = md.iloc[0].url
download_pdf(url)

Path('../data/2023-amazon-sustainability-report.pdf')

download_pdf('https://arxiv.org/pdf/2405.01814v2')

Path('../data/2405.01814v2.pdf')

md['filepath'] = md['url'].map(download_pdf)

Read the pdfs into dataframe

md['content'] = md['filepath'].map(rd.read_pdf)

md.head()

	id	type	title	year	citation	url	filepath	content
0	amazon2023	report	2023 Amazon Sustainability Report	2023	Amazon Staff. (2023). Amazon Sustainability Re...	https://sustainability.aboutamazon.com/2023-am...	../data/2023-amazon-sustainability-report.pdf	Amazon \nSustainability \nReport\n2023 Content...
1	chen2024	paper	Efficient Heterogeneous Large Language Model D...	2024	Shaoyuan Chen, Wencong Xiao, Yutong Lin, Mingx...	https://arxiv.org/pdf/2405.01814	../data/2405.01814.pdf	Efficient Heterogeneous Large Language Model D...
2	chung2025	paper	The ML.ENERGY Benchmark: Toward Automated Infe...	2025	Jae-Won Chung, Jiachen Liu, Jeff J. Ma, Ruofan...	https://arxiv.org/pdf/2505.06371	../data/2505.06371.pdf	The ML.ENERGY Benchmark: Toward Automated\nInf...
3	cottier2024	paper	The Rising Costs of Training Frontier AI Models	2024	Ben Cottier, Robi Rahman, Loredana Fattorini, ...	https://arxiv.org/pdf/2405.21015	../data/2405.21015.pdf	THE RISING COSTS OF TRAINING FRONTIER AI MODEL...
4	dodge2022	paper	Measuring the Carbon Intensity of AI in Cloud ...	2022	Jesse Dodge, Taylor Prewitt, Remi Tachet Des C...	https://arxiv.org/pdf/2206.05229	../data/2206.05229.pdf	Measuring the Carbon Intensity of AI in Cloud ...

md['content_length'] = md['content'].map(lambda x: len(x))

md['content_length'].hist()

Content Tokens

encoding = tiktoken.get_encoding("cl100k_base")

msg = md.iloc[0]['content']
compute_tokens(msg)

md['token_count'] = md['content'].map(compute_tokens)

md['token_count'].hist()

Save metadata file

md.to_csv(data_path/'complete_metadata.csv', index=False)

Create submission file

def create_submission(answers_list, output_path='submission.csv'):
    df = test()
    
    for i, answer in enumerate(answers_list):
        df.loc[i, 'answer'] = answer['answer']
        df.loc[i, 'answer_value'] = answer['answer_value']
        df.loc[i, 'answer_unit'] = answer['answer_unit']
        df.loc[i, 'ref_id'] = answer['ref_id']
        df.loc[i, 'ref_url'] = answer['ref_url']
        df.loc[i, 'supporting_materials'] = answer['supporting_materials']
        df.loc[i, 'explanation'] = answer['explanation']
    
    df = df.fillna('is_blank')
    df.to_csv(output_path, index=False)