Retriever

load_dotenv()

True

Load Data

Loading the files - metadata - train - test

and viewing them

md = eda.metadata()
md.head()

	id	type	title	year	citation	url
0	amazon2023	report	2023 Amazon Sustainability Report	2023	Amazon Staff. (2023). Amazon Sustainability Re...	https://sustainability.aboutamazon.com/2023-am...
1	chen2024	paper	Efficient Heterogeneous Large Language Model D...	2024	Shaoyuan Chen, Wencong Xiao, Yutong Lin, Mingx...	https://arxiv.org/pdf/2405.01814
2	chung2025	paper	The ML.ENERGY Benchmark: Toward Automated Infe...	2025	Jae-Won Chung, Jiachen Liu, Jeff J. Ma, Ruofan...	https://arxiv.org/pdf/2505.06371
3	cottier2024	paper	The Rising Costs of Training Frontier AI Models	2024	Ben Cottier, Robi Rahman, Loredana Fattorini, ...	https://arxiv.org/pdf/2405.21015
4	dodge2022	paper	Measuring the Carbon Intensity of AI in Cloud ...	2022	Jesse Dodge, Taylor Prewitt, Remi Tachet Des C...	https://arxiv.org/pdf/2206.05229

qa = eda.train()
qa.head()

	id	question	answer	answer_value	answer_unit	ref_id	ref_url	supporting_materials	explanation
0	q003	What is the name of the benchmark suite presen...	The ML.ENERGY Benchmark	ML.ENERGY Benchmark	is_blank	['chung2025']	['https://arxiv.org/pdf/2505.06371']	We present the ML.ENERGY Benchmark, a benchmar...	Quote
1	q009	What were the net CO2e emissions from training...	4.3 tCO2e	4.3	tCO2e	['patterson2021']	['https://arxiv.org/pdf/2104.10350']	"Training GShard-600B used 24 MWh and produced...	Quote
2	q054	What is the model size in gigabytes (GB) for t...	64.7 GB	64.7	GB	['chen2024']	['https://arxiv.org/pdf/2405.01814']	Table 3: Large language models used for evalua...	Table 3
3	q062	What was the total electricity consumption of ...	Unable to answer with confidence based on the ...	is_blank	MWh	is_blank	is_blank	is_blank	is_blank
4	q075	True or False: Hyperscale data centers in 2020...	TRUE	1	is_blank	['wu2021b','patterson2021']	['https://arxiv.org/abs/2108.06738','https://a...	Wu 2021, body text near Fig. 1: "…between trad...	The >40% statement is explicit in Wu. Patterso...

tst = eda.test()
tst.head()

	id	question	answer	answer_value	answer_unit	ref_id	ref_url	supporting_materials	explanation
0	q001	What was the average increase in U.S. data cen...	NaN	NaN	percent	NaN	NaN	NaN	NaN
1	q002	In 2023, what was the estimated amount of cars...	NaN	NaN	cars	NaN	NaN	NaN	NaN
2	q004	How many data centers did AWS begin using recy...	NaN	NaN	data centers	NaN	NaN	NaN	NaN
3	q005	Since NVIDIA doesn't release the embodied carb...	NaN	NaN	kg/GPU	NaN	NaN	NaN	NaN
4	q006	By what factor was the estimated amortized tra...	NaN	NaN	ratio	NaN	NaN	NaN	NaN

We have to fill up the answer, answer_value, answer_unit, ref_id, ref_url, supporting_materials and explanation here.

From the competition following values are expected

answer: A clear natural-language response (e.g., 1438 lbs, Water consumption, TRUE)’. If no answer is possible, use “Unable to answer with confidence based on the provided documents.”
answer_value: The normalized numeric or categorical value (e.g., 1438, Water consumption, 1)
- If no answer is possible, use is_blank
- Ranges should be encoded as [low,high]
- Do not include symbols like <, >, ~ here. Those can be left in the clear natural language column.
answer_unit: Unit of measurement (e.g., lbs, kWh, gCO2, projects, is_blank).
ref_id: One or more document IDs from metadata.csv that support the answer.
ref_url: One or more URL(s) of the cited document(s).
supporting_materials: Verbatim justification from the cited document (quote, table reference, figure reference, etc.).
explanation: Short reasoning describing why the cited material supports the answer.

Read pdf

I already downloaded all the pdfs, please refer the notebook 00_eda

We will extract the content from the pdfs here using answerdotai’s contextkit library which uses pypdf underneath.

pypdf does a decent job of text extraction from pdf but it does not preserve the layouts, table structure and reading order.

get_metadata('chen2024')

{'id': 'chen2024',
 'type': 'paper',
 'title': 'Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation',
 'year': 2024,
 'citation': 'Shaoyuan Chen, Wencong Xiao, Yutong Lin, Mingxing Zhang, Yingdi Shan, Jinlei Jiang, Kang Chen, Yongwei Wu. (2024). Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation. arXiv. https://arxiv.org/pdf/2405.01814',
 'url': 'https://arxiv.org/pdf/2405.01814'}

doc_id = 'chen2024'
fc.test_eq(get_metadata(doc_id)['id'], doc_id)

doc = read_doc('chen2024')
doc.content[:100], doc.id

('Efficient Heterogeneous Large Language Model Decoding\nwith Model-Attention Disaggregation\nShaoyuan C',
 'chen2024')

doc_id='chen2024'
doc = read_doc(doc_id)
fc.test_ne(len(doc.content), 0)
fc.test_eq(doc.id, doc_id)

Read Markdown

len(read_markdown('chen2024').content), len(read_doc('chen2024').content)

(75025, 69175)

Total content size

fc.L(eda.metadata()['id'].to_list()).map(lambda x: len(read_doc(x).content)).sum()

I dont think any open source models can handle that many characters in their context window as of November 2025.

A RAG based system will be good where we chunk the content, retrieve the relavent chunk and generate answer with those relevant chunk

Document Chunks

content, metadata = get_content_metadata(read_doc, 'chen2024')
content[:100], metadata

('Efficient Heterogeneous Large Language Model Decoding\nwith Model-Attention Disaggregation\nShaoyuan C',
 {'id': 'chen2024',
  'type': 'paper',
  'title': 'Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation',
  'year': 2024,
  'citation': 'Shaoyuan Chen, Wencong Xiao, Yutong Lin, Mingxing Zhang, Yingdi Shan, Jinlei Jiang, Kang Chen, Yongwei Wu. (2024). Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation. arXiv. https://arxiv.org/pdf/2405.01814',
  'url': 'https://arxiv.org/pdf/2405.01814'})

doc = read_doc('chen2024')
len(doc.content)

chunks = chunk_doc('chen2024')
len(chunks), chunks[0]['text'][-200:]

(50,
 ' performance and cost efficiency. Our com-\nprehensive analysis and experiments confirm the viability\nof splitting the attention computation over multiple devices.\nAlso, the communication bandwidth req')

doc_id = 'chen2024'
chunks = chunk_doc(doc_id)
fc.test_ne(len(chunks), 0)
fc.test_eq(chunks[0].id, doc_id)

chunks[0]['text'][:200]

'Efficient Heterogeneous Large Language Model Decoding\nwith Model-Attention Disaggregation\nShaoyuan Chen1 Wencong Xiao2 Yutong Lin1 Mingxing Zhang1 Yingdi Shan1 Jinlei Jiang1\nKang Chen1 Yongwei Wu1\n1Ts'

chunks[0]

namespace(text='Efficient Heterogeneous Large Language Model Decoding\nwith Model-Attention Disaggregation\nShaoyuan Chen1 Wencong Xiao2 Yutong Lin1 Mingxing Zhang1 Yingdi Shan1 Jinlei Jiang1\nKang Chen1 Yongwei Wu1\n1Tsinghua University\n2ByteDance\nAbstract\nTransformer-based large language models (LLMs) exhibit\nimpressive performance in generative tasks but also intro-\nduce significant challenges in real-world serving due to in-\nefficient use of the expensive, computation-optimized accel-\nerators. Although disaggregated serving architectures have\nbeen proposed to split different phases of LLM inference, the\nefficiency of decoding phase is still low. This is caused by\nthe varying resource demands of different operators in the\ntransformer-based LLMs. Specifically, the attention operator\nis memory-intensive, exhibiting a memory access pattern that\nclashes with the strengths of modern accelerators, especially\nfor long context requests.\nTo enhance the efficiency of LLM decoding, we introduce\nmodel-attention disaggregation. This approach leverages a\ncollection of cheap, memory-optimized devices for the atten-\ntion operator while still utilizing high-end accelerators for\nother parts of the model. This heterogeneous setup ensures\nthat each component is tailored to its specific workload, max-\nimizing overall performance and cost efficiency. Our com-\nprehensive analysis and experiments confirm the viability\nof splitting the attention computation over multiple devices.\nAlso, the communication bandwidth req',
          chunk_id=0,
          id='chen2024',
          type='paper',
          title='Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation',
          year=2024,
          citation='Shaoyuan Chen, Wencong Xiao, Yutong Lin, Mingxing Zhang, Yingdi Shan, Jinlei Jiang, Kang Chen, Yongwei Wu. (2024). Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation. arXiv. https://arxiv.org/pdf/2405.01814',
          url='https://arxiv.org/pdf/2405.01814')

chunks[-1]

namespace(text='i Chen, Christopher De-\nwan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mi-\nhaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel\nSimig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang,\nand Luke Zettlemoyer. Opt: Open pre-trained trans-\nformer language models, 2022.\n[59] Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu,\nYibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Dist-\nServe: Disaggregating prefill and decoding for goodput-\noptimized large language model serving. In 18th\nUSENIX Symposium on Operating Systems Design and\nImplementation (OSDI 24), pages 193–210, 2024.\n16',
          chunk_id=49,
          id='chen2024',
          type='paper',
          title='Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation',
          year=2024,
          citation='Shaoyuan Chen, Wencong Xiao, Yutong Lin, Mingxing Zhang, Yingdi Shan, Jinlei Jiang, Kang Chen, Yongwei Wu. (2024). Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation. arXiv. https://arxiv.org/pdf/2405.01814',
          url='https://arxiv.org/pdf/2405.01814')

Markdown Chunks

chunk_size = 375
chunk_overlap = 125
md_splitter = MarkdownTextSplitter.from_tiktoken_encoder(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

md_splitter

<langchain_text_splitters.markdown.MarkdownTextSplitter>

doc_id = 'chen2024'
md_content = read_markdown(doc_id).content
chunks = md_splitter.split_text(md_content)

chunks[0][-1200:]

'# Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation\n\nShaoyuan Chen<sup>1</sup> Wencong Xiao<sup>2</sup> Yutong Lin<sup>1</sup> Mingxing Zhang<sup>1</sup> Yingdi Shan<sup>1</sup> Jinlei Jiang<sup>1</sup>  \nKang Chen<sup>1</sup> Yongwei Wu<sup>1</sup>\n\n<sup>1</sup>Tsinghua University\n\n<sup>2</sup>ByteDance\n\n## Abstract\n\nTransformer-based large language models (LLMs) exhibit impressive performance in generative tasks but also introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators. Although disaggregated serving architectures have been proposed to split different phases of LLM inference, the efficiency of decoding phase is still low. This is caused by the varying resource demands of different operators in the transformer-based LLMs. Specifically, the attention operator is memory-intensive, exhibiting a memory access pattern that clashes with the strengths of modern accelerators, especially for long context requests.'

chunks[1][:1200]

'Transformer-based large language models (LLMs) exhibit impressive performance in generative tasks but also introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators. Although disaggregated serving architectures have been proposed to split different phases of LLM inference, the efficiency of decoding phase is still low. This is caused by the varying resource demands of different operators in the transformer-based LLMs. Specifically, the attention operator is memory-intensive, exhibiting a memory access pattern that clashes with the strengths of modern accelerators, especially for long context requests.\n\nTo enhance the efficiency of LLM decoding, we introduce model-attention disaggregation. This approach leverages a collection of cheap, memory-optimized devices for the attention operator while still utilizing high-end accelerators for other parts of the model. This heterogeneous setup ensures that each component is tailored to its specific workload, maximizing overall performance and cost efficiency. Our comprehensive analysis and experiments confirm the viability of splitting the attention computation over mult'

chunks = md_splitter.chunk_markdown(doc_id)

chunks[0].text[-900:]

'p> Yutong Lin<sup>1</sup> Mingxing Zhang<sup>1</sup> Yingdi Shan<sup>1</sup> Jinlei Jiang<sup>1</sup>  \nKang Chen<sup>1</sup> Yongwei Wu<sup>1</sup>\n\n<sup>1</sup>Tsinghua University\n\n<sup>2</sup>ByteDance\n\n## Abstract\n\nTransformer-based large language models (LLMs) exhibit impressive performance in generative tasks but also introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators. Although disaggregated serving architectures have been proposed to split different phases of LLM inference, the efficiency of decoding phase is still low. This is caused by the varying resource demands of different operators in the transformer-based LLMs. Specifically, the attention operator is memory-intensive, exhibiting a memory access pattern that clashes with the strengths of modern accelerators, especially for long context requests.'

chunks[1].text[:900]

'Transformer-based large language models (LLMs) exhibit impressive performance in generative tasks but also introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators. Although disaggregated serving architectures have been proposed to split different phases of LLM inference, the efficiency of decoding phase is still low. This is caused by the varying resource demands of different operators in the transformer-based LLMs. Specifically, the attention operator is memory-intensive, exhibiting a memory access pattern that clashes with the strengths of modern accelerators, especially for long context requests.\n\nTo enhance the efficiency of LLM decoding, we introduce model-attention disaggregation. This approach leverages a collection of cheap, memory-optimized devices for the attention operator while still utilizing high-end'

Human readable Chunk

This will help later in creating context for the prompt

print(Nugget(chunks[42]))

### Chunk 0
            Text: ![](_page_8_Figure_0.jpeg)

(a) Request-level partition.

(b) Head-level partition.

Figure 9: Work partition methods of the attention operator.

store the KV caches and compute the attention operators. As depicted in Figure 9, the attention operators can be parallelized among memory devices in various ways. One method is to distribute different requests across different devices; an alternative strategy is to partition and distribute the attention heads, which can also be computed independently, to different devices. The head-level partitioning approach ensures a balanced workload distribution, whereas the request-level partitioning may result in load imbalance due to the differences in sequence lengths and therefore the KV cache sizes among requests. However, head-level partitioning has limited flexibility, as it requires the number of memory devices to be divisible by the number of attention heads. We opt for head-level partitioning in Lamina, which offers optimal load balancing.

## 6 Evaluation

**Testbed.** We deploy Lamina on a real heterogeneous cluster with two kinds of GPU nodes. Each node consists of either eight H100 or H20 GPUs, and each GPU is paired with a dedicated ConnectX-7 NIC via PCIe switch. The GPU nodes are interconnected with 400 Gbps RoCE network. We use H100 as compute-optimized GPUs and H20 as memory-optimized GPUs for Lamina.
            Chunk Id: 42
            Doc ID: chen2024
            Type: paper
            Title: Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation
            Year: 2024
            Citation: Shaoyuan Chen, Wencong Xiao, Yutong Lin, Mingxing Zhang, Yingdi Shan, Jinlei Jiang, Kang Chen, Yongwei Wu. (2024). Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation. arXiv. https://arxiv.org/pdf/2405.01814
            URL: https://arxiv.org/pdf/2405.01814

Chunks all the docs

all_chunks = chunk_all(chunk_doc)
len(all_chunks)

all_chunks[0]

namespace(text='Amazon \nSustainability \nReport\n2023 Contents\nOverview\n3 Introduction\n4 A Letter from Our Chief \nSustainability Officer\xa0\n5 How We Work\n6 Goals Summary\n7 2023 Year in Review \xa0\nEnvironment\n9 Carbon\n24 Carbon-Free Energy\n29 Packaging \n34 Waste and Circularity\n40 Water\nValue Chain\n45 Human Rights \n50 Responsible Supply Chain\n58 Sustainable Products and \nMaterials \n64 Supplier Diversity \n67 Community Impact\nPeople\n75 Employee Experience\n81 Health and Safety\n86 Inclusive Experiences\nAppendix\n94  Sustainability Reporting Topic \nAssessment\n95  Endnotes\n96 Assurance Statements \n97 Disclaimer and Forward-Looking \nStatements \nOn the cover  \nThe Baldy Mesa Solar and Storage Project (developed \nand operated by AES), located in Adelanto, California. Employees inside one of our newest office buildings in Bellevue, \nWashington.\nIntroduction 2023 Year in ReviewGoals SummaryHow We WorkCSO Letter\nAbout This Report\nThis is our sixth annual report detailing progress against \nour goals\xa0  and environmental, social, and governance \ntopics. All financial figures are reported in U.S. dollars ($), \nunless otherwise stated. The data within this report reflects \nprogress from January 1 through December 31, 2023, unless \notherwise indicated. This report includes information about \nmany business units and subsidiaries including AWS, Devices, \nFresh, Whole Foods Market, Amazon Private Brands, Twitch, \nMGM Studios, and Ring.\nOur 2023 Sustainability Report is structured into three \nmain categories: Environment',
          chunk_id=0,
          id='amazon2023',
          type='report',
          title='2023 Amazon Sustainability Report',
          year=2023,
          citation='Amazon Staff. (2023). Amazon Sustainability Report. https://sustainability.aboutamazon.com/2023-amazon-sustainability-report.pdf',
          url='https://sustainability.aboutamazon.com/2023-amazon-sustainability-report.pdf')

all_chunks[-1]

namespace(text='\nWeidinger, L., Mellor, J., et al.: Ethical and social risks of harm from language models.\narXiv preprint arXiv:2112.04359 (2021)\n25',
          chunk_id=1926,
          id='zschache2025',
          type='paper',
          title='Comparing energy consumption and accuracy in text classification inference',
          year=2025,
          citation='Johannes Zschache, & Tilman Hartwig (2025). Comparing energy consumption and accuracy in text classification inference arXiv. https://arxiv.org/pdf/2508.14170',
          url='https://arxiv.org/pdf/2508.14170 ')

Neighbour Chunks

Chunks(all_chunks).get_chunk(1850)

namespace(text='ne-tune the full BlackMamba model (i.e.,\noriginal weight matrices), whereas employed QLoRA [15]\nfor parameter-efficient fine-tuning (PEFT) on Mixtral due to\nGPU memory capacity budget. For QLoRA, we target the\nMoE layers, including the routers, and set the rank of the\nLoRA modules to 16. We enable FlashAttention2 [17] during\nMixtral fine-tuning for enhanced efficiency. Moreover, we use\ngradient checkpointing [18] to save memory usage.\nDatasets. Our fine-tuning process is implemented in Py-\nTorch using the LLaMA-Factory framework [19], with a\nlearning rate of 5e-5 and 10 epochs. Both models were fine-\ntuned on two datasets focused on different tasks: common-\nsense 15k (CS) and Math 14k (MATH), which address com-\nmonsense reasoning and arithmetic reasoning respectively\n(provided by LLM-adapters [20]). The details of datasets\nare used in Table II. For evaluation, we tested the models\non GSM8K [21] for arithmetic reasoning and HE [22] for\ncommonsense reasoning. Each dataset consists of thousands\nof queries. We define a query as the concatenation of a\nprompt and its ground-truth answer, which is feed to LLMs\nfor fine-tuning.\nProfiling experiments. We evaluate the fine-tuning pro-\ncess from both software and hardware perspectives. The\nsoftware evaluation includes an end-to-end assessment of\nthe fine-tuning process and measures the performance of\nthe two models on various tasks post-fine-tuning. Using\nPyTorch, we provide essential algorithm-level information\nsuch as test accuracy, t',
          chunk_id=1850,
          id='xia2024',
          type='paper',
          title='Understanding the Performance and Estimating the Cost of LLM Fine-Tuning',
          year=2024,
          citation='Yuchen Xia, Jiho Kim, Yuhan Chen, Haojie Ye, Souvik Kundu, Cong Hao, Nishil Talati. (2024). Understanding the Performance and Estimating the Cost of LLM Fine-Tuning. arXiv. https://arxiv.org/pdf/2408.04693',
          url='https://arxiv.org/pdf/2408.04693')

left_chunk, right_chunk = Chunks(all_chunks).get_neighbours(1850)
fc.L(left_chunk, right_chunk).attrgot('chunk_id')

(#2) [1849,1851]

some_dup_chunks = fc.L(Chunks(all_chunks).get_chunk(cid) for cid in [1850, 1851, 1852, 1853, 1850, 1851, 1852])

Chunks.unique(some_dup_chunks).attrgot('chunk_id')

(#4) [1850,1851,1852,1853]

ans = Chunks(all_chunks).include_neighbours(some_dup_chunks)
ans.attrgot('chunk_id')

(#6) [1849,1850,1851,1852,1853,1854]

Lexical Search

We will use BM25 here

idx = np.random.randint(0, len(all_chunks))
query = all_chunks[idx].text
all_chunks[idx]

namespace(text='easoning modes\n8 Discussion and Policy Implications\n8.1 The Critical Role of Infrastructure in AI Sustainability\nOur findings indicate that infrastructure is a crucial determinant of AI inference sustainability. While\nmodel design enhances theoretical efficiency, real-world outcomes can substantially diverge based\non deployment conditions and factors such as renewable energy usage and hardware efficiency.\nFor instance, GPT-4o mini, despite its smaller architecture, consumes approximately 20% more\nenergy than GPT-4o on long queries due to reliance on older A100 GPU nodes. Similarly, DeepSeek\nmodels highlight the profound impact of infrastructure: DeepSeek-R1 and DeepSeek-V3 deployed on\nDeepSeek’s own servers exhibit water consumption and carbon emissions nearly six times higher than\ntheir Azure-hosted counterparts. The Azure deployments benefit from better hardware, more efficient\ncooling systems, lower carbon intensity, and tighter PUE control, demonstrating that sustainability\ngains can stem as much from datacenter design as from model optimization. These observations\nunderscore that true AI sustainability will hinge on coordinated progress in hardware efficiency,\nrenewable energy sources, and infrastructure-aware deployment strategies.\n8.2 Rebound Effects and the Jevons Paradox\nAlthough large language models consume significantly less energy, water, and carbon per task than\nhuman labor [ 75], these efficiency gains do not inherently reduce overall environmental impact.\nAs p',
          chunk_id=869,
          id='jegham2025',
          type='paper',
          title='How Hungry is AI? Benchmarking Energy, Water and Carbon Footprint of LLM Inference',
          year=2025,
          citation='Nidhal Jegham, Marwan Abdelatti, Lassad Elmoubarki, Abdeltawab Hendawi (2025). How Hungry is AI? Benchmarking Energy, Water and Carbon Footprint of LLM Inference. arXiv. https://arxiv.org/pdf/2505.09598',
          url='https://arxiv.org/pdf/2505.09598')

def get_random_chunk(chunks): return all_chunks[np.random.randint(0, len(all_chunks))]

get_random_chunk(chunks).text[:100]

'eter, Compute and Data Trends in Machine Learning.https://epochai.org/data/epochdb/\nvisualization, 2'

tokenized_query = tokenize(query)
tokenized_query[:10]

['easoning',
 'modes',
 '8',
 'discussion',
 'and',
 'policy',
 'implications',
 '8.1',
 'the',
 'critical']

bm25 = bm25chunks(all_chunks)
bm25.corpus_size

bm25.get_top_n(tokenized_query, all_chunks, n=1)

[namespace(text='easoning modes\n8 Discussion and Policy Implications\n8.1 The Critical Role of Infrastructure in AI Sustainability\nOur findings indicate that infrastructure is a crucial determinant of AI inference sustainability. While\nmodel design enhances theoretical efficiency, real-world outcomes can substantially diverge based\non deployment conditions and factors such as renewable energy usage and hardware efficiency.\nFor instance, GPT-4o mini, despite its smaller architecture, consumes approximately 20% more\nenergy than GPT-4o on long queries due to reliance on older A100 GPU nodes. Similarly, DeepSeek\nmodels highlight the profound impact of infrastructure: DeepSeek-R1 and DeepSeek-V3 deployed on\nDeepSeek’s own servers exhibit water consumption and carbon emissions nearly six times higher than\ntheir Azure-hosted counterparts. The Azure deployments benefit from better hardware, more efficient\ncooling systems, lower carbon intensity, and tighter PUE control, demonstrating that sustainability\ngains can stem as much from datacenter design as from model optimization. These observations\nunderscore that true AI sustainability will hinge on coordinated progress in hardware efficiency,\nrenewable energy sources, and infrastructure-aware deployment strategies.\n8.2 Rebound Effects and the Jevons Paradox\nAlthough large language models consume significantly less energy, water, and carbon per task than\nhuman labor [ 75], these efficiency gains do not inherently reduce overall environmental impact.\nAs p',
           chunk_id=869,
           id='jegham2025',
           type='paper',
           title='How Hungry is AI? Benchmarking Energy, Water and Carbon Footprint of LLM Inference',
           year=2025,
           citation='Nidhal Jegham, Marwan Abdelatti, Lassad Elmoubarki, Abdeltawab Hendawi (2025). How Hungry is AI? Benchmarking Energy, Water and Carbon Footprint of LLM Inference. arXiv. https://arxiv.org/pdf/2505.09598',
           url='https://arxiv.org/pdf/2505.09598')]

ls = LexicalSearch(all_chunks)
lexical_res = ls.search(query, n=1)
lexical_res

(#1) [NS(text='easoning modes\n8 Discussion and Policy Implications\n8.1 The Critical Role of Infrastructure in AI Sustainability\nOur findings indicate that infrastructure is a crucial determinant of AI inference sustainability. While\nmodel design enhances theoretical efficiency, real-world outcomes can substantially diverge based\non deployment conditions and factors such as renewable energy usage and hardware efficiency.\nFor instance, GPT-4o mini, despite its smaller architecture, consumes approximately 20% more\nenergy than GPT-4o on long queries due to reliance on older A100 GPU nodes. Similarly, DeepSeek\nmodels highlight the profound impact of infrastructure: DeepSeek-R1 and DeepSeek-V3 deployed on\nDeepSeek’s own servers exhibit water consumption and carbon emissions nearly six times higher than\ntheir Azure-hosted counterparts. The Azure deployments benefit from better hardware, more efficient\ncooling systems, lower carbon intensity, and tighter PUE control, demonstrating that sustainability\ngains can stem as much from datacenter design as from model optimization. These observations\nunderscore that true AI sustainability will hinge on coordinated progress in hardware efficiency,\nrenewable energy sources, and infrastructure-aware deployment strategies.\n8.2 Rebound Effects and the Jevons Paradox\nAlthough large language models consume significantly less energy, water, and carbon per task than\nhuman labor [ 75], these efficiency gains do not inherently reduce overall environmental impact.\nAs p', chunk_id=869, id='jegham2025', type='paper', title='How Hungry is AI? Benchmarking Energy, Water and Carbon Footprint of LLM Inference', year=2025, citation='Nidhal Jegham, Marwan Abdelatti, Lassad Elmoubarki, Abdeltawab Hendawi (2025). How Hungry is AI? Benchmarking Energy, Water and Carbon Footprint of LLM Inference. arXiv. https://arxiv.org/pdf/2505.09598', url='https://arxiv.org/pdf/2505.09598')]

ls = LexicalSearch(all_chunks, neighbour_chunks=True)
ls.search(query, n=1).attrgot('chunk_id')

(#3) [868,869,870]

Semantic Search

embed_model = 'nomic-ai/nomic-embed-text-v1.5'

utils.fw().embed(embed_model, ['hi', 'anubhav']).shape

(2, 768)

texts = all_chunks.attrgot('text')
len(texts), texts[0][:100]

(1927,
 'Amazon \nSustainability \nReport\n2023 Contents\nOverview\n3 Introduction\n4 A Letter from Our Chief \nSust')

embeddings = utils.fw().embed(embed_model, texts)

embeddings.shape

(1927, 768)

eda.data_path

Path('../data')

chunks_embeddings = embed_chunks(all_chunks)    
len(chunks_embeddings[0]), chunks_embeddings[-1].shape

(1927, (1927, 768))

random_chunk = get_random_chunk(all_chunks)
random_chunk

namespace(text=' Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language\nUnderstanding. arXiv:1810.04805 [cs.CL]\n[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias\nMinderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint\narXiv:2010.11929 (2020).\n[10] Jim Gao. 2014. Machine learning applications for data center optimization. (2014).\n[11] Michael Gillenwater. 2008. Redefining RECs—Part 1: untangling attributes and offsets. Energy Policy 36, 6 (2008), 2109–2119.\n[12] Google. 2021. Carbon free energy for Google Cloud regions. https://cloud.google.com/sustainability/region-carbon\n[13] Google. 2021. Helping you pick the greenest region for your Google Cloud resources. https://cloud.google.com/blog/topics/sustainability/pick-the-\ngoogle-cloud-region-with-the-lowest-co2\n[14] Abhishek Gupta, Camylle Lanteigne, and Sara Kingsley. 2020. SECure: A Social and Environmental Certificate for AI Systems. arXiv preprint\narXiv:2006.06217 (2020).\n[15] Udit Gupta, Young Geun Kim, Sylvia Lee, Jordan Tse, Hsien-Hsin S Lee, Gu-Yeon Wei, David Brooks, and Carole-Jean Wu. 2021. Chasing Carbon:\nThe Elusive Environmental Footprint of Computing. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA) .\nIEEE, 854–867.\n[16] K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dz',
          chunk_id=530,
          id='dodge2022',
          type='paper',
          title='Measuring the Carbon Intensity of AI in Cloud Instances',
          year=2022,
          citation="Jesse Dodge, Taylor Prewitt, Remi Tachet Des Combes, Erika Odmark, Roy Schwartz, Emma Strubell, Alexandra Sasha Luccioni, Noah A. Smith, Nicole DeCario, Will Buchanan (2022). Measuring the Carbon Intensity of AI in Cloud Instances. FAccT '22. https://arxiv.org/pdf/2206.05229",
          url='https://arxiv.org/pdf/2206.05229')

query_embedding = utils.fw().embed(embed_model, random_chunk.text)
query_embedding.shape

(1, 768)

k = 10
all_chunks, all_embeddings = chunks_embeddings
scores = cosine_similarity(query_embedding, all_embeddings)
best_k_ind = np.argsort(scores)[0].tolist()[::-1][:k]
top_k_chunks = all_chunks[best_k_ind]

top_k_chunks[0]

namespace(text=' Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language\nUnderstanding. arXiv:1810.04805 [cs.CL]\n[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias\nMinderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint\narXiv:2010.11929 (2020).\n[10] Jim Gao. 2014. Machine learning applications for data center optimization. (2014).\n[11] Michael Gillenwater. 2008. Redefining RECs—Part 1: untangling attributes and offsets. Energy Policy 36, 6 (2008), 2109–2119.\n[12] Google. 2021. Carbon free energy for Google Cloud regions. https://cloud.google.com/sustainability/region-carbon\n[13] Google. 2021. Helping you pick the greenest region for your Google Cloud resources. https://cloud.google.com/blog/topics/sustainability/pick-the-\ngoogle-cloud-region-with-the-lowest-co2\n[14] Abhishek Gupta, Camylle Lanteigne, and Sara Kingsley. 2020. SECure: A Social and Environmental Certificate for AI Systems. arXiv preprint\narXiv:2006.06217 (2020).\n[15] Udit Gupta, Young Geun Kim, Sylvia Lee, Jordan Tse, Hsien-Hsin S Lee, Gu-Yeon Wei, David Brooks, and Carole-Jean Wu. 2021. Chasing Carbon:\nThe Elusive Environmental Footprint of Computing. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA) .\nIEEE, 854–867.\n[16] K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dz',
          chunk_id=530,
          id='dodge2022',
          type='paper',
          title='Measuring the Carbon Intensity of AI in Cloud Instances',
          year=2022,
          citation="Jesse Dodge, Taylor Prewitt, Remi Tachet Des Combes, Erika Odmark, Roy Schwartz, Emma Strubell, Alexandra Sasha Luccioni, Noah A. Smith, Nicole DeCario, Will Buchanan (2022). Measuring the Carbon Intensity of AI in Cloud Instances. FAccT '22. https://arxiv.org/pdf/2206.05229",
          url='https://arxiv.org/pdf/2206.05229')

ss = SemanticSearch(all_chunks)
semantic_res = ss.search(random_chunk.text, n=1)
len(semantic_res), semantic_res[0]

(1,
 namespace(text=' Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language\nUnderstanding. arXiv:1810.04805 [cs.CL]\n[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias\nMinderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint\narXiv:2010.11929 (2020).\n[10] Jim Gao. 2014. Machine learning applications for data center optimization. (2014).\n[11] Michael Gillenwater. 2008. Redefining RECs—Part 1: untangling attributes and offsets. Energy Policy 36, 6 (2008), 2109–2119.\n[12] Google. 2021. Carbon free energy for Google Cloud regions. https://cloud.google.com/sustainability/region-carbon\n[13] Google. 2021. Helping you pick the greenest region for your Google Cloud resources. https://cloud.google.com/blog/topics/sustainability/pick-the-\ngoogle-cloud-region-with-the-lowest-co2\n[14] Abhishek Gupta, Camylle Lanteigne, and Sara Kingsley. 2020. SECure: A Social and Environmental Certificate for AI Systems. arXiv preprint\narXiv:2006.06217 (2020).\n[15] Udit Gupta, Young Geun Kim, Sylvia Lee, Jordan Tse, Hsien-Hsin S Lee, Gu-Yeon Wei, David Brooks, and Carole-Jean Wu. 2021. Chasing Carbon:\nThe Elusive Environmental Footprint of Computing. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA) .\nIEEE, 854–867.\n[16] K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dz',
           chunk_id=530,
           id='dodge2022',
           type='paper',
           title='Measuring the Carbon Intensity of AI in Cloud Instances',
           year=2022,
           citation="Jesse Dodge, Taylor Prewitt, Remi Tachet Des Combes, Erika Odmark, Roy Schwartz, Emma Strubell, Alexandra Sasha Luccioni, Noah A. Smith, Nicole DeCario, Will Buchanan (2022). Measuring the Carbon Intensity of AI in Cloud Instances. FAccT '22. https://arxiv.org/pdf/2206.05229",
           url='https://arxiv.org/pdf/2206.05229'))

ss = SemanticSearch(all_chunks, neighbour_chunks=True)
ss.search(random_chunk.text, n=1).attrgot('chunk_id')

(#3) [529,530,531]

Hybrid: Rerank

Here we will rerank the outputs from semantic search and lexical search using a reranker model.

There are other ways to mix the outputs from the above two searches like Reciprocal Rank Fusion (RRF), Linear Combination etc which you can try later

combined_res = combine_chunks(semantic_res, lexical_res)
len(combined_res)

combined_res.attrgot('chunk_id')

(#2) [530,6]

ranker = utils.Reranker()

query[:100]

'on Sustainability Report Value Chain Introduction 2023 Year in ReviewGoals SummaryHow We WorkCSO Let'

combined_res[-1].text[:100]

'on Sustainability Report Value Chain Introduction 2023 Year in ReviewGoals SummaryHow We WorkCSO Let'

ranker.rerank_chunks(combined_res[-1].text, combined_res)[0].text[:100]

'on Sustainability Report Value Chain Introduction 2023 Year in ReviewGoals SummaryHow We WorkCSO Let'

hs = HybridSearch(ls, ss)
chunks_res = hs.search(combined_res[-1].text)
chunks_res[0].text[:100]

'23 Amazon Sustainability Report Value Chain Introduction 2023 Year in ReviewGoals SummaryHow We Work'

hs = HybridSearch(ls, ss, neighbour_chunks=True)
hs.search(combined_res[-1].text, n=1).attrgot('chunk_id')

(#3) [10,11,12]