Retriever

load_dotenv()
True

Load Data

Loading the files - metadata - train - test

and viewing them

md = eda.metadata()
md.head()
id type title year citation url
0 amazon2023 report 2023 Amazon Sustainability Report 2023 Amazon Staff. (2023). Amazon Sustainability Re... https://sustainability.aboutamazon.com/2023-am...
1 chen2024 paper Efficient Heterogeneous Large Language Model D... 2024 Shaoyuan Chen, Wencong Xiao, Yutong Lin, Mingx... https://arxiv.org/pdf/2405.01814
2 chung2025 paper The ML.ENERGY Benchmark: Toward Automated Infe... 2025 Jae-Won Chung, Jiachen Liu, Jeff J. Ma, Ruofan... https://arxiv.org/pdf/2505.06371
3 cottier2024 paper The Rising Costs of Training Frontier AI Models 2024 Ben Cottier, Robi Rahman, Loredana Fattorini, ... https://arxiv.org/pdf/2405.21015
4 dodge2022 paper Measuring the Carbon Intensity of AI in Cloud ... 2022 Jesse Dodge, Taylor Prewitt, Remi Tachet Des C... https://arxiv.org/pdf/2206.05229
qa = eda.train()
qa.head()
id question answer answer_value answer_unit ref_id ref_url supporting_materials explanation
0 q003 What is the name of the benchmark suite presen... The ML.ENERGY Benchmark ML.ENERGY Benchmark is_blank ['chung2025'] ['https://arxiv.org/pdf/2505.06371'] We present the ML.ENERGY Benchmark, a benchmar... Quote
1 q009 What were the net CO2e emissions from training... 4.3 tCO2e 4.3 tCO2e ['patterson2021'] ['https://arxiv.org/pdf/2104.10350'] "Training GShard-600B used 24 MWh and produced... Quote
2 q054 What is the model size in gigabytes (GB) for t... 64.7 GB 64.7 GB ['chen2024'] ['https://arxiv.org/pdf/2405.01814'] Table 3: Large language models used for evalua... Table 3
3 q062 What was the total electricity consumption of ... Unable to answer with confidence based on the ... is_blank MWh is_blank is_blank is_blank is_blank
4 q075 True or False: Hyperscale data centers in 2020... TRUE 1 is_blank ['wu2021b','patterson2021'] ['https://arxiv.org/abs/2108.06738','https://a... Wu 2021, body text near Fig. 1: "…between trad... The >40% statement is explicit in Wu. Patterso...
tst = eda.test()
tst.head()
id question answer answer_value answer_unit ref_id ref_url supporting_materials explanation
0 q001 What was the average increase in U.S. data cen... NaN NaN percent NaN NaN NaN NaN
1 q002 In 2023, what was the estimated amount of cars... NaN NaN cars NaN NaN NaN NaN
2 q004 How many data centers did AWS begin using recy... NaN NaN data centers NaN NaN NaN NaN
3 q005 Since NVIDIA doesn't release the embodied carb... NaN NaN kg/GPU NaN NaN NaN NaN
4 q006 By what factor was the estimated amortized tra... NaN NaN ratio NaN NaN NaN NaN

We have to fill up the answer, answer_value, answer_unit, ref_id, ref_url, supporting_materials and explanation here.

From the competition following values are expected

  • answer: A clear natural-language response (e.g., 1438 lbs, Water consumption, TRUE)’. If no answer is possible, use “Unable to answer with confidence based on the provided documents.”

  • answer_value: The normalized numeric or categorical value (e.g., 1438, Water consumption, 1)

    • If no answer is possible, use is_blank
    • Ranges should be encoded as [low,high]
    • Do not include symbols like <, >, ~ here. Those can be left in the clear natural language column.
  • answer_unit: Unit of measurement (e.g., lbs, kWh, gCO2, projects, is_blank).

  • ref_id: One or more document IDs from metadata.csv that support the answer.

  • ref_url: One or more URL(s) of the cited document(s).

  • supporting_materials: Verbatim justification from the cited document (quote, table reference, figure reference, etc.).

  • explanation: Short reasoning describing why the cited material supports the answer.

Read pdf

I already downloaded all the pdfs, please refer the notebook 00_eda

We will extract the content from the pdfs here using answerdotai’s contextkit library which uses pypdf underneath.

pypdf does a decent job of text extraction from pdf but it does not preserve the layouts, table structure and reading order.

get_metadata('chen2024')
{'id': 'chen2024',
 'type': 'paper',
 'title': 'Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation',
 'year': 2024,
 'citation': 'Shaoyuan Chen, Wencong Xiao, Yutong Lin, Mingxing Zhang, Yingdi Shan, Jinlei Jiang, Kang Chen, Yongwei Wu. (2024). Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation. arXiv. https://arxiv.org/pdf/2405.01814',
 'url': 'https://arxiv.org/pdf/2405.01814'}
doc_id = 'chen2024'
fc.test_eq(get_metadata(doc_id)['id'], doc_id)
doc = read_doc('chen2024')
doc.content[:100], doc.id
('Efficient Heterogeneous Large Language Model Decoding\nwith Model-Attention Disaggregation\nShaoyuan C',
 'chen2024')
doc_id='chen2024'
doc = read_doc(doc_id)
fc.test_ne(len(doc.content), 0)
fc.test_eq(doc.id, doc_id)

Read Markdown

len(read_markdown('chen2024').content), len(read_doc('chen2024').content)
(75025, 69175)

Total content size

fc.L(eda.metadata()['id'].to_list()).map(lambda x: len(read_doc(x).content)).sum()
2673613

I dont think any open source models can handle that many characters in their context window as of November 2025.

A RAG based system will be good where we chunk the content, retrieve the relavent chunk and generate answer with those relevant chunk

Document Chunks

content, metadata = get_content_metadata(read_doc, 'chen2024')
content[:100], metadata
('Efficient Heterogeneous Large Language Model Decoding\nwith Model-Attention Disaggregation\nShaoyuan C',
 {'id': 'chen2024',
  'type': 'paper',
  'title': 'Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation',
  'year': 2024,
  'citation': 'Shaoyuan Chen, Wencong Xiao, Yutong Lin, Mingxing Zhang, Yingdi Shan, Jinlei Jiang, Kang Chen, Yongwei Wu. (2024). Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation. arXiv. https://arxiv.org/pdf/2405.01814',
  'url': 'https://arxiv.org/pdf/2405.01814'})
doc = read_doc('chen2024')
len(doc.content)
69175
chunks = chunk_doc('chen2024')
len(chunks), chunks[0]['text'][-200:]
(50,
 ' performance and cost efficiency. Our com-\nprehensive analysis and experiments confirm the viability\nof splitting the attention computation over multiple devices.\nAlso, the communication bandwidth req')
doc_id = 'chen2024'
chunks = chunk_doc(doc_id)
fc.test_ne(len(chunks), 0)
fc.test_eq(chunks[0].id, doc_id)
chunks[0]['text'][:200]
'Efficient Heterogeneous Large Language Model Decoding\nwith Model-Attention Disaggregation\nShaoyuan Chen1 Wencong Xiao2 Yutong Lin1 Mingxing Zhang1 Yingdi Shan1 Jinlei Jiang1\nKang Chen1 Yongwei Wu1\n1Ts'
chunks[0]
namespace(text='Efficient Heterogeneous Large Language Model Decoding\nwith Model-Attention Disaggregation\nShaoyuan Chen1 Wencong Xiao2 Yutong Lin1 Mingxing Zhang1 Yingdi Shan1 Jinlei Jiang1\nKang Chen1 Yongwei Wu1\n1Tsinghua University\n2ByteDance\nAbstract\nTransformer-based large language models (LLMs) exhibit\nimpressive performance in generative tasks but also intro-\nduce significant challenges in real-world serving due to in-\nefficient use of the expensive, computation-optimized accel-\nerators. Although disaggregated serving architectures have\nbeen proposed to split different phases of LLM inference, the\nefficiency of decoding phase is still low. This is caused by\nthe varying resource demands of different operators in the\ntransformer-based LLMs. Specifically, the attention operator\nis memory-intensive, exhibiting a memory access pattern that\nclashes with the strengths of modern accelerators, especially\nfor long context requests.\nTo enhance the efficiency of LLM decoding, we introduce\nmodel-attention disaggregation. This approach leverages a\ncollection of cheap, memory-optimized devices for the atten-\ntion operator while still utilizing high-end accelerators for\nother parts of the model. This heterogeneous setup ensures\nthat each component is tailored to its specific workload, max-\nimizing overall performance and cost efficiency. Our com-\nprehensive analysis and experiments confirm the viability\nof splitting the attention computation over multiple devices.\nAlso, the communication bandwidth req',
          chunk_id=0,
          id='chen2024',
          type='paper',
          title='Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation',
          year=2024,
          citation='Shaoyuan Chen, Wencong Xiao, Yutong Lin, Mingxing Zhang, Yingdi Shan, Jinlei Jiang, Kang Chen, Yongwei Wu. (2024). Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation. arXiv. https://arxiv.org/pdf/2405.01814',
          url='https://arxiv.org/pdf/2405.01814')
chunks[-1]
namespace(text='i Chen, Christopher De-\nwan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mi-\nhaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel\nSimig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang,\nand Luke Zettlemoyer. Opt: Open pre-trained trans-\nformer language models, 2022.\n[59] Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu,\nYibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Dist-\nServe: Disaggregating prefill and decoding for goodput-\noptimized large language model serving. In 18th\nUSENIX Symposium on Operating Systems Design and\nImplementation (OSDI 24), pages 193–210, 2024.\n16',
          chunk_id=49,
          id='chen2024',
          type='paper',
          title='Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation',
          year=2024,
          citation='Shaoyuan Chen, Wencong Xiao, Yutong Lin, Mingxing Zhang, Yingdi Shan, Jinlei Jiang, Kang Chen, Yongwei Wu. (2024). Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation. arXiv. https://arxiv.org/pdf/2405.01814',
          url='https://arxiv.org/pdf/2405.01814')

Markdown Chunks

chunk_size = 375
chunk_overlap = 125
md_splitter = MarkdownTextSplitter.from_tiktoken_encoder(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
md_splitter
<langchain_text_splitters.markdown.MarkdownTextSplitter>
doc_id = 'chen2024'
md_content = read_markdown(doc_id).content
chunks = md_splitter.split_text(md_content)
chunks[0][-1200:]
'# Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation\n\nShaoyuan Chen<sup>1</sup> Wencong Xiao<sup>2</sup> Yutong Lin<sup>1</sup> Mingxing Zhang<sup>1</sup> Yingdi Shan<sup>1</sup> Jinlei Jiang<sup>1</sup>  \nKang Chen<sup>1</sup> Yongwei Wu<sup>1</sup>\n\n<sup>1</sup>Tsinghua University\n\n<sup>2</sup>ByteDance\n\n## Abstract\n\nTransformer-based large language models (LLMs) exhibit impressive performance in generative tasks but also introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators. Although disaggregated serving architectures have been proposed to split different phases of LLM inference, the efficiency of decoding phase is still low. This is caused by the varying resource demands of different operators in the transformer-based LLMs. Specifically, the attention operator is memory-intensive, exhibiting a memory access pattern that clashes with the strengths of modern accelerators, especially for long context requests.'
chunks[1][:1200]
'Transformer-based large language models (LLMs) exhibit impressive performance in generative tasks but also introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators. Although disaggregated serving architectures have been proposed to split different phases of LLM inference, the efficiency of decoding phase is still low. This is caused by the varying resource demands of different operators in the transformer-based LLMs. Specifically, the attention operator is memory-intensive, exhibiting a memory access pattern that clashes with the strengths of modern accelerators, especially for long context requests.\n\nTo enhance the efficiency of LLM decoding, we introduce model-attention disaggregation. This approach leverages a collection of cheap, memory-optimized devices for the attention operator while still utilizing high-end accelerators for other parts of the model. This heterogeneous setup ensures that each component is tailored to its specific workload, maximizing overall performance and cost efficiency. Our comprehensive analysis and experiments confirm the viability of splitting the attention computation over mult'
chunks = md_splitter.chunk_markdown(doc_id)
chunks[0].text[-900:]
'p> Yutong Lin<sup>1</sup> Mingxing Zhang<sup>1</sup> Yingdi Shan<sup>1</sup> Jinlei Jiang<sup>1</sup>  \nKang Chen<sup>1</sup> Yongwei Wu<sup>1</sup>\n\n<sup>1</sup>Tsinghua University\n\n<sup>2</sup>ByteDance\n\n## Abstract\n\nTransformer-based large language models (LLMs) exhibit impressive performance in generative tasks but also introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators. Although disaggregated serving architectures have been proposed to split different phases of LLM inference, the efficiency of decoding phase is still low. This is caused by the varying resource demands of different operators in the transformer-based LLMs. Specifically, the attention operator is memory-intensive, exhibiting a memory access pattern that clashes with the strengths of modern accelerators, especially for long context requests.'
chunks[1].text[:900]
'Transformer-based large language models (LLMs) exhibit impressive performance in generative tasks but also introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators. Although disaggregated serving architectures have been proposed to split different phases of LLM inference, the efficiency of decoding phase is still low. This is caused by the varying resource demands of different operators in the transformer-based LLMs. Specifically, the attention operator is memory-intensive, exhibiting a memory access pattern that clashes with the strengths of modern accelerators, especially for long context requests.\n\nTo enhance the efficiency of LLM decoding, we introduce model-attention disaggregation. This approach leverages a collection of cheap, memory-optimized devices for the attention operator while still utilizing high-end'

Human readable Chunk

This will help later in creating context for the prompt

print(Nugget(chunks[42]))
### Chunk 0
            Text: ![](_page_8_Figure_0.jpeg)

(a) Request-level partition.

(b) Head-level partition.

Figure 9: Work partition methods of the attention operator.

store the KV caches and compute the attention operators. As depicted in Figure 9, the attention operators can be parallelized among memory devices in various ways. One method is to distribute different requests across different devices; an alternative strategy is to partition and distribute the attention heads, which can also be computed independently, to different devices. The head-level partitioning approach ensures a balanced workload distribution, whereas the request-level partitioning may result in load imbalance due to the differences in sequence lengths and therefore the KV cache sizes among requests. However, head-level partitioning has limited flexibility, as it requires the number of memory devices to be divisible by the number of attention heads. We opt for head-level partitioning in Lamina, which offers optimal load balancing.

## 6 Evaluation

**Testbed.** We deploy Lamina on a real heterogeneous cluster with two kinds of GPU nodes. Each node consists of either eight H100 or H20 GPUs, and each GPU is paired with a dedicated ConnectX-7 NIC via PCIe switch. The GPU nodes are interconnected with 400 Gbps RoCE network. We use H100 as compute-optimized GPUs and H20 as memory-optimized GPUs for Lamina.
            Chunk Id: 42
            Doc ID: chen2024
            Type: paper
            Title: Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation
            Year: 2024
            Citation: Shaoyuan Chen, Wencong Xiao, Yutong Lin, Mingxing Zhang, Yingdi Shan, Jinlei Jiang, Kang Chen, Yongwei Wu. (2024). Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation. arXiv. https://arxiv.org/pdf/2405.01814
            URL: https://arxiv.org/pdf/2405.01814

Chunks all the docs

all_chunks = chunk_all(chunk_doc)
len(all_chunks)
1927
all_chunks[0]
namespace(text='Amazon \nSustainability \nReport\n2023 Contents\nOverview\n3 Introduction\n4 A Letter from Our Chief \nSustainability Officer\xa0\n5 How We Work\n6 Goals Summary\n7 2023 Year in Review \xa0\nEnvironment\n9 Carbon\n24 Carbon-Free Energy\n29 Packaging \n34 Waste and Circularity\n40 Water\nValue Chain\n45 Human Rights \n50 Responsible Supply Chain\n58 Sustainable Products and \nMaterials \n64 Supplier Diversity \n67 Community Impact\nPeople\n75 Employee Experience\n81 Health and Safety\n86 Inclusive Experiences\nAppendix\n94  Sustainability Reporting Topic \nAssessment\n95  Endnotes\n96 Assurance Statements \n97 Disclaimer and Forward-Looking \nStatements \nOn the cover  \nThe Baldy Mesa Solar and Storage Project (developed \nand operated by AES), located in Adelanto, California. Employees inside one of our newest office buildings in Bellevue, \nWashington.\nIntroduction 2023 Year in ReviewGoals SummaryHow We WorkCSO Letter\nAbout This Report\nThis is our sixth annual report detailing progress against \nour goals\xa0  and environmental, social, and governance \ntopics. All financial figures are reported in U.S. dollars ($), \nunless otherwise stated. The data within this report reflects \nprogress from January 1 through December 31, 2023, unless \notherwise indicated. This report includes information about \nmany business units and subsidiaries including AWS, Devices, \nFresh, Whole Foods Market, Amazon Private Brands, Twitch, \nMGM Studios, and Ring.\nOur 2023 Sustainability Report is structured into three \nmain categories: Environment',
          chunk_id=0,
          id='amazon2023',
          type='report',
          title='2023 Amazon Sustainability Report',
          year=2023,
          citation='Amazon Staff. (2023). Amazon Sustainability Report. https://sustainability.aboutamazon.com/2023-amazon-sustainability-report.pdf',
          url='https://sustainability.aboutamazon.com/2023-amazon-sustainability-report.pdf')
all_chunks[-1]
namespace(text='\nWeidinger, L., Mellor, J., et al.: Ethical and social risks of harm from language models.\narXiv preprint arXiv:2112.04359 (2021)\n25',
          chunk_id=1926,
          id='zschache2025',
          type='paper',
          title='Comparing energy consumption and accuracy in text classification inference',
          year=2025,
          citation='Johannes Zschache, & Tilman Hartwig (2025). Comparing energy consumption and accuracy in text classification inference arXiv. https://arxiv.org/pdf/2508.14170',
          url='https://arxiv.org/pdf/2508.14170 ')

Neighbour Chunks

Chunks(all_chunks).get_chunk(1850)
namespace(text='ne-tune the full BlackMamba model (i.e.,\noriginal weight matrices), whereas employed QLoRA [15]\nfor parameter-efficient fine-tuning (PEFT) on Mixtral due to\nGPU memory capacity budget. For QLoRA, we target the\nMoE layers, including the routers, and set the rank of the\nLoRA modules to 16. We enable FlashAttention2 [17] during\nMixtral fine-tuning for enhanced efficiency. Moreover, we use\ngradient checkpointing [18] to save memory usage.\nDatasets. Our fine-tuning process is implemented in Py-\nTorch using the LLaMA-Factory framework [19], with a\nlearning rate of 5e-5 and 10 epochs. Both models were fine-\ntuned on two datasets focused on different tasks: common-\nsense 15k (CS) and Math 14k (MATH), which address com-\nmonsense reasoning and arithmetic reasoning respectively\n(provided by LLM-adapters [20]). The details of datasets\nare used in Table II. For evaluation, we tested the models\non GSM8K [21] for arithmetic reasoning and HE [22] for\ncommonsense reasoning. Each dataset consists of thousands\nof queries. We define a query as the concatenation of a\nprompt and its ground-truth answer, which is feed to LLMs\nfor fine-tuning.\nProfiling experiments. We evaluate the fine-tuning pro-\ncess from both software and hardware perspectives. The\nsoftware evaluation includes an end-to-end assessment of\nthe fine-tuning process and measures the performance of\nthe two models on various tasks post-fine-tuning. Using\nPyTorch, we provide essential algorithm-level information\nsuch as test accuracy, t',
          chunk_id=1850,
          id='xia2024',
          type='paper',
          title='Understanding the Performance and Estimating the Cost of LLM Fine-Tuning',
          year=2024,
          citation='Yuchen Xia, Jiho Kim, Yuhan Chen, Haojie Ye, Souvik Kundu, Cong Hao, Nishil Talati. (2024). Understanding the Performance and Estimating the Cost of LLM Fine-Tuning. arXiv. https://arxiv.org/pdf/2408.04693',
          url='https://arxiv.org/pdf/2408.04693')
left_chunk, right_chunk = Chunks(all_chunks).get_neighbours(1850)
fc.L(left_chunk, right_chunk).attrgot('chunk_id')
(#2) [1849,1851]
some_dup_chunks = fc.L(Chunks(all_chunks).get_chunk(cid) for cid in [1850, 1851, 1852, 1853, 1850, 1851, 1852])
Chunks.unique(some_dup_chunks).attrgot('chunk_id')
(#4) [1850,1851,1852,1853]
ans = Chunks(all_chunks).include_neighbours(some_dup_chunks)
ans.attrgot('chunk_id')
(#6) [1849,1850,1851,1852,1853,1854]

Hybrid: Rerank

Here we will rerank the outputs from semantic search and lexical search using a reranker model.

There are other ways to mix the outputs from the above two searches like Reciprocal Rank Fusion (RRF), Linear Combination etc which you can try later

combined_res = combine_chunks(semantic_res, lexical_res)
len(combined_res)
2
combined_res.attrgot('chunk_id')
(#2) [530,6]
ranker = utils.Reranker()
query[:100]
'on Sustainability Report Value Chain Introduction 2023 Year in ReviewGoals SummaryHow We WorkCSO Let'
combined_res[-1].text[:100]
'on Sustainability Report Value Chain Introduction 2023 Year in ReviewGoals SummaryHow We WorkCSO Let'
ranker.rerank_chunks(combined_res[-1].text, combined_res)[0].text[:100]
'on Sustainability Report Value Chain Introduction 2023 Year in ReviewGoals SummaryHow We WorkCSO Let'
hs = HybridSearch(ls, ss)
chunks_res = hs.search(combined_res[-1].text)
chunks_res[0].text[:100]
'23 Amazon Sustainability Report Value Chain Introduction 2023 Year in ReviewGoals SummaryHow We Work'
hs = HybridSearch(ls, ss, neighbour_chunks=True)
hs.search(combined_res[-1].text, n=1).attrgot('chunk_id')
(#3) [10,11,12]