[namespace(qid='q280',
q='Given the total pre-training GPU hours and the number of GPUs used, estimate the total wall-clock time in days required to pre-train the JetMoE-8B model.',
predicted={'answer': 'Unanswerable from the provided context',
'answer_value': 'is_blank',
'answer_unit': 'is_blank',
'ref_id': ['is_blank'],
'ref_url': ['is_blank'],
'supporting_materials': 'is_blank',
'explanation': 'is_blank'},
expected=id q280
question Given the total pre-training GPU hours and the...
answer ~13 days
answer_value 13
answer_unit days
ref_id ['shen2024']
ref_url ['https://arxiv.org/pdf/2404.07413']
supporting_materials "…30,000 H100 GPU hours… We conduct training o...
explanation Math: wall_clock_hours ≈ 30,000 GPUh ÷ 96 GPUs...
Name: 34, dtype: object,
score=0.0,
answer_score=0.0,
ref_score=0.0,
na_score=0.0,
context='Doc ID: strubell2019\nURL: https://arxiv.org/pdf/1906.02243\nTitle: Energy and Policy Considerations for Deep Learning in NLP\nText: day . W e train all models on\na single NVIDIA Titan X GPU, with the excep-\ntion of ELMo which was trained on 3 NVIDIA\nGTX 1080 Ti GPUs. While training, we repeat-\nedly query the NVIDIA System Management In-\nterface\n2 to sample the GPU power consumption\nand report the average over all samples. T o sample\nCPU power consumption, we use Intel’s Running\nA verage Power Limit interface.\n3\n2 nvidia-smi: https://bit.ly/30sGEbi\n3 RAPL power meter: https://bit.ly/2LObQhV\nConsumer Renew . Gas Coal Nuc.\nChina 22% 3% 65% 4%\nGermany 40% 7% 38% 13%\nUnited States 17% 35% 27% 19%\nAmazon-A WS 17% 24% 30% 26%\nGoogle 56% 14% 15% 10%\nMicrosoft 32% 23% 31% 10%\nT able 2: Percent energy sourced from: Renewable (e.g.\nhydro, solar, wind), natural gas, coal and nuclear for\nthe top 3 cloud compute providers (\nCook et al. , 2017),\ncompared to the United States, 4 China5 and Germany\n(Burger, 2019).\nW e estimate the total time expected for mod-\nels to train to completion using training times and\nhardware reported in the original papers. W e then\ncalculate the power consumption in kilowatt-hours\n(kWh) as follows. Let pc be the average power\ndraw (in watts) from all CPU sockets during train-\ning, let pr be the average power draw from all\nDRAM (main memory) sockets, let pg be the aver-\nage power draw of a GPU during training, and let\ng be the number of GPUs used to train. W e esti-\nmate total power consumption as combined GPU,\nCPU and DRAM consumption, then multiply this\nby Power Usage Effectiveness (PUE), whi\n\nDoc ID: dodge2022\nURL: https://arxiv.org/pdf/2206.05229\nTitle: Measuring the Carbon Intensity of AI in Cloud Instances\nText: 2, June 21–24, 2022, Seoul, Republic of Korea\nModel BERT BERT 6B Dense Dense Dense ViT ViT ViT ViT ViT\nfinetune pretrain Transf. 121 169 201 Tiny Small Base Large Huge\nGPU 4·V100 8·V100 256·A100 1·P40 1·P40 1·P40 1·V100 1·V100 1·V100 4·V100 4·V100\nHours 6 36 192 0.3 0.3 0.4 19 19 21 90 216\nkWh 3.1 37.3 13,812.4 0.02 0.03 0.04 1.7 2.2 4.7 93.3 237.6\nTable 2. For the 11 models in our analysis: the type of GPU, the number of GPUs of that type, the number of hours, and the energy\nused in kWh. For example, our BERT language modeling (BERT LM) experiment used 8 V100 GPUs for 36 hours and used a total of\n37.3 kWh. We note our training run of the 6 billion parameter transformer only trained for approximately 13% of the time it would\ntake to train to completion, we estimate a full training run would consume approximately 103,593 kWh.\n4.1 NLP\nBERT Training. We monitored the energy consumption while training a BERT-small model [8] for approximately 36\nhours on 8 NVIDIA V100 GPUs. That training run consumed over 37 kWh of electricity.\nBERT Finetuning. We tracked the energy consumption while finetuning the BERT-small model on a standard natural\nlanguage inference task [48, MNLI] for approximately 6 hours on 4 NVIDIA V100 GPUs. Our finetuning run consumed\naround 3.2 kWh of electricity, i.e., less than one tenth that due to BERT-small pre-training.\n6 Billion Parameter Transformer. We tracked the energy consumption of training a large language model comprising\nover 6.1 billion parameters dur\n\nDoc ID: strubell2019\nURL: https://arxiv.org/pdf/1906.02243\nTitle: Energy and Policy Considerations for Deep Learning in NLP\nText: n, an increase of just 0.1\nBLEU at the cost of at least $150k in on-demand\ncompute time and non-trivial carbon emissions.\n4.2 Cost of development: Case study\nT o quantify the computational requirements of\nR&D for a new model we study the logs of\nall training required to develop Linguistically-\nInformed Self-Attention (\nStrubell et al. , 2018), a\nmulti-task model that performs part-of-speech tag-\nging, labeled dependency parsing, predicate detec-\ntion and semantic role labeling. This model makes\nfor an interesting case study as a representative\nNLP pipeline and as a Best Long Paper at EMNLP .\nModel training associated with the project\nspanned a period of 172 days (approx. 6 months).\nDuring that time 123 small hyperparameter grid\nsearches were performed, resulting in 4789 jobs\nin total. Jobs varied in length ranging from a min-\nimum of 3 minutes, indicating a crash, to a maxi-\nmum of 9 days, with an average job length of 52\nhours. All training was done on a combination of\nNVIDIA Titan X (72%) and M40 (28%) GPUs.\n8\nThe sum GPU time required for the project\ntotaled 9998 days (27 years). This averages to\n8 W e approximate cloud compute cost using P100 pricing.\nEstimated cost (USD)\nModels Hours Cloud compute Electricity\n1 120 $52–$175 $5\n24 2880 $1238–$4205 $118\n4789 239,942 $103k–$350k $9870\nT able 4: Estimated cost in terms of cloud compute and\nelectricity for training: (1) a single model (2) a single\ntune and (3) all models trained during R&D.\nabout 60 GPUs running constantly th\n\nDoc ID: luccioni2023\nURL: https://arxiv.org/pdf/2302.08476\nTitle: Counting Carbon: A Survey of Factors Influencing the Emissions of Machine Learning\nText: e to ML model are rising.\nIn Section 4, we have discussed that the main sources of variance in the amount of emissions associated to training\nmachine learning models is due to the carbon intensity of the primary energy source and the training time, with the\npower consumption of the hardware having a smaller influence. In terms of training time, the models in our sample\nrange from just about 15 minutes (total GPU/TPU time) up to more than 400,000 hours, with a median of 72 hours,\npointing again to large variance in our sample. While the maximum of of 400,000 GPU hours (equivalent to about 170\ndays with 100 GPUs) in our sample seems very large, note that the total training time of GPT-3 was estimated to be\nover 3.5 million hours (14.8 days with 10,000 GPUs) [38]. Obviously, such long training times result in large amounts of\ncarbon emissions, even with lower carbon intensity energy sources. By way of illustration, the model with the longest\ntraining time in our sample would have reduced by about 30 times the carbon emissions had it used the grid with the\nlowest carbon intensity in our sample, but it would have still resulted in over 1 ton of CO2eq. Also, generally speaking,\nwe can see that the models at the higher end of the emissions spectrum tend to be Transformer-based model with more\nlayers (as well as using techniques such as Neural Architecture Search to find optimal combinations of parameters),\nwhereas simpler and shallower models such as convolutional neural networks te\n\nDoc ID: cottier2024\nURL: https://arxiv.org/pdf/2405.21015\nTitle: The Rising Costs of Training Frontier AI Models\nText: of chips ×\n\x10\n1 − exp\n\x02\n− Training time × r ln 10\n\x03\x11\nwhere training time is in years. However, we could estimate chip-hours more often and more reliably than the training\ntime or the number of chips separately. This is because chip-hours can also be estimated from training compute in\nFLOP divided by the FLOP/s achieved during training. We used a linear approximation to take advantage of these\nchip-hour estimates:\nAmortized training cost = Start value per chip × Training chip-hours\n(365 × 24) hours/year × r ln 10\nThis approximation is valid if(Training time)×r ln 10is small, and this is the case for the training times in our data and\nour choice of r = 0.14. In an extreme case, a training time of 1 year results in 1 × 0.14 ln(10)∼= 32%deprecation\ncompared to 1 − exp(−1 × 0.14 ln(10))∼= 28%depreciation. This is not a large difference relative to other sources\nof uncertainty.\nDue to NVIDIA covering defects and component failures under warranty, we concluded that hardware failures are not a\nsignificant source of depreciation relative to hardware progress. As one data point, an average of 1 to 2 failures per\nweek occurred when training the BLOOM model on a cluster of 384 NVIDIA A100 GPUs [25]. Even if these were all\ncatastrophic failures, the expected hardware lifetime would be 3.7 years. We expect that NVIDIA replaces or repairs\ndefective GPUs on a faster timescale, which makes the cost of failure small compared to hardware price depreciation.\nA.4 Energy cost estimation\nTo model th\n\nDoc ID: luccioni2023\nURL: https://arxiv.org/pdf/2302.08476\nTitle: Counting Carbon: A Survey of Factors Influencing the Emissions of Machine Learning\nText: [38] and it remains a fair approximation of the\nactual energy consumption of many hardware models. We provide more information about TDP and the hardware used\nfor training the models in our sample in Section A.2 of the Appendix.\nTraining Time. Training time was computed as the total number of hardware hours, which is different from the\n"wall time" of ML model training, since most models were trained on multiple units at once. For instance, if training a\n1For instance, methane is 28 times more potent than CO 2 based on its 100-year global warming potential, so energy generation emitting 1 gram of\nmethane per kWh will emit 28 grams of CO2eq per kWh.\nManuscript pending review Counting Carbon: A Survey of Factors Influencing the Emissions of Machine Learning 5\nmodel used 16 GPUs for 24 hours, this equals a training time of 384 GPU hours ; a model using 8 GPUs for 48 hours will\ntherefore have an equivalent training time.\n4 DATA ANALYSIS\nIn the sections below, we present several aspects regarding the carbon footprint of training ML models, examining the\nmain sources of energy used for training (§ 4.1), the order of magnitude of CO2 emissions produced (§ 4.2), the evolution\nof these emissions over time (§ 4.3) and the relationship between carbon emissions and model performance (§ 4.4) 2.\n4.1 What are the main sources of energy used for training ML models?\nThe primary energy source used for powering an electricity grid is the single biggest influence on the carbon intensity\nof that \n\nDoc ID: cottier2024\nURL: https://arxiv.org/pdf/2405.21015\nTitle: The Rising Costs of Training Frontier AI Models\nText: of about $5,000.\n14 A.3 Amortization model\nAs explained in section 2.2, we estimated the value of the training hardware at the beginning of training as:\nStart value per chip = Acquisition cost per chip\nexp\n\x10\x02\nTraining start date − Hardware availability date\n\x03\n· r ln 10\n\x11\nwhere r is a depreciation rate in orders of magnitude per year, and the difference in dates is in years. The hardware\navailability date depended on the type of hardware. If the hardware was a Google TPU, we used the hardware\nannouncement date. For GPUs, we used a 90-day buffer between the GPU first going on the market and the GPU\nactually being shipped to the buyer. Our results are robust to variations in this buffer time—see Appendix B.4.\nFor the training start date, there were a few known cases—for example, GPT-4 finished training in August 2022 [12].\nOtherwise, we subtracted the training time from the publication date, and then subtracted a further 60 days to account\nfor time spent evaluating the model and writing the paper. Again, our results are robust to variations in this buffer. If the\ntraining time was unknown, we used the median of known values in our dataset, which was approximately 33 days.\nThe precise way to amortize the training cost through exponential depreciation is:\nAmortized training cost = Start value per chip × Number of chips × Depreciation during training\n= Start value per chip × Number of chips ×\n\x10\n1 − exp\n\x02\n− Training time × r ln 10\n\x03\x11\nwhere training time is in years. However, we cou\n\nDoc ID: wu2021a\nURL: https://arxiv.org/pdf/2111.00364\nTitle: Sustainable AI: Environmental Implications, Challenges and Opportunities\nText: large collection of diverse ML\nideas are explored simultaneously at-scale. Thus, during this\nphase, we observe unique system resource requirements from\nthe large pool of training experiments. Within Facebook’s ML\nresearch cluster, 50% (p50) of ML training experiments take up\nto 1.5 GPU days while 99% (p99) of the experiments complete\nwithin 24 GPU days. There are a number of large-scale, trillion\nparameter models which require over 500 GPUs days.\nOnce a ML solution is determined as promising, it moves into\nTraining where the ML solution is evaluated using extensive\nproduction data — data that is more recent, is larger in quantity,\nand contains richer features . The process often requires\nadditional hyper-parameter tuning. Depending on the ML task\nrequirement, the models can be trained/re-trained at different\nfrequencies. For example, models supporting Facebook’sSearch\nservice were trained at an hourly cadence whereas the Language\nTranslation models were trained weekly [24]. A p50 production\nmodel training workflow takes 2.96 GPU days while a training\nworkflow at p99 can take up to 125 GPU days.\nFinally, for Inference, the best-performing model is de-\nployed, producing trillions of daily predictions to serve billions\nof users worldwide. The total compute cycles for inference\npredictions are expected to exceed the corresponding training\ncycles for the deployed model.\nB. Machine Learning System Life Cycle\nLife Cycle Analysis (LCA) is a common methodology to\nassess the carbon emis\n\nDoc ID: patterson2021\nURL: https://arxiv.org/pdf/2104.10350\nTitle: Carbon Emissions and Large Neural Network Training\nText: takes ~14.8 days for 10,000 \nGPUs at 24.6 TeraFLOPS/sec to compute 3.14E+23 FLOPS. For the CO 2 e calculation, it doesn’t \nactually matter whether it takes 2 weeks on 10,000 GPUs or 20 weeks on 1,000 GPUs, but we need \none number for Table 4, so we used NVIDIA’s suggestion of 10,000 GPUs. \n● Total Computation (Table 1, row 13; Table 4, row 16): We calculate from measured performance, \nnumber of chips, and days to train (except for GPT-3, as OpenAI published the total FLOPS). \n● % of Google 2019 Energy Consumption. (Table 4, row 17): For all models (even those not actually run \nin Google datacenters or not run in 2019), we calculate the percentage of Google’s total energy \nconsumption of 12.2 Terawatt-hours in 2019 [Goo20]. \n● Ratio of round trips (Table 4, row 22) . To give perspective on the CO 2 e cost of training a model is \ncompared to other activities, we show the CO 2 e of passenger jets. Google Flights calculated the \naverage CO 2 emission for all the direct flights between San Francisco (SFO) and New York (JFK) in its \ndatabase as 90.2t, so the average round trip is 180.4t. (This is for the\n\nDoc ID: dodge2022\nURL: https://arxiv.org/pdf/2206.05229\nTitle: Measuring the Carbon Intensity of AI in Cloud Instances\nText: the energy consumption of training a large language model comprising\nover 6.1 billion parameters during 8 days on 256 NVIDIA A100s. The total energy amounted to a staggering 13.8 MWh.\nThis model was not trained to completion, but only until 13%; a full training run would take 60 days. Thus, we estimate\nthe total energy consumption to train this model to completion would be approximately (60/8)∗13.8 = 103.5 MWh, or\n103,500 kWh — almost 2800 times more than training the BERT-small model!\n4.2 Computer Vision\nDenseNets. We trained three sizes of DenseNets [19] on MNIST [25]. The jobs lasted between 20 and 25 minutes and\nconsumed between 20 and 38Wh (or 0.02 to 0.04 kWh) of electricity, which is negligible compared to the other models.\nVision Transformers. We evaluated the energy consumption during the training of five sizes of Vision Transformers [9]\non ImageNet [7]. For the smallest ViT experiment (ViT tiny), training lasted around 19 hours on a single V100 and\nconsumed approximately 1.7 kWh. For the largest one (ViT huge), training lasted more than 9 days on a 4 V100s and\nconsumed approximately 237 kWh. The full list of models can be found in Table 2.\n5 EMISSIONS BY REGION AND TIME OF DAY\nUsing the methodology presented above, we provide some of the first measurements of the differences of actual\ndatacenters from a major cloud provider. Importantly, what we have is a time series of marginal emissions: for example,\nif a job were to run from 1 pm to 5 pm in the US West region wit'),
namespace(qid='q170',
q='How many days of CO₂ emissions from an average American life are equivalent to training BERT base?',
predicted={'answer': '6 years, 1 month',
'answer_value': '6.083',
'answer_unit': 'years',
'ref_id': ['morrison2025'],
'ref_url': ['https://arxiv.org/pdf/2503.05804'],
'supporting_materials': 'Table 2: We list the estimated power usage, carbon emissions, and water consumption from training our dense transformers... Llama 2 7B 81 31 6 yrs, 1 mo - -',
'explanation': 'Table'},
expected=id q170
question How many days of CO₂ emissions from an average...
answer 14.4
answer_value 14.4
answer_unit days
ref_id ['strubell2019']
ref_url ['https://arxiv.org/pdf/1906.02243']
supporting_materials Table 1 and Table 3
explanation 1438 lbs / 99.8 lbs/day = 14.4
Name: 18, dtype: object,
score=0.1,
answer_score=0.0,
ref_score=0.0,
na_score=1.0,
context='Doc ID: dodge2022\nURL: https://arxiv.org/pdf/2206.05229\nTitle: Measuring the Carbon Intensity of AI in Cloud Instances\nText: average US home energy use is estimated to emit 8.30 metric tons (using the sum of emissions from generating\nelectricity, natural gas, liquid petroleum, and fuel oil), and one rail car of coal is estimated to emit 181.29 metric tons.\n8 Measuring the Carbon Intensity of AI in Cloud Instances FAccT ’22, June 21–24, 2022, Seoul, Republic of Korea\nFig. 2. Emissions for our 11 experiments described in §4. For each model we show a vertical blue bar, where the top of the bar is\nthe max, the bottom is the min, and the black line represents the average emissions (across regions and time of year). First and\nfourth quartiles are represented by the light blue at the top and bottom of each vertical blue bar. The largest training runs (e.g., 6\nbillion parameter LM) releases a significant amount of emissions, no matter the region (and recall the 6 billion parameter LM is only\ntrained for 13% of a full run, so a full run would emit about an order of magnitude more emissions than reported here). The smallest\nexperiments emit very little. Presented on a log scale, with references on the right indicating equivalent sources of emissions per the\nUnited States Environmental Protection Agency [46].\nThe largest experiment in our set is the 6 billion parameter transformer, and that model is only partially trained (as\ndescribed in §4, it is only trained for about 13% of the time needed to converge). Even partially trained, experiments of\nthis size can emit more CO2 than all emissions from the average\n\nDoc ID: luccioni2024\nURL: https://arxiv.org/pdf/2311.16863\nTitle: Power Hungry Processing: Watts Driving the Cost of AI Deployment?\nText: s, architecture\nand carbon emissions of their products, we can make a comparison based on the experiments carried out in the\npresent study. For instance, the average emissions of a BERT-based model fine-tuned for extractive question answering\n(bert-large-uncased-whole-word-masking-finetuned-squad), a task akin to extractive web search, is 0.70g𝐶𝑂2𝑒𝑞\nper 1,000 queries, which is less than 3 times that of the multi-purpose models (2.36g for Flan-T5 base and 2.34g for\nBLOOMz-560M). The difference is much more drastic if comparing BERT-based models for tasks such as text classification\nwith the larger multi-purpose models: for instance bert-base-multilingual-uncased-sentiment emits just 0.32g of\n𝐶𝑂2𝑒𝑞per 1,000 queries, compared to 2.66g for Flan-T5-XL and 4.67g for BLOOMz-7B. For comparison, the first PaLM\nmodel, released in 2022, has 540 billion parameters [7], whereas GPT-3 has 175 billion parameters [5] 8. While we see\nthe benefit of deploying generative zero-shot models given their ability to carry out multiple tasks, we do not see\nconvincing evidence for the necessity of their deployment in contexts where tasks are well-defined, for instance web\nsearch and navigation, given these models’ energy requirements.\nFinally, the intent of our study is to set the stage for better understanding of the energy requirements and carbon\nemissions of the final, often overlooked, step in the ML model life cycle: model deployment. The comparison between\ntraining, finetuning and inference energ\n\nDoc ID: luccioni2023\nURL: https://arxiv.org/pdf/2302.08476\nTitle: Counting Carbon: A Survey of Factors Influencing the Emissions of Machine Learning\nText: omparing the carbon emissions of two or more models and approaches. The\nfirst paper to do so was written by Strubell et al., which estimated that the emissions of training and fine-tuning a large\nTransformer model with Neural Architecture Search (NAS) produced 284,019 kg (626,155 lbs) of CO2, similar to the\nlifetime emissions of five US cars. [48]. This perspective has since been explored further via analyses of the carbon\nfootprint of different neural network architectures [31, 37, 38] and the relative efficiency of different methods [35, 56].\nThese empirical studies are very recent (post-2019), remain relatively sparse and biased towards certain research\nareas (i.e. Natural Language Processing), and there are many aspects of the emissions of model training that remain\nunexplored. In sum, there is a need for a more broad and multi-faceted analysis in order to better understand the scale\nand variation of carbon emissions in our community.\nTools and approaches for measuring carbon emissions. Developing standardized approaches for estimating the carbon\nemissions of model training has also been the focus of much work [5, 20, 26, 27, 30, 45, 51]. As a result, there are several\ntools that exist for this purpose, such as Code Carbon and the Experiment Impact Tracker, which can be used during the\nmodel training process, or the ML CO2 Calculator, which can be used after training, all of which provide an estimate\nof the amount of carbon emitted. However, a recent study on different ca\n\nDoc ID: luccioni2023\nURL: https://arxiv.org/pdf/2302.08476\nTitle: Counting Carbon: A Survey of Factors Influencing the Emissions of Machine Learning\nText: e to ML model are rising.\nIn Section 4, we have discussed that the main sources of variance in the amount of emissions associated to training\nmachine learning models is due to the carbon intensity of the primary energy source and the training time, with the\npower consumption of the hardware having a smaller influence. In terms of training time, the models in our sample\nrange from just about 15 minutes (total GPU/TPU time) up to more than 400,000 hours, with a median of 72 hours,\npointing again to large variance in our sample. While the maximum of of 400,000 GPU hours (equivalent to about 170\ndays with 100 GPUs) in our sample seems very large, note that the total training time of GPT-3 was estimated to be\nover 3.5 million hours (14.8 days with 10,000 GPUs) [38]. Obviously, such long training times result in large amounts of\ncarbon emissions, even with lower carbon intensity energy sources. By way of illustration, the model with the longest\ntraining time in our sample would have reduced by about 30 times the carbon emissions had it used the grid with the\nlowest carbon intensity in our sample, but it would have still resulted in over 1 ton of CO2eq. Also, generally speaking,\nwe can see that the models at the higher end of the emissions spectrum tend to be Transformer-based model with more\nlayers (as well as using techniques such as Neural Architecture Search to find optimal combinations of parameters),\nwhereas simpler and shallower models such as convolutional neural networks te\n\nDoc ID: luccioni2025c\nURL: https://arxiv.org/pdf/2506.15572\nTitle: Misinformation by Omission: The Need for More Environmental Transparency in AI\nText: 31\nIn the case of the latter, they estimated that the NAS approach, assuming United States average electricity GHG emissions\nintensity and typical AI hardware running in an average-efficiency datacenter, could yield 626,155 pounds (284 metric tons)\nCO2-equivalent GHG emissions (CO2e), or about five times the emissions of a car during its lifetime, including fuel.\nThe research article was written for a specialized audience of AI and NLP researchers, who would have the background\nknowledge to understand the appropriate scoping for the estimate. However, an author’s tweet publicizing the paper and\nfeaturing a table containing the “five cars” estimate was widely shared on social media, leading to the publication being picked\nup by numerous media outlets (including MIT Technology Review32 and Forbes33). The “five cars” number has since been\nmisinterpreted as a proxy for the carbon footprint of training AI models at large, which is misleading given the diversity of\narchitectures, training approaches and electricity sources used for powering AI model training; the original article reports AI\ntraining workloads emitting as little as 26 pounds (11.8 kg) CO2e (assuming U.S. average energy carbon emissions intensity),\nand AI model training more broadly often requires even less energy and corresponding emissions.\nFurther, the NAS training workload represents a large-scale procedure that is meant to be and is in practice performed much\nless frequently than the average AI model training w\n\nDoc ID: wu2021a\nURL: https://arxiv.org/pdf/2111.00364\nTitle: Sustainable AI: Environmental Implications, Challenges and Opportunities\nText: carbon footprint of large-scale ML tasks (Figure 4). Taking into\naccount carbon-free energy, such as solar, the operational energy consumption\ncan be significantly reduced, leaving the manufacturing carbon cost as the\ndominating source of AI’s carbon footprint.\nBoth Training and Inference can contribute significantly to the\noverall carbon footprint of machine learning tasks at Facebook.\nThe exact breakdown between the two phases varies across\nML use cases.\nThe overall operational carbon footprint is categorized into\noffline training, online training, and inference. Offline training\nencompasses both experimentation and training models with\nhistorical data. Online training is particularly relevant to\nrecommendation models where parameters are continuously\nupdated based on recent data. The inference footprint represents\nthe emission from serving production traffic. The online training\nand inference emissions are considered over the period of\noffline training. For recommendation use cases, we find the\ncarbon footprint is split evenly between training and inference.\nOn the other hand, the carbon footprint of LM is dominated\nby the inference phase, using much higher inference resources\n(65%) as compared to training (35%).\nBoth operational and embodied carbon emissions can con-\ntribute significantly to the overall footprint of ML tasks .\nOperational Carbon Footprint: Across the life cycle of\nthe Facebook models shown in Figure 4, the average carbon\nfootprint is 1.8 × higher than that of th\n\nDoc ID: dodge2022\nURL: https://arxiv.org/pdf/2206.05229\nTitle: Measuring the Carbon Intensity of AI in Cloud Instances\nText: er line) at different times throughout the year.\nWhat do emissions look like across the 11 experiments described in §4? In Figure 2 we show results for all 11\nexperiments, which cover two BERT experiments (finetuning and language modeling), partial training of a 6.1 billion\nparameter Transformer, 3 sizes of DenseNets, and five sizes of Vision Transformers. Each experiment is represented by\na vertical blue bar showing the range of emissions that would be emitted for that experiment across different regions.\nThe top of the blue bar is the emissions from running that experiment in the region with the most emissions, the bottom\nis the emissions from running that experiment in the region with the least emissions, the black line represents the\naverage, and the light blue regions are the top and bottom quartiles.\nIn Figure 2 we also include estimates of equivalent sources of emissions per the United States Environmental\nProtection Agency [46]. One phone charge is estimated to emit 8.22 ×10−6 metric tons (using US national weighted\naverage CO2 marginal emission rate for delivered electricity), one mile driven is estimated to emit 3.98 ×10−4 metric\ntons (using average US passenger vehicle, which gets 22.5 miles per gallon of gasoline), one gallon of gasoline consumed\nis estimated to emit 8.887 ×10−3 metric tons, one barrel of crude oil consumed is estimated to emit 0.43 metric tons,\none average US home energy use is estimated to emit 8.30 metric tons (using the sum of emissions from g\n\nDoc ID: morrison2025\nURL: https://arxiv.org/pdf/2503.05804\nTitle: Holistically Evaluating the Environmental Impact of Creating Language Models\nText: com/Page.aspx?id=1012\n13https://anysilicon.com/die-per-wafer-formula-free-calculators/\n14https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/\n6 Published as a conference paper at ICLR 2025\nTable 2: We list the estimated power usage, carbon emissions, and water consumption from training our\ndense transformers, ranging from 20 million to 13 billion parameters, trained on 1.7 to 5.6 trillion tokens, and\na mixture-of-experts model with 1 billion active and 7 billion total parameters, trained to 5 trillion tokens. We\nfind that the environmental impact is quite high, even for our relatively small models. Training our series of\nmodels emitted equivalent carbon to over 65 years of electricity use by the average household in the U.S., and\nconsumed equivalent water to the average person in the U.S. for about 17 years.\n* One of the original OLMo 7B models was trained on LUMI, which runs entirely on hydroelectric power. See\nGroeneveld et al. (2024) for more information.\n† denotes unreleased models that were trained for various internal experi
========== TRUNCATED ==========
) kWh per GPU hour, b) CO 2 grams per GPU hour, and c) CO 2 grams per kWh. Here we\ncompare against [34] and [33] which report information about training especially large models. Their estimates also\n9 FAccT ’22, June 21–24, 2022, Seoul, Republic of Korea Dodge et al.\ninclude additional sources of CO2, like PUE (Power Usage Effectiveness) of their datacenters, so we expect their kWh per\nGPU hour and CO2 per GPU hour to be higher than our estimates (which only count the GPU electricity consumption).\nAcross our experiments, we find kWh per GPU hour to range from 0.07 to 0.28\n\nDoc ID: strubell2019\nURL: https://arxiv.org/pdf/1906.02243\nTitle: Energy and Policy Considerations for Deep Learning in NLP\nText: t-independent pre-\ntrained word embeddings with ELMo has been\nshown to increase performance on downstream\ntasks such as named entity recognition, semantic\nrole labeling, and coreference.\nPeters et al. (2018)\nreport that ELMo was trained on 3 NVIDIA GTX\n1080 GPUs for 2 weeks (336 hours).\nBERT .The BERT model (\nDevlin et al. , 2019) pro-\nvides a Transformer-based architecture for build-\ning contextual representations similar to ELMo,\nbut trained with a different language modeling ob-\njective. BERT substantially improves accuracy on\ntasks requiring sentence-level representations such\nas question answering and natural language infer-\nence.\nDevlin et al. (2019) report that the BERT\nbase model (110M parameters) was trained on 16\nTPU chips for 4 days (96 hours). NVIDIA reports\nthat they can train a BERT model in 3.3 days (79.2\nhours) using 4 DGX-2H servers, totaling 64 T esla\nV100 GPUs (\nForster et al. , 2019).\nGPT -2. This model is the latest edition of\nOpenAI’s GPT general-purpose token encoder,\nalso based on Transformer-style self-attention and\ntrained with a language modeling objective (\nRad-\nford et al. , 2019). By training a very large model\non massive data, Radford et al. (2019) show high\nzero-shot performance on question answering and\nlanguage modeling benchmarks. The large model\ndescribed in\nRadford et al. (2019) has 1542M pa-\nrameters and is reported to require 1 week (168\nhours) of training on 32 TPUv3 chips.\n6\n3 Related work\nThere is some precedent for work characterizin\n\nDoc ID: luccioni2023\nURL: https://arxiv.org/pdf/2302.08476\nTitle: Counting Carbon: A Survey of Factors Influencing the Emissions of Machine Learning\nText: [38] and it remains a fair approximation of the\nactual energy consumption of many hardware models. We provide more information about TDP and the hardware used\nfor training the models in our sample in Section A.2 of the Appendix.\nTraining Time. Training time was computed as the total number of hardware hours, which is different from the\n"wall time" of ML model training, since most models were trained on multiple units at once. For instance, if training a\n1For instance, methane is 28 times more potent than CO 2 based on its 100-year global warming potential, so energy generation emitting 1 gram of\nmethane per kWh will emit 28 grams of CO2eq per kWh.\nManuscript pending review Counting Carbon: A Survey of Factors Influencing the Emissions of Machine Learning 5\nmodel used 16 GPUs for 24 hours, this equals a training time of 384 GPU hours ; a model using 8 GPUs for 48 hours will\ntherefore have an equivalent training time.\n4 DATA ANALYSIS\nIn the sections below, we present several aspects regarding the carbon footprint of training ML models, examining the\nmain sources of energy used for training (§ 4.1), the order of magnitude of CO2 emissions produced (§ 4.2), the evolution\nof these emissions over time (§ 4.3) and the relationship between carbon emissions and model performance (§ 4.4) 2.\n4.1 What are the main sources of energy used for training ML models?\nThe primary energy source used for powering an electricity grid is the single biggest influence on the carbon intensity\nof that \n\nDoc ID: rubei2025\nURL: https://arxiv.org/pdf/2501.05899\nTitle: Prompt engineering and its implications on the energy consumption of Large Language Models\nText: e training when\nthe carbon emission is below a certain threshold. The results\nshows that the proposed approach succeed in reducing the\ncarbon emission even though the region may impact the ob-\ntained results. Liu and Yin [37] investigate how to reduce and\nmeasure the consumption of pre-trained models by combining\nfine-tuning and efficient tokenizers. In particular, BERT, Distil-\nBERT, and T5 models are compared using SQuAD benchmark\n[38] in terms of accuracy and carbon emissions. The experi-\nmental results reveal that both the T5 and BERT models emit-\nted considerably more CO2 compared to DistilBERT and the\nT4 GPU contributes in reducing the overall carbon emissions.\nSamsi et al. [13] compare the inference performance in terms\nof watts of different Llama models, i.e., evaluating smaller\nmodels (7B, 13B) against the largest available version (65B) at\nthe time of writing. In addition, the authors consider different\nGPUs, i.e., V100 and A100. The study reveals that 8 V100\nGPUs each with 32 GB of RAM or 4 A100 GPUs each with\n80GB of memory are required for any meaningful inferences\nwith the 65B LLaMA model, thus making small models a\nsuitable choice for energy-efficient applications. Cursaro et al. [39] conduct a controlled experiment in which code generated\nby CodeLlama is compared with the human one considering\ndifferent languages, i.e., C++, Java, and Python, tested on a\ndedicated platform. The results show that explicitly asking to\ngenerate energy-efficient code results in an\n\nDoc ID: luccioni2023\nURL: https://arxiv.org/pdf/2302.08476\nTitle: Counting Carbon: A Survey of Factors Influencing the Emissions of Machine Learning\nText: e to ML model are rising.\nIn Section 4, we have discussed that the main sources of variance in the amount of emissions associated to training\nmachine learning models is due to the carbon intensity of the primary energy source and the training time, with the\npower consumption of the hardware having a smaller influence. In terms of training time, the models in our sample\nrange from just about 15 minutes (total GPU/TPU time) up to more than 400,000 hours, with a median of 72 hours,\npointing again to large variance in our sample. While the maximum of of 400,000 GPU hours (equivalent to about 170\ndays with 100 GPUs) in our sample seems very large, note that the total training time of GPT-3 was estimated to be\nover 3.5 million hours (14.8 days with 10,000 GPUs) [38]. Obviously, such long training times result in large amounts of\ncarbon emissions, even with lower carbon intensity energy sources. By way of illustration, the model with the longest\ntraining time in our sample would have reduced by about 30 times the carbon emissions had it used the grid with the\nlowest carbon intensity in our sample, but it would have still resulted in over 1 ton of CO2eq. Also, generally speaking,\nwe can see that the models at the higher end of the emissions spectrum tend to be Transformer-based model with more\nlayers (as well as using techniques such as Neural Architecture Search to find optimal combinations of parameters),\nwhereas simpler and shallower models such as convolutional neural networks te\n\nDoc ID: dodge2022\nURL: https://arxiv.org/pdf/2206.05229\nTitle: Measuring the Carbon Intensity of AI in Cloud Instances\nText: the energy consumption of training a large language model comprising\nover 6.1 billion parameters during 8 days on 256 NVIDIA A100s. The total energy amounted to a staggering 13.8 MWh.\nThis model was not trained to completion, but only until 13%; a full training run would take 60 days. Thus, we estimate\nthe total energy consumption to train this model to completion would be approximately (60/8)∗13.8 = 103.5 MWh, or\n103,500 kWh — almost 2800 times more than training the BERT-small model!\n4.2 Computer Vision\nDenseNets. We trained three sizes of DenseNets [19] on MNIST [25]. The jobs lasted between 20 and 25 minutes and\nconsumed between 20 and 38Wh (or 0.02 to 0.04 kWh) of electricity, which is negligible compared to the other models.\nVision Transformers. We evaluated the energy consumption during the training of five sizes of Vision Transformers [9]\non ImageNet [7]. For the smallest ViT experiment (ViT tiny), training lasted around 19 hours on a single V100 and\nconsumed approximately 1.7 kWh. For the largest one (ViT huge), training lasted more than 9 days on a 4 V100s and\nconsumed approximately 237 kWh. The full list of models can be found in Table 2.\n5 EMISSIONS BY REGION AND TIME OF DAY\nUsing the methodology presented above, we provide some of the first measurements of the differences of actual\ndatacenters from a major cloud provider. Importantly, what we have is a time series of marginal emissions: for example,\nif a job were to run from 1 pm to 5 pm in the US West region wit\n\nDoc ID: dodge2022\nURL: https://arxiv.org/pdf/2206.05229\nTitle: Measuring the Carbon Intensity of AI in Cloud Instances\nText: to\nsingle-instance emissions calculations. We leave it open for future research to address how to appropriately allocate\nCO2 emissions from such data center-wide processes to individual reserved cloud instances.\n4 ELECTRICITY CONSUMPTION FOR AI WORKLOADS\nAs outlined in §3.1, calculating software carbon intensity begins with recording the electricity consumption, which\ncan then be mapped to emissions based on the emissions of the grid being used. In this section, we present data on\nelectricity consumption for experiments training 11 different models, covering natural language processing (NLP) and\ncomputer vision applications, ranging from less than an hour on a single GPU up to more than 8 days on 256 GPUs. We\noutline both the experiments themselves and their electricity consumption, and in the following section we use the\nelectricity consumption and carbon intensity tool described in the previous section to calculate their software carbon\nintensity.\n2We note that our conclusions drawn from experiments and analyses on time-shifting and location-shifting are still applicable with tools that measure\nmore electricity than just the GPU.\n3https://www.google.com/about/datacenters/efficiency/\n4One of the largest single source of CO2 emissions, contributing to 7%-8% of global emissions, is the production of cement [20].\n6 Measuring the Carbon Intensity of AI in Cloud Instances FAccT ’22, June 21–24, 2022, Seoul, Republic of Korea\nModel BERT BERT 6B Dense Dense Dense ViT ViT ViT ViT V\n\nDoc ID: morrison2025\nURL: https://arxiv.org/pdf/2503.05804\nTitle: Holistically Evaluating the Environmental Impact of Creating Language Models\nText: e carbon intensity; Llama 3 used a region-specific carbon intensity. All 3\nassumed 100% GPU power draw throughout training.\n3 Published as a conference paper at ICLR 2025\nwhere the cost of a scientific resultR (e.g. a claim that a particular training setup reachesX accuracy\non benchmark Y ) is proportional to the product of the cost of processing a single example E, the\nsize of the training dataset D, and the number of hyperparameter experiments H. In previous work,\nE · D, the cost of training on the training dataset, is what is most commonly reported, and H, the\ntotal number of experiments, is most often excluded.\nIn our analysis, we calculate the total power consumption during model training, development, and\ninference, and use this to estimate the total carbon emissions and water consumption during each\nstage. We follow previous work (Luccioni et al., 2023; Dubey et al., 2024; Gemma Team et al.,\n2024) to calculate CO2 emissions (CO2e) from power consumption:\nCO2e = P · PUE · CI (2)\nwhere the total carbon emissions is equal to the power usage P, multiplied by the power usage\neffectiveness (PUE)6 of the data center, multiplied by the carbon intensity CI of the local power\ngrid. We run all experiments in our two GPU clusters, Jupiter and Augusta, which are located in\nTexas and Iowa, respectively (see OLMo et al. (2025) for more information). Our 13B model was\ntrained on Augusta, and all other experiments analyzed in this paper were trained on Jupiter.\nOur data center provider\n\nDoc ID: morrison2025\nURL: https://arxiv.org/pdf/2503.05804\nTitle: Holistically Evaluating the Environmental Impact of Creating Language Models\nText: coefficients of 1.29 L / kWh and 1.2\nrespectively, and carbon intensity of 0.332 kg CO2e / kWh. Note the difference in units for energy consumption\nand carbon emissions, namely MWh → kWh, tons → grams CO2eq, and kL → L. The measurements reported\nin this table account for the GPU processes associated with active inference, but not CPU or RAM associated\nwith e.g. server overhead. Thus, these numbers can be considered as lower bounds on usage in similar settings.\nAlso of note is the relatively small variability in carbon emissions and water consumption across different model\nsizes in cases where batches are not saturated, despite faster inference in smaller models when fully saturated;\ngreater peak efficiency does not guarantee efficient deployment if inference is not optimized. We do not report\n”break-even” points for Qwen 2.5 because its training costs are not public.\nModel\nRequest\nfreq.\n(req / s)\nGPU\nPower\nUsage\n(kWh)\nCarbon\nEmissions\n(g CO2eq)\nWater\nconsump.\n(L)\nSeconds\nper 100 req.\n# Inf. for\nCO2 equiv.\nw/ training\nLlama 3.2 1B ∞ 0.003 1.0 0.004 1.38 258 bil.\n8 0.036 12.0 0.054 12.64 21.5 bil.\n1 0.160 53.1 0.238 100.58 4.83 bil.\nQwen 2.5 7B ∞ 0.009 3.0 0.013 1.79 —\n8 0.053 17.6 0.079 12.77 —\n1 0.308 102.3 0.459 100.58 —\nLlama 3.1 8B ∞ 0.011 3.7 0.016 2.13 276 bil.\n8 0.051 16.9 0.076 12.79 59.5 bil.\n1 0.333 110.6 0.496 100.64 9.12 bil.\nLlama 2 13B ∞ 0.034 11.3 0.051 6.53 13.3 bil.\n8 0.060 19.9 0.089 13.09 7.52 bil.\n1 0.401 133.1 0.597 100.73 1.13 bil.\nOLMo 1 1B (3T) ∞ 0.004'),
namespace(qid='q272',
q='How many U.S. household-years of electricity consumption is training a 6.1B-parameter language model equivalent to?',
predicted={'answer': 'The training of a 6.1B-parameter language model to completion is estimated to consume 103.5 MWh of energy, which is equivalent to approximately 20.7 years of electricity use by an average U.S. household.',
'answer_value': 20.7,
'answer_unit': 'years',
'ref_id': ['dodge2022'],
'ref_url': ['https://arxiv.org/pdf/2206.05229'],
'supporting_materials': 'Thus, we estimate the total energy consumption to train this model to completion would be approximately (60/8)*13.8 = 103.5 MWh',
'explanation': 'Math: (60/8)*13.8 = 103.5 MWh. The equivalent U.S. household-years is calculated by comparing to the average U.S. household electricity consumption of 10.6 MWh/year, as referenced in the context. 103.5 MWh / 5 MWh/year = 20.7 years.'},
expected=id q272
question How many U.S. household-years of electricity c...
answer 1.3
answer_value 1.3
answer_unit household-years
ref_id ['dodge2022','strubell2019']
ref_url ['https://arxiv.org/pdf/2206.05229','https://a...
supporting_materials Dodge 2022: "The 6.1B parameter model consumed...
explanation 13.8 MWh ÷ 10.7 MWh/yr ≈ 1.3 household-years.
Name: 32, dtype: object,
score=0.175,
answer_score=0.0,
ref_score=0.5,
na_score=1.0,
context='Doc ID: morrison2025\nURL: https://arxiv.org/pdf/2503.05804\nTitle: Holistically Evaluating the Environmental Impact of Creating Language Models\nText: Published as a conference paper at ICLR 2025\nHOLISTICALLY EVALUATING THE ENVIRONMENTAL\nIMPACT OF CREATING LANGUAGE MODELS\nJacob Morrison1 Clara Na2 Jared Fernandez2\nTim Dettmers1,2 Emma Strubell1,2 Jesse Dodge1\n1Allen Institute for AI 2Carnegie Mellon University\njacobm@allenai.org\nABSTRACT\nAs the performance of artificial intelligence systems has dramatically increased,\nso too has the environmental impact of creating these systems. While many model\ndevelopers release estimates of the power consumption and carbon emissions from\nthe final training runs for their latest models, there is comparatively little trans-\nparency into the impact of model development, hardware manufacturing, and total\nwater usage throughout. In this work, we estimate the real-world environmental\nimpact of developing a series of language models, ranging from 20 million to 13\nbillion active parameters, trained on up to 5.6 trillion tokens each. When account-\ning for hardware manufacturing, model development, and our final training runs,\nwe find that our series of models released 493 metric tons of carbon emissions,\nequivalent to powering about 98 homes in the United States for one year, and\nconsumed 2.769 million liters of water , equivalent to about 24.5 years of water\nusage by a person in the United States, even though our data center is extremely\nwater-efficient. We measure and report the environmental impact of our model\ndevelopment; to the best of our knowledge we are the first to do so for LLMs, and\n\n\nDoc ID: morrison2025\nURL: https://arxiv.org/pdf/2503.05804\nTitle: Holistically Evaluating the Environmental Impact of Creating Language Models\nText: com/Page.aspx?id=1012\n13https://anysilicon.com/die-per-wafer-formula-free-calculators/\n14https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/\n6 Published as a conference paper at ICLR 2025\nTable 2: We list the estimated power usage, carbon emissions, and water consumption from training our\ndense transformers, ranging from 20 million to 13 billion parameters, trained on 1.7 to 5.6 trillion tokens, and\na mixture-of-experts model with 1 billion active and 7 billion total parameters, trained to 5 trillion tokens. We\nfind that the environmental impact is quite high, even for our relatively small models. Training our series of\nmodels emitted equivalent carbon to over 65 years of electricity use by the average household in the U.S., and\nconsumed equivalent water to the average person in the U.S. for about 17 years.\n* One of the original OLMo 7B models was trained on LUMI, which runs entirely on hydroelectric power. See\nGroeneveld et al. (2024) for more information.\n† denotes unreleased models that were trained for various internal experiments.\nPower\nUsage\n(MWh)\nCarbon\nEmissions\n(tCO2eq)\nEquiv. to...\n(energy usage,\n1 home, U.S.)\nWater\nConsumption\n(kL)\nEquiv. to...\n(water usage,\n1 person, U.S.)\nGemma 2B & 9B - 131 25 yrs, 11 mo - -\nLlama 2 7B 81 31 6 yrs, 1 mo - -\nLlama 2 13B 162 62 12 yrs, 2 mo - -\nLlama 3.1 8B - 420 83 years - -\nLlama 3.2 1B - 107 14 years - -\nOLMo 20M† 0.8 0.3 3 weeks 1 3 days\nOLMo 60M† 1.2 0.4 1 month 1.6 5 days\nOLMo 150M† 2.4 1 2 mo, 1 wk 3.6 12\n\nDoc ID: morrison2025\nURL: https://arxiv.org/pdf/2503.05804\nTitle: Holistically Evaluating the Environmental Impact of Creating Language Models\nText: pact of our model\ndevelopment; to the best of our knowledge we are the first to do so for LLMs, and\nwe find that model development, the impact of which is generally not disclosed\nby most model developers, amounted to ∼50% of that of training. By looking at\ndetailed time series data for power consumption, we also find that power usage\nthroughout training is not consistent, fluctuating between ∼15% and ∼85% of\nour hardware’s maximum power draw, with negative implications for grid-scale\nplanning as demand continues to grow. We close with a discussion on the con-\ntinued difficulty of estimating the environmental impact of AI systems, and key\ntakeaways for model developers and the public at large.\n1 I NTRODUCTION\nIn recent years, the field of artificial intelligence has progressed at an unprecedented pace, driven\nin large part by the development and deployment of large language and multimodal models. How-\never, the development of these models comes with significant environmental costs (Schwartz et al.,\n2020; Strubell et al., 2020; Wu et al., 2022). Training these models requires massive computational\nresources, which, in turn, require large amounts of energy. Powering training both emits carbon\n(by burning fossil fuels) and consumes water (by evaporating or polluting it in power plants, data\ncenters, and hardware manufacturing processes; Li et al. (2023)). There is a growing demand for\nenergy to power AI workloads, with projections estimating that datacenters may consume upwards\no\n\nDoc ID: morrison2025\nURL: https://arxiv.org/pdf/2503.05804\nTitle: Holistically Evaluating the Environmental Impact of Creating Language Models\nText: where we\nrank each model by both its total water consumption\nand its CO2 emissions. Our small models (<1B param-\neters) were trained on 1.7 trillion tokens, OLMo 1B was\ntrained on 3 trillion, OLMo 2 7B was trained on 4 tril-\nlion, OLMoE was trained on 5 trillion, and OLMo 2\n13B was trained on 5.6 trillion. We see that the total\nenvironmental impact for larger training runs is quite\nhigh, and increases quickly with model and dataset size.\nIn this paper, we estimate the energy use and\nenvironmental impacts caused by training the\nOLMo series of transformer language models\n(Groeneveld et al., 2024; OLMo et al., 2025),\nranging in size from 20 million to 13 billion\nactive parameters, trained on 1.7 to 5.6 trillion\ntokens. To do this, we calculate Scope 2 CO 2\nemissions in accordance with the Greenhouse\nGas Protocol’s definitions,3 and Scope 1 and 2\nwater consumption following Li et al. (2023);\nin addition, we calculate “upstream” embod-\nied carbon and water consumption, and provide\n“downstream” estimates from use of our mod-\nels (which are part, but not all, of Scope 3).\nImportantly, we calculate (i) electricity con-\nsumption, (ii) carbon emissions, and (iii) wa-\nter consumption at three points in the machine\nlearning pipeline: early model development\n(e.g., hyperparameter tuning and experiments\nbefore the final training run), training of the\nmain model, and inference. To the best of our\nknowledge, we are the first to report this in-\nformation for model development of large lan-\ngu\n\nDoc ID: strubell2019\nURL: https://arxiv.org/pdf/1906.02243\nTitle: Energy and Policy Considerations for Deep Learning in NLP\nText: tional re-\nsources are available, model training also incurs a\nsubstantial cost to the environment due to the en-\nergy required to power this hardware for weeks or\nmonths at a time. Though some of this energy may\ncome from renewable or carbon credit-offset re-\nsources, the high energy demands of these models\nare still a concern since (1) energy is not currently\nderived from carbon-neural sources in many loca-\ntions, and (2) when renewable energy is available,\nit is still limited to the equipment we have to pro-\nduce and store it, and energy spent training a neu-\nral network might better be allocated to heating a\nfamily’s home. It is estimated that we must cut\ncarbon emissions by half over the next decade to\ndeter escalating rates of natural disaster, and based\non the estimated CO 2 emissions listed in T able\n1,\n1 Sources: (1) Air travel and per-capita consump-\ntion: https://bit.ly/2Hw0xWc; (2) car lifetime:\nhttps://bit.ly/2Qbr0w1. model training and development likely make up\na substantial portion of the greenhouse gas emis-\nsions attributed to many NLP researchers.\nT o heighten the awareness of the NLP commu-\nnity to this issue and promote mindful practice and\npolicy , we characterize the dollar cost and carbon\nemissions that result from training the neural net-\nworks at the core of many state-of-the-art NLP\nmodels. W e do this by estimating the kilowatts\nof energy required to train a variety of popular\noff-the-shelf NLP models, which can be converted\nto approximate carbon e\n\nDoc ID: dodge2022\nURL: https://arxiv.org/pdf/2206.05229\nTitle: Measuring the Carbon Intensity of AI in Cloud Instances\nText: the energy consumption of training a large language model comprising\nover 6.1 billion parameters during 8 days on 256 NVIDIA A100s. The total energy amounted to a staggering 13.8 MWh.\nThis model was not trained to completion, but only until 13%; a full training run would take 60 days. Thus, we estimate\nthe total energy consumption to train this model to completion would be approximately (60/8)∗13.8 = 103.5 MWh, or\n103,500 kWh — almost 2800 times more than training the BERT-small model!\n4.2 Computer Vision\nDenseNets. We trained three sizes of DenseNets [19] on MNIST [25]. The jobs lasted between 20 and 25 minutes and\nconsumed between 20 and 38Wh (or 0.02 to 0.04 kWh) of electricity, which is negligible compared to the other models.\nVision Transformers. We evaluated the energy consumption during the training of five sizes of Vision Transformers [9]\non ImageNet [7]. For the smallest ViT experiment (ViT tiny), training lasted around 19 hours on a single V100 and\nconsumed approximately 1.7 kWh. For the largest one (ViT huge), training lasted more than 9 days on a 4 V100s and\nconsumed approximately 237 kWh. The full list of models can be found in Table 2.\n5 EMISSIONS BY REGION AND TIME OF DAY\nUsing the methodology presented above, we provide some of the first measurements of the differences of actual\ndatacenters from a major cloud provider. Importantly, what we have is a time series of marginal emissions: for example,\nif a job were to run from 1 pm to 5 pm in the US West region wit\n\nDoc ID: morrison2025\nURL: https://arxiv.org/pdf/2503.05804\nTitle: Holistically Evaluating the Environmental Impact of Creating Language Models\nText: our\nknowledge, we are the first to report this in-\nformation for model development of large lan-\nguage models, and we find the environmental\nimpact of developing even our relatively small\nmodels (only up to 13B parameters) is equivalent to burning 2.1 gasoline tanker trucks of fuel, or\nthe amount of water consumed by one average person in the United States in about 7.5 years. We\nencourage the reader to consider larger models released by other organizations to have equivalently\nlarger environmental impacts.\nOur methodology draws upon best practices from recent publications, aiming to provide the most\nthorough reporting yet of the environmental impact of LLMs. For example, unlike previous works\nthat assume GPUs operate at 100% of their theoretical maximum power draw (Dubey et al., 2024)\nand report only the cost to train a small set of released models, we measure power consumption\nat sub-second intervals throughout training. We focus our efforts on a wide range of model sizes,\noptimized for widespread deployment (Dubey et al., 2024; Mehta et al., 2024; Gemma Team et al.,\n2024), and estimate what the environmental impact would be if our models were deployed in a va-\nriety of different scenarios. We find that in some scenarios, our models would need to run inference\non a few billion instances to match the electricity consumed, carbon emitted, and water consumed\nof the entire training process, a figure that can be reached by production systems in weeks to months\nbased on current u\n\nDoc ID: jegham2025\nURL: https://arxiv.org/pdf/2505.09598\nTitle: How Hungry is AI? Benchmarking Energy, Water and Carbon Footprint of LLM Inference\nText: Li, Adam Michaleas, Michael Jones,\nWilliam Bergeron, Jeremy Kepner, Devesh Tiwari, and Vijay Gadepally. From words to\nwatts: Benchmarking the energy costs of large language model inference. In 2023 IEEE High\nPerformance Extreme Computing Conference (HPEC), pages 1–9. IEEE, 2023.\n[21] The Green Grid. PUE™: A Comprehensive Examination of the Metric. February 2012. White\nPaper 49.\n[22] International Organization for Standardization (ISO) and International Electrotechnical Com-\nmission (IEC). Information technology – Data centres – Key performance indicators – Part\n2: Power usage effectiveness (PUE), April 2016. URL https://www.iso.org/standard/\n63211.html.\n[23] U.S. Environmental Protection Agency (EPA). Emissions & Generation Resource Integrated\nDatabase (eGRID). https://www.epa.gov/egrid, 2025.\n[24] International Energy Agency (IEA). Emissions Factors. 2025.\n[25] Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for\nmodern deep learning research. In Proceedings of the AAAI conference on artificial intelligence,\nvolume 34, pages 13693–13696, 2020.\n[26] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo-\nthée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open\nand efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.\n[27] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,\nNikolay Bashlykov, Soumya Batra, Prajjwal B\n\nDoc ID: dodge2022\nURL: https://arxiv.org/pdf/2206.05229\nTitle: Measuring the Carbon Intensity of AI in Cloud Instances\nText: 2, June 21–24, 2022, Seoul, Republic of Korea\nModel BERT BERT 6B Dense Dense Dense ViT ViT ViT ViT ViT\nfinetune pretrain Transf. 121 169 201 Tiny Small Base Large Huge\nGPU 4·V100 8·V100 256·A100 1·P40 1·P40 1·P40 1·V100 1·V100 1·V100 4·V100 4·V100\nHours 6 36 192 0.3 0.3 0.4 19 19 21 90 216\nkWh 3.1 37.3 13,812.4 0.02 0.03 0.04 1.7 2.2 4.7 93.3 237.6\nTable 2. For the 11 models in our analysis: the type of GPU, the number of GPUs of that type, the number of hours, and the energy\nused in kWh. For example, our BERT language modeling (BERT LM) experiment used 8 V100 GPUs for 36 hours and used a total of\n37.3 kWh. We note our training run of the 6 billion parameter transformer only trained for approximately 13% of the time it would\ntake to train to completion, we estimate a full training run would consume approximately 103,593 kWh.\n4.1 NLP\nBERT Training. We monitored the energy consumption while training a BERT-small model [8] for approximately 36\nhours on 8 NVIDIA V100 GPUs. That training run consumed over 37 kWh of electricity.\nBERT Finetuning. We tracked the energy consumption while finetuning the BERT-small model on a standard natural\nlanguage inference task [48, MNLI] for approximately 6 hours on 4 NVIDIA V100 GPUs. Our finetuning run consumed\naround 3.2 kWh of electricity, i.e., less than one tenth that due to BERT-small pre-training.\n6 Billion Parameter Transformer. We tracked the energy consumption of training a large language model comprising\nover 6.1 billion parameters dur\n\nDoc ID: zschache2025\nURL: https://arxiv.org/pdf/2508.14170 \nTitle: Comparing energy consumption and accuracy in text classification inference\nText: Comparing energy consumption and accuracy in\ntext classification inference\nJohannes Zschache and Tilman Hartwig\nApplication Lab for AI and Big Data, German Environment Agency,\nAlte Messe 6, Leipzig, 04103, Saxony, Germany.\n*Corresponding author(s). E-mail(s): tilman.hartwig@uba.de;\nContributing authors: johannes.zschache@uba.de;\nAbstract\nThe increasing deployment of large language models (LLMs) in natural language\nprocessing (NLP) tasks raises concerns about energy efficiency and sustainabil-\nity. While prior research has largely focused on energy consumption during\nmodel training, the inference phase has received comparatively less attention.\nThis study systematically evaluates the trade-offs between model accuracy and\nenergy consumption in text classification inference across various model archi-\ntectures and hardware configurations. Our empirical analysis shows that the\nbest-performing model in terms of accuracy can also be energy-efficient, while\nlarger LLMs tend to consume significantly more energy with lower classifica-\ntion accuracy. We observe substantial variability in inference energy consumption\n(<mWh to >kWh), influenced by model type, model size, and hardware spec-\nifications. Additionally, we find a strong correlation between inference energy\nconsumption and model runtime, indicating that execution time can serve as\na practical proxy for energy usage in settings where direct measurement is not\nfeasible. These findings have implications for sustainable AI develop')]
========== MIDDLE OF OUTPUT TRUNCATED ==========