Curating LLM Tuning Data from the FineWeb Dataset for High-fidelity Domain Adaptation

We created a post-training dataset from FineWeb dataset for high-fidelity domain adaptation of open weight LLM (Google Flan). Parameter efficient fine-tuning through prompt tuning resulted in remarkable improvement in perplexity scores as well as demonstration of ability of the tuned model to generalize based on information in the tuning dataset.

The work was selected for oral presentation at AGU24. Slide attached.

AGU-LLM-talk

Benchmarking power system optimization: CPU vs GPU

Power systems are getting increasingly complicated with renewable energy integration, distributed generation, storage, increasing demand, and more. Naturally, optimizing power systems is getting more computationally intensive. While gaming and AI industries have adopted GPU to speed up intensive computation, the adoption seems limited in power system planning. As an applied AI/ML researcher for energy and climate, I wanted to explore the extent to which GPU-based operations could speed up power system optimization.

I benchmark optimization power systems of different sizes using CPU and GPU-based approaches. I modeled power systems of various node sizes using PyPSA (a popular Pythonic framework) and another bespoke minimal GPU optimal setup. PyPSA was chosen because of its simplicity of implementation and growing adoption.

The PowerSystem class that models the fundamental components of an electrical grid includes:

  • Buses (nodes) representing connection points
  • Generators with specified capacities and costs
  • Loads (power demand) at various points
  • Transmission lines with physical parameters (reactance, resistance, capacity)

A simple representation of the power system is chosen with a linear chain network topology for experimental purposes only. The Appendix details configurations with a 100-node system chosen as an example.

A power system optimization problem typically involves finding the most cost-effective way to meet electrical demand across a network while respecting physical constraints like transmission line capacities and generator limitations. This is a linear programming (LP) problem in which we minimize generation costs subject to power flow and capacity constraints.

For the GPU-based optimization, I used the CuPy library. The main way GPU speeds up computation is by dividing the computation into batches and running them in a large number of parallel processes. I chose a batch size of 100 and used an OSQP solver optimized for GPU configuration.

The study tested both implementations across eight different system sizes:

  • Small systems: 10, 100 nodes
  • Medium systems: 500, 1,000, 2,000 nodes
  • Large systems: 5,000, 10,000, 20,000 nodes

For each size, the benchmark measured:

  • Solution status and correctness
  • Execution time for both CPU and GPU implementations
  • Objective function values (total generation cost)

Summary of results: I logged time taken by CPU and GPU, as well as the optimized objective values estimated by CPU and GPU. We notice that the optimized objective values as comparable for most sizes (except the smallest size).

SizeCPU Time (s)GPU Time (s)SpeedupObjective value Difference (%)
100.6590.1873.53220.16%
1000.5530.2272.4372.08%
5000.6930.8840.7840.41%
10001.2031.7720.6790.21%
20002.6524.1920.6330.10%
500027.46513.5992.0200.04%
10000179.22639.4634.5420.02%
200001177.919126.7699.2920.01%

While I found that for the system sizes chosen, GPU generally speeds up computation, the order of magnitude differs by system size. The speed-up is relatively high for smaller system, goes does to below 1 for medium sizes and again dramatically speeds up for larger sizes.

The results reveal several important insights about GPU acceleration for power system optimization:

Performance Crossover Point

There appears to be a “crossover point” around 2,000 nodes where GPU acceleration becomes clearly advantageous. This suggests that:

  • For smaller systems, the overhead of GPU memory transfers may offset potential gains
  • Larger systems better utilize GPU parallelism, leading to substantial speedups

Scalability Characteristics

The GPU implementation shows superior scalability:

  • CPU time grows roughly quadratically with system size
  • GPU time grows more linearly, especially for larger systems
  • The speedup factor increases with system size, suggesting even better performance for very large systems

Implications

The results show remarkable speedup using GPU-based optimization, favoring its usage in large systems requiring prompt optimization. However, GPUs consume more electricity and water, and environmental factors must be taken into account during implementation. Ultimately, it’s about the trade-off between time saved, GPU costs, and potential environmental costs.

Use the following Github gist for replication. https://gist.github.com/kshitizkhanal7/4bed7ac04f9f89f64c99a5d297a611b7

Appendix: Reference system with 100 nodes

The system represents a large-scale power transmission network with several key characteristics:

  1. System Structure
    • 100 buses (B0 through B99)
    • 100 generators (G0 through G99)
    • 100 loads (L0 through L99)
    • 99 transmission lines connecting adjacent buses
  2. Generator Characteristics Each generator has:
    • A maximum capacity of 1000 MW
    • A cost that varies linearly across the system:
      • G0 starts at 50 $/MWh
      • G99 ends at 150 $/MWh
      • Each generator’s cost increases by approximately 1 $/MWh This cost gradient creates an interesting optimization problem where cheaper generators are preferred but transmission constraints may force the use of more expensive ones.
  3. Load Pattern Each load follows a sinusoidal pattern:
    • Base load of 500 MW
    • Variation of ±100 MW based on position
    • The formula P = 500 + 100*sin(2πi/100) creates a wave pattern across the system This pattern mimics real-world load variations while maintaining mathematical tractability.
  4. Transmission Lines Each line connecting adjacent buses has:
    • Capacity of 1000 MW
    • Reactance (X) of 0.1 per unit
    • Resistance (R) of 0.01 per unit These parameters create realistic power flow constraints.
  5. Optimization Problem Size The complete system creates a substantial optimization problem with:
    • 100 decision variables (generator outputs)
    • 99 line flow constraints
    • 100 power balance constraints
    • 100 generator capacity constraints Total: ~400 constraints and ~100 variables
  6. Performance Results For this 100-node system, the benchmark showed:
    • CPU time: 0.553 seconds
    • GPU time: 0.227 seconds
    • Speedup factor: 2.437x
    • CPU objective: 4.717929e+06
    • GPU objective: 4.815897e+06

Making the spectrum of ‘openness’ in AI more visible

A (very) recent history of openness in AI

Google released demos of Gemini last week with much fanfare, but no way to even test it except with a supposed integration with Bard.

Mistral AI tweeted a Magnet link to one of its models. No fanfare. No press. Anyone with decent LLM skills could download, use, and even fine-tune the model. For open-source enthusiasts, it was a much better release than Gemini. This kind of accessibility to pretrained parameters of the neural network is called open weights. It enables users to use the model for inference and finetuning.

Open weights are better than just a demo or access to a product like ChatGPT or an API, no doubt. The example of Mistral is a case in point on what seems to be open source, might not be open source or fully open source. A post from The Register discusses in detail how Meta’s llama 2 isn’t exactly open source despite the claims.

Other models are more open. BLOOM (BigScience Large Open-science Open-access Multilingual Language Model) provides fully accessible source code and uses responsibly sourced training data, with support for diverse languages and cultures.

My main argument is that whenever an AI model is released for public consumption, where the model falls on the spectrum of openness should be clearly expressed and understood, without putting the burden of digging that information from the tome of license agreements on the user. AI, as a community of practice, should engage more in making that happen.

Spectrum of openness in AI

To make the idea of the spectrum of openness easier to understand, let’s take the example of openness in software. Openness, or typically a digital artifact being “open” is often thought of as binary. Whether something is open or closed. A straightforward example is that Linux is open while Windows is not. OpenStreetMap is open while Google Maps is not.

Openness is not exactly binary, it’s a spectrum. It’s easier to understand with the example of open-source software, as the history of free/open/libre software movements paves the way for discussions in openness of other artifacts such as data, research, science, etc. Software can be open source, but still varies in the level of “freedom” it provides the users.

Here’s what a spectrum of freedom in open source software might look like:

  • Freedom to modify source code and redistribute
  • Freedom to modify source code, but not to redistribute
  • Freedom to modify source code of core components, but additional features are proprietary
  • Freedom to view source code, but not to modify

This is only for software that’s considered open source. Some freemiums are free to use, but source code is not available and sometimes might be mistaken for open source. This kind of freedom is only one dimension in which we can discuss the openness of software. There are other dimensions to consider, for example: community engagement and governance, language support, documentation, interoperability, commercial engagement, and more.

Extrapolating the same concepts to openness in AI, even for an open weights model, the following (at the very least) are most likely closed:

  • Training dataset (with all potential bias and ethical issues, including legal compliance and copyright issues
  • Ethical guidelines and safety measures behind the creation of the model
  • Training code, methodology, hyperparameters, optimization techniques, post-training
  • Complete model architecture
  • Documentation
  • Objective evaluation following the norms of open, reproducible science
  • Organizational collaboration, governance
  • Finance, GPU, labor, and other resources necessary

Why is openness to all this information important?

Mainly, because we should be able to trust AI before using it, like we need to trust any product before we use it. Some instances of what trustworthy AI might look like:

  • Model architecture can be studied to make further developments. For example, the publication of the “Attention Is All You Need” paper with details on the attention mechanisms enabled much of the recent developments in Large Language Models.
  • An AI auditor can look at the training datasets and methodology to identify potential legal and ethical issues.
  • A startup developing an LLM-based app for their customers can understand potential security issues with the app and address those to save their customers from harm.
  • A lot of social bias and potential harm to underprivileged communities can be scrutinized so they can be avoided or remarkably mitigated.

However, the benefits of a level of privacy must be acknowledged as with all discussions in openness. Information that might affect the privacy or security concerns of stakeholders, including trademark and copyright issues should be private. Ultimately, it’s about finding the right trade-off to maximize social utility.

What next?

Now that we understand the value of openness and its visibility in AI, here are some actions the community can take.

We should develop a framework to define openness in AI.

The framework covers all the information about a model that its users need to be aware of. Some efforts have already been made. Sunil Ramlochan makes the distinction between open source, open weights, and restricted weights and suggests a simple framework for openness in AI. We can consolidate similar efforts to develop a comprehensive framework for openness in AI.

We should encourage the practice of discussing openness of AI models/products, not just using them.

AI as a community of practice, has enabled discussions on finetuning models and building products on top of them, pushing the limits of making diffusing AI to the masses. In addition to this, we should also discuss openness. Openness is not only an idealistic concept for academic discussions, but also a property of the models that can enable or hinder innovation and usefulness.

AI creators/companies should make openness information more accessible during release.

Instead of burying limitations in license agreements, creators/companies can make the information of where the models like in the spectra of openness in accessible language help the users understand the possibilities and limitations more easily and help reduce friction for the creators to enforce compliance with the terms.

We should develop a community-supported index to track and discuss openness of AI models/products.

Leaderboards have been very helpful recently in facilitating discussions of the performance of recently released models. Since openness is more qualitative than benchmark performance, an index can be designed that represents the openness of models in various dimensions in quantitative or definitive qualitative terms. Open data has a rich history of using indices to assess the current state of openness and pinpoint areas for improvement. Open Knowledge Foundation’s Open Data Index and Web Foundation’s Open Data Barometer can serve as good references for the AI models’ openness index. It could be hosted on a platform with good community support, for instance, HuggingFace. [I was involved in the Open Data Index and Open Data Barometer as a country reviewer for Nepal.] Stanford University has recently launched the Foundation Model Transparency Index which provided a rating of openness of 10 large foundation models. The project can provide lessons for a more active and community-managed project in which the openness of models can be assessed and compared with others soon after release.

We should increase community engagement in developing licenses for AI models.

Similar to how Creative Commons has made licensing content (text, images, etc.) easier, we need a variety of licenses that suit AI models with substantial community engagement. A notable initiative is the OpenRAIL project that has made a great start but still feels niche. The conversation about licensing needs to be more mainstream, and for that we need greater community engagement. As someone involved with open data, open source software, and OpenStreetMap communities for over a decade, vibrant community support is required to make open projects more widely accessible.

Summing up

Open access to AI research, openly available neural network architectures, open weights, and in general support for open source in various forms even from large tech companies have gotten us this far in making powerful AI more accessible. Openness in provenance information and source, and the freedom this enables will help make the future of AI more trustworthy.

Embedding Shiny App in WordPress

I mostly code in R and Python for my data science/machine learning projects and use WordPress in my portfolio blog. In order to communicate my experiments as interactive visualizations, I can either publish those as ShinyApps, or Quarto websites.

I wanted to test if I could embed a Shiny app in WordPress. It could help me write the data analysis and interactive visualization code in R, and publish it to my WordPress-based personal website.

The solution was to embed a Shiny app as an “iframe” in a WordPress blog.

An iframe (short for inline frame) is an HTML element that allows us to embed another HTML document within the current document. It provides a way to include external content from another source or website into your web page. The content within the iframe is displayed as a separate independent window within the parent document.

I published the example ShinyApp in “https://kshitizkhanal7.shinyapps.io/basic_shiny/“. Then I used the following HTML code in this WordPress code to get the app embedded here.

<iframe src="https://kshitizkhanal7.shinyapps.io/basic_shiny/" width="150%" height = "650"></iframe>

Let’s break it down:

  • <iframe>: This is the opening tag of the iframe element.
  • src="https://kshitizkhanal7.shinyapps.io/basic_shiny/": The src attribute specifies the URL of the external web page you want to display within the iframe. In this case, it is set to "https://kshitizkhanal7.shinyapps.io/basic_shiny/".
  • width="150%": The width attribute determines the width of the iframe. In this example, it is set to "150%", indicating that the iframe will be 150% of the width of its container. This allows the iframe to expand beyond the normal width of the container if needed.
  • height="650": The height attribute specifies the height of the iframe in pixels. In this case, it is set to "650" pixels.
  • </iframe>: This is the closing tag of the iframe element.

The resulting embedded app follows.

I plan to use this and explore other tools to create scrolly data stories in WordPress. Follow this space for more.

I am on Twitter @kshitizkhanal7.