# Alpine: condo FairShare and resource access 

This purpose of this document is to provide a detailed overview of 
Alpine's FairShare and resource access policies. The intended audience is 
Alpine's institutional-level condo contributors.


## Goals and philosophy 

The Alpine supercomputer hosted at CU Boulder and administered by Research 
Computing at CU Boulder (CURC) supports institutional condo purchases, the 
first two of which came online in the fall of 2022. This document outlines 
how jobs affiliated with condo contributors will be assigned priority and 
billing weight on the system. It does not aim to address FairShare scoring 
for individual users and allocations, as these involve more detail and 
complexity than this document intends to cover. 

The goals of this policy are to maximize overall system utilization, 
assign appropriate priority and billing weights proportional to 
contribution, and maintain a straightforward and accountable 
configuration. This document describes the initial policy, but changes to 
any or all of it are possible based on contributor feedback, user issues, 
CURC’s own data-gathering and monitoring, and other factors. 

## Node, partition, and FairShare policies 

### Overview 

All nodes, whether contributed by CU Boulder or by another institution, 
will be added to a shared partition to which all Alpine users will have 
access. This does not preclude adding nodes to one or more additional 
partitions with different settings and attributes. At minimum, CURC will 
create one dedicated partition and QOS per contributor. 

By configuring all nodes into a common pool by default, CURC intends to 
better maximize system utilization and ensure that users are not locked 
out of Alpine due to (for example) power or networking issues that only 
affect their institution’s nodes. Users will always have the option to 
request specific node attributes for their jobs – for example, presence of 
GPUs, InfiniBand, or high clock speeds – and those requests will remain 
part of all scheduling calculations. 

Through a combination of QOS’s, trackable resources, partitions, and other 
Slurm settings, users will receive some preference for running jobs on 
their own nodes. However, preemption of jobs will not be supported. 

As one of several determinants of job priority, CURC will assign a 
FairShare score in the Slurm database for each condo buy-in on Alpine 
based on the institution’s total contribution to the system. This will 
account for total cores, clock speeds, cores on nodes with high memory 
(>=1TB), and GPUs. All resources will be billed back accordingly at the 
time of job submission. 

### GPU calculations 

To compute the GPU term in the FairShare equation, CURC will multiply 
total contributed GPUs by an "acceleration factor". This factor is derived 
from the MATLAB GPU benchmark series, which CURC has used to measure 
computational performance on a variety of tasks on a representative GPU 
against the equivalent on a CPU. 
As of February 2025, the acceleration factor of an NVIDIA A100, NVIDIA L40, or AMD 
MI100 GPU will be 108.6, meaning that our benchmarks indicate a GPU will 
provide 108.6x speedup (on average) over a CPU. Scaling this to a typical 
Alpine node, the number of SUs allocated for a 64-core node with 3 GPUs 
and 64 cores would be a factor of 5-6 higher than a 64-core CPU-only node. 
This figure and the choice of benchmarking software are subject to change 
in the future based on new information. 

### Complete equation 

The exact score will be derived from the following equation: 

```
FairShare = floor ( (standard node CPU core hours contributed 
* average clock speed / minimum clock speed on Alpine) 
+ (GPUs contributed * GPU acceleration factor) # changed in this version 
+ (high-memory node CPU core hours contributed 
* average clock speed in MHz / minimum clock speed available 
* mem-per-core_high-mem / mem-per-core_standard=256GB)) 
```

To distribute the FairShare score, each institutional contributor will 
receive an account (containing all its affiliated users) that will hold 
the full score. In turn, each of those accounts will contain a "general" 
subaccount that will have 20% of the FairShare pool and a "projects" 
subaccount with 80%. The general/projects pattern is already in place on Alpine. Contributors wishing for a different accounting arrangement from "general/projects" should reach out to CURC with specific requests. This crediting and billing system is intended to balance contributions and 
utilization. 

### Example and comment on FairShare and billing 

Suppose the lowest rated clock speed available on any Alpine node is 
2.2GHz. Acme University contributes the following order: 

- 16x 64-core 3.2GHz CPU nodes with 256 GB RAM 
- 2x GPU node with 3 A100 GPUs, same CPU 
- 4x 1TB high-memory nodes, same CPU 

For this order, Acme would receive the following score: 

```
Acme University FairShare = floor ( 
  16 * (64+2) * 3200MHz / 2200MHz           # 1536 
  + 2 * 3 GPUs * 108.6/GPU                  #  651.6 
  + 4 * 64 * (1028 GB RAM / 256 GB RAM))    # 1028 
  = 3,215 
```

Note that prior to the change described in this policy, the GPU term was 2 
x 3 x 6912 = 41,472. 

These scores will be revised upon new nodes being contributed. 

GPU, CPU, and memory (along with potentially other resources in the 
future) will be billed to users in proportion to their weight in the 
fairshare calculation. For example, a partition might be configured with 
any of the following resource billing weights: 

```
TRESBillingWeights="CPU=1.0,GRES/gpu:a100=108.6" 
TRESBillingWeights="CPU=1.0,GRES/gpu:mi100=108.6" 
TRESBillingWeights="CPU=1.0,GRES/gpu:l40=108.6"
TRESBillingWeights="CPU=1.0,Mem=4.0" 
```

Additionally, CURC reserves the authority to modify this policy or 
specific jobs if users abuse it – if, for example, a contributor purchases 
many GPU cores and uses their share of "credit" from those GPUs to exploit 
another contributor’s CPU-only contributions. 

## Reporting 

CURC will provide regular reports to institutional Alpine partners to 
demonstrate they are receiving appropriate value for their contributions. 
The full content of these reports is beyond the scope of this document, 
but it will include measurements of resource utilization, statistics 
useful for quality assurance, and the number and size of jobs run. 

CURC will generate automated reports monthly and share them through 
appropriate channels, either written or in-person. CURC will also produce 
more detailed reports examining the overall value of the system to condo 
contributors once per quarter. Reports may be addressed to the institution 
or, if applicable, to the condo buyer within the institution (for example, 
a PI or department). 

Note that some Slurm usage querying tools are also available for general 
system users, including `sinfo`, `sshare`, `sreport`, and `squeue`. 
Public-facing reports and dashboards are available online through RC’s 
instance of XDMoD. 

## Key details 

Please note the following about the way Slurm calculates priority: 

- A job’s priority is based on multiple factors in addition to the 
FairShare score, including (but not limited to) job age, resources 
requested, job size, and QOS. 
- The FairShare component of the final priority calculation is determined 
by normalizing the institution’s FairShare against the highest FairShare 
on the system. 
- FairShare scores for sub-accounts (e.g., acme-projects and acme-general) 
are normalized against each other. These sub-accounts will be assigned 
scores at an appropriate 80/20 ratio, which will not need to be changed to 
accommodate expansions or new contributions. 

In most but not all cases, changes to FairShare scores, QOSs, and other 
Slurm settings are straightforward to implement. Systems administrators 
can apply these changes quickly once CURC and the contributor(s) agree 
upon them. 

For additional details about how Slurm implements this score, please 
consult the official Slurm documentation (in particular, docs pertaining 
to the "Multifactor Priority Plugin") or follow up with CURC with specific 
questions. This policy is now in effect and is subject to change at any 
time. 


Alpine is jointly funded by the University of Colorado Boulder, the 
University of Colorado Anschutz, Colorado State University, and the National Science 
Foundation (award 2201538).

