WestGrid Legacy Systems Migration

From CC Doc
Jump to: navigation, search
This page contains changes which are not marked for translation.

Other languages:
English • ‎français

Parent page: Migration from Legacy Regional Systems to Compute Canada National Systems 2016-18

Updates

v1.6 Sep.12, 2017 2018 migration and defunding plans added.
v1.5 Aug 22, 2017 Some new systems are now available and in production. The text has been updated to reflect actuals.
v1.4 June 30, 2017 RAC 2017 allocations have been implemented on the continuing WG legacy systems.
v1.3 May 31, 2017 Schedule updates and various details for systems to be defunded in 2017.
v1.2 Mar.20, 2017 Link to software availability documentation
v1.1 Mar.9, 2017 Nestor has been extended to June 1, 2017.
v1.0 Mar.8, 2017 UofC dates revised to Aug.31. General clean-up of the docs.
v0.92 Mar.7, 2017 Revised dates. University of Alberta details updated. User access to Jasper & Hungabee extended to September 30, 2017.
v0.91 Mar 3, 2017 UVic nestor/hermes details updated and confirmed (to June 1, 2017)

WestGrid RAC Implementation

The RAC 2017 allocations were implemented on WestGrid continuing systems and the new Cedar system on June 30, 2017. Users should now see their jobs running with their RAC 2017 priorities.

Defunded systems were not allocated, and will remain with their RAC2016 priorities until the defunding dates.

Parallel, Grex, Orcinus and Bugaboo were included in RAC 2017, but will be defunded Mar.31/2018 and will not be in RAC 2018.

WestGrid Legacy Systems Migration Process 2016-2018

WestGrid will be defunding the following legacy systems. See also Where to Migrate? on the WG home pages for further details.

System Site Defunding Date Recommended Migration Date(*) Notes
Nestor and hermes University of Victoria June 1, 2017 (DONE) Due to delays with new systems UVic shared storage remained available until the end of July 31. See below for details.
Breezy and lattice University of Calgary August 31, 2017 (DONE) Parallel and the shared storage system will continue until 2018.
Jasper and Hungabee University of Alberta September 30, 2017 NOW

UofA will be reformatting the storage system so all must migrate their data (see below).

Parallel University of Calgary Mar.31, 2018 January 1, 2018
orcinus University of British Columbia Mar.31, 2018 January 1, 2018
bugaboo Simon Fraser University Mar.31, 2018 January 1, 2018
grex University of Manitoba Mar.31, 2018 January 1, 2018

(*) - We recommend that you start the migration process about 3 months before the actual defunding date in order to leave plenty of time for data migration, job migration and new system familiarization.

There are many details to consider: defunding dates, storage/data accessibility, recommendations for migration, availability of new Compute Canada resources, etc. The rest of this document reviews WestGrid's plans and recommendations.

Defunded” means that the system is no longer funded, operated or maintained as part of Compute Canada’s national cyberinfrastructure platform. For defunded systems, the host institution assumes full control of the system after the defunding date, including managing the storage file systems and backups. Users should contact Local Site Support of defunded systems for further information about the host institution’s data retention and deletion policies:

Users on the above systems will need to migrate their data and jobs to new or continuing legacy systems before the defunding date. This article makes various recommendations and suggestions for that migration.

We recommend users move their data well in advance of the relevant defunding date to avoid network bottlenecks with file transfers, etc. Any remaining data MAY BE DELETED after the Data Deletion Dates. Please note that due to privacy constraints WestGrid will not retain copies of user data. Users should ensure they take the appropriate steps to comply with any data management requirements their institution may require.

Please email support@westgrid.ca to request help with moving data or any other concerns with this migration policy.

Please visit the Migration Process page on WestGrid's website for further details about migrating off WestGrid’s legacy systems.

Data Retention and Deletion Policy for Legacy Systems

User data on defunded systems will be deleted. Users are responsible for migrating their data to alternate storage. WestGrid will give as much advance notice of data deletion dates as possible. WestGrid will keep users of defunded systems informed about timelines for migration, and will provide support for the migration process.

For defunded systems, the host institution assumes full control of the system, including managing the storage systems and backups. Users should contact Local Site Support of defunded systems for further information about the host institution’s data retention and deletion policies.

IMPORTANT: Data on defunded systems will be deleted after the published deletion dates. WestGrid has arranged with the host institutions to keep data until the published deletion date, but can make no guarantees about data retention after that date. WestGrid will not retain any long term or back-up copies of user data and as noted above users must arrange for migration of their data. Users should also ensure they take the appropriate steps to comply with any data management requirements their institution or project may require.

Where to migrate to?

Both 2017/2018 and 2018/2019 Resource allocation years are transition years with a mix of legacy and new systems available for use. Please refer to the specific recommendations below.

Storage Resources

The National_Data_Cyberinfrastructure (NDC) provides a backed-up, reliable storage system (the /project space) that is mounted on the new Cedar and Graham compute systems. This is therefore a very attractive option as users can migrate once and then access their data as mounted filesystems on the new compute systems.

A few special notes:

  • If you have a RAC 2017 award
    • you may have been allocated to Cedar or Graham, therefore you should choose the PROJECT storage attached to that system;
    • you may have been allocated to a continuing legacy resource; in this case you might elect to keep unused but important data on the PROJECT space (either SFU or Waterloo), but will want to copy active data to your RAC awarded resource.
  • If you do not have a RAC 2017 award then you may use Rapid Access Service (RAS) storage amounts at either SFU or Waterloo at your discretion. You may also use the default amounts on continuing legacy systems, especially if you are already running jobs on such systems.

Compute Resources

Both Cedar and Graham are now in production and running with RAC 2017 resource allocations and priorities.

Users on defunded systems have a few options:

  1. If you have a RAC 2017 award then you should move your jobs and storage to the awarded system(s).
    • This may include Graham, Cedar or continuing legacy systems.
  2. If you have long-term, infrequently accessed data then the data should be migrated to backed up PROJECT space on Graham or Cedar.
  3. If you do not have a RAC 2017 award then you should use the Compute Canada Rapid Access Service and move your jobs/storage to one of the new systems.
    • The Rapid Access Service reserves a certain amount of resources on both new and legacy systems for opportunistic use by users who do not have a resource allocation.

WestGrid Legacy Systems Available for 2017-18

Details on WestGrid legacy systems which will be available for 2017-18, including the storage quotas and intended purpose of the systems can be found HERE.

Users with RAC 2017 Awards

If you have a RAC 2017 allocation (award letters will be sent out mid-March) then it is advantageous to move to your 2017 allocated system as soon as possible.

Your 2016 RAC priority will continue on the legacy systems listed until your 2017 RAC award (if applicable) is implemented.

Limited Support for Continuing Legacy Systems

Generally these continuing, to-be-defunded legacy systems are not under comprehensive vendor support programs due to the prohibitive cost for these very old systems. Only critical components (interconnects and shared storage) are maintained. No new (or replacement) compute nodes will be added which will likely result in a reduced number of available compute nodes and cores over time as components die. We also expect the reliability of these resources to decline with downtime increasing due to increased need for maintenance and repairs. There may be significant outages.

Software Available on New Systems

Software lists have been compiled and installation scripts developed so that software can be (mostly) automatically installed in Cedar and Graham as soon as they are available. The software list is continuously updated: current list of prepared opensource software.

Some commercial software will be licensed nationally. This is still under discussion within Compute Canada. Please check back for updates.

Code and Job Migration

Please keep in mind that you may need to re-compile or re-install your software and any required packages on any legacy or new systems you migrate to (unfortunately the new systems are not identical to any of the legacy systems so this will be more complex for users migrating to the new systems). This can be a time-consuming process. Our support staff have considerable expertise in such tasks, so please feel free to contact support@westgrid.ca for help.

See Code and job migration_from legacy systems for various details that should be kept in mind.

How to migrate data?

User Responsibility for Data Migration

We would like to emphasize that each user is responsible for copying his/her data from the to-be-defunded systems. WestGrid has no ability to track individual data transfers so we cannot send out individual reminders, and the defunding dates are hard deadlines. You must copy your data to a new, reliable site well before the defunding date. WestGrid recommends starting at least 3 months before the defunding date.

We strongly recommend users create backup copies in the new National Data Cyberinfrastructure when it is available.

General File Management Best Practices and Suggestions

  1. Users must move their own data to a new system BEFORE the defunding dates.
  2. Delete any unnecessary data and files
  3. Refrain from keeping multiple copies of your data on multiple systems.
  4. Move any remaining data not currently being used to a long term storage site.
  5. Scratch storage is not backed up, and has no long-term availability guarantees. Inactive data in scratch storage areas is subject to being purged. We have noticed that some users seem to be storing more important data on scratch. Therefore please check your scratch storage and move anything important to permanent storage.

Recommendations for Large Numbers of Files

We have a few users with very large numbers of files (many thousands). It is notoriously inefficient to transfer such a large number of files with the usual file-transfer utilities. We strongly recommend that users consider taring up their files before transfer. See HERE for an (external) tutorial on how to compress and extract files using the tar command on Linux or the much more detailed discussion and examples in the Compute Canada Wiki at archiving_and_compressing_files.

Globus for Data Transfer

Globus is the ideal tool as it significantly improves transfer performance and reduces the time spent managing transfers. Users have reported 10x or even 100x improvements over other transfer methods such as SCP.

Globus can be used to move data between any two WestGrid resources. All WestGrid resources are already configured as Globus endpoints.

CLICK HERE for more info on using Globus File Transfer. Refer to the Best practices for data migration page for detailed instructions on moving your data to a new system.

Orcinus (University of British Columbia)

Continuation to Mar.31/2018

The University of British Columbia agreed to continue operating Orcinus for the Compute Canada user community until Mar.31, 2018. This provides a lengthy overlap period with the new systems and essentially gives users ample time migrate their data and codes. WestGrid very much appreciates this offer which incurs significant costs to the institution for power and maintenance.

Please contact your local UBC IT services for future plans.

Storage

Orcinus has a small attached disc system providing /home and /globalscratch. Users should migrate /home to either Cedar or Graham. The Orcinus /globalscratch is not backed up and was aimed at providing high performance scratch. However we have noticed that some users have what looks like important or permanent data on /globalscratch. Please migrate any such data to Cedar or Graham.

Migration

Orcinus is a general purpose system used for both serial and parallel computation. Any of the new general purpose systems would be suitable for typical orcinus users.

Bugaboo (Simon Fraser University)

Continuation to Mar.31/2018

Simon Fraser University agreed to continue operating Bugaboo for the Compute Canada user community until Mar.31, 2018. This provides a lengthy overlap period with the new systems and essentially gives users ample time migrate their data and codes. WestGrid very much appreciates this offer which incurs significant costs to the institution for power and maintenance.

Please ask your local SFU IT services for future plans.

Storage

Bugaboo has large /home and /global/scratch filesystems. Users should migrate /home to either Cedar or Graham. The /global/scratch is not backed up and was aimed at providing high performance scratch. However we have noticed that some users have what looks like important or permanent data on /global/scratch. Please migrate any such data to Cedar or Graham.

Migration

Bugaboo is a general purpose system used for both serial and parallel computation. In particular Bugaboo had a large storage system, and a few large-memory nodes. All the new general purpose systems have much larger storage systems, and also provide a mix of large memory nodes. So all would be suitable for typical bugaboo users. Since Bugaboo and Cedar are in the same datacentre it will be faster to transfer data locally, and many users may already have working relationships with local support staff.

Grex (University of Manitoba)

Continuation to Mar.31/2018

The University of Manitoba agreed to continue operating Grex for the Compute Canada user community until Mar.31, 2018. This provides a lengthy overlap period with the new systems and essentially gives users ample time migrate their data and codes. WestGrid very much appreciates this offer which incurs significant costs to the institution for power and maintenance.

Please ask your local UManitoba IT services for future plans.

Storage

Grex has a small local /home and /global/scratch filesystems. Users should migrate /home to either Cedar or Graham. The /global/scratch is not backed up and was aimed at providing high performance scratch. However we have noticed that some users have what looks like important or permanent data on /global/scratch. Please migrate any such data to Cedar or Graham.

Migration

Grex is a general purpose system generally aimed at parallel computation (non-blocking InfiniBand interconnects). Grex also provided a number of large memory nodes. The new general purpose systems are at least as performant although they are aimed more at smaller jobs (less than 1,000 cores). However this is amply sufficient for grex users.

The new large parallel system Niagara will be available for the RAC 2018 year and would be an excellent alternative for those users with large parallel ambitions.

Parallel (University of Calgary)

Continuation to Mar.31/2018

The University of Calgary agreed to continue operating Parallel for the Compute Canada user community until Mar.31, 2018. This provides a lengthy overlap period with the new systems and essentially gives users ample time migrate their data and codes. WestGrid very much appreciates this offer which incurs significant costs to the institution for power and maintenance.

Please ask your local UCalgary IT services for future plans.

Storage

Parallel has small local /home and /global/scratch filesystems. Users should migrate /home to either Cedar or Graham. The /global/scratch is not backed up and was aimed at providing high performance scratch. However we have noticed that some users have what looks like important or permanent data on /global/scratch. Please migrate any such data to Cedar or Graham.

Migration

Parallel is aimed at parallel computation, and also provides GPU nodes. The new general purpose systems are at least as performant although they are aimed more at smaller jobs (less than 1,000 cores). However this is amply sufficient for most parallel users. Note that jobs >1,000 cores will run, but will communicate between islands so there will be additional latency issues.

The new General Purpose systems also provide a large number of modern GPU nodes (NVidia P100's). Parallel-gpu users can use any of the GP systems.

The new large parallel system Niagara will be available for the RAC 2018 year and would be an excellent alternative for those users with large parallel ambitions. However we do not expect Niagara to be available until April 2018 which is after defunding parallel. So unfortunately users may have to undertake a two-step migration

  1. Migrate code and data from parallel to /project on either Graham or Cedar before Mar.31/2017.
  2. Migrate code and data from /project to Niagara when it becomes available. We would expect users to migrate over Summer, 2018.

Jasper and Hungabee (University of Alberta)

Continuation to Sep.30/2017

The University of Alberta has agreed to continue operating Jasper and Hungabee for the Compute Canada user community until Sept 30, 2017. This provides a lengthy overlap period with the new systems and essentially gives users the summer to migrate their data and codes. WestGrid very much appreciates this offer which incurs significant costs to the institution for power and maintenance.

These machines have were not allocated through the RAC 2017 process so the RAC 2016 priorities will remain in use until defunding. After the defunding date the UofA is responsible for their own resource allocation.

Continuation after Sep.30

UofA will be continuing to operate Hungabee and Jasper as local resources for institutional users. These are defined as those users who are members of the UofA, or who are on teams with a UofA Principal Investigator. If you are unsure of your relationship with the UofA please contact local support at research.support@ualberta.ca.

Important: Storage will be reformatted during handover of the systems to UofA. Therefore all data will be deleted. So all users (including UofA users) must move important data to alternate resources.

Please contact research.support@ualberta.ca if you have questions.

Storage

Jasper and Hungabee share the same storage system which provides a single /home for both systems. Note that this storage system is not backed up. See the Jasper Quickstart Guide for details.

Hungabee also has a small direct-attached high-performance scratch area. It is not backed up. See the Hungabee Quickstart Guide for details.

WestGrid recommends that you copy your important data to whichever system you are moving your compute to (see below for recommendations).

The /project storage on both Graham and Cedar provide very large, backed up storage as part of the National_Data_Cyberinfrastructure. Generally you should migrate your codes, scripts etc. to /home, and important datasets to /project on the new systems.

Any remaining data MAY BE DELETED after September 30, 2017. Please contact support@westgrid.ca if you are not able to move your data before that date.

Compute

As noted above it is best in general to migrate to your allocated (RAC2017) resources; if you do not have a RAC2017, use the Rapid Access Service.

Jasper
Jasper is a relatively standard cluster and users can use Cedar and Graham. Various other WestGrid clusters will also continue to Mar.31/2018 and could be used. See WestGrid New Users Quickstart Guide - Choosing a System.
Hungabee
Hungabee is a special purpose, large shared memory system which is not being replaced so users may need special consideration. Cedar and Graham include some large-memory nodes which most users will be able to fit their jobs into, but there may be some users with very large memory requirements (> 3 TB). Other regions in Compute Canada maybe able to offer immediate large-memory nodes. For example, the Centre for Advanced Computing (CAC) in Ontario has some 2 TB nodes available.
If you have very large shared memory requirements then please contact support@westgrid.ca.

Limited Warranty from January 1, 2017

The University of Alberta would like to emphasize that as of January 01/2017 hardware support for these systems was limited to critical components only. The compute nodes on Hungabee are not under warranty (however the storage will remain under vendor support).

Nestor and Hermes (University of Victoria) - COMPLETE

July 30 update: Defunding has been completed

Nestor and Hermes have been defunded (July 30, 2017) and are no longer operated by WestGrid. For current services to UVic users please contact UVic's support services.

Storage

Through special arrangements with the University of Victoria, both Nestor and Hermes storage will be kept in operation until the new NDC is fully available. The new systems are now available, so we expect Nestor/Hermes storage to be available until July 30, 2017. WestGrid very much appreciates this offer which incurs significant costs to the institution for power and maintenance.

  • The nestor/hermes login nodes will remain in service so that users can login and access their data.

We recommend that you move your data (and compute) to your preferred/allocated resource as soon as such resources are available. The NDC is particularly suitable for long-term storage or for users moving to Cedar or Graham. And as usual we recommend that you get started as early as possible.

  • As soon as the NDC becomes available users should copy their data over.
  • If possible we also recommend that users copy their data to their own resources.
  • If you are also using other RAS (default) resources then you can copy data to those resources.
  • If your data requirements are large and you do not have alternate resources, please contact mailto:support@westgrid.ca

Compute

As of the defunding date (June 1, 2017) users are not be able to submit jobs to nestor or hermes. RAC 2017 holders should migrate to their allocated system(s), and other users can use the Rapid Access Service (previously known as a "Default" allocation) for opportunistic use on any suitable system.

Hermes
Hermes is aimed at single-node, shared memory jobs. The new systems Cedar and Graham are suitable options. WestGrid legacy systems like Bugaboo or Orcinus would work well. It is worth noting that Hermes is now virtual! It is currently running on unused cloud resources.
Nestor
Nestor is aimed at multi-node distributed jobs. The new systems have considerably higher performance and are suitable options.
If you have a RAC 2017 award on the new systems, Cedar and Graham, you will need to wait until those systems are in-production before you can start using them. The large parallel system Niagara should be available in Fall 2017 and would also be suitable to use.
If you have a RAC 2017 award on legacy systems, you can migrate to that legacy system immediately, however, your new allocation priority will only take effect when the RAC 2017 allocations are implemented (late June).
For non-RAC holders, suitable WestGrid legacy systems include Parallel, Grex and Orcinus.

Future Use

The University of Victoria is keeping Nestor running beyond June 1, 2017 for UVic users, but at its own cost and discretion. Users are encouraged to contact sysadmin@uvic.ca directly for further information.

Breezy and Lattice (University of Calgary) - COMPLETE

Completed Aug.31/2017

Breezy and Lattice have been defunded and are no longer operated by WestGrid.

Storage

The storage system is shared between the two systems and will be accessible from Breezy and Lattice until August 31, 2017. WestGrid thanks the University of Calgary for continuing its operation of these machines! As Parallel is scheduled for defunding next year (Mar.31/2018), its storage system will remain in operation. Parallel users may access the shared storage by logging in to Parallel.

The Globus Data Transfer Node (DTN) will be continued and available to Breezy and Lattice users until August 31, 2017. Please refer to the WestGrid Globus documentation for details.

Compute

Breezy
Breezy will continue in normal operation until August 31, 2017.
Breezy has not been allocated for RAC 2017. The RAC 2016 priorities will continue to be used.
RAC 2017 holders should migrate to their allocated system(s), and other users can use Compute Canada's Rapid Access Service (previously known as the "Default" allocation) on any suitable system.
Breezy is a special purpose system with large memory nodes. Cedar and Graham include some large-memory nodes which most users will be able to fit their jobs into. The Centre for Advanced Computing (CAC) in Ontario has some 2 TB nodes available.
If you have very large shared memory requirements then please contact support@westgrid.ca.
Lattice
Lattice will continue in normal operation until August 31, 2017.
Lattice has not been allocated for RAC 2017. The RAC 2016 priorities will continue to be used.
RAC 2017 holders should migrate to their allocated system(s), and other users can use Compute Canada's Rapid Access Service (previously know as the "Default" allocation) on any suitable system.
Lattice is aimed at multi-node distributed jobs. Cedar and Graham would be suitable replacements, and later on in the year the large parallel system Niagara should be available. For non-RAC holders suitable WestGrid legacy systems include Parallel, Grex and Orcinus.

Future Use

The University of Calgary may choose to keep Breezy and Lattice running beyond August 31, 2017, but at its own cost and discretion. This may result in usage limited to a certain group of users (e.g. only those from the University of Calgary). Users are encouraged to contact support@hpc.ucalgary.ca directly for further information.

Support & Other Links

Please email support@westgrid.ca for general help.

See also WestGrid Institutional Support Contacts for help with systems that will continued by their institutions.

Check out the following links for more tools, tips and support related to migration: