As an essential component of the AI computing engine of Alibaba Cloud Platform for AI (PAI), Lingjun resources are designed for large-scale and high-density computing. Lingjun resources provide heterogeneous computing power tailored for high-performance AI training and computing. You can use Lingjun resources in Data Science Workshop (DSW), Deep Learning Containers (DLC), and Elastic Algorithm Service (EAS) to facilitate AI development, training, and service deployment. This topic describes how to create a resource group and purchase Lingjun resources.
Overview
Lingjun resource
Lingjun resources are the new-generation intelligent computing resources developed by Alibaba Cloud that provide the following features:
High-speed Remote Direct Memory Access (RDMA) network architecture
High-performance communication library
High-performance acceleration software
Technical solution for GPU virtualization
Lingjun resources can meet your requirements for high-performance computing.
Lingjun resource group
PAI provides fully managed Lingjun resources that you can purchase and use in resource groups in the PAI console. If you purchase Lingjun hardware resources, you can add the resources to the PAI console as semi-managed resources and use them to run training jobs.
Limits
Supported regions
Lingjun resources are available only in the China (Ulanqab), Singapore, China (Shenzhen), China (Beijing), China (Shanghai), and China (Hangzhou) regions.
Supported users
Only users in the whitelist can use Lingjun resources. If you want to use Lingjun resources to run training jobs, submit a ticket to apply to join the whitelist.
Supported job types
Lingjun resources support training jobs of only the following type: TensorFlow, PyTorch, ElasticBatch, MPIJob, Slurm and Ray.
Account and permission requirements
Alibaba Cloud account: You can use an Alibaba Cloud account to perform all operations without additional authorization.
RAM user: Contact your Alibaba Cloud account to grant permissions to manage the resource pool or attach the AliyunPAIFullAccess policy to the RAM user. For more information, see the "Permissions to manage the resource pool" section in the Custom policies for RAM users topic.
ImportantThe AliyunPAIFullAccess policy provides permissions to manage all resources and features of PAI. Exercise caution when you grant these permissions.
Dependencies
Lingjun resources depend on the following Alibaba Cloud services. To create, purchase, and use Lingjun resources, familiarize yourself with and activate these Alibaba Cloud services and prepare resources based on your business requirements.
VPC (required)
When you allocate Lingjun resources, you must associate the resources with a virtual private cloud (VPC) in the same region and configure a vSwitch and a security group. This ensures the network connectivity between the Lingjun resources and other Alibaba Cloud services.
Internet NAT gateway and EIP (optional)
Your Lingjun resources may need to access the Internet. For example, they may need to pull custom images from the Internet. In this case, you must configure an Internet NAT gateway with SNAT enabled and associate an elastic IP address (EIP) with the Internet NAT gateway.
For more information, see Use the SNAT feature of an Internet NAT gateway to access the Internet.
OSS, NAS, and CPFS (optional)
To submit DLC training jobs to Lingjun resources, you must create datasets first. Lingjun resources supports only Object Storage Service (OSS), File Storage NAS (NAS), and Cloud Parallel File Storage (CPFS) datasets. For more information, see the Prepare a dataset section of the "General process" topic.
Procedure
Create a Lingjun resource group
Go to the Resource Pool page in the PAI console.
On the Intelligent Computing Lingjun resources tab, click Create Resource Group.
In the Create Resource Group dialog box, configure the parameters described in the following table and click OK.
Parameter
Description
Type
Select Dedicated Resource Group.
Resource Group Name
Enter a resource group name based on the naming rule.
Purchase Lingjun resources
To purchase Lingjun resources for a dedicated resource group, perform the following steps. For more information about the specifications and billing of Lingjun resources, see Billing of Lingjun resources (Serverless Edition).
On the Intelligent Computing Lingjun resources tab, click Create Order in the Actions column.
On the buy page, the system automatically specifies Region and Resource Group ID. You only need to configure the parameters such as Node Specification, Amount, and Duration. Then, click Buy Now.
Possible issues and their solutions:
The order does not contain information for the current resource group.
Cause: You have switched to another region. Therefore, the resource group ID cannot match.
Solution: Switch to the region where the resource group resides.
The specified instance type is out of stock in zone.
Cause: The selected node specification type is out of stock in the region.
Solution: Select another node specification type.
The current kind of instance is temporarily not supported. Please choose another kind of ecs to purchase.
Cause: The selected node specification type is not supported in the region.
Solution: Select another node specification type.
After you complete the payment, the purchased Lingjun resources are displayed on the Orders tab of the resource group details page.
The system automatically splits the orders based on the node instances you purchased. This helps you manage orders based on nodes, including operations such as renewal and unsubscription.
References
After you create a resource group and purchase computing resources, you can perform the following operations:
On the resource group details page, view the basic information about the resource group and manage the purchased resources. For more information, see the Manage resources section of the "Overview" topic.
Allocate the purchased resources to specific training jobs by configuring resource quotas. For more information, see Lingjun resource quotas.