aws-samples/sample-dlc-mcp-server
If you are the rightful owner of sample-dlc-mcp-server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
AWS Deep Learning Containers MCP Server is a comprehensive Model Context Protocol server designed to support machine learning workflows on AWS.
AWS Deep Learning Containers MCP Server
A comprehensive Model Context Protocol (MCP) server for AWS Deep Learning Containers (DLC) that provides end-to-end support for machine learning workflows. This server offers six core service modules to help you build, deploy, upgrade, and optimize your DLC-based ML infrastructure.
Quick Start Guide
Installation Steps
1. Prerequisites:
- Create an AWS Instance Profile with the following policies. Use this profile while creating EC2 instance in the next step.
- AmazonECS_FullAccess Policy
- AmazonEC2ContainerRegistryFullAccess
- EC2 with DLC Image recommended.
- Launch an Amazon Elastic Compute Cloud instance (CPU or GPU), preferably a Deep Learning Base AMI. Other AMIs work but require relevant GPU drivers. If you prefer to work with local docker desktop setup on your machine, then you can skip to step.
- AWS CLI
- Python 3.11 or later
- Install uv (pip install uv) to run the mcp server locally
- Docker (DLC Image contains Docker)
- Connect to your instance by using SSH. For more information about connections, see Troubleshooting Connecting to Your Instance in the Amazon EC2 user guide..
- Install and configure the AWS Q CLI with this guide
2. Configure DLC MCP Server
#clone the repo
git clone https://github.com/aws-samples/sample-dlc-mcp-server.git
cd sample-dlc-mcp-server
# Build and install MCP server
python3 -m pip install -e .
# Verify the server start
dlc-mcp-server
# Update ~/.aws/amazonq/mcp.json
# Update the path to your local sample-dlc-mcp-server directory
echo "{ "mcpServers": { "dlc-mcp-server": { "command": "uv", "args": [ "--directory", "<<Update-directory-path>>/sample-dlc-mcp-server", "run", "dlc-mcp-server" ], "env": {}, "timeout": 120000 } } }” > ~/.aws/amazonq/mcp.json```
3. AWS Credentials and Configuration
- AWS Credentials and Configuration
- Configure your AWS credentials using one of these methods:
aws configure aws configure set aws_session_token <<token>>
- Other authentication methods:
- Configuration Files: AWS CLI Configuration and Credential Files
- Authentication Methods: AWS CLI User Authentication
Basic Usage Examples
List Available DLC Images
# List all available DLC images
q chat "List available DLC images"
# Filter by framework
q chat "Show me PyTorch training images"
# Filter by specific criteria
q chat "List PyTorch images with CUDA 12.1 and Python 3.10"
Build Custom Images
# Create a custom PyTorch image
q chat "Create a custom PyTorch image with scikit-learn and pandas"
# Build image with specific packages
q chat "Build a TensorFlow image with OpenCV and matplotlib"
Deploy to AWS Services
# Deploy to SageMaker
q chat "Deploy my custom image to SageMaker for inference"
# Deploy to ECS
q chat "Deploy to ECS cluster with 2 CPU and 4GB memory"
Upgrade Images
# Upgrade framework version
q chat "Upgrade my PyTorch image from 1.13 to 2.0"
# Analyze upgrade path
q chat "What's needed to upgrade my TensorFlow 2.10 image to 2.13?"
6. Configuration
The server can be configured using environment variables:
# Enable write operations (required for building and deployment)
export ALLOW_WRITE=true
# Enable access to sensitive data (for detailed logs and resource info)
export ALLOW_SENSITIVE_DATA=true
# Configure server port
export FASTMCP_PORT=8080
# Configure logging
export FASTMCP_LOG_LEVEL=INFO
export FASTMCP_LOG_FILE=/path/to/logfile.log
Example Prompts
Here are some example prompts you can use with the AWS DLC MCP Server:
Building Custom Images
1. "Please add the latest version of DeepSeek model to my DLC"
2. "Use latest 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training as base image and can you build a custom image that has image recognition model"
3. "Create a custom pytorch image with text recognition model"
4. "Please create a container from the latest DLC version"
Managing and Upgrading Containers
5. "Can you list available DLC images"
6. "Please update the container with the nightly version of PyTorch"
7. "Please update DLC with the latest version of CUDA"
8. "Please add Nemo toolkit (https://github.com/NVIDIA/NeMo) to my container"
Advanced Customizations
9. "Add NVIDIA NeMo Framework to my existing PyTorch DLC image"
10. "Optimize my DLC image for inference workloads on GPU instances"
11. "Create a multi-stage build for my custom DLC with minimal runtime footprint"
Deployment and Troubleshooting
12. "Deploy my custom image to SageMaker for real-time inference"
13. "Help me troubleshoot CUDA out of memory errors in my training job"
14. "What are the security best practices for deploying DLC images in production?"
15. "How can I optimize costs when running DLC containers on AWS?"
Performance and Best Practices
16. "Get performance optimization tips for PyTorch training on GPU"
17. "What are the framework-specific best practices for TensorFlow inference?"
18. "Show me deployment best practices for EKS"
19. "How do I create maintainable custom DLC images?"
20. "Check compatibility between PyTorch 1.13 and 2.0"
Getting Started Instructions
Step 1: Environment Setup
- Install Prerequisites: Ensure you have AWS Q CLI, Docker, and proper AWS credentials configured
- Install the MCP Server: Run
pip install aws_samples.dlc-mcp-server
- Configure Environment Variables: Set
ALLOW_WRITE=true
for building/deployment operations
Step 2: Basic Operations
- Check Configuration: Start with
q chat "Check my AWS configuration"
- Explore Available Images: Use
q chat "List available DLC images"
- Authenticate with ECR: Run
q chat "Setup ECR authentication"
Step 3: Choose Your Workflow
- For Custom Image Building: Follow Workflow 1 above
- For Existing Image Upgrades: Follow Workflow 2 above
- For Troubleshooting: Follow Workflow 3 above
- For Distributed Training: Follow Workflow 4 above
Step 4: Deploy and Monitor
- Deploy to Your Platform: Choose from EC2, SageMaker, ECS, or EKS
- Monitor Status: Check deployment status and endpoint health
- Optimize Performance: Apply best practices and performance tips
Workflows and Usage Patterns
The AWS DLC MCP Server supports several common ML workflows:
Workflow 1: Building and Deploying Custom DLC Images
-
Discover Base Images
q chat "List available PyTorch base images for training"
-
Create Custom Dockerfile
q chat "Create a custom PyTorch image with transformers, datasets, and wandb"
-
Build Custom Image
q chat "Build the custom image and push to ECR repository 'my-pytorch-training'"
-
Deploy to AWS Service
q chat "Deploy my custom image to SageMaker for training with ml.p3.2xlarge instance"
Workflow 2: Upgrading Existing DLC Images
-
Analyze Current Image
q chat "Analyze upgrade path from my current PyTorch 1.13 image to PyTorch 2.0"
-
Generate Upgrade Dockerfile
q chat "Generate upgrade Dockerfile preserving my custom packages"
-
Perform Upgrade
q chat "Upgrade my DLC image to PyTorch 2.0 while keeping custom configurations"
Workflow 3: Troubleshooting and Optimization
-
Diagnose Issues
q chat "Help me troubleshoot 'CUDA out of memory' error in my training job"
-
Get Performance Tips
q chat "Get performance optimization tips for PyTorch training on GPU"
-
Apply Best Practices
q chat "What are the security best practices for my DLC deployment?"
Workflow 4: Distributed Training Setup
-
Configure Environment
q chat "Setup distributed training for 4 nodes with 8 GPUs each using PyTorch"
-
Run Training Container
q chat "Run my custom training container with GPU support"
Core Services
The AWS DLC MCP Server provides six comprehensive service modules:
1. Container Management Service (containers.py
)
Purpose: Core container operations and DLC image management
Key Features:
- Image Discovery: List and filter available DLC images by framework, Python version, CUDA version, and repository type
- Container Runtime: Run DLC containers locally with GPU support
- Distributed Training Setup: Configure multi-node distributed training environments
- AWS Integration: Automatic ECR authentication and AWS configuration validation
- Environment Setup: Check GPU availability and Docker configuration
Available Tools:
check_aws_config
- Validate AWS CLI configurationsetup_ecr_prod
- Authenticate with ECR production accountlist_dlc_repos
- List available DLC repositorieslist_dlc_images
- List and filter DLC images with advanced filteringrun_dlc_container
- Run containers locally with GPU supportsetup_distributed_training
- Configure distributed training setups
2. Image Building Service (image_building.py
)
Purpose: Create and customize DLC images for specific ML workloads
Key Features:
- Base Image Selection: Browse available DLC base images by framework and use case
- Custom Dockerfile Generation: Create optimized Dockerfiles with custom packages and configurations
- Image Building: Build custom DLC images locally or push to ECR
- Package Management: Install system packages, Python packages, and custom dependencies
- Environment Configuration: Set environment variables and custom commands
Available Tools:
list_base_images
- Browse available DLC base imagescreate_custom_dockerfile
- Generate custom Dockerfilesbuild_custom_dlc_image
- Build and optionally push custom images to ECR
3. Deployment Service (deployment.py
)
Purpose: Deploy DLC images across AWS compute platforms
Key Features:
- Multi-Platform Deployment: Support for EC2, SageMaker, ECS, and EKS
- SageMaker Integration: Create models and endpoints for inference
- Container Orchestration: Deploy to ECS clusters and EKS clusters
- EC2 Deployment: Launch EC2 instances with DLC images
- Status Monitoring: Check deployment status and endpoint health
Available Tools:
deploy_to_sagemaker
- Deploy to SageMaker for training/inferencedeploy_to_ecs
- Deploy to Amazon ECS clustersdeploy_to_ec2
- Launch EC2 instances with DLC imagesdeploy_to_eks
- Deploy to Amazon EKS clustersget_sagemaker_endpoint_status
- Monitor SageMaker endpoint status
4. Upgrade Service (upgrade.py
)
Purpose: Upgrade and migrate DLC images to newer framework versions
Key Features:
- Upgrade Path Analysis: Analyze compatibility between current and target framework versions
- Migration Planning: Generate upgrade strategies with compatibility warnings
- Dockerfile Generation: Create upgrade Dockerfiles that preserve customizations
- Version Migration: Upgrade PyTorch, TensorFlow, and other frameworks
- Custom File Preservation: Maintain custom files and configurations during upgrades
Available Tools:
analyze_upgrade_path
- Analyze upgrade compatibility and requirementsgenerate_upgrade_dockerfile
- Create Dockerfiles for version upgradesupgrade_dlc_image
- Perform complete image upgrades with customization preservation
5. Troubleshooting Service (troubleshooting.py
)
Purpose: Diagnose and resolve DLC-related issues
Key Features:
- Error Diagnosis: Analyze error messages and provide specific solutions
- Framework Compatibility: Check version compatibility and requirements
- Performance Optimization: Get framework-specific performance tuning tips
- Common Issues: Database of solutions for frequent DLC problems
- Environment Validation: Verify system requirements and configurations
Available Tools:
diagnose_common_issues
- Analyze errors and provide solutionsget_framework_compatibility_info
- Check framework version compatibilityget_performance_optimization_tips
- Get performance tuning recommendations
6. Best Practices Service (best_practices.py
)
Purpose: Provide expert guidance for optimal DLC usage
Key Features:
- Security Guidelines: Comprehensive security best practices for DLC deployments
- Cost Optimization: Strategies to reduce costs while maintaining performance
- Deployment Patterns: Platform-specific deployment recommendations
- Framework Guidance: Framework-specific best practices and optimizations
- Custom Image Guidelines: Best practices for creating maintainable custom images
Available Tools:
get_security_best_practices
- Security recommendations and guidelinesget_cost_optimization_tips
- Cost reduction strategiesget_deployment_best_practices
- Platform-specific deployment guidanceget_framework_specific_best_practices
- Framework optimization recommendationsget_custom_image_guidelines
- Custom image creation best practices
Configuration
The server can be configured using environment variables:
FASTMCP_PORT
: Port to run the server on (default: 8080)FASTMCP_LOG_LEVEL
: Logging level (default: INFO)FASTMCP_LOG_FILE
: Path to log file (optional)ALLOW_WRITE
: Enable write operations (default: false)ALLOW_SENSITIVE_DATA
: Enable access to sensitive data (default: false)
Supported Frameworks and Tools
The MCP server supports building and managing containers with:
- Deep Learning Frameworks: PyTorch, TensorFlow, MXNet, Hugging Face Transformers
- NVIDIA Tools: CUDA, cuDNN, NeMo Toolkit, TensorRT
- Popular Models: DeepSeek, LLaMA, BERT, ResNet, and custom models
- Specialized Libraries: Computer Vision, NLP, Speech Recognition, and more
Complete Tool Reference
Container Management Tools
check_aws_config
- Validate AWS CLI configurationsetup_ecr_prod
- Authenticate with ECR production account (763104351884)list_dlc_repos
- List available DLC repositories with filteringlist_dlc_images
- List and filter DLC images by framework, Python version, CUDA versionrun_dlc_container
- Run containers locally with GPU supportsetup_distributed_training
- Configure multi-node distributed training
Image Building Tools
list_base_images
- Browse available DLC base images by framework and use casecreate_custom_dockerfile
- Generate custom Dockerfiles with packages and configurationsbuild_custom_dlc_image
- Build and optionally push custom images to ECR
Deployment Tools
deploy_to_sagemaker
- Deploy to SageMaker for training/inferencedeploy_to_ecs
- Deploy to Amazon ECS clustersdeploy_to_ec2
- Launch EC2 instances with DLC imagesdeploy_to_eks
- Deploy to Amazon EKS clustersget_sagemaker_endpoint_status
- Monitor SageMaker endpoint status
Upgrade Tools
analyze_upgrade_path
- Analyze upgrade compatibility and requirementsgenerate_upgrade_dockerfile
- Create Dockerfiles for version upgradesupgrade_dlc_image
- Perform complete image upgrades with customization preservation
Troubleshooting Tools
diagnose_common_issues
- Analyze errors and provide solutionsget_framework_compatibility_info
- Check framework version compatibilityget_performance_optimization_tips
- Get performance tuning recommendations
Best Practices Tools
get_security_best_practices
- Security recommendations and guidelinesget_cost_optimization_tips
- Cost reduction strategiesget_deployment_best_practices
- Platform-specific deployment guidanceget_framework_specific_best_practices
- Framework optimization recommendationsget_custom_image_guidelines
- Custom image creation best practices
Development
See for development instructions.
Disclaimer
Intent of this project is to share a flavor of DLC MCP Server to demonstrate the use of Amazon Q and MCP server to optimize the DLC maintenance. This project is not suited for direct production usage.
License
This library is licensed under the MIT-0 License.