30 Jun CodeArtifact: Storing your dependencies
CodeArtifact is a recently announced service by AWS and it is a fully managed repository for storing your build artifacts and code dependencies. CodeArtifact plugs a gap in the Amazon cloud native developer toolkit that I first identified last December as I was packing my bags for re:invent:
This blog will take a detailed look at the service, so please read on if you would like to know more.
During my career I have developed applications and solutions in a number of languages including Java, Node JS and Python. All of which (and lots of other languages too) allow you to load dependencies (or libraries) from repositories. Java has Maven or Gradle, Node JS has NPM, and Python has PIP. Dependencies are really important as they extend the core capabilities of a language and most developers reading this blog would have used at least one dependency/library. Some examples of libraries are Java (Spring), Node JS (Express), Python (Flask) and we must not forget to mention the AWS SDK. These dependencies are made available from public repositories like maven central, npm registry, and PyPi.
In addition to open source libraries most professional development teams will also create their own libraries. However, please be warned there is a whole debate on how to structure these things, my advice is to make each library as small as possible and do one thing really well. One massive library can create a monolith, even in a distributed solution. The issue with private libraries is that you do not want to share your IPR (Intellectual property rights) by publishing them to a public repository, instead you require a private repository hosted centrally in your organisation that allows you to share them.
The two private repositories I have used are Sonatype nexus and Jfrog Artifactory. Before I joined Inawisdom I spent a lot of time deploying Sonatype nexus on AWS. I created a beautiful CloudFormation script that created an ALB and EC2 instances with Sonatype nexus deployed on them, implemented routine backups to S3 of the libraries, authentication with LDAP to provide SSO, and sending logs to Cloud Watch logs. Every now and then, I would have to purge the repo of old versions of dependencies to free space or allocate more EBS storage. Rarely I would perform some maintenance to update the version of nexus, at which point we would incur an outage. All very painful and the definition of undifferentiated heavy lifting. This meant that AWS having a fully managed private repository was high on my wish list.
Therefore, let’s give CodeArtifact a spin and see what it can do. I will focus on Python and Pip + twine, however CodeArtifact also supports Maven, Gradle, and NPM.
Creating a Repository and Domain
First, In the AWS Console go to CodeArtifact, then start by creating a repository by providing a name. Please note I recommend adding an upstream. An upstream is a proxy to a public repository. This allows you to pull seamlessly both public and private dependencies.
Next you have to create a new, or select an existing, domain. Domains are where your private libraries are actually stored and are useful in organisations; you can create your domain in a central shared services account and then a repository in each development account. This means each development team can pull dependencies published from other development teams when using different accounts.
To complete the creation your need to review and approve it:
Once created you will see something similar to this:
Pulling a Public Dependency
So, let’s see if the repository and that upstream works by trying to pull pandas using pip:
(py3.7-env) phil@Phils-MBP % aws codeartifact login --tool pip --repository PrivatePyPi --domain philipbasford --domain-owner XXXXXXXXXX --profile pbasford-sandbox Successfully logged in to codeartifact for pip. (py3.7-env) phil@Phils-MBP lib % pip install pandas Looking in indexes: https://aws:****@philipbasford-XXXXXXXXXX.d.codeartifact.eu-west-1.amazonaws.com/pypi/PrivatePyPi/simple/ Collecting pandas Downloading https://philipbasford-XXXXXXXXXX.d.codeartifact.eu-west-1.amazonaws.com/pypi/PrivatePyPi/simple/pandas/1.0.4/pandas-1.0.4-cp37-cp37m-macosx_10_9_x86_64.whl (10.0 MB) |████████████████████████████████| 10.0 MB 1.4 MB/s Collecting python-dateutil>=2.6.1 Downloading https://philipbasford-XXXXXXXXXX.d.codeartifact.eu-west-1.amazonaws.com/pypi/PrivatePyPi/simple/python-dateutil/2.8.1/python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB) |████████████████████████████████| 227 kB 1.9 MB/s Collecting numpy>=1.13.3 Downloading https://philipbasford-XXXXXXXXXX.d.codeartifact.eu-west-1.amazonaws.com/pypi/PrivatePyPi/simple/numpy/1.18.5/numpy-1.18.5-cp37-cp37m-macosx_10_9_x86_64.whl (15.1 MB) |████████████████████████████████| 15.1 MB 872 kB/s Collecting pytz>=2017.2 Downloading https://philipbasford-XXXXXXXXXX.d.codeartifact.eu-west-1.amazonaws.com/pypi/PrivatePyPi/simple/pytz/2020.1/pytz-2020.1-py2.py3-none-any.whl (510 kB) |████████████████████████████████| 510 kB 2.1 MB/s Requirement already satisfied: six>=1.5 in /Users/phil/.local/share/virtualenvs/py3.7-env-xOcoyF5C/lib/python3.7/site-packages (from python-dateutil>=2.6.1->pandas) (1.15.0) Installing collected packages: python-dateutil, numpy, pytz, pandas Successfully installed numpy-1.18.5 pandas-1.0.4 python-dateutil-2.8.1 pytz-2020.1
Note two things here:
- You have to first perform the login and an example is kindly provided by AWS in the console on how to do this. However, I had to add a profile to the login command as I use a switch role in a multi account setup.
- Also, the https://aws:****@philipbasford-XXXXXXXXXX.d.codeartifact.eu-west-1.amazonaws.com/pypi/PrivatePyPi/simple/ confirms the pull for a public library was from CodeArtifact.
Publish a Private Dependency
As we are using Python we will need to use twine to push our private dependency, you will have to create and run your setup.py and the publish it:
(py3.7-env) phil@Phils-MBP % aws codeartifact login --tool twine --repository PrivatePyPi --domain philipbasford --domain-owner XXXXXXXXXX --profile pbasford-sandbox (py3.7-env) phil@Phils-MBP % twine upload --repository codeartifact dist/ecommon-lib-0.0.1.tar.gz Uploading distributions to https://philipbasford-XXXXXXXXXX.d.codeartifact.eu-west-1.amazonaws.com/pypi/PrivatePyPi/ Uploading ecommon-lib-0.0.1.tar.gz 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8.42k/8.42k [00:00<00:00, 9.57kB/s] (py3.7-env) phil@Phils-MBP CapacityMicroService %
Note two things here:
- Again, you have to perform the login first as twine is a separate login than pip.
- Also, like last time the https://aws:****@philipbasford-XXXXXXXXXX.d.codeartifact.eu-west-1.amazonaws.com/pypi/PrivatePyPi/simple/ confirms that the push for the private library was to CodeArtifact:
Finally, by refreshing and digging into the console, we can confirm the private library is uploaded and shows us the details:
Nice to haves:
The following are some features I would like to be added to CodeArtifact that I identified from my initial exploring:
- CodeDeploy and CodeArtifact are integrated very well together and are fully functional together. However, it would be good to be able to declare in the ‘artifacts’ section (of the build spec), for CodeBuild to push a distribution to CodeArtifact.
- CodeArtifact has a free tier for every month. After which you are then charged for storage and for requests made (plus data if going outside of the cloud). However, with lots of versions being constantly released from open source and private projects it means storage costs will mount up.
- Luckily, AWS allows you to delete libraries or versions of libraries, but I suspect this might become unmanageable at scale. I therefore think it would be great if you could set a TTL (time-to-live) on at least open source libraries if not used for x days. Once the TTL is reached then delete or move to infrequent accessed storage on S3.
- Governance is provided with Event Bridge and Cloud Trail. However, I think Governance could be further improved with a workflow that allows for the approval for use of dependencies and versions before they can be pulled.
CodeArtifact removes the undifferentiated heavy lifting of running your own private repository and for that it is very good and trouble free. However, there are two other wider implications of CodeArtifact:
- Is the NAT now dead? At Inawisdom our RAMP product has a NAT instance to allow SageMaker notebooks to pull dependences using Pip from public repositories. I suspect many AWS networks are like this. However, CodeArtifact removes the need for this NAT if you deploy a VPC endpoint for CodeArtifact. This sufficiently improves your security posture and provides governance.
- CodeArtifact will dramatically simplify the creation of lambda layers for your serverless applications. You will be able to use your private dependency in your IDE and then include it nicely into a layer at build time using Pip. Instead of using some complex build process or a direct GIT dependency.
This is all good, therefore if you are developer on AWS I encourage you to adopt CodeArtifact today. If you’re an architect I challenge you to simplify your architectures and remove them NATs. Happy coding!