-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jan 21 merge] feat(blog): blog post--lessons learned on publish security #538
base: main
Are you sure you want to change the base?
Conversation
python-version: '3.9' | ||
|
||
# Step 3: Cache dependencies | ||
- name: Cache dependencies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not tested and could be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the uses but oftentimes, building is performed by pypa/build
so it'd be just pip install build
and not multiple direct deps as one might expect from a requirements.txt
file. So the example seems a bit unrealistic, except for some untypical projects with surprising build automation required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, right, good point. What is a good example of caching dependencies (other than what we say with ultranalytics)? I use caching when i'm publishing.
we pip install hatch and use hatch build
What do you think about removing the example and just telling people that if they do cache dependencies for any reason, don't do it when the build is for publishing?
for instance, doing it for a website build could be ok if it's a preview, etc. vs publishing a package?
hey @webknjaz @hugovk do you have a bit of time to review this post for accuracy? i'm trying to provide a more beginner-friendly overview of what we learned that our community can quickly understand and implement. We use the the workflow @webknjaz helped us create for the pyosmeta package. And we want to link to all of the pypi / psf posts that are relevant too. many thanks for your time! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's some quick notes. I might've missed something.
Let's also ask @woodruffw for feedback.
|
||
## Is your PyPI publication workflow secure? | ||
|
||
The recent Python package breach [involving Ultralytics](https://blog.pypi.org/posts/2024-12-11-ultralytics-attack-analysis/) has highlighted the need for secure PyPI publishing workflows. Hackers exploited a GitHub workflow (`pull_request_target`) to inject malicious code, which was published to PyPI. Users who downloaded the package unknowingly allowed their machines to be hijacked for Bitcoin mining. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a GitHub workflow (
pull_request_target
)
pull_request_target
is not a workflow, it's one of the events that may trigger a workflow run. A workflow may “subscribe” to different events like push
and pull_request
, this is just another event type with elevated privileges in its security context.
unknowingly allowed their machines to be hijacked for Bitcoin mining.
Not only that, but they also did that in Google Colab which got their accounts banned pretty swiftly...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOOOH no way. i missed the Google Colab part!! that's a pretty bold move for a hacker? maybe they were new hackers ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See googlecolab/colabtools#4979.
It was unwitting end users who installed the hacked package on their Colab accounts, and the Google automation then suspended their accounts:
Connection failed
This account has been blocked from accessing Colab runtimes due to suspected abusive activity. This does not impact access to other Google products. If you believe this has been done in error, review the usage limitations and file an appeal.
Google was pretty quick in restoring accounts, but it happened to a lot of accounts: googlecolab/colabtools#4979 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
geez. it's amazing how quick and far-reaching this single incident was!
|
||
### 🔐 Secure your workflows 🔐 | ||
- 🚫 Avoid risky events like `pull_request_target` and adopt release-based workflows. | ||
- ♻️ Don’t cache dependencies in your publish workflows to prevent tampering. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's probably an oversimplification. Not caching the deps may as well lead to something like this if the transitive deps are unpinned and get updated in uncontrolled manner. The problem was that the cache was poisoned in a job with a security context that had high privileges and then, another job picked it up, effectively reusing what that first one “pinned” in the installed deps.
But if there's no cache, a compromised transitive dependency can be downloaded from PyPI, resulting in the same problem.
Security exists in context and not in isolation. It should come from a place of understanding how different bits fit together.
That said, I wouldn't make a call of whether using caching or not will save us all. The problem was that it was possible to compromise the cache that ended up being used.
I think it's possible to use cache securely by having separate cache keys unique to the dist building job. So caches for building the dists and testing them would not leak into each other. That's the real fix for cache.
Another side of this is the supply chain problem, which cache ended up being a part of in this instance. But as I mentioned earlier, the supply chain could be compromised differently, without cache being a factor. The solution to this is pinning your dependencies. This means recording every single version number for every single transitive dependency there is. One of the low-level tools to use for this is pip-tools
, in the context of pip install
(until we have PEP 751, that is). Non-pip installers have different mechanisms. With pip, you can use pip install -r requirements.in -c constraints.txt
, where requirements.in
would contain a direct and unrestricted dependency list, and constraints.txt
would set the entire tree to exact versions. A complete solution is usually integration of different aspects through multiple tools (or a "one-tool-to-rule-them-all" which some projects aim to become). I've been taking this to extremes with "DIY lock files" in https://github.com/webknjaz/ep2024-multilock-toy-template / https://github.com/ansible/awx-plugins/tree/2c2b22c/dependencies/lock-files.
Additionally, there's --require-hashes
which also records every single dist file hash, not just version (this is not compatible with -c
, though, due to a bug in pip).
To summarize, the solution here is to be in control of your deps, make sure their updates are transparent and controllable on all the levels, and to know what influences them… Out of the context, disallowing caching is a band-aid at best. But it may be harmful in that it would cultivate a false sense of security.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. This is super helpful. I could remove the caching part altogether, given so many moving pieces and caching seems more nuanced then just locking down the workflow itself with better triggers and tighter permissions. But I also have a question about the dependencies. Because it is my understanding and experience that pinning all deps can also be highly problematic for users. For instance, if you have an environment with many different packages and try installing a package with pinned dips, that could lead to significant user issues.
Unless we're just talking about dips in the workflow itself, such as hatch, pypa/build, etc.
web apps make sense to pin but i can't quite wrap my head around pinning deps for a package. Could you please help me understand? And do you agree - maybe for more beginner users we just drop caching altogether as a topic?
### 🔐 Secure your workflows 🔐 | ||
- 🚫 Avoid risky events like `pull_request_target` and adopt release-based workflows. | ||
- ♻️ Don’t cache dependencies in your publish workflows to prevent tampering. | ||
- If you reference branches in a pull request, clean or sanitize branch names in your workflow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, link https://docs.github.com/en/get-started/using-git/dealing-with-special-characters-in-branch-and-tag-names and recommend that people use https://woodruffw.github.io/zizmor/.
And maybe https://securitylab.github.com/resources/github-actions-untrusted-input/.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestions!! I may add these links lower down in the post at the end, so the takeaways are just text.. but I'll definitely include them. This is what I have now:
- If you reference branches in a pull request, clean or sanitize branch names in your workflow.
- Consider using zizmor to check your GitHub workflows for security vulnerabilities!
i haven't used zizmor yet but i saw a package that Hugo maintains uses it so i may try it out with stravalib and pyosmeta!!
|
||
### Lock down GitHub repo access | ||
- 🔒 Restrict repository access to essential maintainers only. | ||
- ✅ Add automated checks to ensure releases are authorized and secure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is mostly about setting an environment in the job and making sure that environment has required reviewers in the repo settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the blog posts, it seemed like the hackers somehow got direct access to that repo. Did i read that wrong?
|
||
* [<i class="fa-brands fa-discourse"></i> Discourse](https://pyopensci.discourse.group/) | ||
* [<i class="fa-brands fa-mastodon"></i> Mastodon](https://fosstodon.org/@pyopensci) | ||
* [<i class="fa-solid fa-cloud"></i> Bluesky](https://bsky.app/profile/pyopensci.bsky.social) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tip
Pro tip: consider updating the Bluesky username to pyopensci.org
. This can be configured and verified in like 3 minutes (through either DNS or a .well-known
document on the website, but adds credibility to the account). You can create a repository in the org, call in .well-known
, and add a file with the account DID as follows: webknjaz/.well-known@5a3c310. Just don't forget to enable Pages in the repo settings before clicking Verify
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WOAH!! i never knew about this for bluesky!.
thank you for this tip - and for all of the thoughtful feedback. i'm going to work on this some more tomorrow!
|
||
You can see how to set up GitHub Actions securely in our own [PyPI publishing GitHub workflow](https://github.com/pyOpenSci/pyosMeta/blob/main/.github/workflows/publish-pypi.yml), which follows the Trusted Publisher approach. | ||
|
||
**Note:** Trusted Publisher workflows are currently only available for GitHub. Support for GitLab may be coming in the future—stay tuned! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the doc you already linked, there's an example for GitLab as well: https://docs.pypi.org/trusted-publishers/adding-a-publisher/#gitlab-cicd. There are 4 platforms that support this mechanism already, and I saw some Bitbucket folks asking around whether they can integrate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed the breakout and added that link!! I somehow read that GitLab support wasn't available yet, but obviously, I wasn't looking in the right place! thanks!
Co-authored-by: 🇺🇦 Sviatoslav Sydorenko (Святослав Сидоренко) <[email protected]>
Co-authored-by: 🇺🇦 Sviatoslav Sydorenko (Святослав Сидоренко) <[email protected]>
Co-authored-by: 🇺🇦 Sviatoslav Sydorenko (Святослав Сидоренко) <[email protected]>
NOTES: i'm doing a big rewrite (more of a reorg) of this content. I'll start with the high-level concepts of securing the environment / GitHub and GitHub-PyPI, and then the other things like sanitizing branch names and such can come next. i'm super appreciative of all of the feedback. |
Ok i've made significant changes to the post and added a new figure that tries to show trusted publisher as a concept with the 2 step ci action part. i've also added links throughout to pypa / pypi documentation. AND I've tested out gizmo!! it's great. i haven't tried as a pre-commit yet but plan to. Also @webknjaz that Bluesky pro tip was 🔥 !! I set that up quickly, and we now have a pyopensci.org domain. I added a nod to you at the top of the post for providing so much input!! |
|
||
## Don’t cache package dependencies in your publish step | ||
|
||
Caching dependencies involves storing dependencies to be reused in future workflow runs. This approach saves time, as GitHub doesn’t need to redownload all dependencies each time the workflow runs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where I'm confused about pinning deps, as that doesn't seem like a good idea for users when I think about environments and dependency conflict nightmares.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For deploy workflows or anything creating release artifacts end users might use, I think I've heard from @sethmlarson that it makes sense to reduce the number of dependencies you're using where possible: you can't have problems with a hacked dependency you're not using.
That should also reduce the conflict nightmares when pinning things...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Hugo! This helps clarify what I was struggling with! I'll rewrite this sentence to discuss the concept more broadly! I see that seth also gave this a thumbs up!
|
||
Using a release-based workflow ensures that your publishing step is tightly controlled. A pull request will never accidentally trigger a publish build. This reduces your risk! | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add: a few sentences on
delete old tokens and secrets containing pypi tokens... this is easy for people to do and good for them to know about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome blog post, thanks Leah! I left some comments:
Now that you have a GitHub environment setup, you can set up Trusted Publisher in your PyPI account. | ||
A Trusted Publisher setup creates a short-lived secure link between PyPI and your repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe remove "short-lived" from this sentence? The underlying implementation of Trusted Publishers uses short-lived credentials, but the link itself is durable!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed!
|
||
Named as a playful nod to Dr. Zizmor’s famous “clear skin” ads, Zizmor aims to give you “beautiful clean workflows.” | ||
|
||
Learn more about Zizmor on the [official blog post by Yossarian](https://blog.yossarian.net/2024/10/27/Now-you-can-have-beautiful-clean-workflows). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yossarian -> William Woodruff
@sethmlarson thank you so much for the review!! I appreciate it! 🚀 |
@all-contributors please add @sethmlarson for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, mostly nitpicks.
And good to see @sethmlarson here, I was going to summon him!
|
||
## Is your PyPI publication workflow secure? | ||
|
||
The recent Python package breach [involving Ultralytics](https://blog.pypi.org/posts/2024-12-11-ultralytics-attack-analysis/) has highlighted the need for secure PyPI publishing workflows. Hackers exploited a GitHub workflow (`pull_request_target`) to inject malicious code, which was published to PyPI. Users who downloaded the package unknowingly allowed their machines to be hijacked for Bitcoin mining. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See googlecolab/colabtools#4979.
It was unwitting end users who installed the hacked package on their Colab accounts, and the Google automation then suspended their accounts:
Connection failed
This account has been blocked from accessing Colab runtimes due to suspected abusive activity. This does not impact access to other Google products. If you believe this has been done in error, review the usage limitations and file an appeal.
Google was pretty quick in restoring accounts, but it happened to a lot of accounts: googlecolab/colabtools#4979 (comment)
|
||
## 4. Use Trusted Publisher for PyPI | ||
|
||
If you only [publish locally to PyPI using the command line](https://www.pyopensci.org/python-package-guide/tutorials/publish-pypi.html), you need to use a PyPI token. However, if you’re using GitHub Actions to automate your publishing process, setting up **Trusted Publisher** is a more secure option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TP is more secure in that the tokens are short-lived compared to perpetual API tokens.
And it seems like the hackers got hold of forgotten API tokens in the repo to be able to make the second pair of releases in this case.
|
||
## Don’t cache package dependencies in your publish step | ||
|
||
Caching dependencies involves storing dependencies to be reused in future workflow runs. This approach saves time, as GitHub doesn’t need to redownload all dependencies each time the workflow runs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For deploy workflows or anything creating release artifacts end users might use, I think I've heard from @sethmlarson that it makes sense to reduce the number of dependencies you're using where possible: you can't have problems with a hacked dependency you're not using.
That should also reduce the conflict nightmares when pinning things...
I love seeing comments by @sethmlarson, @hugovk and @webknjaz. Thank you for being part of the pyOpenSci community and sharing your expertise. 🐍 🐍 |
Co-authored-by: Hugo van Kemenade <[email protected]>
Co-authored-by: Hugo van Kemenade <[email protected]>
Co-authored-by: Hugo van Kemenade <[email protected]>
@all-contributors please add @hugovk for code, review |
I've put up a pull request to add @hugovk! 🎉 |
This is the start of a draft post on publishing to pypi best practices. it should be super duper user friendly and easy to read. I will need feedback!!
i will be working on some graphics and such next.