Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jan 21 merge] feat(blog): blog post--lessons learned on publish security #538

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

lwasser
Copy link
Member

@lwasser lwasser commented Dec 14, 2024

This is the start of a draft post on publishing to pypi best practices. it should be super duper user friendly and easy to read. I will need feedback!!

i will be working on some graphics and such next.


python-version: '3.9'

# Step 3: Cache dependencies
- name: Cache dependencies
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not tested and could be wrong.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the uses but oftentimes, building is performed by pypa/build so it'd be just pip install build and not multiple direct deps as one might expect from a requirements.txt file. So the example seems a bit unrealistic, except for some untypical projects with surprising build automation required.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, right, good point. What is a good example of caching dependencies (other than what we say with ultranalytics)? I use caching when i'm publishing.

we pip install hatch and use hatch build

What do you think about removing the example and just telling people that if they do cache dependencies for any reason, don't do it when the build is for publishing?

for instance, doing it for a website build could be ok if it's a preview, etc. vs publishing a package?

@lwasser lwasser marked this pull request as ready for review December 17, 2024 18:51
@lwasser
Copy link
Member Author

lwasser commented Dec 17, 2024

hey @webknjaz @hugovk do you have a bit of time to review this post for accuracy? i'm trying to provide a more beginner-friendly overview of what we learned that our community can quickly understand and implement.

We use the the workflow @webknjaz helped us create for the pyosmeta package. And we want to link to all of the pypi / psf posts that are relevant too. many thanks for your time!

Copy link

@webknjaz webknjaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's some quick notes. I might've missed something.

Let's also ask @woodruffw for feedback.


## Is your PyPI publication workflow secure?

The recent Python package breach [involving Ultralytics](https://blog.pypi.org/posts/2024-12-11-ultralytics-attack-analysis/) has highlighted the need for secure PyPI publishing workflows. Hackers exploited a GitHub workflow (`pull_request_target`) to inject malicious code, which was published to PyPI. Users who downloaded the package unknowingly allowed their machines to be hijacked for Bitcoin mining.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a GitHub workflow (pull_request_target)

pull_request_target is not a workflow, it's one of the events that may trigger a workflow run. A workflow may “subscribe” to different events like push and pull_request, this is just another event type with elevated privileges in its security context.

unknowingly allowed their machines to be hijacked for Bitcoin mining.

Not only that, but they also did that in Google Colab which got their accounts banned pretty swiftly...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOOOH no way. i missed the Google Colab part!! that's a pretty bold move for a hacker? maybe they were new hackers ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See googlecolab/colabtools#4979.

It was unwitting end users who installed the hacked package on their Colab accounts, and the Google automation then suspended their accounts:

Connection failed
This account has been blocked from accessing Colab runtimes due to suspected abusive activity. This does not impact access to other Google products. If you believe this has been done in error, review the usage limitations and file an appeal.

Google was pretty quick in restoring accounts, but it happened to a lot of accounts: googlecolab/colabtools#4979 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

geez. it's amazing how quick and far-reaching this single incident was!

_posts/2024-12-13-python-packaging-security.md Outdated Show resolved Hide resolved

### 🔐 Secure your workflows 🔐
- 🚫 Avoid risky events like `pull_request_target` and adopt release-based workflows.
- ♻️ Don’t cache dependencies in your publish workflows to prevent tampering.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably an oversimplification. Not caching the deps may as well lead to something like this if the transitive deps are unpinned and get updated in uncontrolled manner. The problem was that the cache was poisoned in a job with a security context that had high privileges and then, another job picked it up, effectively reusing what that first one “pinned” in the installed deps.

But if there's no cache, a compromised transitive dependency can be downloaded from PyPI, resulting in the same problem.

Security exists in context and not in isolation. It should come from a place of understanding how different bits fit together.

That said, I wouldn't make a call of whether using caching or not will save us all. The problem was that it was possible to compromise the cache that ended up being used.

I think it's possible to use cache securely by having separate cache keys unique to the dist building job. So caches for building the dists and testing them would not leak into each other. That's the real fix for cache.

Another side of this is the supply chain problem, which cache ended up being a part of in this instance. But as I mentioned earlier, the supply chain could be compromised differently, without cache being a factor. The solution to this is pinning your dependencies. This means recording every single version number for every single transitive dependency there is. One of the low-level tools to use for this is pip-tools, in the context of pip install (until we have PEP 751, that is). Non-pip installers have different mechanisms. With pip, you can use pip install -r requirements.in -c constraints.txt, where requirements.in would contain a direct and unrestricted dependency list, and constraints.txt would set the entire tree to exact versions. A complete solution is usually integration of different aspects through multiple tools (or a "one-tool-to-rule-them-all" which some projects aim to become). I've been taking this to extremes with "DIY lock files" in https://github.com/webknjaz/ep2024-multilock-toy-template / https://github.com/ansible/awx-plugins/tree/2c2b22c/dependencies/lock-files.
Additionally, there's --require-hashes which also records every single dist file hash, not just version (this is not compatible with -c, though, due to a bug in pip).

To summarize, the solution here is to be in control of your deps, make sure their updates are transparent and controllable on all the levels, and to know what influences them… Out of the context, disallowing caching is a band-aid at best. But it may be harmful in that it would cultivate a false sense of security.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. This is super helpful. I could remove the caching part altogether, given so many moving pieces and caching seems more nuanced then just locking down the workflow itself with better triggers and tighter permissions. But I also have a question about the dependencies. Because it is my understanding and experience that pinning all deps can also be highly problematic for users. For instance, if you have an environment with many different packages and try installing a package with pinned dips, that could lead to significant user issues.

Unless we're just talking about dips in the workflow itself, such as hatch, pypa/build, etc.

web apps make sense to pin but i can't quite wrap my head around pinning deps for a package. Could you please help me understand? And do you agree - maybe for more beginner users we just drop caching altogether as a topic?

### 🔐 Secure your workflows 🔐
- 🚫 Avoid risky events like `pull_request_target` and adopt release-based workflows.
- ♻️ Don’t cache dependencies in your publish workflows to prevent tampering.
- If you reference branches in a pull request, clean or sanitize branch names in your workflow.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestions!! I may add these links lower down in the post at the end, so the takeaways are just text.. but I'll definitely include them. This is what I have now:

  • If you reference branches in a pull request, clean or sanitize branch names in your workflow.
  • Consider using zizmor to check your GitHub workflows for security vulnerabilities!

i haven't used zizmor yet but i saw a package that Hugo maintains uses it so i may try it out with stravalib and pyosmeta!!


### Lock down GitHub repo access
- 🔒 Restrict repository access to essential maintainers only.
- ✅ Add automated checks to ensure releases are authorized and secure.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is mostly about setting an environment in the job and making sure that environment has required reviewers in the repo settings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the blog posts, it seemed like the hackers somehow got direct access to that repo. Did i read that wrong?


* [<i class="fa-brands fa-discourse"></i> Discourse](https://pyopensci.discourse.group/)
* [<i class="fa-brands fa-mastodon"></i> Mastodon](https://fosstodon.org/@pyopensci)
* [<i class="fa-solid fa-cloud"></i> Bluesky](https://bsky.app/profile/pyopensci.bsky.social)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tip

Pro tip: consider updating the Bluesky username to pyopensci.org. This can be configured and verified in like 3 minutes (through either DNS or a .well-known document on the website, but adds credibility to the account). You can create a repository in the org, call in .well-known, and add a file with the account DID as follows: webknjaz/.well-known@5a3c310. Just don't forget to enable Pages in the repo settings before clicking Verify.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WOAH!! i never knew about this for bluesky!.
thank you for this tip - and for all of the thoughtful feedback. i'm going to work on this some more tomorrow!

_posts/2024-12-13-python-packaging-security.md Outdated Show resolved Hide resolved
_posts/2024-12-13-python-packaging-security.md Outdated Show resolved Hide resolved

You can see how to set up GitHub Actions securely in our own [PyPI publishing GitHub workflow](https://github.com/pyOpenSci/pyosMeta/blob/main/.github/workflows/publish-pypi.yml), which follows the Trusted Publisher approach.

**Note:** Trusted Publisher workflows are currently only available for GitHub. Support for GitLab may be coming in the future—stay tuned!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the doc you already linked, there's an example for GitLab as well: https://docs.pypi.org/trusted-publishers/adding-a-publisher/#gitlab-cicd. There are 4 platforms that support this mechanism already, and I saw some Bitbucket folks asking around whether they can integrate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed the breakout and added that link!! I somehow read that GitLab support wasn't available yet, but obviously, I wasn't looking in the right place! thanks!

_posts/2024-12-13-python-packaging-security.md Outdated Show resolved Hide resolved
lwasser and others added 5 commits December 18, 2024 18:15
Co-authored-by: 🇺🇦 Sviatoslav Sydorenko (Святослав Сидоренко) <[email protected]>
Co-authored-by: 🇺🇦 Sviatoslav Sydorenko (Святослав Сидоренко) <[email protected]>
Co-authored-by: 🇺🇦 Sviatoslav Sydorenko (Святослав Сидоренко) <[email protected]>
@lwasser
Copy link
Member Author

lwasser commented Dec 19, 2024

NOTES: i'm doing a big rewrite (more of a reorg) of this content. I'll start with the high-level concepts of securing the environment / GitHub and GitHub-PyPI, and then the other things like sanitizing branch names and such can come next.
This gives users 2 key takeaways (plus adding Zizmor as a third!) to start with now rather than 10 things, which is a lot! so I'd avoid more reviews until I push these edits given the structure will change (to hopefully be more clear and more accurate!) 🚀

i'm super appreciative of all of the feedback.

@lwasser lwasser changed the title [DRAFT] feat: draft of lessons learned on publish security feat: draft of lessons learned on publish security Dec 19, 2024
@lwasser lwasser changed the title feat: draft of lessons learned on publish security feat(blog): blog post--lessons learned on publish security Dec 19, 2024
@lwasser
Copy link
Member Author

lwasser commented Dec 20, 2024

Ok i've made significant changes to the post and added a new figure that tries to show trusted publisher as a concept with the 2 step ci action part.

i've also added links throughout to pypa / pypi documentation. AND I've tested out gizmo!! it's great. i haven't tried as a pre-commit yet but plan to.

trusted-publisher-pypi-github

Also @webknjaz that Bluesky pro tip was 🔥 !! I set that up quickly, and we now have a pyopensci.org domain. I added a nod to you at the top of the post for providing so much input!!


## Don’t cache package dependencies in your publish step

Caching dependencies involves storing dependencies to be reused in future workflow runs. This approach saves time, as GitHub doesn’t need to redownload all dependencies each time the workflow runs.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where I'm confused about pinning deps, as that doesn't seem like a good idea for users when I think about environments and dependency conflict nightmares.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For deploy workflows or anything creating release artifacts end users might use, I think I've heard from @sethmlarson that it makes sense to reduce the number of dependencies you're using where possible: you can't have problems with a hacked dependency you're not using.

That should also reduce the conflict nightmares when pinning things...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Hugo! This helps clarify what I was struggling with! I'll rewrite this sentence to discuss the concept more broadly! I see that seth also gave this a thumbs up!


Using a release-based workflow ensures that your publishing step is tightly controlled. A pull request will never accidentally trigger a publish build. This reduces your risk!


Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add: a few sentences on

delete old tokens and secrets containing pypi tokens... this is easy for people to do and good for them to know about.

Copy link

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome blog post, thanks Leah! I left some comments:

Now that you have a GitHub environment setup, you can set up Trusted Publisher in your PyPI account.
A Trusted Publisher setup creates a short-lived secure link between PyPI and your repository.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe remove "short-lived" from this sentence? The underlying implementation of Trusted Publishers uses short-lived credentials, but the link itself is durable!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!


Named as a playful nod to Dr. Zizmor’s famous “clear skin” ads, Zizmor aims to give you “beautiful clean workflows.”

Learn more about Zizmor on the [official blog post by Yossarian](https://blog.yossarian.net/2024/10/27/Now-you-can-have-beautiful-clean-workflows).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yossarian -> William Woodruff

@lwasser
Copy link
Member Author

lwasser commented Dec 20, 2024

@sethmlarson thank you so much for the review!! I appreciate it! 🚀
I'll address your comments next!

@lwasser
Copy link
Member Author

lwasser commented Dec 20, 2024

@all-contributors please add @sethmlarson for review

Copy link

@hugovk hugovk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, mostly nitpicks.

And good to see @sethmlarson here, I was going to summon him!


## Is your PyPI publication workflow secure?

The recent Python package breach [involving Ultralytics](https://blog.pypi.org/posts/2024-12-11-ultralytics-attack-analysis/) has highlighted the need for secure PyPI publishing workflows. Hackers exploited a GitHub workflow (`pull_request_target`) to inject malicious code, which was published to PyPI. Users who downloaded the package unknowingly allowed their machines to be hijacked for Bitcoin mining.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See googlecolab/colabtools#4979.

It was unwitting end users who installed the hacked package on their Colab accounts, and the Google automation then suspended their accounts:

Connection failed
This account has been blocked from accessing Colab runtimes due to suspected abusive activity. This does not impact access to other Google products. If you believe this has been done in error, review the usage limitations and file an appeal.

Google was pretty quick in restoring accounts, but it happened to a lot of accounts: googlecolab/colabtools#4979 (comment)


## 4. Use Trusted Publisher for PyPI

If you only [publish locally to PyPI using the command line](https://www.pyopensci.org/python-package-guide/tutorials/publish-pypi.html), you need to use a PyPI token. However, if you’re using GitHub Actions to automate your publishing process, setting up **Trusted Publisher** is a more secure option.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TP is more secure in that the tokens are short-lived compared to perpetual API tokens.

And it seems like the hackers got hold of forgotten API tokens in the repo to be able to make the second pair of releases in this case.

_posts/2024-12-13-python-packaging-security.md Outdated Show resolved Hide resolved
_posts/2024-12-13-python-packaging-security.md Outdated Show resolved Hide resolved
_posts/2024-12-13-python-packaging-security.md Outdated Show resolved Hide resolved
_posts/2024-12-13-python-packaging-security.md Outdated Show resolved Hide resolved
_posts/2024-12-13-python-packaging-security.md Outdated Show resolved Hide resolved
_posts/2024-12-13-python-packaging-security.md Outdated Show resolved Hide resolved
_posts/2024-12-13-python-packaging-security.md Outdated Show resolved Hide resolved

## Don’t cache package dependencies in your publish step

Caching dependencies involves storing dependencies to be reused in future workflow runs. This approach saves time, as GitHub doesn’t need to redownload all dependencies each time the workflow runs.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For deploy workflows or anything creating release artifacts end users might use, I think I've heard from @sethmlarson that it makes sense to reduce the number of dependencies you're using where possible: you can't have problems with a hacked dependency you're not using.

That should also reduce the conflict nightmares when pinning things...

@willingc
Copy link
Collaborator

I love seeing comments by @sethmlarson, @hugovk and @webknjaz. Thank you for being part of the pyOpenSci community and sharing your expertise. 🐍 🐍

@lwasser lwasser changed the title feat(blog): blog post--lessons learned on publish security [jan 21 merge] feat(blog): blog post--lessons learned on publish security Jan 6, 2025
Co-authored-by: Hugo van Kemenade <[email protected]>
lwasser and others added 3 commits January 8, 2025 11:23
@lwasser
Copy link
Member Author

lwasser commented Jan 8, 2025

@all-contributors please add @hugovk for code, review

Copy link
Contributor

@lwasser

I've put up a pull request to add @hugovk! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants