-
Notifications
You must be signed in to change notification settings - Fork 584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support LLM based de-identification #1234
Comments
Hi @omri374 |
Hi @VMD7, thanks for this review. There are some challenges in this implementation, confidence being one of them. We can definitely evaluate other approaches for an LLM integration, such as https://github.com/vllm-project/vllm |
Hey I think this is a great idea! Could you provide some demo code on how to integrate spacy-llm? Do we need to create a customized nlp engine, similar to |
I have a draft if this, but not ready yet. Yes, my approach is to create a new NlpEngine. |
If you'd like to give it a try and create a PR, we can collaborate on this. |
Would love to collaborate on this! Could you draft a pull request for the work you've been focusing on? Currently I'm doing some initial test trying to do NER using open ai gpt model in a customized nlp engine:
But I have problem on how to calculate scores for the retrieved entities. |
Let met push my branch. It's very initial, and to be honest I'm not sure that's the best way to go, but we can brainstorm on this. |
@cloudsere this is taking me longer than usual to get to, so I'll just write down my initial design/thinking: Requirements
Alternatives
Prompt (simple just for illustration)
Output: [
{"word": "Hello", "label": "O"},
{"word": ",", "label": "O"},
{"word": "my", "label": "O"},
{"word": "name", "label": "O"},
{"word": "is", "label": "O"},
{"word": "David", "label": "PERSON"},
{"word": "Johnson", "label": "PERSON"},
{"word": "and", "label": "O"},
{"word": "I", "label": "O"},
{"word": "live", "label": "O"},
{"word": "in", "label": "O"},
{"word": "Maine", "label": "LOCATION"},
{"word": ".", "label": "O"},
{"word": "My", "label": "O"},
{"word": "credit", "label": "O"},
{"word": "card", "label": "O"},
{"word": "number", "label": "O"},
{"word": "is", "label": "O"},
{"word": "4095-2609-9393-4932", "label": "CREDIT_CARD"},
{"word": "and", "label": "O"},
{"word": "my", "label": "O"},
{"word": "crypto", "label": "O"},
{"word": "wallet", "label": "O"},
{"word": "id", "label": "O"},
{"word": "is", "label": "O"},
{"word": "16Yeky6GMjeNkAiNcBY7ZhrLoMSgg1BoyZ", "label": "CRYPTO_KEY"},
{"word": ".", "label": "O"}
]
If anyone has a better idea than what I described here (I'm sure there is!) please respond to this thread. |
Initial (and experimental) code using |
@omri374 this is quite a cool idea, is it now abandoned? |
@magedhelmy1 if you (or anyone else) would be interested in taking this initial approach and continuing developing it, that would be great! |
Great point, but when I tried implementing, something similar to this, I ran into tokenization compatibility issues, especially with special characters being detected, and longer texts. So I passed a list of tokens, and asked it to return a list of labels back. Which can then be compared to check with original list of tokens, so that both are of the same length. |
Nice approach. Does it affect the detection accuracy in any way? |
Is your feature request related to a problem? Please describe.
LLMs usually do well in PII detection and de-identification. Using LLMs to identify PII in text could allow users to easily expand Presidio's capabilities with arbitrary PII entities and PII which is a characteristic of a person rather than an identifier (e.g. "He recently got divorced" vs. "His SSN is 1234")
Describe the solution you'd like
Presidio currently supports multiple NER and NLP approaches for PII detection. Presidio proposes several
NLPEngine
instances for transformers, stanza and spacy. Creating one for LLM would be a simple integration of an LLM into Presidio. One possible way to achieve this is using spacy-llm which already has integrations with many LLM frameworks and models, and takes care of things like identifying the span of a PII entity discovered by an LLM.Describe alternatives you've considered
We can use LLMs in many steps in the de-identification pipeline. We have examples for using LLMs to generate fake data, we can use LLMs to identify PII in text, and we can use LLMs to do the end-to-end de-identification. While we can consider building all three capabilities, we should start with PII detection, in order to conform with the Presidio structure, and be able to leverage existing de-identification operators in presidio-anonymizer.
Additional context
Contributions welcome! There's plenty of docs on how the
NlpEngine
is structured, and existing code samples for integrating NLP frameworks into Presidio.The text was updated successfully, but these errors were encountered: