Unexpected silent exits of presidio application #1505

grafandreas · 2025-01-02T11:39:40Z

First of all, thanks for the great work on this project.

I am encountering the following problem: The Python app silently exits indeterministicly during a call of anonymize_text().
Activating logging level DEBUG shows the following:

DEBUG:presidio-analyzer:Returning a total of 10 recognizers
INFO:presidio-analyzer:Fetching all recognizers for language de
DEBUG:presidio-analyzer:Returning a total of 10 recognizers

And that is the last output before the application just returns to command line. Other texts passed before are anonymized correctly.

We do not have a custom analyzer, so this is out of the box
Running with Python 3.12.3
No error messages / stack trace shown

Any pointers / hints on what might cause this problems?

omri374 · 2025-01-02T12:07:46Z

Hi, thanks for raising this. Would it be possible to create a slightly more detailed reproducible example?
Is this running on pure Python, in Docker, or in pyspark?

grafandreas · 2025-01-03T08:53:17Z

Hi,
it is really difficult / impossible to create a concise reproducible example, since it seems non-deterministic and I cannot share the data set. A bit more information:

We are running pure Python (in a VS Code terminal)
Base setup:

configuration = {
    "nlp_engine_name": "spacy",
    "models": [{"lang_code": "de", "model_name": "de_core_news_lg"}],
}

# Create NLP engine based on configuration
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()
# the languages are needed to load country-specific recognizers 
# for finding phones, passport numbers, etc.
analyzer = AnalyzerEngine(nlp_engine=nlp_engine,
                          supported_languages=["de"])

def anonymize_text(text: str) -> str:
    logger.info(f"Anonymizing text: {text}")
    analyzer_results = analyzer.analyze(text=text,
                            language='de')
    
    logger.info(f"Anonymizer results: {analyzer_results}")

    engine = presidio_anonymizer.AnonymizerEngine()
    result = engine.anonymize(text=text, analyzer_results=analyzer_results)
    logger.info(result)
    # Restructuring anonymizer results

    anonymization_results =  {"anonymized": result.text,"found": [entity.to_dict() for entity in analyzer_results]}
    return anonymization_results["anonymized"]

anonymize_text() is then basically called in a loop that fetches data from a SQL (MariaDB) table and writes the anonymized data into another table. Are there maybe any other trace options to get further output?

grafandreas · 2025-01-03T09:07:52Z

I also tried to see if the problem is with one of the registered anonymizers, trying to exclude some with combinations of

analyzer.registry.recognizers = analyzer.registry.recognizers[0:1]

to no avail.

janorivera · 2025-01-03T18:04:35Z

Hi,
I have the same issue:
I'm using pure Python.

Below is the function that I'm using:
It has worked once, the other times it fails a some point with no errors.
The loop basically tries to run the scrubber on all message bodies inside a transcript object.

def scrub_transcript_messages(transcript, analyzer, anonymizer, entities=None):
    if "transcript" not in transcript or "messages" not in transcript["transcript"]:
        raise ValueError("Invalid transcript format. Expected 'transcript' key with 'messages' list.")

    if entities is None:
        entities = ["PHONE_NUMBER", "PERSON"]

    scrubbed_transcript = {"transcript": {"messages": []}}
    messages_list = transcript["transcript"]["messages"]

    for message in messages_list:
        scrubbed_message = message.copy()
        try:
            log.logger.info("Processing")
            print(message["body"])
            results = analyzer.analyze(
                text=message["body"],
                entities=entities,
                language='en'
            )
            anonymized_text = anonymizer.anonymize(
                text=message["body"],
                analyzer_results=results
            )
            scrubbed_message["body"] = anonymized_text.text
            log.logger.info("Scrubbed message")
            print(anonymized_text.text)

        except Exception as e:
            scrubbed_message["body"] = f"Error anonymizing message: {e}"
        
        scrubbed_transcript["transcript"]["messages"].append(scrubbed_message)

    return scrubbed_transcript

omri374 · 2025-01-05T13:38:26Z

Thanks, we're trying to reproduce this. @janorivera in your case, I see that you're collecting exceptions into the body of the scrubbed message. Do you have instances where the scrubbed message contains an error and not the scrubbed text?

Also, it could be more scalable to use the BatchAnalyzerEngine and BatchAnonymizerEngine to run presidio on a list of texts. https://microsoft.github.io/presidio/samples/python/batch_processing/ and https://microsoft.github.io/presidio/api/analyzer_python/#presidio_analyzer.batch_analyzer_engine.BatchAnalyzerEngine.analyze_iterator

Could you please check if this happens with batch mode too?

omri374 · 2025-01-05T13:40:52Z

@grafandreas I'm trying to reproduce your case. I'm using this code. Is it different in any way from yours?

from logging import getLogger
logger = getLogger()

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider
import presidio_anonymizer



configuration = {
    "nlp_engine_name": "spacy",
    "models": [{"lang_code": "de", "model_name": "de_core_news_lg"}],
}

# Create NLP engine based on configuration
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()
# the languages are needed to load country-specific recognizers 
# for finding phones, passport numbers, etc.
analyzer = AnalyzerEngine(nlp_engine=nlp_engine,
                          supported_languages=["de"])

def anonymize_text(text: str) -> str:
    logger.info(f"Anonymizing text: {text}")
    analyzer_results = analyzer.analyze(text=text,
                            language='de')
    
    logger.info(f"Anonymizer results: {analyzer_results}")

    engine = presidio_anonymizer.AnonymizerEngine()
    result = engine.anonymize(text=text, analyzer_results=analyzer_results)
    logger.info(result)
    # Restructuring anonymizer results

    anonymization_results =  {"anonymized": result.text,"found": [entity.to_dict() for entity in analyzer_results]}
    return anonymization_results["anonymized"]


text = """
Hier sind ein paar Beispielsätze, die wir derzeit unterstützen:

Hallo, mein Name ist David Johnson, und ich komme ursprünglich aus Liverpool.
Meine Kreditkartennummer ist 4095-2609-9393-4932, und meine Krypto-Wallet-ID ist 16Yeky6GMjeNkAiNcBY7ZhrLoMSgg1BoyZ.

Am 11.10.2024 habe ich www.microsoft.com besucht und eine E-Mail an [email protected] von der IP-Adresse 192.168.0.1 gesendet.

Mein Reisepass: 191280342 und meine Telefonnummer: (212) 555-1234.

Dies ist eine gültige internationale Bankkontonummer: IL150120690000003111111. Können Sie bitte den Status des Bankkontos 954567876544 überprüfen?

Kates Sozialversicherungsnummer ist 078-05-1126. Ihr Führerschein? Er lautet 1234567A.

"""

for i in range(100000):
    if i % 100 == 0:
        print(i)
    anonymize_text(text)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected silent exits of presidio application #1505

Unexpected silent exits of presidio application #1505

grafandreas commented Jan 2, 2025

omri374 commented Jan 2, 2025

grafandreas commented Jan 3, 2025

grafandreas commented Jan 3, 2025

janorivera commented Jan 3, 2025

omri374 commented Jan 5, 2025

omri374 commented Jan 5, 2025 •

edited

Loading

Unexpected silent exits of presidio application #1505

Unexpected silent exits of presidio application #1505

Comments

grafandreas commented Jan 2, 2025

omri374 commented Jan 2, 2025

grafandreas commented Jan 3, 2025

grafandreas commented Jan 3, 2025

janorivera commented Jan 3, 2025

omri374 commented Jan 5, 2025

omri374 commented Jan 5, 2025 • edited Loading

omri374 commented Jan 5, 2025 •

edited

Loading