UrlRecognizer detects many false positives when analyzing code snippets #1498

gfrebello · 2024-12-12T20:53:23Z

Describe the bug
The BASE_URL_REGEX pattern of presidio_analyzer.predefined_recognizers.url_recognizer.UrlRecognizer generates several false positives when analyzing text that contains code snippets, filenames, or any non-URL token in the format <foo>.<bar>. In particular, regex patterns like (?:sy), (?:mt), (?:py) etc. match with os.system (Python module), zeus.mtia.local (hostname), rpc.py (Python file) and others, causing unintended false positives like os.sy, zeus.mt, rpc.py, etc.

To Reproduce
Install presidio_analyzer and run this:

from presidio_analyzer import AnalyzerEngine

text = '''
# Exploit Title: - rpc.py: Remote Code Execution (RCE)
# Google Dork: N/A
# Date: 2022-07-12
# Exploit Author: Elias Hohl
# Vendor Homepage: https://github.com/abersheeran
# Software Link: https://github.com/abersheeran/rpc.py
# Version: v0.4.2 - v0.6.0
# Tested on: Debian 11, Ubuntu 20.04
# CVE : CVE-2022-35411
import requests
import pickle
HOST = "zeus.mtia.local:65432"
HEADERS = {
"serializer": "pickle"
}
def generate_payload(cmd):
  class PickleRce(object):
    def __reduce__(self):
      import os
      return os.system, (cmd,)
  payload = pickle.dumps(PickleRce())
  print(payload)
  return payload
'''

analyzer = AnalyzerEngine()
results = analyzer.analyze(text=text, entities=['URL'], language='en')
piis = []
for result in results:
    pii = text[result.start:result.end]
    piis.append(pii)

print(piis)

Expected behavior
['https://github.com/abersheeran/rpc.py', 'https://github.com/abersheeran']

Actual behavior
['https://github.com/abersheeran/rpc.py', 'https://github.com/abersheeran', 'rpc.py', 'zeus.mt', 'os.sy']

Screenshots
UrlRecognizer.BASE_URL_REGEX:

Additional context
If this behavior is intended, adding (?![a-z]) to the expression at least prevents false positives like os.system (see the screenshots below). Cases like rpc.py (Python file) and rpc.py (valid Paraguayan URL) don't seem to have a simple solution though.

The text was updated successfully, but these errors were encountered:

omri374 · 2024-12-14T13:38:52Z

Thanks for raising this issue. Agree this isn't straightforward as website.ru is a valid URL which is similar to website.py.

One way to overcome this is to set the confidence threshold above 0.5, which is the current base URL pattern score. In case of a real URL, context words around the url text could help boost the score (but not in all cases).

If you have a suggestion for improvement, we'd be happy to review it.

gfrebello changed the title ~~UrlRecognizer causes false positives when analyzing code snippets~~ UrlRecognizer detects many false positives when analyzing code snippets Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UrlRecognizer detects many false positives when analyzing code snippets #1498

UrlRecognizer detects many false positives when analyzing code snippets #1498

gfrebello commented Dec 12, 2024 •

edited

Loading

omri374 commented Dec 14, 2024 •

edited

Loading

UrlRecognizer detects many false positives when analyzing code snippets #1498

UrlRecognizer detects many false positives when analyzing code snippets #1498

Comments

gfrebello commented Dec 12, 2024 • edited Loading

omri374 commented Dec 14, 2024 • edited Loading

gfrebello commented Dec 12, 2024 •

edited

Loading

omri374 commented Dec 14, 2024 •

edited

Loading