You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
The BASE_URL_REGEX pattern of presidio_analyzer.predefined_recognizers.url_recognizer.UrlRecognizer generates several false positives when analyzing text that contains code snippets, filenames, or any non-URL token in the format <foo>.<bar>. In particular, regex patterns like (?:sy), (?:mt), (?:py) etc. match with os.system (Python module), zeus.mtia.local (hostname), rpc.py (Python file) and others, causing unintended false positives like os.sy, zeus.mt, rpc.py, etc.
To Reproduce
Install presidio_analyzer and run this:
Actual behavior ['https://github.com/abersheeran/rpc.py', 'https://github.com/abersheeran', 'rpc.py', 'zeus.mt', 'os.sy']
Screenshots UrlRecognizer.BASE_URL_REGEX:
Additional context
If this behavior is intended, adding (?![a-z]) to the expression at least prevents false positives like os.system (see the screenshots below). Cases like rpc.py (Python file) and rpc.py (valid Paraguayan URL) don't seem to have a simple solution though.
The text was updated successfully, but these errors were encountered:
gfrebello
changed the title
UrlRecognizer causes false positives when analyzing code snippets
UrlRecognizer detects many false positives when analyzing code snippets
Dec 12, 2024
Thanks for raising this issue. Agree this isn't straightforward as website.ru is a valid URL which is similar to website.py.
One way to overcome this is to set the confidence threshold above 0.5, which is the current base URL pattern score. In case of a real URL, context words around the url text could help boost the score (but not in all cases).
If you have a suggestion for improvement, we'd be happy to review it.
Describe the bug
The
BASE_URL_REGEX
pattern ofpresidio_analyzer.predefined_recognizers.url_recognizer.UrlRecognizer
generates several false positives when analyzing text that contains code snippets, filenames, or any non-URL token in the format<foo>.<bar>
. In particular, regex patterns like(?:sy)
,(?:mt)
,(?:py)
etc. match withos.system
(Python module),zeus.mtia.local
(hostname),rpc.py
(Python file) and others, causing unintended false positives likeos.sy
,zeus.mt
,rpc.py
, etc.To Reproduce
Install
presidio_analyzer
and run this:Expected behavior
['https://github.com/abersheeran/rpc.py', 'https://github.com/abersheeran']
Actual behavior
['https://github.com/abersheeran/rpc.py', 'https://github.com/abersheeran', 'rpc.py', 'zeus.mt', 'os.sy']
Screenshots
UrlRecognizer.BASE_URL_REGEX
:Additional context
If this behavior is intended, adding
(?![a-z])
to the expression at least prevents false positives likeos.system
(see the screenshots below). Cases likerpc.py
(Python file) and rpc.py (valid Paraguayan URL) don't seem to have a simple solution though.The text was updated successfully, but these errors were encountered: