Create geoclip_embedding_function.py #3353
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces a GeoClipEmbeddingFunction to Chroma, enabling the creation of embeddings from geographic coordinates (latitude and longitude). It supports various input formats (strings, lists, and dictionaries) and includes robust error handling and logging.
Description of changes
This PR adds a GeoClipEmbeddingFunction to Chroma, enabling the creation of embeddings from geographic coordinates (latitude and longitude). It supports string, list, and dictionary input formats and includes robust error handling.
Test plan
The changes are covered by unit tests using pytest. The tests verify:
Successful embedding generation for valid "lat,lon" strings, [lat, lon] lists, and {"latitude": lat, "longitude": lon} dictionaries.
Correct handling of edge cases, such as coordinates at the poles and the antimeridian.
Proper error handling for invalid input formats (e.g., incorrect number of values, non-numeric values) and out-of-range coordinates.
Logging of warnings for invalid inputs.
Device handling (CPU/CUDA) is tested where applicable.
Documentation Changes
Yes, we need to make changes.
Purpose:
The GeoClipEmbeddingFunction allows you to create embeddings from geographic coordinates (latitude and longitude) using the GeoCLIP model. These embeddings can then be used within Chroma for various geospatial applications, such as:
Similarity Search: Find locations that are geographically close to a given query location.
Clustering: Group similar locations together based on their geographic proximity.
Geographic Data Analysis: Perform analysis on datasets with geographic components, leveraging the semantic understanding of location encoded by GeoCLIP.
GeoCLIP is a CLIP-inspired model that aligns locations and images, providing a rich representation of geographic space. By using GeoClipEmbeddingFunction, you can bring this powerful model's capabilities into Chroma.
To use the GeoClipEmbeddingFunction, you first need to install the geoclip and torch Python packages:
Bash
pip install geoclip torch
Then, you can instantiate the embedding function and use it to generate embeddings from geographic coordinates. The function supports three input formats:
String: A string in the format "latitude,longitude" (e.g., "37.7749,-122.4194").
List: A list containing two floats: [latitude, longitude] (e.g., [37.7749, -122.4194]).
Dictionary: A dictionary with "latitude" and "longitude" keys (e.g., {"latitude": 37.7749, "longitude": -122.4194}).
When instantiating the GeoClipEmbeddingFunction, you can optionally specify the device to use for computation ('cpu' or 'cuda'). If no device is provided, the function will automatically attempt to use a CUDA-enabled GPU if available and fall back to CPU otherwise.