When someone knows a lot about a subject but doesn't have any experience using that information.
When someone knows a lot about a subject but also has experience applying that knowledge in real life.
When we talk about Ctrl + F we're talking about a model that is only smart but not wise. The model is like a representation of a piece of knowledge, without any context.
Our first goal is to create a PoC that follows the Crl + F model.
Our data set will be a book in .pdf format.
- We don't want to spend a lot of time in data preprocessing.
Source of image
We will be looking in the direction of Azure to find services that can help us achieve our PoC.
In what way can PoC v1 provide real value to people?
Because the Azure OpenAI service only accepts English, we will be limited to only use English. Maximum amount of 2048 tokens are accepted by the OpenAI embeddings API.
In trying to understand how we can expose large sets of data to an LLM without exceeding its token limit, we formulated the following hypothesis.
In the example of a book, we generate embedding vectors for the contents of this book and use a model, in this case 'text-embedding-ada-002', respecting the model's limitation of 2048 tokens (around 2 to 3 pages of text), and replace line endings with spaces. We then store the resulting embedding vectors in a vector database; we are still looking for which one. As our understanding now stands, the context of a question to the model is limited. These embedding vectors offer a shortcut where we can limit the context of the entire book to just the parts that match the generated vectors of the question asked. After which, we can add the required context to the LLM to answer the question.