Embeddings with Sentence Transformers and Pinecone for Question Answering in Spanish

12 min readJul 18, 2023

A simple how-to guide on using a vector database, semantic search and sentence-transformer for a question-answering task

In this short article, we are going to show you how we can in a few lines of code store the information of thousands of documents in a database and later, extract the most relevant pieces of documents to solve a question-answering or summarization task.
It is not about showing the most efficient model or algorithm to solve question-answering problems, but describing how easily, using tools such as vector databases and sentence transformers models, we can compress the information of thousands of documents and then query them quickly to retrieve those documents that can help us to solve a task or answer a question.

Let’s start by describing what will be our “compressed” information storage space.

A vector database: Pinecone

A vector database is a database where we store and manage unstructured information, like text, images, and audio, in the form of high-dimensional vectors (embeddings) with functionalities that speed up the retrieval of such information based on similarity.

That is, we have an unstructured document and we encode its information into a single vector in a high dimensional space, an embedding, and we usually store along with it the metadata that will later be used to support the resolution of certain tasks. Subsequently, given a question or text about a topic we want more information on, we can encode it in the same way as the original documents and use it as a “base” vector to execute a similarity search in our vector database. As a result, we retrieve those documents that contain information on the same topic or similar to the question.

Instead of storing documents by classifying them by category, author, subject, etc… and then searching for keywords in that metadata, we try to compress all the information of the document in a vector of an n-dimensional space so that documents of the same subject are close together in that multidimensional space. When we have another text for which we want to search for similar documents, we apply the same “compression” and proceed to perform a semantic search through the rest of the documents, those whose distance is smaller will be more closely related and can help us in the task at hand.

Vector databases are able to retrieve similar objects of a query quickly because they have already pre-calculated them. The underlying concept is called Approximate Nearest Neighbor (ANN) search, which uses different algorithms for indexing and calculating similarities. [1]

by Leoni Monigatti

To optimize the search and the information extraction process, an index is created that stores the representations or vectors of each text and speeds up the search process between documents by similarity.

Pinecone, indexes, and vectors

There is a multitude of vector databases on the market, many are provided in cloud mode and others on-premise, among them, we have selected Pinecone, which allows free evaluation and is offered in cloud mode.

The highest level organizational element in Pinecone is the index, which receives and stores vectors, executes queries on the vector, and enables other operations on its contents. Each index runs in at least one pod. The pod is the basic hardware unit on which an index runs. The index can have more than one pod, which allows for more storage, lower latency, and lower bandwidth usage. In addition, the billing of the service is based on the type and number of pods we assign to our indexes.

When working with indexes, we must determine 2 critical factors when creating them: the length of the vectors and the measure of similarity.
The vector length will be determined by the model we use to create the embeddings or vectors, therefore, once evaluated and selected the model that will generate our embeddings, we will identify the size of its output and we will take it as the length of our vectors. Usually, it is 768 for BERT-based models, 512 in other cases, or even minor lengths.

Another relevant concept when working with vectors is the distance measure that will be used to determine how close two vectors are to each other, and consequently how similar they are. Pinecone allows us to work with three measures:

- Euclidean: this is the most commonly used measure and the larger it is. the closer the vectors are to each other.
- Cosine: It is often used to measure the distance between 2 documents and its values are between -1 and 1.
- Dot product: Multiplies the two vectors and like the previous ones, the higher the value the more similar the vectors will be.

From reading several articles and examples when working with sentence transformers, the most widespread measurement is either the cosine or the Euclidean.

A record in an index contains different elements:
- ID: a unique ID for the record
- Dense vector: the embedding in a dense format,
- Sparse vector: sometimes the vector can be stored in sparse mode, reducing its size when we have a majority of null or 0 values.
- Metadata: Key-value pairs that allow us to store the relevant information that when retrieving the vector we will use to solve the problem we are facing.

It will be the metadata that will provide us with the information we need to solve our problem. The record closest to our query will return a metadata text with the answer to the query, for example. It will be essential that when creating our index and storing the vectors, we register the appropriate metadata to find the answer to the task we are solving.

It is not a question of delving into the techniques of Information Retrieval or the technical details of the indexes in Pinecone, you can consult the official website of Pinecone for technical details and other sources of information to understand in detail what it is and how it works [2].

Image generated in Leonardo.AI by the author

Problem description and steps

Let’s get to the point of what we want to show, it is about approaching a question-answering problem where from a query we will search for those texts that can give us the answer. Subsequently, we will take a model trained in the question-answering task and we will provide as context the texts obtained in the previous step and our question. With the above mentioned, it is obvious that we can use embeddings, vector databases, and semantic search techniques to find a simple solution.

We are going to list the main steps of the process, which will be detailed and shown later on:

1. Download and process the data we will work with.
2. Identify and create a model to generate our vectors
3. Connect to the vector database and create an index
4. Upload the text embeddings to the index
5. Identify and create a question-answering model that will return the answer
6. Create the embedding of the query and query the index to return the closest texts.
7. Infer the answer to the query by executing the model taking as input the retrieved texts.

The Q&A dataset

For this demo we will use the well-known SQUAD 2.0 dataset but in a Spanish version hosted in the Huggingface hub under the name of `squad_es`. For training, it has more than 130,000 records, although in our case as we are only going to use it as a demo, we will take a few thousand records.

Every record contains the title and the context, these values will be stored in our index. As we are not going to train any model we can reject the columns about the questions and answers.

Select a sentence transformer model

When selecting the model for vector generation, first we have to make sure that it supports the Spanish language. At this moment, a robust model that offers good performance with low resource requirements is the one based on sentence transformers distiluse-base-multilingual-cased-v1[4]. This model maps sentences and paragraphs to a 512-dimensional dense vector space.

# set device to GPU if available
device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
# load the retriever model from huggingface model hub
retriever = SentenceTransformer('sentence-transformers/distiluse-base-multilingual-cased-v1', device=device)
# Our sentences to encode
sentences = ["This is an example sentence", "Esta es una sentencia de ejemplo"]
# Create the embeddings for our sentences
embeddings = retriever.encode(sentences, convert_to_tensor=True)
# Show the final embeddings
print(embeddings)

Create an index in Pinecone

We are now ready to create our index in Pinecone, which will be our vector database. As it is a cloud service, first we need to register on the website and generate the API key that we will use to establish the connection to the cloud.
Given the high demand for this type of service, we have to sign up for the list and wait a few days until they give us access to the free layer, which is more than enough to make demos and simple tasks.

Once the connection is established, we will simply have to launch a command to create an index indicating as parameters the length of the vector and the distance metric or similarity measure that we will use to compare vectors.

# Load .env file with environment variables
load_dotenv()

# connect to pinecone environment
pinecone.init(
    api_key=os.environ["PINECONE_API_KEY"],
    environment="us-west4-gcp-free"  # find next to API key in console
)
# Set the index name
index_name = "extractive-question-answering"

# check if the extractive-question-answering index exists
if index_name not in pinecone.list_indexes():
    # create the index if it does not exist
    pinecone.create_index(
        index_name,
        dimension=512,
        metric="cosine"
    )

# connect to extractive-question-answering index we created
index = pinecone.Index(index_name)

Create and upload the embeddings

At this point we can build a loop that goes through the texts, encodes the embeddings, extracts the relevant metadata and finally uploads them to the index. For our case, we will store as metadata the title and the context of every record. When answering a question we will need at least the context to use it in the question-answering model.

The functionalities of the vector database allow us to perform these operations with just a few lines of code.

# we will use batches of 64
batch_size = 64
# In order to minimize compute tim for this demo we limit the number of context passages we will work with
max_context = batch_size*50
print("Max number of context passages:", max_context)

# Check if index is empty
index_stats_response = index.describe_index_stats()
if index_stats_response['total_vector_count']<100:
  for i in tqdm(range(0, max_context, batch_size)):
      # find end of batch
      i_end = min(i+batch_size, max_context)
      # extract batch
      batch = df.iloc[i:i_end]
      # generate embeddings for batch
      emb = retriever.encode(batch['context'].tolist()).tolist()
      # get metadata
      meta = batch.to_dict(orient='records')
      # create unique IDs
      ids = [f"{idx}" for idx in range(i, i_end)]
      # add all to upsert list
      to_upsert = list(zip(ids, emb, meta))
      # upsert/insert these records to pinecone
      _ = index.upsert(vectors=to_upsert)

# check that we have all vectors in index
index.describe_index_stats()

Select and instantiate a question-answering model

The second model to be selected will be the one that, given a context and a question, will return the answer to it. For this particular example, we have selected a model based on DeBERTa, which has been trained in multilingual mode, the one known as mDeBERTa, sharing the same structure and trained on the CC100 multilingual corpus. This model has been finetuned in the SQUAD dataset, so our examples will be easy to solve using this model.

Basically, we repeat the usual operations of loading a model from the Huggingface repository and we make use of a pipeline on the question-answering task. It is the simplest way to invoke the model and get responses.

# Set the reader model
model_name = 'timpal0l/mdeberta-v3-base-squad2'
# load the reader model into a question-answering pipeline
reader = pipeline(tokenizer=model_name, model=model_name, task='question-answering', device=device)
print(reader)

Create the query embedding and search for context information

Our next step is to obtain the vector that represents our query, the question we need to answer and take it as a reference text. We will query the vector database to return those texts that most closely match the query in semantic terms. Between them should be the right answer. Hopefully, the database provides us with simple methods that take care of doing the semantic search and return the results we ask for, we don’t need to build the code to calculate the distance between the vectors, we simply call the method and wait for the results.

Among the parameters, we can indicate the number of contexts that will be returned and if we want to include the metadata, we will usually indicate true and recover the original text of this vector from the metadata.

We define two helper functions to execute the inference and the retrieval of information from the vector database.

# gets context passages from the pinecone index
def get_context(question, top_k):
    # generate embeddings for the question
    xq = retriever.encode([question]).tolist()
    # search pinecone index for context passage with the answer
    xc = index.query(xq, top_k=top_k, include_metadata=True)
    # extract the context passage from pinecone search result
    c = [x["metadata"]['context'] for x in xc["matches"]]
    return c

# extracts answer from the context passage
def extract_answer(question, context):
    results = []
    for c in context:
        # feed the reader the question and contexts to extract answers
        answer = reader(question=question, context=c)
        # add the context to answer dict for printing both together
        answer["context"] = c
        results.append(answer)
    # sort the result based on the score from reader model
    sorted_result = pprint(sorted(results, key=lambda x: x['score'], reverse=True))
    return sorted_result

Finally, we will invoke our question-answering model and hopefully, we will obtain a correct answer based on the texts retrieved from the vector database. Depending if it is an extractive or abstractive process, we will build the input to the model in one way or another.

In an extractive approach, it is as simple as:

question = "¿Cómo se llama la hermana menor de Beyoncé?"
context = get_context(question, top_k=1)
extract_answer(question, context)

Result:
[{'answer': ' Solange,',
  'context': 'Beyoncé Giselle Knowles nació en Houston, Texas, hija de '
             'Celestine Ann "Tina" Knowles, una peluquera y dueña de salón, y '
             'Mathew Knowles, un gerente de ventas de Xerox. El nombre de '
             'Beyoncé es un homenaje al apellido de soltera de su madre. La '
             'hermana menor de Beyoncé, Solange, también es cantante y ex '
             "miembro de Destiny 's Child. Mathew es afroamericano, mientras "
             'que Tina es de ascendencia criolla de Luisiana (con ascendencia '
             'africana, nativa americana, francesa, cajún, y distante '
             'irlandesa y española). A través de su madre, Beyoncé es '
             'descendiente del líder acadiano Joseph Broussard. Fue criada en '
             'un hogar metodista.',
  'end': 277,
  'score': 0.9883753061294556,
  'start': 268}]

In an extractive process, where we generate the response directly from text pieces of the context, we will create a response from each context and display them to the user. In the abstractive process, where the response is generated from scratch by the system, what we do is join all the contexts and send them to the question-answering model that will produce a completely new response based on the context passages.

#Set the query
query = "¿Qué presidente de Francia legalizó el matrimonio homosexual?"
result = query_pinecone(query, top_k=1)
# format the query in the form generator expects the input
query = format_query(query, result["matches"])
pprint(query)

Result:
('question: ¿Qué presidente de Francia legalizó el matrimonio homosexual? '
 'context: <P> La Asamblea Nacional de Francia votó por legalizar el '
 'matrimonio entre personas del mismo sexo.-La ley fue aprobada por 331 votos '
 'a favor contra 225. De esta forma Francia se convierte en décimo cuarto país '
 'en aprobar tal medida. Antes de producirse el voto, hubo escenas de caos en '
 'la cámara, con el orador de orden exigiéndoles a quienes protestaban que '
 'salieran del edificio. La propuesta del presidente francés Francois Hollande '
 'de legalizar el matrimonio homosexual generó manifestaciones a favor y en '
 'contra a lo largo de todo el país en las que participaron cientos de miles '
 'de personas. La nueva ley permite a las parejas homosexuales adoptar niños, '
 'algo apoyado por la mayoría de la población francesa. Final de Quizás '
 'también te interese Partidos de oposición dijeron que apelarán ante el '
 'Consejo Constitucional, el árbitro supremo del país en materia de leyes.')

# Ask for the answer
generate_answer(query)

Result:
'francois hollande'

We are not going to analyze the parameters of this text generation process because it is not the objective of this article but it is recommendable to explore these parameters if we want to obtain more or less extensive and creative answers.

And that’s it, as you can see, this is a simple workflow that we can build with a few lines of code and using well-known and accessible models in Huggingface. Remember that this work is just a simple demo of the basic use of embeddings and vector databases, showing how to work with them in very basic operations.

If we want to dive deeper into the problem and obtain a better solution, we must invest in analyzing embedding models, analyzing benchmarks in this topic, since semantic search is a key part of the problem. We cannot forget either the selection of the question-answering model, particularly tricky in the case of the abstractive process that can easily lead us to invented and not reliable answers or, as in our example, to extremely short sentences with no creativity at all.

You can check the demo notebooks and full code explanation on my Github repo question-answering-pinecone-sts, feel free to make suggestions for improving the code. I will really appreciate it!!!