Performing a semantic search in a dataset

# Load the dataset - 2000 first articles
dataset = load_dataset('openwebtext', split='train[:2000]')

# Preprocess the dataset
texts = [simple_preprocess(article['text']) for article in dataset]
# Train a Word2Vec model
model = Word2Vec(texts, vector_size=200, window=5, min_count=1, workers=4)

# Find the most similar words to "politics", "global", and "economy
similar_words = model.wv.most_similar(positive=['politics', 'global', 'economy'], topn=10)
print(similar_words)

In this code, we are performing a semantic search on the terms "politics", "global", and "economy".

Vector Databases

What is a vector database?

Full-text search

Semantic search

Full-text search

Semantic search

How does vectorization work?

NLP Example

How does vectorization work?

Vectorization algorithms

Finally, the vector database!

How the vector database knows which vectors are similar?

Let's suppose you want to search for the following text:

Performing a semantic search in a dataset

Performing a semantic search in a dataset

Performing a semantic search in a dataset

Most popular vector databases

Thank you!