We often rely on cosine similarity to compare embeddings—it's like “duct tape” for vector comparisons. But just like duct tape, it can quietly mask deeper problems. Sometimes, embeddings pick up a “wrong kind” of similarity, matching questions to questions instead of questions to answers or getting thrown off by formatting quirks and typos rather than the text's real meaning.

In my post, I discuss what can go wrong with off-the-shelf cosine similarity and share practical alternatives. If you’ve ever wondered why your retrieval system returns oddly matched items or how to refine your embeddings for more meaningful results, this is for you!
`
I want to thank Max Salamonowicz and Grzegorz Kossakowski for their feedback after my flash talk at the Warsaw AI Breakfast, Rafał Małanij for inviting me to give a talk at the Python Summit, and for all the curious questions at the conference, and LinkedIn.

p.migdal.plDon't use cosine similarity carelesslyCosine similarity - the duct tape of AI. Convenient but often misused. Let's find out how to use it better.

#cosineSimilarity #embedding #llm

**Ricardo** @rmdes@mstdn.social · Jan 4

**IJCISIM Journal** @IJCISIM_Journal@fediscience.org · Dec 21, 2024

**Ben Lorica 罗瑞卡** @bigdata@indieweb.social · Dec 20, 2024

Dec 20, 2024

Ben Lorica 罗瑞卡 @bigdata@indieweb.social

Encoder only model that's a direct drop-in replacement for existing BERT models
- First major upgrade to BERT-style models in six years
- Significantly reduced processing costs for large-scale applications
- Enables longer document processing without chunking
- Better performance in retrieval tasks
- Suitable for consumer-grade GPU deployment
#llm #ai #embedding
https://huggingface.co/blog/modernbert

huggingface.coFinally, a Replacement for BERT: Introducing ModernBERTWe’re on a journey to advance and democratize artificial intelligence through open source and open science.

Recent searches

Search options

Administered by:

Server stats:

#Embedding