The Gradient @thegradient

**Mathieu Jacomy** @jacomyma@mas.to · 1d

Ah, my latest tool, just out of the oven! Just in time for my Summer break... It's called *Vandolie*. It's for high school students, but it may work for you as well. I will let you discover it by yourself.

https://jacomyma.github.io/vandolie/en/

It's like a mini CorTexT for teenagers, if you know that tool. But it runs entirely in the browser.

Entirely localized in Danish.

Consider it a beta version. Usable, but feel free to file GitHub issues for feedback & bugs.

#CSSH #DistantReading #embeddings

**➴➴➴Æ🜔Ɲ.Ƈꭚ⍴𝔥єɼ** @AeonCypher@lgbtqia.space · May 18

May 18

➴➴➴Æ🜔Ɲ.Ƈꭚ⍴𝔥єɼ @AeonCypher@lgbtqia.space

Okay, Back of the napkin math:
- There are probably 100 million sites and 1.5 billion pages worth indexing in a #search engine
- It takes about 1TB to #index 30 million pages.
- We only care about text on a page.

I define a page as worth indexing if:
- It is not a FAANG site
- It has at least one referrer (no DD Web)
- It's active

So, this means we need 40TB of fast data to make a good index for the internet. That's not "runs locally" sized, but it is nonprofit sized.

My size assumptions are basically as follows:
- #URL
- #TFIDF information
- Text #Embeddings
- Snippet

We can store an index for 30kb. So, for 40TB we can store an full internet index. That's about $500 in storage.

Access time becomes a problem. TFIDF for the whole internet can easily fit in ram. Even with #quantized embeddings, you can only fit 2 million per GB in ram.

Assuming you had enough RAM it could be fast: TF-IDF to get 100 million candidated, #FAISS to sort those, load snippets dynamically, potentially modify rank by referers etc.

6 128 MG #Framework #desktops each with 5tb HDs (plus one raspberry pi to sort the final condidates from the six machines) is enough to replace #Google. That's about $15k.

In two to three years this will be doable on a single machine for around $3k.

By the end of the decade it should be able to be run as an app on a powerful desktop

Three years after that it can run on a #laptop.

Three years after that it can run on a #cellphone.

By #2040 it's a background process on your cellphone.

**Nebraska.Code** @NebraskaCode@mastodon.social · May 13

May 13

Nebraska.Code @NebraskaCode@mastodon.social

Adam Barney is 'Demystifying LLMs: How They Work and Why It Matters' July 24th at Nebraska.Code().

https://nebraskacode.amegala.com/

#Travefy #LargeLanguageModels #LLM

**Markus Eisele** @myfear@mastodon.online · May 9

May 9

Markus Eisele @myfear@mastodon.online

From Strings to Semantics: Comparing Text with Java, Quarkus, and Embeddings
Learn how to build an AI-powered text similarity service using Quarkus, LangChain4j, and local embedding models.
https://myfear.substack.com/p/java-quarkus-text-embeddings-similarity
#Java #Quarkus #Embeddings #Ollama #LangChain4j

**JMLR** @jmlr · Apr 27

Apr 27

JMLR @jmlr

'Variance-Aware Estimation of Kernel Mean Embedding', by Geoffrey Wolfer, Pierre Alquier.

http://jmlr.org/papers/v26/23-0161.html

#embeddings #embedding #empirical

**FIZ ISE Research Group** @fizise · Mar 26

Mar 26

FIZ ISE Research Group @fizise

We are very happy that our colleage @GenAsefa has contributed the chapter on "Neurosymbolic Methods for Dynamic Knowledge Graphs" for the newly published Handbook on Neurosymbolic AI and Knowledge Graphs together with Mehwish Alam and Pierre-Henri Paris.

Handbook: https://ebooks.iospress.nl/doi/10.3233/FAIA400
our own chapter on arxive: https://arxiv.org/abs/2409.04572

#neurosymbolicAI #AI #generativeAI

**Judith van Stegeren** @jd7h@fosstodon.org · Mar 21

Mar 21

Judith van Stegeren @jd7h@fosstodon.org

Should you use OpenAI (or other closed-source) embeddings?

1. Try the lightest embedding model first
2. If it doesn’t work, try a beefier model and do a blind comparison
3. If you are already using a relatively large model, only then try some blind test against a proprietary model. If you really find it that the closed-source model is better for your application, then go for it.

Paraphrased from https://iamnotarobot.substack.com/p/should-you-use-openais-embeddings

I Am Not a Robot · Mar 30, 2023Should you use OpenAI's embeddings? Probably not, and here's why.By Diego Basch

#embeddings #genai #openai

**FIZ ISE Research Group** @fizise · Mar 3

Mar 3

FIZ ISE Research Group @fizise

Poster from our colleague @epoz from UGent-IMEC Linked Data & Solid course. "Exploding Mittens - Getting to grips with huge SKOS datasets" on semantic embeddings enhanced SPARQL queries for ICONCLASS data.
Congrats for the 'best poster' award ;-)

poster: https://zenodo.org/records/14887544
iconclass on GitHub: https://github.com/iconclass

#rdf2vec #bert #llm #embeddings #iconclass #semanticweb #lod #linkeddata #knowledgegraphs #dh @nfdi4culture @fiz_karlsruhe #iconclass

**IJCISIM Journal** @IJCISIM_Journal@fediscience.org · Feb 24

Feb 24

IJCISIM Journal @IJCISIM_Journal@fediscience.org

#call4reading

A new #strategy for incorporating #BERT embeddings to enhance static word #embeddings: The case of #COVID-19 SA #by Nouhaila Bensalah, Habib Ayad, Abdellah Adib and Abdelhamid Ibn El Farouk

https://cspub-ijcisim.org/index.php/ijcisim/article/view/548

**PyData Madrid** @madrid.pydata.org@bsky.brid.gy · Feb 11

Feb 11

PyData Madrid @madrid.pydata.org@bsky.brid.gy

Tenemos cita el 20 de febrero Nos vemos en BBVA AI Factory para hablar embeddings para contratación financiera y de redes neuronales de grafos. Estamos probando @guild.host, ¡reserva tu plaza aquí! guild.host/events/embed... #PyData #PyDataMadrid #python #embeddings #GraphNeuralNetworks

Embeddings para contratación...

Bluesky SocialGuild (@guild.host)We're the platform for communities, no matter the shape ✨ Use your Guild username as a Bluesky handle and get an AI announcement bot to help with your socials 🦋🤖 Check us out → https://guild.host

Replied in thread

**Microsoft DevBlogs** @msftdevblogs@dotnet.social · Jan 17

Jan 17

Microsoft DevBlogs @msftdevblogs@dotnet.social

Leveraging embedding models for deep understanding was another breakthrough moment. OpenAI’s text_embedding_3 allowed us to grasp the nuances of user queries, going beyond mere keyword matches. This capability was instrumental in returning more relevant results and understanding user intent better. #OpenAI #Embeddings

**Jennifer Lin** @jhylin@fosstodon.org · Jan 9 *

Jan 9 *

Jennifer Lin @jhylin@fosstodon.org

Here's a new post on my first encounter with building a simple deep learning model on manually-compiled adverse drug reactions data (thanks to @baoilleach for feedback) - https://jhylin.github.io/Data_in_life_blog/posts/22_Simple_dnn_adrs/2_ADR_regressor.html

Notes re. data - https://jhylin.github.io/Data_in_life_blog/posts/22_Simple_dnn_adrs/1_ADR_data.html

jhylin.github.ioHome - Building a simple deep learning model about adverse drug reactions

#PyTorch #RDKit #ChEMBL

**Mark Igra** @markigra@sciences.social · Dec 20, 2024

Dec 20, 2024

Mark Igra @markigra@sciences.social

Is there a consensus process or good paper on state of the art on using #embeddings & #LLM to do the kinds of things that were being done with topic models? I imagine for tasks with pre-defined classifications, prompts are sufficient, but any recommendations for identifying latent classes? After reading the paper below I think I'll want to use local models. #machinelearning https://drive.google.com/file/d/1wNDIkMZfAGoh4Oaojrgll9SPg3eT-YXz/view

Google Docsllreplication.pdf

**michabbb** @michabbb@vivaldi.net · Nov 23, 2024

Nov 23, 2024

michabbb @michabbb@vivaldi.net

Major Update for Vector Search in SQLite

#SQLite-vec v0.1.6 introduces powerful new features:
• Added support for #metadata columns enabling WHERE clause filtering in #KNN queries
• Implemented partition keys for 3x faster selective queries
• New auxiliary columns for efficient unindexed data storage
• Compatible with #embeddings from any provider

Key improvements:
• Store non-vector data like user_id and timestamps
• Filter searches using metadata constraints
• Optimize query performance through smart partitioning
• Enhanced data organization with auxiliary columns

Performance focus:
• Partition keys reduce search space significantly
• Metadata filtering streamlines result selection
• Auxiliary columns minimize JOIN operations
• Binary quantization options for speed optimization

#Database integration:
• Supports boolean, integer, float & text values
• Works with standard SQL queries
• Enables complex search combinations
• Maintains data consistency

Source: https://alexgarcia.xyz/blog/2024/sqlite-vec-metadata-release/index.html

alexgarcia.xyzsqlite-vec now supports metadata columns and filteringMetadata, partition key, and auxiliary column support in sqlite-vec

**Anand Philip** @anandphilipc · Nov 22, 2024

Nov 22, 2024

Anand Philip @anandphilipc

who's working on making the process of embedding texts faster? im trying to embed my local documents, it's not a huge amount, a couple of gbs, but this takes a day on my GPU. longer if i used just CPU. #embeddings #nlproc

**marmelab** @marmelab@mastodon.social · Nov 20, 2024

Nov 20, 2024

marmelab @marmelab@mastodon.social

Great read on binary vector embeddings & why they are so impressive.

In short, they can retain 95+% retrieval accuracy with 32x compression and ~25x retrieval speedup.

https://emschwartz.me/binary-vector-embeddings-are-so-cool/

Evan Schwartz
#ai #appreciation #LLM #embeddings #scour #search

Evan SchwartzBinary vector embeddings are so coolVector embeddings by themselves are pretty neat. Binary quantized vector embeddings are extra impressive. In short, they can retain 95+% retrieval accuracy with 32x compression 🤯.

**Alessio Pomaro** @alessiopomaro@mastodon.uno · Nov 19, 2024

Nov 19, 2024

Alessio Pomaro @alessiopomaro@mastodon.uno

Sentiamo sempre più spesso parlare di #embeddings: di cosa si tratta, come si generano, e come possono essere utili nei flussi operativi?
Una spiegazione semplice, con alcuni esempi di utilizzo: https://www.alessiopomaro.it/embeddings-cosa-sono-esempi/.
Facciamo anche alcune importanti riflessioni sull'importanza della consapevolezza di questi sistemi per ottenere performance.

#AI #GenAI #GenerativeAI

**Andrew Wooldridge** @triptych@social.lol · Nov 15, 2024

Nov 15, 2024

Andrew Wooldridge @triptych@social.lol

Embeddings are cool https://technicalwriting.dev/data/embeddings.html #llm #embeddings

technicalwriting.devEmbeddings are underrated

**Alessio Pomaro** @alessiopomaro@mastodon.uno · Nov 14, 2024

Nov 14, 2024

Alessio Pomaro @alessiopomaro@mastodon.uno

Screaming Frog introduce le API per l'interfacciamento con i modelli di #OpenAI, #Google e con #Ollama.
Lavora sull'HTML salvato in fase di scansione, mentre nella versione precedente si usavano snippet JavaScript personalizzati eseguiti durante il rendering delle pagine.
È possibile generare #embeddings e contenuti con prompt personalizzati su contesti selezionabili (attraverso estrattori predefiniti e custom).

#AI #GenAI #GenerativeAI

**michabbb** @michabbb@vivaldi.net · Nov 10, 2024

Nov 10, 2024

michabbb @michabbb@vivaldi.net

#txtai - All-in-one #embeddings database combining vector indexes, graph networks & relational databases

Key Features:
• Vector search with SQL support, object storage, topic modeling & multimodal indexing for text, documents, audio, images & video
• Built-in #RAG capabilities with citation support & autonomous #AI agents for complex problem-solving
• #LLM orchestration supporting multiple frameworks including #HuggingFace, #OpenAI & AWS Bedrock
• Seamless integration with #Python 3.9+, built on #FastAPI & Sentence Transformers

Technical Highlights:
• Supports multiple programming languages through API bindings (#JavaScript, #Java, #Rust, #Go)
• Easy deployment: run locally or scale with container orchestration
• #opensource under Apache 2.0 license
• Minimal setup: installation via pip or Docker

Use Cases:
• Semantic search applications
• Knowledge base construction
• Multi-model workflows
• Speech-to-speech processing
• Document analysis & summarization

Learn more: https://github.com/neuml/txtai

GitHubGitHub - neuml/txtai: 💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows - neuml/txtai

Recent searches

Search options

Administered by:

Server stats:

#embeddings