sigmoid.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
A social space for people researching, working with, or just interested in AI!

Server stats:

612
active users

#embeddings

1 post1 participant0 posts today

Ah, my latest tool, just out of the oven! Just in time for my Summer break... It's called *Vandolie*. It's for high school students, but it may work for you as well. I will let you discover it by yourself.

👉 jacomyma.github.io/vandolie/en/

It's like a mini CorTexT for teenagers, if you know that tool. But it runs entirely in the browser.

Entirely localized in Danish.

Consider it a beta version. Usable, but feel free to file GitHub issues for feedback & bugs.

Okay, Back of the napkin math:
- There are probably 100 million sites and 1.5 billion pages worth indexing in a #search engine
- It takes about 1TB to #index 30 million pages.
- We only care about text on a page.

I define a page as worth indexing if:
- It is not a FAANG site
- It has at least one referrer (no DD Web)
- It's active

So, this means we need 40TB of fast data to make a good index for the internet. That's not "runs locally" sized, but it is nonprofit sized.

My size assumptions are basically as follows:
- #URL
- #TFIDF information
- Text #Embeddings
- Snippet

We can store an index for 30kb. So, for 40TB we can store an full internet index. That's about $500 in storage.

Access time becomes a problem. TFIDF for the whole internet can easily fit in ram. Even with #quantized embeddings, you can only fit 2 million per GB in ram.

Assuming you had enough RAM it could be fast: TF-IDF to get 100 million candidated, #FAISS to sort those, load snippets dynamically, potentially modify rank by referers etc.

6 128 MG #Framework #desktops each with 5tb HDs (plus one raspberry pi to sort the final condidates from the six machines) is enough to replace #Google. That's about $15k.

In two to three years this will be doable on a single machine for around $3k.

By the end of the decade it should be able to be run as an app on a powerful desktop

Three years after that it can run on a #laptop.

Three years after that it can run on a #cellphone.

By #2040 it's a background process on your cellphone.

Should you use OpenAI (or other closed-source) embeddings?

1. Try the lightest embedding model first
2. If it doesn’t work, try a beefier model and do a blind comparison
3. If you are already using a relatively large model, only then try some blind test against a proprietary model. If you really find it that the closed-source model is better for your application, then go for it.

Paraphrased from iamnotarobot.substack.com/p/sh

I Am Not a Robot · Should you use OpenAI's embeddings? Probably not, and here's why.By Diego Basch

Poster from our colleague @epoz from UGent-IMEC Linked Data & Solid course. "Exploding Mittens - Getting to grips with huge SKOS datasets" on semantic embeddings enhanced SPARQL queries for ICONCLASS data.
Congrats for the 'best poster' award ;-)

poster: zenodo.org/records/14887544
iconclass on GitHub: github.com/iconclass

@nfdi4culture @fiz_karlsruhe

Replied in thread

Leveraging embedding models for deep understanding was another breakthrough moment. OpenAI’s text_embedding_3 allowed us to grasp the nuances of user queries, going beyond mere keyword matches. This capability was instrumental in returning more relevant results and understanding user intent better. #OpenAI #Embeddings

Is there a consensus process or good paper on state of the art on using #embeddings & #LLM to do the kinds of things that were being done with topic models? I imagine for tasks with pre-defined classifications, prompts are sufficient, but any recommendations for identifying latent classes? After reading the paper below I think I'll want to use local models. #machinelearning drive.google.com/file/d/1wNDIk

Google Docsllreplication.pdf

Major Update for Vector Search in SQLite 🚀

🔍 #SQLite-vec v0.1.6 introduces powerful new features:
• Added support for #metadata columns enabling WHERE clause filtering in #KNN queries
• Implemented partition keys for 3x faster selective queries
• New auxiliary columns for efficient unindexed data storage
• Compatible with #embeddings from any provider

🎯 Key improvements:
• Store non-vector data like user_id and timestamps
• Filter searches using metadata constraints
• Optimize query performance through smart partitioning
• Enhanced data organization with auxiliary columns

⚡ Performance focus:
• Partition keys reduce search space significantly
• Metadata filtering streamlines result selection
• Auxiliary columns minimize JOIN operations
• Binary quantization options for speed optimization

🔄 #Database integration:
• Supports boolean, integer, float & text values
• Works with standard SQL queries
• Enables complex search combinations
• Maintains data consistency

Source: alexgarcia.xyz/blog/2024/sqlit

alexgarcia.xyzsqlite-vec now supports metadata columns and filteringMetadata, partition key, and auxiliary column support in sqlite-vec

🐸 Screaming Frog introduce le API per l'interfacciamento con i modelli di #OpenAI, #Google e con #Ollama
✨ Lavora sull'HTML salvato in fase di scansione, mentre nella versione precedente si usavano snippet JavaScript personalizzati eseguiti durante il rendering delle pagine.
👉 È possibile generare #embeddings e contenuti con prompt personalizzati su contesti selezionabili (attraverso estrattori predefiniti e custom).

🔍 #txtai - All-in-one #embeddings database combining vector indexes, graph networks & relational databases

💡 Key Features:
• Vector search with SQL support, object storage, topic modeling & multimodal indexing for text, documents, audio, images & video
• Built-in #RAG capabilities with citation support & autonomous #AI agents for complex problem-solving
#LLM orchestration supporting multiple frameworks including #HuggingFace, #OpenAI & AWS Bedrock
• Seamless integration with #Python 3.9+, built on #FastAPI & Sentence Transformers

🛠️ Technical Highlights:
• Supports multiple programming languages through API bindings (#JavaScript, #Java, #Rust, #Go)
• Easy deployment: run locally or scale with container orchestration
#opensource under Apache 2.0 license
• Minimal setup: installation via pip or Docker

🔄 Use Cases:
• Semantic search applications
• Knowledge base construction
• Multi-model workflows
• Speech-to-speech processing
• Document analysis & summarization

Learn more: github.com/neuml/txtai

GitHubGitHub - neuml/txtai: 💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows - neuml/txtai