sigmoid.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
A social space for people researching, working with, or just interested in AI!

Server stats:

598
active users

#Cheminformatics

6 posts4 participants0 posts today

🎙️ Join Franciszek Job at EuroSciPy as he presents a scalable framework to unify chemical datasets from sources like PubChem, UniChem & COCONUT.

💻 Canonicalize with RDKit
⚡ Scale via Dask
🔁 Deduplicate with InChI keys

Ideal for ML pretraining, benchmarking, and chemical data analysis.

📅 Schedule: lnkd.in/eaAxwUN2
🎟️ Tickets: lnkd.in/end9aYzE

lnkd.inLinkedInThis link will take you to a page that’s not on LinkedIn

Most cheminformatics code that queries ChEMBL struggles with reproducibility.

chembl-downloader can help:

>>> import chembl_downloader as cd
>>> df = cd.query("""
SELECT chembl_id, pref_name
FROM molecule_dictionary
WHERE pref_name IS NOT NULL
""")

It's even sneaking its way into @wpwalters and @dr_greg_landrum blogs :)

Code/Docs: github.com/cthoyt/chembl-downl

Preprint: arxiv.org/pdf/2507.17783

New Preprint Alert!

We're excited to share our latest work on #ChemRxiv! MARCUS (Molecular Annotation and Recognition for Curating Unravelled Structures) is a web-based platform for extracting chemical information from scientific papers.

📄 Preprint: doi.org/10.26434/chemrxiv-2025

🔗 Try it out: marcus.decimer.ai

ChemRxivMARCUS: Molecular Annotation and Recognition for Curating Unravelled StructuresThe exponential growth of chemical literature necessitates the development of automated tools for extracting and curating molecular information from unstructured scientific publications into open-access chemical databases. Current optical chemical structure recognition (OCSR) and named entity recognition solutions operate in isolation, which limits their scalability for comprehensive literature curation. Here we present MARCUS (Molecular Annotation and Recognition for Curating Unravelled Structures), a tool to aid curators in performing literature curation in the field of natural products. This integrated web-based platform combines automated text annotation, multi-engine OCSR, and direct submission capabilities to the COCONUT database. MARCUS employs a fine-tuned GPT-4 model to extract chemical entities and utilises an ensemble approach integrating DECIMER, MolNexTR, and MolScribe for structure recognition. The platform aims to streamline the data extraction workflow from PDF upload to database submission, significantly reducing curation time. MARCUS bridges the gap between unstructured chemical literature and machine-actionable databases, enabling FAIR data principles and facilitating AI-driven chemical discovery. Through open-source code, accessible models, and comprehensive documentation, the web application enhances accessibility and promotes community-driven development. This approach facilitates unrestricted use and encourages the collaborative advancement of automated chemical literature curation tools. We dedicate MARCUS to Dr Marcus Ennis, the longest-serving curator of the ChEBI database, on the occasion of his 75th birthday.

I think I am going to try to recover a bit of #cheminformatics / #chemistry #history, and make the index of the Internet Journal of Chemistry (IJC) FAIR in @wikidata

While the journal no longer exists, many articles are cited quite a few times.

I did some exploration some time ago, and for some I found full text "self-archiving" versions online.

And, TIL that Web of Science has entries for the articles too, which I just added for the 9 articles already in #Wikidata: w.wiki/Eide

Anyone have views or references on the effectiveness of count/sparse #cheminformatics fingerprints compared to binary/dense fingerprints?

What about comparing methods to turn count/sparse fingerprints into binary ones? I know of several approaches, but nothing methodical or published.

I'm trying to understand how I might add sparse/count fingerprints to chemfp.

Version 5.0b1 of chemfp - the comprehensive package for binary #cheminformatics fingerprints - is out and ready for curious beta testers.

Here's are highlights. For more info and Linux install info see chemfp.com/chemfp-50b1-availab

- shardsearch to search multiple files

- simhistogram for a histogram of pairwise Tanimoto comparisons

- the FPB file size limit increased from ~250M fingerprints to well over a billion

- new Klekota-Roth fingerprint type for RDKit and OpenEye

chemfp.com/

chemfp.comchemfp 5.0b1 available for beta testing