The Gradient @thegradient

4 posts4 participants0 posts today

**Jeremy Monat** @jemonat@fosstodon.org · 23h

I'm excited to present "Finding Tautomers" at the first North American #RDKit User Group Meeting in the #Boston #MA area on Friday April 11!

Reminder that I'm #OpenToWork so if you're in the area and hiring for #cheminformatics or #scientificSoftware development, let me know and we can meet to discuss your needs.

**Chris Swain** @macinchem@sciencemastodon.com · 3d

Chris Swain @macinchem@sciencemastodon.com

A vortex script for getting PDB ligand structures. https://macinchem.org/2025/04/06/vortex-script-for-getting-pdb-ligand-structures/ #cheminformatics

macinchem.orgVortex script for getting PDB ligand structures – Macs in Chemistry

**Egon Willighgen** @egonw@mastodon.social · 3d

Egon Willighgen @egonw@mastodon.social

it seems I just released my first Pypi package every.

pyBacting 2.14 with Bacting 1.0.5 is now out: https://pypi.org/project/pybacting/0.2.14/

This gives you access in Python to (some of) the functionality of the Chemistry Development Kit, OPSIN, ChemSpider, PubChem, InChI, Excel files, BridgeDb, and BioJava

pypi.orgClient Challenge

#chemistry #openscience #bioinformatics

**Blue Obelisk** @blueobelisk@fosstodon.org · 4d

Blue Obelisk @blueobelisk@fosstodon.org

#openscience #cheminformatics dates back to the late nineties with the emerging collaborative development of JChemPaint, Jmol, and the Chemical Markup Language. Sketch of the history by Chris Steinbeck: "The evolution of open science in cheminformatics: a journey from closed systems to collaborative innovation" https://jcheminf.biomedcentral.com/articles/10.1186/s13321-025-00990-w

BioMed CentralThe evolution of open science in cheminformatics: a journey from closed systems to collaborative innovation - Journal of CheminformaticsCheminformatics has significantly transformed over the past four decades, evolving from a field dominated by proprietary systems to one increasingly embracing open science principles. In its early years, cheminformatics was characterised by commercial software and restricted data access, limiting collaboration and reproducibility. The advent of open-source software in the late 1990s and early 2000s, including tools such as the Chemistry Development Kit (CDK) and RDKit, played a crucial role in democratising computational chemistry. Open data initiatives, such as PubChem and NMRShiftDB, further enhanced accessibility by providing freely available chemical information, fostering transparency and interoperability and introducing key standards, such as the International Chemical Identifier (InChI), revolutionised data integration and retrieval across diverse platforms. Community-driven efforts, including the Blue Obelisk movement and Open Notebook Science, have promoted open methodologies and collaborative research. More recently, national data infrastructure projects like NFDI4Chem have aimed to standardise research data management in cheminformatics, ensuring the long-term sustainability of open science practices. The increasing adoption of the FAIR (Findable, Accessible, Interoperable, Reusable) principles has further reinforced data sharing and reuse in computational chemistry. Challenges remain, particularly in overcoming resistance to data sharing and ensuring sustainable funding for open projects. However, the trajectory of cheminformatics demonstrates that embracing openness enhances scientific integrity and accelerates discovery and innovation.

**Andrew Dalke** @dalke@toots.nu · Mar 31

Mar 31

Andrew Dalke @dalke@toots.nu

I finally finished implementing the newest chemfp feature - similarity histograms, both full comparison and sampled, and both NxN (upper triangular) and NxM (two datasets)

$ chemfp simhist chembl_34.fpb --bins 10 --num-samples 1000000 --no-metadata
start end count percent
0.0 0.1 413799 41.380
0.1 0.2 561503 56.150
0.2 0.3 23685 2.369
0.3 0.4 887 0.089
0.4 0.5 105 0.011
0.5 0.6 13 0.001
0.6 0.7 5 0.001
0.7 0.8 1 0.000
0.8 0.9 1 0.000
0.9 1.0 1 0.000

Next, update the docs.

#cheminformatics

**Chris Swain** @macinchem@sciencemastodon.com · Mar 31

Mar 31

Chris Swain @macinchem@sciencemastodon.com

23 April Cambridge Cheminformatics Network Meeting https://macinchem.org/2025/03/31/cambridge-cheminformatics-meeting-on-23-april-2025/ #cheminformatics

macinchem.org Cambridge Cheminformatics Meeting on 23 April 2025 – Macs in Chemistry

**Blue Obelisk** @blueobelisk@fosstodon.org · Mar 30

Mar 30

Blue Obelisk @blueobelisk@fosstodon.org

CMLXOM 4.11 has been released: https://doi.org/10.5281/zenodo.15108779

"Minor release, reverting to (the newer) xml-apis 1.4.01, updating to Joda time 2.14, and removing unused imports, updating deprecated code, and minimal added JavaDoc."

CMLXOM is a Java library for reading and writing Chemical Markup Language files

ZenodoCMLXOMMinor release, reverting to (the newer) xml-apis 1.4.01, updating to Joda time 2.14, and removing unused imports, updating deprecated code, and minimal added JavaDoc. Full Changelog: https://github.com/BlueObelisk/cmlxom/compare/cmlxom-4.10...cmlxom-4.11

#xml #chemistry #cheminformatics

Replied to Egon Willighagen

**Andrew Dalke** @dalke@toots.nu · Mar 26

Mar 26

Andrew Dalke @dalke@toots.nu

@egonw @wdscholia

#cheminformatics advertisement - chemfp has a pretty fast Butina clustering implementation, and implements several variations for handling singletons and pruning the number of clusters.

https://chemfp.com/docs/chemfp_butina_command.html

With last year's release you can compute and save the NxN matrix (for a given threshold), and quickly re-cluster using the matrix as a staring point.

chemfp.comchemfp butina — chemfp documentation 4.2 documentation

Replied in thread

**Andrew Dalke** @dalke@toots.nu · Mar 24 *

Mar 24 *

Andrew Dalke @dalke@toots.nu

@cfeldmann

25,000 samples should easily be enough.

Select 100,000 #cheminformatics fingerprints from ChEMBL at random. Compute the histogram of all 49,99,950,000 pairs in the upper triangle. 100 bins, shown as a bar chart.

Then sample sizes N∈{5K, 10K, 15K, 20K, and 50K}, each for 20 times to get a distribution of samplings for each point, shown as a boxplot of percentages, one boxplot per bin.

Here's the result. 50K doesn't seem all that much better than 20K.

Select 100,000 fingerprints from ChEMBL at random. Compute the histogram of all 49,99,950,000 pairs in the upper triangle. 100 bins, shown as a bar chart. The peak is at 0.15 Tanimoto similarity. The plot only goes up to 0.25 similarity as the tail gets very small.

Then sample sizes N∈{5K, 10K, 15K, 20K, and 50K}, each for 20 times to get a distribution of samplings for each point, shown as a boxplot of percentages, one boxplot per bin.

Here's the result. 50K doesn't seem much better than 20K. The boxplot dividers are about the same for both samples sizes, while the clearly wider for 15K and smaller.

**Egon Willighgen** @egonw@mastodon.social · Mar 23

Mar 23

Egon Willighgen @egonw@mastodon.social

I like to remind people, if you want less of my human rights, future, open science, and other opiniatied posts, follow my #cheminformatics and #bioinformatics account at @egonw@social.edu.nl

Of course, I love you to stay here too, because I honestly believe in a better future and am hopeful. But hoping is not enough, hence my posts here.

**The Chemistry Development Kit** @cdk@fosstodon.org · Mar 22

Mar 22

The Chemistry Development Kit @cdk@fosstodon.org

Jonas Schaub: "Last week, I presented my work on algorithmic substructure extraction (scaffolds, functional groups, and aglycones) at the Chemistry Development Kit User Group Meeting (#CDK25UGM) in Maastricht.

You can now find my slides on Zenodo: https://doi.org/10.5281/zenodo.15058008"

ZenodoScaffolds, Functional Groups, Aglycones: Algorithmic Substructure Identification with CDKDr Jonas Schaub's presentation about "Scaffolds, Functional Groups, Aglycones: Algorithmic Substructure Identification with CDK" held at the Chemistry Development Kit 2025 User Group Meeting in Maastricht on 10th of March 2025.

#cheminformatics #openscience

**Andrew Dalke** @dalke@toots.nu · Mar 21

Mar 21

Andrew Dalke @dalke@toots.nu

Just got word that OpenEye's domain name will start migrating from eyesopen.com to cadence.com starting Monday.

The full migration will take a while: "Product, webinar and events emails will continue to be sent from @eyesopen.com until a later date"

#cheminformatics

Replied to Egon Willigh

gen

**Egon Willighagen** @egonw@social.edu.nl · Mar 21

Mar 21

Egon Willighagen @egonw@social.edu.nl

I just added some 10 more. Here is a helpful SPARQL query to list all functional groups in @wikidata and their CxSMILES, if they have one: https://w.wiki/DWgp

If you want to add a few too, this list should give you a nice set of examples. Actually, the next SPARQL query gives an list that you can copy/paste into CDK Depict: https://w.wiki/DWvR

(The list has a few functional groups with links to the Japanese Wikipedia; help welcome there)

Screenshot of the VHP4Safety instance of the CDK Depict service showing the copy/pasted list of CxSMILES from the SPARQL query in a text field, and below that the first two 2D depictions.

#openscience #cheminformatics #chemistry

**Andrew Dalke** @dalke@toots.nu · Mar 19 *

Mar 19 *

Andrew Dalke @dalke@toots.nu

Tadashi Taffee Tanimoto - a famous name in #cheminformatics - was a Japanese internee at the Poston War Relocation Center, one of "the 10 American concentration camps operated by the War Relocation Authority during World War II".

https://ireizo.org/

https://en.wikipedia.org/wiki/Poston_War_Relocation_Center

I used to live on the site of the former Justice Department detention camp in Santa Fe.

I talked with someone who was a kid there in the 1970s. Kids would sometimes find Japanese artifacts from that era.

IreizōIreizō | National Names Monument Honoring Persons of Japanese Ancestry Incarcerated in the U.S. During WWIINational Names Monument Honoring Persons of Japanese Ancestry Incarcerated in the U.S. During WWII

**pwk2024** @pwk2024@beta.argyle.social · Mar 19

Mar 19

pwk2024 @pwk2024@beta.argyle.social

Opportunity for #cheminformatics data scientist at Drug Hunter (USA remote) #DataScience #CompChem #ChemJobs #chemsky
drughunter.isolvedhire.com/jobs/1444460
https://drughunter.isolvedhire.com/jobs/1444460

drughunter.isolvedhire.comCheminformatics Data Scientist -Drug Hunter™ (drughunter.com) is an essential web-based knowledge platform for drug discovery and development innovators turning molecules into medicines. The Scientific team at Drug Hunter™ distills the science and technology behind emerging drugs into concise searchable reports and resources with relevant transferable insights. Drug Hunter™ members include many leading biotechnology and ph...

Continued thread

**Egon Willighagen** @egonw@social.edu.nl · Mar 10

Mar 10

Egon Willighagen @egonw@social.edu.nl

next up was Yajie Ding, a ERC from @universityofgroningen, talking about her #cheminformatics needs for glycan modifications of proteins. We discussed formats LINUCS, WURCS, and SNGF

#CDK25UGM

**Egon Willighagen** @egonw@social.edu.nl · Mar 8

Mar 8

Egon Willighagen @egonw@social.edu.nl

this has been fun so far!

"One Million IUPAC names" https://doi.org/10.59350/tjkf2-k1608 https://chem-bla-ics.linkedchemistry.info/2025/03/08/iupac-names.html

"Thus, the idea came up, can we create a set of 1 million unique IUPAC names found in literature? I asked on the ELIXIR Europe slack channel if Europe PMC had such a dataset. I knew they had been adding chemical named-entity recognition (NER) results in their annotation API. [] Magnus Palmblad also replied and provided Python code to use the Europe PMC API"

Screenshot of output of the Europe PMC Annotations API, showing the complexity of NER, obvious from the prefix and postfix text, showing that several found "indole"s are actually substrings of longer IUPAC names.

The screenshot shows a table, with an example line with the prefix "5-Bromo-1H-", the found entity "indole-3-carboxylic acid", and the postfix "(2)", the last likely a citation. The table shows 8 other rows with similar message.

#cheminformatics #chemistry

**Andrew Dalke** @dalke@toots.nu · Mar 1 *

Mar 1 *

Andrew Dalke @dalke@toots.nu

#statistics question for y'all.

How many samples do I need to generate a histogram which is close to the full comparison? (Full based on >1 trillion possible values.)

For a single value this needs about 10,000 samples. But I have a nagging feeling that the number of bins is also important.

I'm also not sure how to define "close to" a histogram.

FWIW, it's for a pairwise comparison of two #cheminformatics fingerprint datasets, each with >1M elements, and Jaccard/Tanimoto similarity 0 ≤ S ≤1.

**The Chemistry Development Kit** @cdk@fosstodon.org · Feb 28 *

Feb 28 *

The Chemistry Development Kit @cdk@fosstodon.org

new paper: "Extending Chemoinformatics Techniques With JMolecular Energy: A Robust CDK-Based Force Field Library" https://onlinelibrary.wiley.com/doi/full/10.1002/jcc.70071 or https://doi.org/10.1002/jcc.70071

" This paper introduces JMolecular Energy (JME), a novel, open-source Java library designed to implement MMFF94 with a robust and extendable API (Application Programming Interface) that allows for access to individual energy components."

#Chemistry #cheminformatics #openscience