sigmoid.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
A social space for people researching, working with, or just interested in AI!

Server stats:

723
active users

#Cheminformatics

4 posts4 participants0 posts today

#openscience #cheminformatics dates back to the late nineties with the emerging collaborative development of JChemPaint, Jmol, and the Chemical Markup Language. Sketch of the history by Chris Steinbeck: "The evolution of open science in cheminformatics: a journey from closed systems to collaborative innovation" jcheminf.biomedcentral.com/art

BioMed CentralThe evolution of open science in cheminformatics: a journey from closed systems to collaborative innovation - Journal of CheminformaticsCheminformatics has significantly transformed over the past four decades, evolving from a field dominated by proprietary systems to one increasingly embracing open science principles. In its early years, cheminformatics was characterised by commercial software and restricted data access, limiting collaboration and reproducibility. The advent of open-source software in the late 1990s and early 2000s, including tools such as the Chemistry Development Kit (CDK) and RDKit, played a crucial role in democratising computational chemistry. Open data initiatives, such as PubChem and NMRShiftDB, further enhanced accessibility by providing freely available chemical information, fostering transparency and interoperability and introducing key standards, such as the International Chemical Identifier (InChI), revolutionised data integration and retrieval across diverse platforms. Community-driven efforts, including the Blue Obelisk movement and Open Notebook Science, have promoted open methodologies and collaborative research. More recently, national data infrastructure projects like NFDI4Chem have aimed to standardise research data management in cheminformatics, ensuring the long-term sustainability of open science practices. The increasing adoption of the FAIR (Findable, Accessible, Interoperable, Reusable) principles has further reinforced data sharing and reuse in computational chemistry. Challenges remain, particularly in overcoming resistance to data sharing and ensuring sustainable funding for open projects. However, the trajectory of cheminformatics demonstrates that embracing openness enhances scientific integrity and accelerates discovery and innovation.

I finally finished implementing the newest chemfp feature - similarity histograms, both full comparison and sampled, and both NxN (upper triangular) and NxM (two datasets)

$ chemfp simhist chembl_34.fpb --bins 10 --num-samples 1000000 --no-metadata
start end count percent
0.0 0.1 413799 41.380
0.1 0.2 561503 56.150
0.2 0.3 23685 2.369
0.3 0.4 887 0.089
0.4 0.5 105 0.011
0.5 0.6 13 0.001
0.6 0.7 5 0.001
0.7 0.8 1 0.000
0.8 0.9 1 0.000
0.9 1.0 1 0.000

Next, update the docs.

CMLXOM 4.11 has been released: doi.org/10.5281/zenodo.1510877

"Minor release, reverting to (the newer) xml-apis 1.4.01, updating to Joda time 2.14, and removing unused imports, updating deprecated code, and minimal added JavaDoc."

CMLXOM is a Java library for reading and writing Chemical Markup Language files

ZenodoCMLXOMMinor release, reverting to (the newer) xml-apis 1.4.01, updating to Joda time 2.14, and removing unused imports, updating deprecated code, and minimal added JavaDoc. Full Changelog: https://github.com/BlueObelisk/cmlxom/compare/cmlxom-4.10...cmlxom-4.11
Replied to Egon Willighagen

@egonw @wdscholia

#cheminformatics advertisement - chemfp has a pretty fast Butina clustering implementation, and implements several variations for handling singletons and pruning the number of clusters.

chemfp.com/docs/chemfp_butina_

With last year's release you can compute and save the NxN matrix (for a given threshold), and quickly re-cluster using the matrix as a staring point.

chemfp.comchemfp butina — chemfp documentation 4.2 documentation
Replied in thread

@cfeldmann

25,000 samples should easily be enough.

Select 100,000 #cheminformatics fingerprints from ChEMBL at random. Compute the histogram of all 49,99,950,000 pairs in the upper triangle. 100 bins, shown as a bar chart.

Then sample sizes N∈{5K, 10K, 15K, 20K, and 50K}, each for 20 times to get a distribution of samplings for each point, shown as a boxplot of percentages, one boxplot per bin.

Here's the result. 50K doesn't seem all that much better than 20K.

Just got word that OpenEye's domain name will start migrating from eyesopen.com to cadence.com starting Monday.

The full migration will take a while: "Product, webinar and events emails will continue to be sent from @eyesopen.com until a later date"

Replied to Egon Willigh☮gen 🟥

I just added some 10 more. Here is a helpful SPARQL query to list all functional groups in @wikidata and their CxSMILES, if they have one: w.wiki/DWgp

If you want to add a few too, this list should give you a nice set of examples. Actually, the next SPARQL query gives an list that you can copy/paste into CDK Depict: w.wiki/DWvR

(The list has a few functional groups with links to the Japanese Wikipedia; help welcome there)

Tadashi Taffee Tanimoto - a famous name in #cheminformatics - was a Japanese internee at the Poston War Relocation Center, one of "the 10 American concentration camps operated by the War Relocation Authority during World War II".

ireizo.org/

en.wikipedia.org/wiki/Poston_W

I used to live on the site of the former Justice Department detention camp in Santa Fe.

I talked with someone who was a kid there in the 1970s. Kids would sometimes find Japanese artifacts from that era.

IreizōIreizō | National Names Monument Honoring Persons of Japanese Ancestry Incarcerated in the U.S. During WWIINational Names Monument Honoring Persons of Japanese Ancestry Incarcerated in the U.S. During WWII

this has been fun so far!

"One Million IUPAC names" doi.org/10.59350/tjkf2-k1608 chem-bla-ics.linkedchemistry.i

"Thus, the idea came up, can we create a set of 1 million unique IUPAC names found in literature? I asked on the ELIXIR Europe slack channel if Europe PMC had such a dataset. I knew they had been adding chemical named-entity recognition (NER) results in their annotation API. [] Magnus Palmblad also replied and provided Python code to use the Europe PMC API"

#statistics question for y'all.

How many samples do I need to generate a histogram which is close to the full comparison? (Full based on >1 trillion possible values.)

For a single value this needs about 10,000 samples. But I have a nagging feeling that the number of bins is also important.

I'm also not sure how to define "close to" a histogram.

FWIW, it's for a pairwise comparison of two #cheminformatics fingerprint datasets, each with >1M elements, and Jaccard/Tanimoto similarity 0 ≤ S ≤1.

new paper: "Extending Chemoinformatics Techniques With JMolecular Energy: A Robust CDK-Based Force Field Library" onlinelibrary.wiley.com/doi/fu or doi.org/10.1002/jcc.70071

" This paper introduces JMolecular Energy (JME), a novel, open-source Java library designed to implement MMFF94 with a robust and extendable API (Application Programming Interface) that allows for access to individual energy components."