Karsten Schmidt<p>Recently I've combined various functions which I've been using in other projects (e.g. my personal PKM toolchain) and published them as new library <a href="https://thi.ng/text-analysis" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">https://</span><span class="">thi.ng/text-analysis</span><span class="invisible"></span></a> for better re-use:</p><p>- customizable, composable & extensible tokenization (transducer based)<br>- ngram generation<br>- Porter-stemming & stopword removal<br>- vocabulary (bi-directional index) creation<br>- dense & sparse multi-hot vector encoding/decoding<br>- histograms (incl. sorted versions)<br>- tf-idf (term frequency & inverse document frequency), multiple strategies<br>- k-means clustering (with k-means++ initialization & customizable distance metrics)<br>- similarity/distance functions (dense & sparse versions)<br>- central terms extraction</p><p>The attached code example (also in the project readme) uses this package to creeate a clustering of all ~210 <a href="https://mastodon.thi.ng/tags/ThingUmbrella" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>ThingUmbrella</span></a> packages, based on their assigned tags/keywords...</p><p>The library is not intended to be a full-blown NLP solution, but I keep on finding myself running into these functions/concepts quite often, and maybe you'll find them useful too...</p><p><a href="https://mastodon.thi.ng/tags/Text" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Text</span></a> <a href="https://mastodon.thi.ng/tags/Analysis" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Analysis</span></a> <a href="https://mastodon.thi.ng/tags/Cluster" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Cluster</span></a> <a href="https://mastodon.thi.ng/tags/KMeans" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>KMeans</span></a> <a href="https://mastodon.thi.ng/tags/TFIDF" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>TFIDF</span></a> <a href="https://mastodon.thi.ng/tags/Ngram" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Ngram</span></a> <a href="https://mastodon.thi.ng/tags/Vector" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Vector</span></a> <a href="https://mastodon.thi.ng/tags/TypeScript" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>TypeScript</span></a> <a href="https://mastodon.thi.ng/tags/JavaScript" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>JavaScript</span></a></p>