Text Classification With Document Embeddings

1

FlairRepository55/100

via “text classification with document-level embeddings and feed-forward networks”

PyTorch NLP framework with contextual embeddings.

Unique: Seamlessly integrates with Flair's embedding system to support any embedding type as input; includes native multi-label classification with automatic handling of label imbalance through weighted sampling; supports both single-task and multi-task learning where a classifier learns multiple classification tasks with shared embedding layers

vs others: Faster to train and deploy than transformer-based classifiers (BERT) with comparable accuracy on small-to-medium datasets; more flexible than scikit-learn classifiers by supporting deep learning and custom architectures; tighter integration with NLP preprocessing (tokenization, embedding) than generic PyTorch approaches

2

all-MiniLM-L12-v2Model54/100

via “semantic-clustering-and-document-organization”

sentence-similarity model by undefined. 28,25,304 downloads.

Unique: Provides high-quality semantic representations suitable for clustering without task-specific fine-tuning; 384-dimensional space balances expressiveness with computational tractability for clustering algorithms; works with standard scikit-learn clustering implementations without custom distance metrics

vs others: More semantically meaningful than TF-IDF clustering; simpler than topic modeling (LDA) without hyperparameter complexity; enables both hard clustering (K-means) and soft clustering (HDBSCAN) with single embedding model

3

e5-base-v2Model49/100

via “semantic clustering with embedding-based grouping”

sentence-similarity model by undefined. 17,78,169 downloads.

Unique: Embeddings are optimized for clustering through contrastive learning, where semantically similar texts are pulled together in embedding space. The 768-dimensional space provides sufficient capacity for fine-grained clustering without the curse of dimensionality affecting algorithms like K-means.

vs others: Semantic clustering using embeddings is more robust to vocabulary variation and synonymy than keyword-based clustering, and requires no manual feature engineering unlike TF-IDF or BM25 clustering.

4

donut-baseModel41/100

via “visual-encoder-to-embedding-conversion”

image-to-text model by undefined. 1,50,036 downloads.

Unique: Implements a document-specific visual encoder that preserves spatial layout information through patch-based embeddings, enabling the downstream decoder to maintain awareness of document structure and text positioning rather than treating the image as a generic visual input

vs others: More layout-aware than generic vision encoders (CLIP, ViT) because it's trained specifically on document images, and more efficient than pixel-level processing because it operates on patch embeddings rather than raw pixels

5

gensimRepository29/100

via “doc2vec document embeddings (paragraph vector)”

Python framework for fast Vector Space Modelling

Unique: Implements Paragraph Vector (Doc2Vec) with both DM and DBOW variants, extending Word2Vec architecture with document ID tokens to learn document-level semantic representations through the same neural training objective

vs others: Simpler and faster to train than transformer-based document encoders; however, produces non-contextual embeddings and requires inference passes for new documents unlike pre-computed BERT embeddings

6

flairRepository25/100

via “text-classification-with-document-embeddings”

A very simple framework for state-of-the-art NLP

Unique: Flair's text classification decouples embedding computation from classification, allowing users to swap embedding sources (Flair contextual, BERT, GloVe, etc.) without retraining the classifier. This modular design enables rapid experimentation with different embedding strategies on the same classification task.

vs others: Flair's text classification is more flexible than spaCy's text categorizer (supports arbitrary embeddings) and simpler than HuggingFace transformers (no tokenizer configuration needed), while maintaining competitive accuracy through strong pre-trained embeddings.

7

colbert-aiRepository25/100

via “token-level document encoding with contextual bert embeddings”

Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

Unique: Uses token-level matrix representations instead of pooled single vectors, enabling MaxSim late-interaction matching where each query token independently compares against all document tokens — this preserves fine-grained semantic interactions lost in single-vector approaches like DPR

vs others: Achieves higher precision than single-vector dense retrievers (DPR, Sentence-BERT) while maintaining sub-100ms latency through efficient MaxSim computation, compared to sparse BM25 which sacrifices semantic understanding for speed

8

Nomic Embed Text (137M)Model24/100

via “document similarity and clustering analysis”

Nomic's embedding model — semantic search and similarity — embedding model

Unique: Enables local clustering and similarity analysis without external services by providing embeddings compatible with standard Python ML libraries (scikit-learn, scipy). The model's 137M-parameter size makes embedding large collections feasible on CPU-only systems.

vs others: More flexible than cloud-based clustering services (no API rate limits, full control over algorithms) while requiring less infrastructure than building custom similarity systems; compatible with standard ML tooling without proprietary extensions.

9

wink-embeddings-sg-100dModel21/100

via “vector-based document or sentence embedding aggregation”

100-dimensional English word embeddings for wink-nlp

Unique: Integrates with wink-nlp's tokenization pipeline to ensure consistent preprocessing of multi-word sequences, and provides simple aggregation strategies suitable for lightweight JavaScript environments without requiring sentence-level transformer models

vs others: Significantly faster and lighter than sentence-level embedding models (Sentence-BERT, Universal Sentence Encoder) for document-level tasks, though with lower semantic quality — suitable for resource-constrained environments or rapid prototyping

10

OpenAI CookbookRepository21/100

via “classification, clustering, and semantic search patterns”

Examples and guides for using the OpenAI API.

11

quivrProduct

via “semantic document embedding”

Top Matches

Also Known As

Company