Multilingual Code To Code Translation Dataset Construction

1

xCodeEvalBenchmark65/100

via “code translation task evaluation with language-pair validation”

Multilingual code evaluation across 17 languages.

Unique: Validates code translation by executing both source and target code against identical unit tests and comparing outputs, ensuring functional equivalence rather than syntactic similarity. Uses language-specific compiler mappings to handle the complexity of 17 different compilation environments and their idiosyncrasies.

vs others: More rigorous than BLEU-score-based translation metrics because it validates actual functional correctness through execution, and covers more language pairs (17 vs typical 2-4) with explicit compiler integration.

2

LAION-5BDataset60/100

via “language-aware dataset organization and filtering across 100+ languages”

5.85 billion image-text pairs foundational for image generation.

Unique: Pre-organized into language clusters (2.3B English, 2.2B multilingual across 100+ languages) enabling direct access to language-specific subsets without re-processing; supports non-English vision-language model training at scale

vs others: Larger multilingual coverage than most open datasets; however, language assignment reliability is lower than human-curated datasets, and language distribution is skewed toward English and high-resource languages

3

StarCoderDataDataset58/100

via “multi-language code representation and tokenization”

250GB curated code dataset for StarCoder training.

Unique: Explicitly supports 86 languages with language-aware metadata, enabling models to learn language-specific syntax and patterns. Preserves raw code rather than pre-tokenizing, allowing flexible tokenizer choices downstream.

vs others: Broader language coverage than CodeSearchNet (14 languages) and more flexible than pre-tokenized datasets like Codex, enabling researchers to experiment with different tokenization strategies and language-specific fine-tuning.

4

OpenAssistant Conversations (OASST)Dataset58/100

via “multilingual conversation dataset with 35 language support and cross-lingual sampling”

161K human-written messages in 35 languages with quality ratings.

Unique: Covers 35 languages including low-resource ones (Swahili, Vietnamese, Polish) with human-written conversations, not machine-translated. Enables genuine cross-lingual preference learning rather than synthetic translation.

vs others: Broader language coverage than English-centric datasets (e.g., ShareGPT, HH-RLHF), though with language imbalance requiring careful sampling. Larger low-resource language component than most instruction datasets.

5

StarCoder DataDataset57/100

via “multi-language code representation with language-specific tokenization”

783 GB curated code dataset from 86 languages with PII redaction.

Unique: Explicit language-specific representation across 86 languages with language-aware tokenization, rather than treating code as generic text — enables models to learn language idioms and syntax-specific patterns

vs others: More comprehensive language coverage (86 languages) than CodeSearchNet (~10 languages) and more language-aware than generic code datasets, improving multilingual code generation

6

CodeLlama 70BModel57/100

via “multi-language code translation and porting”

Meta's 70B specialized code generation model.

Unique: Supports code translation across 15+ languages with understanding of language-specific idioms and standard library patterns, enabling more idiomatic translations than generic seq2seq models. The code-specific pretraining enables better preservation of algorithm semantics during translation.

vs others: Produces more idiomatic and functionally correct translations than GPT-3.5 or general-purpose models due to code-specific training, while remaining open-source and free for commercial use.

7

WildChatDataset57/100

via “multilingual conversation corpus extraction and analysis”

1M+ real user-AI conversations with demographic metadata.

Unique: Includes real-world multilingual conversations from production ChatGPT/GPT-4 deployments, capturing authentic non-English user interactions and code-switching patterns, though limited in coverage and requiring language detection for explicit language identification

vs others: More authentic multilingual examples than synthetic multilingual datasets, though smaller and less balanced than purpose-built multilingual corpora like FLORES or mC4

8

GraniteRepository56/100

via “multilingual code generation across 116 programming languages”

IBM's enterprise-focused open foundation models.

Unique: Trained on 116 programming languages with unified tokenization and no language-specific architectural branches, enabling cross-language code generation from a single model rather than language-specific fine-tunes. Uses a two-phase training approach (3-4T code tokens + 500B mixed tokens) to balance code-specific patterns with natural language understanding for better instruction following.

vs others: Broader language coverage than Codex (92 languages) and more balanced multilingual performance than Copilot, which optimizes primarily for Python/JavaScript; Granite's enterprise data filtering and PII redaction make it safer for regulated industries than models trained on raw GitHub.

9

Mistral: Devstral Small 1.1Model26/100

via “multi-language-code-understanding-and-translation”

Devstral Small 1.1 is a 24B parameter open-weight language model for software engineering agents, developed by Mistral AI in collaboration with All Hands AI. Finetuned from Mistral Small 3.1 and...

Unique: Trained on parallel code corpora across 10+ languages with explicit focus on semantic equivalence rather than syntactic mapping, enabling idiomatic translations that respect target language conventions and libraries

vs others: Produces more idiomatic translations than rule-based transpilers by understanding semantic intent and applying language-specific best practices, though still requires manual review for production code

10

Z.ai: GLM 4 32B Model26/100

via “multi-language translation with context preservation”

GLM 4 32B is a cost-effective foundation language model. It can efficiently perform complex tasks and has significantly enhanced capabilities in tool use, online search, and code-related intelligent tasks. It...

Unique: GLM 4 32B uses multilingual embeddings trained on diverse parallel corpora, enabling it to handle low-resource language pairs better than models trained primarily on English — this is a training data advantage rather than architectural

vs others: More cost-effective than specialized translation APIs while maintaining competitive quality through multilingual training, with better handling of technical and code-related content than generic translation services

11

xCodeEvalDataset25/100

via “multilingual code-to-code translation dataset construction”

Dataset by NTU-NLP-sg. 6,65,024 downloads.

Unique: Combines expert-generated annotations with found code sources to create 696K+ translation pairs across 6+ programming languages, using token-classification and text-retrieval task formulations to enable both fine-grained alignment learning and semantic matching — a scale and diversity not matched by earlier code translation datasets

vs others: Larger and more diverse than CodeXGLUE's translation subset and includes expert validation of translation quality, whereas most prior datasets rely on automated alignment or single-language-pair focus

12

Llama 3.1 (8B, 70B, 405B)Model25/100

via “multilingual text generation and translation”

Meta's Llama 3.1 — high-quality text generation and reasoning

Unique: Unified multilingual model eliminates need for separate language-specific models or external translation APIs. Supports code-switching and maintains context across language boundaries within a single forward pass, unlike pipeline approaches that translate then re-process.

vs others: Faster and cheaper than calling Google Translate or DeepL APIs for bulk translation, and runs entirely locally without data leaving your infrastructure; however, translation quality is likely inferior to specialized translation models trained on parallel corpora.

13

fineweb-edu-translatedDataset24/100

via “low-resource language dataset augmentation via translation”

Dataset by Helsinki-NLP. 3,48,667 downloads.

Unique: Systematically translates high-quality educational content to 19 languages including underrepresented European languages, creating synthetic training data at scale for low-resource NLP — most competing datasets focus on high-resource languages or provide limited coverage for low-resource languages

vs others: Provides significantly more training data for low-resource languages than native-language corpora alone; broader language coverage than language-specific datasets

14

Qwen 2.5 Coder (1.5B, 3B, 7B, 32B)Model24/100

via “multi-language-code-generation-with-unified-interface”

Alibaba's Qwen 2.5 specialized for code generation and understanding — code-specialized

Unique: Training on code from diverse language ecosystems enables the model to understand language-agnostic algorithmic concepts and translate them into language-specific idioms. The unified interface eliminates the need for separate language-specific tools or models.

vs others: More efficient than maintaining separate code generators for each language because a single model handles all languages, and more consistent than manual translation because the model applies learned conventions from each language's training data.

15

StarCoder 2 (3B, 7B, 15B)Model22/100

via “multilingual code generation”

BigCode's StarCoder 2 — multilingual code generation model — code-specialized

Unique: Utilizes a specialized training dataset that includes a wide variety of programming languages, enhancing its multilingual capabilities compared to other models that may focus on a single language.

vs others: More versatile than GitHub Copilot in generating code across multiple languages due to its extensive training on diverse programming languages.

16

DeepSeek Coder V2 (16B, 236B)Model22/100

via “programming language translation and code transformation”

DeepSeek's Coder V2 — specialized for code generation and understanding — code-specialized

17

DeepSeek-R1Product

via “multi-language code translation”

18

JIT.codesProduct

via “multi-language-code-translation”

19

Devassistant.aiProduct

via “multi-language code translation”

20

CodeCompanionProduct

via “cross-language code translation”

Unique: Integrates language translation directly into IDE workflow without requiring separate tools or manual mapping; free tier enables developers to experiment with cross-language code reuse without cost barriers

vs others: More accessible than manual code translation or hiring developers fluent in multiple languages, but produces code requiring significant review and adaptation for production use compared to human-written implementations

Top Matches

Also Known As

Company