language distance calculator - Aaron Graves, PhDude Replica

Language A Name (optional)

Sample Text A

Language B Name (optional)

Sample Text B

Comparison Mode

Tip: Use equivalent content (for example, translations of the same sentence) for more meaningful comparisons.

What this language distance calculator does

This tool estimates how similar or different two language samples are. It is useful for learners, translators, and curious readers who want a quick numeric signal of lexical and structural overlap.

The calculator returns a distance score from 0 to 100:

0 means the texts are extremely close (or identical after normalization).
100 means the texts are highly different in terms of characters and word usage.

How the score is computed

Language distance is not one single thing, so this calculator combines three lightweight signals:

1) Normalized Levenshtein similarity

Measures how many single-character edits are needed to transform one text into the other. The result is normalized by text length so short and long samples are more comparable.

2) Bigram Dice similarity

Looks at two-character chunks (bigrams) and compares overlap. This captures spelling patterns and orthographic style better than plain edit distance alone.

3) Word-set Jaccard similarity

Compares overlap between unique words in each sample. This provides a vocabulary-level signal and can highlight shared lexical roots.

How to interpret your result

0–20: Very close (often dialectal variation, near-identical phrasing, or same language).
21–40: Close (many shared forms, common roots, or closely related languages).
41–60: Moderate distance (mixed overlap and divergence).
61–80: Far apart (limited overlap at text level).
81–100: Very far apart (different scripts, vocabulary, and form patterns).

Best practices for accurate comparisons

Use equivalent meaning

Compare texts that express the same idea. If one sample is technical and the other is casual, the score may reflect topic differences more than language distance.

Use enough text

One or two words are noisy. A sentence or short paragraph gives a more stable estimate.

Avoid mixed-language snippets

If your input includes borrowed terms, names, or code-switching, distance may be distorted.

Limitations you should know

This is a practical text-comparison calculator, not a full historical linguistics model. It does not estimate deep phylogenetic relationships, sound shifts over centuries, morphology, syntax trees, or semantic alignment. It simply quantifies observed overlap in your two samples.

For serious research, use larger corpora and methods such as cognate detection pipelines, language family trees, or embedding-based cross-lingual similarity models.

Quick example ideas

Compare Spanish vs Portuguese translations of the same sentence.
Compare Norwegian Bokmål and Danish news snippets.
Compare English and German product descriptions.
Compare two dialect spellings within the same language.