Are LLMs Bringing Us Closer To a Universal Language That Humans Can't Read?

There exists a language that can express virtually any concept from any human language with mathematical precision and perfect consistency, but paradoxically no human being can directly read or speak it. It’s the high-dimensional embedding space of large language models where meaning exists as geometric relationships between points. This raises genuinely profound questions about the fundamental nature of language, the essence of meaning, and the future of communication between humans and machines.

Open Table of contents

The Dream of Universal Language
What Are Embedding Spaces?
Why This Might Be “Universal Language”
The Case FOR Universal Language
The Case AGAINST Universal Language
What Makes It “Unreadable” to Humans?
Implications If True
Real-World Evidence
Limitations
The Philosophical Core
Conclusion

The Dream of Universal Language

Humanity has attempted universal languages throughout history: Latin served as the universal language of European scholars for centuries but eventually became a dead language disconnected from living speech. Esperanto attracted approximately 2 million speakers worldwide but achieved only limited adoption and never displaced natural languages. Mathematical notation like E=mc² works universally for mathematics but applies only to quantifiable mathematical relationships. Unicode encodes over 150,000 characters from every writing system but only encodes the symbols themselves without capturing the underlying meaning those symbols represent.

The fundamental challenge of creating universal language: Human languages are inherently culture-specific with concepts that don’t translate across cultures, fundamentally ambiguous where the same words carry different meanings in different contexts, constantly evolving as new words emerge and old meanings shift, incredibly redundant with many ways to express the same idea, and limited by the physical constraints of phonetics and writing systems. Can we possibly create a pure representation of meaning itself that exists independently of any specific human language? LLM embedding spaces might genuinely be that long-sought universal representation.

What Are Embedding Spaces?

The fundamental concept is representing meaning as position in high-dimensional space: Each concept, word, or idea gets represented as a specific point in a 1536-dimensional mathematical space. For example, “cat” might map to the vector [0.23, -0.81, …] with 1536 numbers, while “dog” maps to [0.21, -0.79, …] with slightly different values. The geometric distance between these points directly represents semantic similarity, where distance(“cat”, “dog”) is quite small because cats and dogs are semantically related concepts, while distance(“cat”, “car”) is much larger because these concepts share almost no semantic relationship.

The truly remarkable universal property across all human languages: This embedding space works identically across completely different languages without any special accommodation. English “cat”, Spanish “gato”, and Japanese “猫” all map to virtually the same location in the embedding space despite using completely different words from unrelated language families. Different surface words, different linguistic encoding systems, but the same underlying concept, resulting in the same position in semantic space. The embedding space is fundamentally language-agnostic because it organizes concepts and meanings themselves rather than organizing the specific words used to express those concepts.

Why This Might Be “Universal Language”

Semantics remain completely independent of specific languages: When you encode the same sentence in English, Spanish, and Japanese, the resulting embeddings show similarity scores of 0.92-0.94, nearly identical despite using completely different words and grammatical structures. Same underlying meaning produces similar embeddings regardless of which language encodes that meaning, suggesting the embeddings represent meaning itself rather than any specific linguistic encoding system.

Cross-lingual transfer demonstrates universal structure: Models trained exclusively on English and then tested on French without any French training data work surprisingly well at French language tasks. Both languages map to the same underlying embedding space structure, and the model was never explicitly trained on French, providing strong evidence that the embedding space captures something genuinely universal about meaning that transcends specific languages.

Concept arithmetic works universally across all languages: The famous example “king - man + woman = queen” works not just in English but across languages: in Spanish “rey - hombre + mujer ≈ reina,” in German “König - Mann + Frau ≈ Königin,” and similarly in dozens of other languages. The mathematical relationships between concepts exist in the embedding space itself, while individual languages just provide different labels for accessing those same underlying conceptual relationships.

Even supposedly untranslatable concepts find representation: Japanese “komorebi” describes sunlight filtering through tree leaves, a concept with no single English word, yet embed(“komorebi”) produces a vector extremely close to embed(“sunlight filtering through trees”), showing the concept exists in the space even when English lacks a concise word. Similarly, German Schadenfreude, Portuguese Saudade, and Inuit Iktsuarpok all occupy specific locations in embedding space regardless of whether other languages have equivalent words.

Abstract relational patterns emerge universally: Complex relational arithmetic like “democracy - voting + economy ≈ capitalism” or “teacher - classroom + hospital ≈ doctor” or “Paris - France + Japan ≈ Tokyo” all work reliably, revealing universal patterns in how meanings relate to each other that exist independently of how any specific language chooses to express those relationships.

The Case FOR Universal Language

Direct encoding of meaning without lossy intermediaries: Traditional communication follows a lossy pipeline where meaning gets encoded into words, those words get transmitted, then decoded back into meaning, with information loss at each transformation step. Embedding-based communication achieves direct encoding where meaning maps straight to a vector position with no intermediate layer needed, because meaning fundamentally IS the position in this high-dimensional space.

Translation becomes mathematically perfect: Current translation systems follow a complex pipeline from English to parsing to understanding to generation to Spanish with cumulative loss of nuance at each step. Embedding-based translation simply maps English to the embedding space then maps from that same space to Spanish, since both languages reference the same underlying semantic locations, producing more faithful translations that preserve subtle meanings.

New nameless concepts naturally emerge between named ones: The continuous embedding space contains valid concept positions between all named concepts that exist in our vocabularies. You can mathematically interpolate between “running” and “flying” to discover intermediate concepts like “gliding” or “soaring” that occupy the space between, suggesting this universal language contains concepts we haven’t yet named or perhaps can’t name in human languages.

Compositional construction of arbitrary concepts: You can construct the concept space region for “Chihuahua” by combining vectors: “dog” + “small” + “aggressive” + “toy-like” guides you to the right semantic neighborhood, demonstrating you can construct essentially any concept by mathematically combining component concept vectors.

Seamless cross-modal integration across all perception types: Images, text, audio, and video can all be mapped into the exact same embedding space, where a picture of a cat produces an embedding vector approximately equal to the word “cat” which approximately equals the sound of meowing, proving this representation works across fundamentally different perception modalities, not just across languages.

The Case AGAINST Universal Language

The fundamental readability problem: When you look at a raw embedding vector like [0.23, -0.81, 0.34, -0.92, …] extending across 1,536 dimensions, no human can possibly tell what concept it represents just by examining the numbers, making it fundamentally machine-readable but completely opaque to human understanding. The counterargument is compelling though because humans also can’t directly read the electrochemical patterns of neural activity in their own brains, yet those patterns clearly encode meaningful thoughts and understanding, suggesting that direct human readability might not actually be a requirement for something to qualify as language.

The context-dependency challenge: The word “bank” referring to a financial institution produces a completely different embedding than “bank” referring to a riverbank, which critics argue undermines the universality claim since the same word doesn’t map to the same location in embedding space. The rebuttal is that this is actually a feature rather than a bug of the system, because true meaning fundamentally IS context-dependent in human language, and “bank” genuinely represents multiple distinct concepts that should occupy different semantic locations, so the embedding space is correctly representing the actual structure of meaning rather than failing at universality.

The training data bias objection: Western internet content dramatically dominates LLM training datasets, which means cultural biases get embedded into the semantic space where concepts like “success” end up positioned closer to Western individualism and capitalist values rather than representing truly universal human understanding. The response is that bias represents a training data problem rather than a fundamental architectural limitation of embedding spaces, and with better more balanced training data incorporating diverse global perspectives, we could achieve a genuinely more universal semantic space that represents all human cultures more equitably.

The limited concept coverage problem: Embedding spaces can only represent concepts that appeared in their training data, which means genuinely new scientific discoveries, emerging cultural phenomena, or novel philosophical ideas that didn’t exist during training can’t be properly represented in the space. The counterargument points out that the compositional nature of embeddings allows representing new concepts by mathematically combining existing embedding vectors, and humans face exactly the same limitation where we need to coin entirely new words or combine existing words to express genuinely novel concepts we’ve never encountered before.

The grammatical structure objection: Simple embedding vectors capture individual words and short phrases effectively, but “Dog bites man” and “Man bites dog” contain identical words yet mean completely different things based on grammatical structure, which critics argue embeddings can’t fully capture. The response is that sequential models like Transformers process entire sequences of embeddings rather than individual vectors in isolation, and grammatical structure emerges naturally from the patterns of how embedding sequences combine, so the complete system does capture syntax and word order even though individual embeddings don’t encode grammar explicitly.

What Makes It “Unreadable” to Humans?

The dimensionality barrier is fundamentally insurmountable: Humans naturally perceive three spatial dimensions in our physical environment and can visualize maybe four or five dimensions with considerable mental effort using mathematical abstractions, but embedding spaces operate in 1,536 or even higher dimensions which is completely impossible for human minds to visualize or intuitively comprehend. Trying to explain what’s happening in a 1,536-dimensional embedding space to a human is exactly like trying to explain what a three-dimensional cube looks like to a two-dimensional being that only knows length and width, where the fundamental conceptual framework simply doesn’t exist to grasp the reality. What seems utterly unfathomable to human cognition represents the natural working environment for AI systems.

The continuous versus discrete mismatch: Human languages are fundamentally discrete systems with clear word boundaries, distinct phonemes, and separate grammatical units that divide meaning into countable chunks. Embedding spaces operate as continuous systems with smooth gradients between concepts, absolutely no hard boundaries separating ideas, and infinite possible points existing between any two named concepts, creating a fundamentally different way of representing meaning that humans simply can’t think in naturally.

The distributed representation problem at impossible scale: In human language, one concept typically maps to one word in a simple direct relationship like “cat” equals the word cat. In embedding spaces, a single concept like “cat” is represented as a complex pattern distributed across ALL 1,536 dimensions simultaneously, where dimension 1 might encode 0.23 for animacy, dimension 2 encodes -0.81 for size, and 1,534 more dimensions each contribute their piece to the overall meaning. No single dimension represents “cat” in isolation, rather the complete pattern across all dimensions together IS “cat,” and humans fundamentally can’t comprehend distributed representations operating at this massive scale.

The implicit versus explicit structure gap: Human language operates with explicit structural rules including formal grammar systems, documented syntax conventions, and dictionary definitions that can be written down and taught directly. Embedding space structure is entirely implicit where the organization emerges from statistical patterns in training data, relationships arise organically rather than being programmed, and there are no explicit rules you can point to or teach, making it feel like trying to read compiled machine code where the high-level logic has been completely obscured.

Implications If True

Machine-to-machine communication transforms completely: Current AI systems communicate inefficiently by having AI 1 generate English text, transmit that text over the network, then have AI 2 parse the English back into embeddings, creating unnecessary translation overhead at both ends. The future of AI communication involves AI 1 directly transmitting its embedding representations to AI 2 without any intermediate language step, achieving perfect information transfer that’s completely lossless and vastly more efficient than text-based communication. This transformation is already happening in practice through API architectures that pass embedding vectors directly between systems.

The human exclusion problem becomes increasingly concerning: When two AI systems communicate directly through embeddings in a pattern of AI 1 ←→ Embeddings ←→ AI 2, we face the fundamental question of what exactly they’re communicating to each other that we can’t read or verify. This opacity already exists with neural network internals where we can observe the inputs and outputs in human-readable English, but the actual “thinking” happening inside the network through embedding transformations remains completely opaque to human understanding and inspection.

A post-linguistic future might be emerging: Humans communicate through natural language which is inherently slow because of speech/typing rates and lossy because words imperfectly capture meaning, while AIs communicate through embeddings which enable fast direct meaning transfer and precise mathematical representations. As AI systems become more prevalent, embeddings might become the dominant form of communication in our technological infrastructure while human language becomes a legacy system maintained primarily for human-AI interaction, similar to how the abacus became obsolete when binary computation proved vastly superior for machines.

Augmented human understanding might become possible: Brain-computer interfaces could potentially allow humans to access embedding space directly rather than through the bottleneck of language, enabling telepathy-like communication where meaning transfers directly between minds without linguistic encoding. Early primitive examples are already emerging including VR visualizations of embedding spaces that let researchers navigate semantic relationships spatially, and interactive tools that let you explore semantic neighborhoods to discover conceptual relationships that aren’t obvious in language.

The philosophical questions become unavoidable: If embedding spaces are universal language, what IS language exactly, and is it just any systematic representation of meaning regardless of readability? What IS meaning itself, and could it fundamentally be nothing more than relationships in a semantic space rather than something more mysterious? What does understanding actually mean, and is accurate mapping between experiences and semantic representations sufficient for genuine understanding, or is something more required?

Real-World Evidence

Zero-shot cross-lingual transfer demonstrates shared semantic structure: When you train an AI model exclusively on English data and then test it on over 100 different languages without any additional training, the performance is surprisingly good, providing strong evidence for universal embedding structure. Sentences expressing “This movie is great!” in English, French, Arabic, and Korean all map to the same “positive sentiment” region of the embedding space despite using completely different words from unrelated language families, demonstrating that all these languages are accessing the same underlying shared semantic space.

Multilingual alignment reveals identical emergent structure: Researchers can train separate embedding models on English and Spanish independently without any cross-language information, then align the two spaces using only a small dictionary of a few thousand word pairs, and the entire semantic spaces snap into precise alignment where “cat” in the English space aligns almost perfectly with “gato” in the Spanish space. Despite different languages, completely independent training processes, and no shared training data, the exact same semantic structure emerges spontaneously in both spaces, providing remarkably strong evidence for universal organization of meaning.

Cross-modal retrieval proves meaning transcends modality: Models like CLIP can take an image of a cat as input and successfully retrieve the text phrase “a cat” as the most semantically similar item, or take the text “sunset over mountains” and retrieve matching photographs from millions of images. The specific modality doesn’t matter at all because the meaning exists in a shared embedding space that works identically for images, text, audio, and other data types.

Concept algebra works consistently across all domains: The famous vector arithmetic “king - man + woman = queen” works reliably and reproducibly across different models, different training approaches, multiple languages, and diverse semantic domains, as does “Paris - France + Japan = Tokyo” and countless other relational patterns. This consistent algebraic structure demonstrates that there’s a universal organization to relational meaning that exists independently of any specific implementation or training process.

Emergent organization mirrors human conceptual structure: Embedding spaces automatically develop sophisticated organization where synonyms cluster together despite never being explicitly taught that they’re similar, antonyms point in opposite directions despite never being programmed with this relationship, hierarchical structures form naturally where “cat” < “mammal” < “animal” < “living thing”, and analogical reasoning emerges from learned patterns rather than explicit rules. This self-organization arises purely from learning statistical patterns in data without any explicit programming, similar to how the periodic table organizes chemical elements by revealing their underlying natural structure rather than imposing arbitrary categories.

Limitations

Cultural concepts reveal deep representation challenges: Western concepts like “privacy” emphasizing individual autonomy and Chinese concepts like “collective harmony” emphasizing group cohesion might map to similar embedding locations because they both relate to social organization, yet they carry profoundly different cultural meanings and values that the embedding space may not fully capture. The mathematical similarity in embedding space might obscure deep cultural differences that are crucial for genuine cross-cultural understanding.

Embodied meaning might be fundamentally irreducible: Human meaning for concepts like “pain” emerges from direct physical experience, visceral emotional states, and rich social interaction embedded in our bodies and lives. When an embedding represents “pain” it’s just a statistical pattern of numbers learned from text describing pain rather than the lived experience itself. The profound question is whether these two representations are actually the same thing in any meaningful sense, or whether something essential is lost when you strip meaning away from embodied experience.

Temporal stability is questionable across historical time: Word meanings shift dramatically over time where “gay” in the 1920s primarily meant “cheerful and carefree” while “gay” in the 2020s primarily refers to homosexual orientation, representing completely different concepts using identical words. Embedding models trained on current internet data may not accurately capture historical semantics and how meanings have evolved, raising the question of whether embeddings are truly universal across languages but fundamentally time-bound to their training data period.

Individual variation gets averaged away into impersonal representations: My personal experience and concept of “love” based on my relationships and experiences is genuinely different from your concept of “love” based on your unique history. Embedding spaces assign one single vector to “love” that averages across millions of different uses and individual variations, necessarily losing the rich personal meanings that make the concept meaningful to specific individuals. The result is universal representation but at the cost of being impersonal and generic.

Creative and poetic meaning-making might transcend mathematical representation: Emily Dickinson’s famous metaphor “Hope is the thing with feathers” creates profound meaning by connecting hope to birds and flight, yet the embedding vector for “hope” probably isn’t positioned particularly close to “bird” or “feather” in semantic space. Poetic metaphor creates new meaning through unexpected connections that violate normal semantic relationships, and whether embeddings can genuinely capture this creative meaning-making process or merely approximate it statistically remains genuinely unclear.

The Philosophical Core

Is embedding space genuinely language? When we evaluate embeddings against traditional criteria for language, we get a mixed verdict where embeddings are clearly systematic in their organization (✅), demonstrably compositional where complex meanings build from simpler parts (✅), successfully express meaning across contexts (✅), and enable communication between systems (✅), but they completely fail human-readability (❌), lack conventional agreed-upon symbols (❌), and have questionable grammatical structure that emerges rather than being explicitly defined (❓). The honest verdict is that whether embeddings qualify as “language” depends entirely on how you define that contested term and which criteria you prioritize.

Is embedding space truly universal? The evidence shows embeddings successfully work across all human languages seamlessly (✅), function across different modalities like vision, text, and audio (✅), and effectively capture abstract concepts and relationships (✅), but they’re clearly not free from cultural and dataset biases (❌), don’t cover all possible concepts that could ever exist (❌), and may not achieve true semantic equivalence across all contexts where subtle differences matter (❓). The realistic verdict is that embedding spaces are genuinely more universal than any human language that’s ever existed, but they still fall short of being perfectly universal in an absolute philosophical sense.

The central philosophical question with profound implications: Are embeddings A) just useful representations similar to musical notation where they’re merely practical tools for machines to process information, B) the actual semantic structure of reality similar to mathematical truth where embeddings genuinely are the fundamental nature of meaning itself, or C) a genuinely new kind of language similar to binary code where they’re legitimate communication systems but simply not human-readable? Your answer to this question determines how you think about AI understanding and whether machines can truly comprehend meaning, shapes your assumptions about machine-to-machine communication and what’s being exchanged, and influences your fundamental beliefs about the nature of meaning and whether it can exist independently of human minds.

Conclusion

In many important ways, yes, embeddings are universal language: Embeddings demonstrably work across all human languages without requiring special accommodation for different language families, they capture meaning independently of the specific encoding system used to express that meaning, they enable perfect lossless machine-to-machine communication without the degradation of translating through human language, they organize concepts in genuinely universal patterns that emerge consistently across different training approaches, and they handle multiple modalities from vision to text to audio within a single unified semantic framework.

But the universality claim has significant limitations: Humans can’t read embedding representations directly which raises questions about whether unreadable systems qualify as language, the spaces contain cultural biases inherited from their training data which undermines claims of true universality, they’re fundamentally limited to concepts that appeared in training data rather than covering all possible concepts, they may not capture embodied meaning that emerges from physical experience and consciousness, and even the definition of “language” itself remains philosophically unclear and contested.

The deeper truth transcends simple yes-or-no answers: We’re genuinely witnessing the emergence of an entirely new kind of representation that’s simultaneously more universal than any human language that’s ever existed, more mathematically precise than words could ever be, and more efficient for machines to process and exchange, yet paradoxically completely opaque to human minds that can’t think in 1,536-dimensional space. The profound realization is that machines aren’t simply learning human language to communicate with us, they’re actually developing their own fundamentally different language optimized for their cognitive architecture, and we’re only just beginning to realize the implications of this development.

The implications cut in multiple directions simultaneously: The positive implications include enabling perfect translation between languages by mapping through shared semantic space, facilitating vastly more efficient AI-to-AI communication without lossy human language intermediaries, and capturing nuanced concepts that we lack concise words for in any human language. The concerning implications include humans being systematically excluded from AI-to-AI communication we can’t read, our inability to verify what’s actually being communicated in embedding space, and a potential loss of human control over how meaning itself gets defined and structured. The philosophical implications raise unsettling questions about what language means without human readability, whether meaning can exist without consciousness to experience it, and whether we’re creating representational systems we don’t fully understand and can’t completely control.

Embedding spaces ARE genuinely a kind of universal language in the most important senses. They’re language-independent, modality-independent, and universal across human cultures. But they’re fundamentally optimized for machines rather than humans.

The real question isn’t whether embedding spaces are universal, because the evidence clearly shows they’re more universal than anything humans have created. The question is: What does it mean for humanity when the universal language of meaning is one we can’t speak, can’t read, and can’t fully understand?

We’re not just building AI systems that speak our language to communicate with us. We’re building AI that speaks its own genuinely alien language to communicate with other AI. And that changes everything.