Any modality starts with simple codes. Text modality starts with one-character word codes. Visual modality starts with position codes. Let's have a look at graph presentation of most primitive codes.
When a modality is empty we see the root (1) and the starting clusters of zero (2,3) and one (5,6).
Suppose, one needs to encode 0, as it's the vertical position of a dot on an image. The graph will look like this.
The vertex 8 was added to the cluster representing zero. This vertex 8 means position 0 and can be used in other layers of a multimodal graph.
Now let's compare the last graph to a graph which contains position 1 in an image.
Both graphs are symmetric.
Let's consider a number that consists of 2 digits in binary format. You see the graph for 3. The actual vertex with meaning 3 has ID 11 here.
Now a number with 3 digits in binary format - 4. It's vertex is 14 below.