Inter-module communication is an essential part of a modular robotic system. Without it, there is no space for emergent behavior to form. There are two possible means of communication: wireless and wired. Both of them have its advantages and disadvantages.
Wired vs. Wireless
The wireless communication is easy to implement, as there is no need to establish a physical connection between the modules. Absence of physical wiring simplifies the connector mechanism and also allows for communication of separate systems of modules. However, due to its nature, it does not scale well as all the modules share the same medium and only a single module can transmit at a time. Also, standard networking technologies like WiFi or Bluetooth are not suitable for handling possibly hundreds of devices in one place, especially in an embedded environment.
On the other hand, wired communication can scale well as every two connected modules can communicate without being limited by the other modules. It can also be more power efficient compared to the wireless one. As there is already a need for physical connection, a power-sharing interface can be added at a minimal cost. Therefore the connected modules can share and distribute energy. Unlike the wireless one, there is a need for routing as there is no shared medium and only adjacent modules can communicate.
RoFI Communication
The RoFI platform relies primarily on wired communication between every two connected modules and optionally allows for wireless communication. This design choice allows for fast and parallel communication inside a single system and also allows to establish communication with a remote system.
With the RoFI platform we do not want to reinvent the wheel and, e.g., come up with a custom communication and routing protocol based on RS-485 like M-TRAN or the CAN in case of the HyMod project. For example, the CAN bus is fault tolerant and provides prioritized messages, however for purposes of modular robots, it features several disadvantages:
- it has a limited bandwidth of 1 Mb/s,
- it does not scale well as all devices share the same medium,
- cannot be cyclic and therefore, needs bus switches to remove cycles,
- needs an adjustable impedance based on the topology (see the HyMod paper)
Therefore, the RoFI platform leverages traditional computer network based on the TCP/IP protocol and stack. The TCP/IP protocol allows for seamless operation with existing computer networks and services. The users can use traditional sockets, port OpenMPI to their projects or use the network for sending debug logs from the distributed environment. It potentially opens doors for global communication of the RoFI systems. Using TCP/IP also makes wired and wireless connection indistinguishable from each other. Routing and robustness also come for free. The downside of the TCP/IP stack is its complexity; however, this is partially compensated by the existence of ready-to-use implementation even for embedded platforms (eg., lwIP).
There are two possibilities to implement the TCP/IP in the platform: either use Ethernet to connect the modules (making each module act either as a hub or a switch) or introduce a custom L1/L2 layer of the OSI/ISO model.
See RoFICoM for communication protocol of our connector. The connection of two RoFICoMs allows us to pass arbitrary packets of data between two modules. Considering concrete implementation in lwIP, these two operations, sending and receiving a packet (possibly with a custom header), is all that is necessary to implement a custom device driver. A device driver is a terminology of lwIP and basically is an equivalent of the L2 of ISO/OSI model.
We choose to implement a custom physical layer for the RoFI platform for several reasons:
- ethernet switching ICs usually feature only up to 5 interfaces, which is not enough for the universal module;
- routing a board with such a chip is challenging;
- the connectors cannot share a bus and many wires are required;
- ethernet requires electromagnetic coupling for galvanic isolation, which is space consuming and the isolation does not make sense in our setup due to the presence of power-sharing.
The usage of TCP/IP also allows us to easily change our choice later, e.g., if we find current solution to be too slow, without affecting existing software. Only a small portion of the robot driver would have to change and also, we can combine multiple physical layers.
Implementation
Our primary MCU for our modules is ESP32, therefore, we leverage the lwIP library to implement networking. To make lwIP compatible with the connectors, we have to provide a new network interface. All connectors in the module appear as a single interface. Therefore, the implementation performs simple switching to determine the destination connector.
Connector Integration
Technically speaking, we have to provide the following function to lwIP to implement a new network interface:
err_t rofi_if_init(struct netif *netif)
{.cpp}err_t rofi_if_output(struct netif *netif, struct pbuf *p, ip_addr_t *ipaddr)
{.cpp} to wrap a datagram in custom headers and to transmit the datagram to the connectors.
After a system start, the our driver calls netif_add
. The call registers a new
network interface. After the setup, all the incoming datagrams are passed to the
rofi_netif->input
callback. The callback is set in the structure by
netif_add
. Providing these functions is all what is needed to integrate
connectors into the lwIP library.
Network Interface Implementation
The implementation of the network interface is straightforward and somewhat technical. There are only several aspects worth further comment:
- mapping IP datagrams to the connector interface,
- pairing destination address to connectors, and
- the relation between switching and routing.
The connectors can pass arbitrary binary blobs marked by the content type specifier to the mating side. The specifier allows to transmit multiple protocols through the same interface. In the RoFI driver, there are so far four content type specifiers: 0 – IP datagrams, 1 – an address mapping protocol, 2 – low-level logging, and 3 – a firmware update protocol.
In order to determine which connector should be used for datagram transmission, a mapping between destination IP addresses and the outcoming connector is needed. The mapping is performed by the ARP protocol in the ethernet based networks; however, the ARP protocol is not a good choice in the RoFI environment for the following reasons:
- the connection between connectors is point-to-point and
- the ARP protocol relies on MAC addresses of physical interfaces which the connectors do not provide
Therefore, we design a simple address mapping protocol instead. The protocol is used to find IP addresses and GUIDs of the module neighbors. There are two messages in the protocol: a mapping call and a mapping response.
The mapping response follows the format:
Each module keeps a table with IP addresses and module GUIDs for each connector. The table can contain multiple entries for each connector even though the connectors provide only point-to-point communication. There are following reasons to do that: it allows to cluster modules in a system and perform packet switching in the cluster if the further development finds it suitable (e.g., packet switching has lower latency), and there can be simple modules in the system not capable of routing (e.g., an accumulator module). The overhead caused by the extended packet format is negligible and is worth further extensibility.
On the connector connection, the module actively sends the mapping response message to the mating side. If there is a change which makes the information sent previously invalid, a mapping response is sent (e.g., the module has changed its IP address). Such event-driven approach allows establishing the mapping quickly, however the module should periodically use the mapping call to update entries in the table to avoid the entries to go out-of-data.
The mapping call follows the format:
Note that under normal conditions the table cannot go out-of-date; this can only happen when a malfunction occurs (e.g., a module resets).
Module Address Configuration and Routing
Each module has to have a unique and valid IP address to communicate in the IP network. The modules have their unique GUIDs, however, GUID is not a good candidate for an IP address for several reasons. First, it is longer than IPv4 address, second, it does not follow the standard format of IPv6 address, which reflects the structure of the network, and therefore, GUID does not allow for efficient routing. There is no central authority in the network of RoFIs like there is a router in a traditional network, therefore traditional solutions for the network address obtaining, like DHCP server, cannot be used. Both, address configuration and routing in large networks with unstable topology, are still an area of active research, and there is no standardized solution available yet.
As the RoFI is built on top of the TCP/IP networking, nearly all newly proposed algorithms for routing (e.g., APM) can be adopted as their experimental implementation relies on TCP/IP. The same goes for IP address configuration, e.g., in form of Distributed DHCP server. However, in the first versions of RoFI driver, we rely on the basic address autoconfiguration implemented in lwIP and a simple routing algorithm. These algorithms work fine in small networks. In the future, we plan to adopt the Distributed DHCP server and the APM routing algorithm.
Low-Level Logging
When firmware for a microcontroller is developed, programmers usually use a simple communication interface, typically UART, to print debugging messages. Using print-based debugging is useful, since a debugger cannot pause most systems as it is impossible to pause the surrounding environment. UART is usually used as it is stateless, has nearly no code dependencies and is easy to set up.
The same approach can be used to debug a single unit, however, collecting data from multiple UARTs on multiple modules is rather impractical. TCP/IP sockets can be used instead of UART, however, since they rely on networking in case of routing malfunction or a system crash, a debug message cannot be delivered. Therefore, we introduce \emph{low-level logging} in the RoFI driver – a simple and robust protocol for sending messages in the RoFI system.
The low-level logging is implemented directly on top of the connector interface and does not rely on TCP/IP networking. The goal is to emit a string and propagate it to all modules or a special debugging adapter in the form of a connector, where it can be transmitted to the programmer. The protocol should be used only for debugging purposes and should be disabled in release builds as it is built on a naive flooding algorithm.
The low level debugging uses content type specifier 2. There is only one type of message:
The string can have a length up to 65519 bytes. There is a set of message hashes in each module. The set keeps hashes of all received messages. When a module receives a message, it calculates its hash. If the hash is not present in the set, it is inserted, and the message is resent to all other connectors in the module. The hashes in the set expire after a given period and afterwards, the expired hashes are removed from the set. This algorithm ensures all emitted messages are spread into the whole system. Also, the random salt is necessary to allow for sending the same messages. Note, that if the expiration period is too short, it is possible to create a message going through the system indefinitely. However, for the debugging purposes, we do not perceive it as a limitation.
Automatic Firmware Distribution
It is highly desirable that all modules of the same type run the same firmware version. If the modules can synchronize firmware automatically, the development process can speed up. The firmware distribution can be achieved in two ways – either there is a central authority publishing the firmware, or the modules check the firmware of their neighbors and update the firmware from them.
ESP32 is a suitable microcontroller for remote firmware upgrades. Traditional microcontrollers can update their firmware only using a bootloader, a small program residing in reserved area of flash memory, which is executed after microcontroller boot. If there is a need for custom update protocol, the custom bootloader has to be written. However, bootloaders have to be self-contained and have a restricted binary size. Therefore, sophisticated update protocols are hard to implement. This is not the case for ESP32. The flash memory of ESP32 can be divided into partitions (e.g., a firmware or a virtual file system partition). In the default configuration, there are two partitions for firmware. These partitions are used to implement over-the-air updates (OTA). One partition is marked as an active, and upon boot, the bootloader loads user firmware from the active partition. Then, during regular operation of the microcontroller, firmware can be written in the second, inactive, partition. Once the firmware is written and validated, the active flag of the partitions is swapped. When the microcontroller reboots, the new firmware is used. OTA updates allow using an existing codebase in the firmware to perform firmware updates. Therefore complicated protocols, like an update from an HTTP server can be implemented.
ESP-IDF provides an implementation of OTA update from HTTP server and an interface for implementing custom updates. The update from HTTP server works by specifying URL and then the microcontroller periodically polls for new firmware version. The interface consists of three functions – one for starting the update, one for processing a binary blob with a piece of firmware, and one for committing the changes.
The HTTP server update is an example of a central update. We find it unsuitable for the RoFI systems for following reasons:
- each module downloads the firmware separately and therefore the protocol does not scale well, and
- it relies on working TCP/IP connection and routing. If a firmware update breaks routing, the firmware update stops working.
Therefore, we introduce a decentralized firmware update protocol relying only on a connector connection. By flashing a single module, its firmware distributes among the other modules in a system. We do not restrict the flashing procedure – it can be either flashed by a cable from a PC, or the module can download its firmware from a central authority (therefore, a hybrid approach can be achieved).
The protocol works as follows. Each firmware has two symbols built in in the
binary: ROFI_FW_TYPE
and ROFI_FW_TIMESTAMP
. The user specifies the firmware
type, the timestamp is automatically added during a build. Firmware type is a
16-bit unsigned integer, the timestamp is a 64-bit unsigned integer. There are
no restrictions on the meaning of the numbers; it is recommended that the
timestamp are milliseconds since UNIX time epoch. A module updates only to the
same firmware type as it already is and to a newer timestamp. The firmware type
allows the system to consist of various types of modules.
The firmware update protocol uses content type specifier 3. There are four types of messages:
Firmware request is sent by a module to find out a firmware version of its neighbors. The answer is the firmware announcement. The message has the following format:
Firmware announcement is broadcasted by a module to its direct neighbors when it knows the new firmware. The firmware might not be completely known to the module, and therefore, there is a field of the known size in the message. If more of the firmware code is known, a new message is broadcasted. The message has the following format:
Firmware chunk request is sent by a module to obtain a new chunk of the firmware. It is usually a response to firmware announcement. The chunks are 1024 bytes in length (except the last one), and the allowed offsets are also in multiples of 1024. The message has the following format:
Firmware chunk response is a response to a firmware chunk request. The message has the following format:
First, we illustrate the protocol operation in a system composed of a single module type. Then, we extend it to arbitrary RoFI systems.
When a first module receives a new firmware, it broadcasts the firmware announcement. Each module keeps a record of the source of newest firmware available. It can be either the module itself or one of its neighbors. The firmware sources are ordered by the timestamp and the known size, and for now, we ignore firmware types. If a module receives a firmware announcement, it checks if it is newer than any source known so far. If not, the message is ignored. Otherwise, it updates the firmware source. When a firmware source contains a newer firmware than the firmware the module currently runs, it starts the upgrade procedure. It uses a firmware chunk request to ask the firmware source for a new firmware chunk. If a newer source of firmware appears, the current firmware update is aborted and a new update is started. During the update, the module itself broadcasts a firmware announcement message as new chunks of firmware are received. In this fashion, the new firmware spreads like a wave in the system. Note that messages can get lost in the system as connectors are allowed to drop datagrams. Therefore, if the message response times out, the module broadcasts a firmware request to find a new firmware source.
To support multiple firmware types, the module keeps track of the newest firmware available for each type. If a firmware announcement updates another type of firmware than the module type is, it broadcasts the announcement to all neighbors. If a firmware chunk request for other firmware is received, it is redirected to the source of that type of firmware and the module remembers the request. When a response in the form of a firmware chunk is received, the module redirects the message to the author of the request.