Festa Hardware

Created by Joseph Joy

AXI4 Lite Interface

Accelerator's control and status register interface. It allows a processor to manage the hardware.

Purpose: To give a processor read/write access to internal registers.
What it Controls:
- Control Registers (Write): A processor can write to specific addresses to:
  - 0x10: zeta parameter
  - 0x18: epsilon parameter
  - 0x20: delta parameter
  - 0x00 (Control Register): A specific bit in this register can act as a start signal for your FSM.
- Status Registers (Read): The processor can read from other addresses to check the status:
  - 0x00 (Status Register): Read a bit to see if the core is busy or done.

Pipelining

The coordination between the FSM, AXI interfaces, and the algorithm cores is the key to achieving the high-throughput, pipelined performance described in the paper. Think of it as a perfectly synchronized factory assembly line.

The FSM (Finite State Machine) is the Factory Manager 🧠.
The AXI Interfaces are the Conveyor Belts ⚙️.
The Algorithm Cores are the Workstations 🛠️.

Here is how they work together to pipeline the processing of LiDAR frames.

The Roles of Each Component

AXI-Lite Slave (The Manager's Office): Before anything happens, the main processor uses this interface to talk to the FSM. It's used for slow, one-time tasks like:
- Writing the configuration parameters (zeta, epsilon, delta) into registers.
- Writing to a control register to issue the main start command that kicks off the entire process.
AXI-Stream Interfaces (The Conveyor Belts): These are responsible for the continuous, high-speed movement of data.
- AXI-Stream Slave (Input Belt): Feeds a constant stream of LiDAR points into the first workstation. It uses the TLAST signal to tell the FSM, "This is the last piece of the current job (frame)."
- AXI-Stream Master (Output Belt): Takes the finished product (the 1-bit classification flags) from the second workstation and streams it out of the factory.
FSM (The Factory Manager): The FSM doesn't do any point processing itself. Instead, it directs the traffic. It has a simple set of states (IDLE, CLEARING_GRID, PROCESSING) and tells the two algorithm cores when to work based on the signals from the AXI interfaces.
Algorithm Cores (The Workstations):
- Cell Grid Core (Workstation 1): Takes raw materials (points) from the input belt and does the first processing step (populating the BRAMs).
- Ground Segmentation Core (Workstation 2): Takes the semi-finished product from Workstation 1 (via the BRAMs) and completes the final assembly (classification).

The Pipelined "Symphony" in Action

Let's trace the flow of two consecutive frames, Frame N and Frame N+1, to see the pipeline in action.

Step 1: Processing Frame N Begins

The system is IDLE. The processor has already configured the parameters via AXI-Lite.
Points for Frame N start arriving at the AXI-Stream Slave.
The FSM sees the data arriving, moves to its PROCESSING state, and enables the Cell Grid Core (Workstation 1).
The Cell Grid Core processes each point of Frame N, writing to the 'A' ports of the Grid BRAM and Points BRAM.
During this entire time, the Ground Segmentation Core (Workstation 2) is idle. It has nothing to do yet.

Step 2: The Critical Hand-Off (The Pipelining Magic)

The very last point of Frame N arrives at the AXI-Stream Slave. The TLAST signal is asserted for one clock cycle.
The FSM sees the TLAST signal. This is the trigger for the entire pipeline. It immediately does two things in the very next clock cycle:
- It sends a start signal to the Ground Segmentation Core (Workstation 2), telling it, "The data for Frame N is ready in the BRAMs. Begin your work."
- It keeps the Cell Grid Core (Workstation 1) enabled. Why? Because the points for the next frame, Frame N+1, are arriving on the AXI-Stream input right behind Frame N.

Step 3: Full Pipeline Operation

This is the state where the system achieves maximum throughput. For the entire duration that Frame N+1 is being processed:

The Cell Grid Core is busy processing points from Frame N+1. It reads from the AXI-Stream Slave and writes to the 'A' ports of the BRAMs.
The Ground Segmentation Core is simultaneously busy processing the data from Frame N. It reads from the 'B' ports of the BRAMs and sends the results to the AXI-Stream Master.

This parallel operation is only possible because the Block RAMs are dual-port. One core can write to one side of the memory while the other core reads from the other side without conflict.The FSM's job is simply to manage these start/stop signals at the frame boundaries. The AXI interfaces ensure the data keeps flowing smoothly, and the algorithm cores just focus on their dedicated tasks. This elegant coordination allows the accelerator to start processing a new frame long before it has finished with the previous one, which is the very definition of a pipelined architecture.

FSM

Purpose: This is the brain of the accelerator. It orchestrates the entire process by controlling when the other modules are active.
Functionality:
- Initializes the system and waits for incoming data.
- Manages the two main phases: Grid Creation and Ground Segmentation.
- Generates control signals (e.g., start, done, reset) for all other modules.
- Controls the clearing of the Grid BRAM at the beginning of each new frame processing.

Points Bram

Size Determined by the maximum number of points in a frame .

Purpose: A memory block to buffer the essential point information needed for the second stage. This decouples the two processing stages.
Configuration: A dual-port Block RAM.
- Port A (Write Port): Connected to the Cell Grid Core. It writes the is_valid, point_z and cell_index for every point in the frame. The memory address corresponds to the point's index in the frame.
- Port B (Read Port): Connected to the Ground Segmentation Core.
Size: max_points_per_frame (e.g., ~70k for VLP-16) x (Z_coord_width + cell_index_width).

AXI4 Stream Slave Interface

Purpose: To provide a standardized input port for receiving the LiDAR point cloud data from an external source, such as a processor or a DMA controller.
Functionality:
- Receives a continuous stream of data packets. Each packet contains the (X, Y, Z) coordinates of a single point.
- Implements the AXI4-Stream protocol handshake (TVALID, TREADY, TDATA, TLAST). TLAST is used to signal the end of a frame.

Cell Grid Core

Responsible for the first step—creating and populating the grid.

Processes each point from the input LiDAR frame to determine which grid cell it belongs to.

Functionality:

Receives one point at a time from the AXI Slave Interface.
Internal Logic:
- Coordinate Normalization Unit: Performs the additions/subtractions to normalize point coordinates.
- Cell Index Calculator: Implements the division (via bit-shifting) and multiplication-addition to calculate the cell_index.
- The cell_index_calculator also determines if the point is inside the grid boundaries. This generates the is_valid flag.
  - If the point is valid: It writes {is_valid=1, point_z, cell_index} to the Points BRAM. It also proceeds with the read-modify-write operation on the Grid BRAM.
  - If the point is invalid (discarded): It writes {is_valid=0, DONT_CARE, DONT_CARE} to the Points BRAM. It does not perform any operation on the Grid BRAM. The point is effectively ignored by the grid creation logic, as intended.
- Memory Interface Logic:
  - Writes the (is_valid, point_z, cell_index) tuple to the Points BRAM.
  - Performs a read-modify-write operation on the Grid BRAM to update the Z_min and Z_max values for the corresponding cell_index.
Start of Frame: The Top-Level Controller detects the start of a new frame from the AXI-Stream Slave Interface and asserts a signal to clear the Grid BRAM, setting all Z_min values to +infinity and Z_max to -infinity.

Point Streaming: Raw (X, Y, Z) points from Frame 'N' are streamed into the Cell Grid Core one by one.
Parallel Processing: For each incoming point i, the Cell Grid Core performs two actions simultaneously:
- Action A (Update Grid BRAM): It calculates the cell_index. It then reads the existing (Z_min, Z_max) from the Grid BRAM at that cell_index(Address), compares them with the current point's Z coordinate, and writes back the updated values if necessary.
- Action B (Buffer to Points BRAM): It writes the point's is_valid flag, Z coordinate (point_z) and its calculated cell_index into the Points BRAM at address
- End of Phase 1: This continues until the last point of Frame 'N' is processed, indicated by the TLAST signal. The Top-Level Controller now knows that the Grid BRAM holds the complete elevation map for Frame 'N' and the Points BRAM holds all the necessary point data.

Ground Segmentation Core

Performs the second step—classifying points as ground or non-ground. It uses the pre-calculated grid information (Z_min, Z_max) to make its decision for each point.

After the Cell Grid Core has processed the entire frame, the Ground Segmentation Core starts reading the point data that was stored in the Points BRAM sequentially.

Data Fetch: The core iterates from address 0 to N-1 of the Points BRAM. In each cycle, it reads one entry:is_valid p.z (the point's Z value) and p.cell_index.
It checks the is_valid flag first.
- If is_valid == 1: It proceeds with the normal FESTA classification logic (fetches Z_min/Z_max from Grid BRAM, using the cell index , runs the comparisons, etc.) and outputs the resulting 0 or 1 flag.
- If is_valid == 0: It bypasses all the classification logic. It immediately outputs a default classification flag. The most logical and safe default for a point outside the known area is ALFA_LABEL_NO_GROUND (0).

Start Segmentation: The Top-Level Controller signals the Ground Segmentation Core to begin.
- Point Classification: The Ground Segmentation Core starts reading from the Points BRAM sequentially, from address 0 to the last point. For each entry it reads (is_valid,point_z, cell_index) ,
- If its valid it uses the cell_index to perform a read from the Grid BRAM to fetch the final, correct Z_min and Z_max values for that cell.
- It executes the classification logic from classification algorithm using these values.
- The resulting 1-bit is_ground flag is sent to the AXI-Stream Master Interface.
- Output Streaming: The AXI-Stream Master Interface streams the classification results out of the FPGA.
- End of Frame: When the last point has been classified, the Top-Level Controller marks the processing of

AXI4 Stream Master Interface

Purpose: To stream the final processed results out of the FPGA.
Functionality:
- Takes the classification flag from the Ground Segmentation Core.
- It can be configured to output only the flags or to combine the flags with the original point data (which would require another BRAM to store the full point cloud).
- Implements the AXI4-Stream protocol handshake to send the data to a downstream module.

Annotation to Segmentation

The Core Concept: Synchronization is Everything ⛓️

The fundamental principle is that the output stream of flags is perfectly synchronized with the original input point cloud.

The 1st flag out corresponds to the 1st point that went in.
The 100th flag out corresponds to the 100th point that went in.
The Nth flag out corresponds to the Nth point that went in.

The AXI stream of 0s and 1s is essentially a tag or a label for each point. The downstream system's job is to correlate these tags with the original point data, which it has kept a copy of.

The System-Level Workflow

Here is the step-by-step process that happens outside the FESTA accelerator, typically on the host processor (like the ARM core in a Zynq SoC).

Step 1: Buffer the Original Point Cloud

Before the point cloud frame is sent to the FPGA, the host processor makes sure it has a complete copy of it stored in main memory (DDR RAM). Let's call this OriginalPointCloud_FrameN.

Step 2: Stream Data and Receive Flags

The processor configures a DMA (Direct Memory Access) controller to perform two tasks:

Write Task: Read OriginalPointCloud_FrameN from the DDR RAM and stream it to the FESTA accelerator's AXI-Stream Slave input.
Read Task: Simultaneously, configure another DMA channel to listen to the FESTA accelerator's AXI-Stream Master output. It captures the incoming stream of 1-bit flags and writes them to a separate buffer in DDR RAM. Let's call this the FlagBuffer_FrameN.

Step 3: The "Merge" or "Correlation" Step (The Magic)

Once the DMA transfer is complete, the processor has two arrays in memory:

OriginalPointCloud_FrameN: Contains the full (X, Y, Z, intensity, etc.) data for every point.
FlagBuffer_FrameN: Contains the corresponding ground/non-ground flag for every point.

Now, the software can perform the actual segmentation by simply iterating through both arrays simultaneously.

Grid Bram

Size determined by the Grid Dimensions (eg 512x216 cells)

Purpose: The central data structure of the algorithm, storing the elevation map (Z_min, Z_max) for the entire grid.
Configuration: A dual-port Block RAM.
- Port A (Read/Write Port): Connected to the Cell Grid Core for the read-modify-write updates.
- Port B (Read-Only Port): Connected to the Ground Segmentation Core to fetch the Z_min and Z_max values for classification.
Size: grid_width x grid_length (e.g., 512x256) x (Z_min_width + Z_max_width).