pencil and rubber

Logo of Triple-A Level WCAG-1 Conformance, W3C-WAI Web Content Accessibility Guidelines 1.0

XHTML 1.0 Conformance Validation CSS 3 Conformance Validation
Logo of Department of Mathematics and Computer Science, Course on Dedicated systems, link to Forum

FPGA implementation of a memory-mapped multicore coprocessor

Tutorial 12 on Dedicated systems

Teacher: Giuseppe Scollo

University of Catania
Department of Mathematics and Computer Science
Graduate Course in Computer Science, 2019-20

Table of Contents

  1. FPGA implementation of a memory-mapped multicore coprocessor
  2. tutorial outline
  3. project workflow
  4. multicore coprocessor design
  5. coprocessor hardware interface
  6. coprocessor register map
  7. Nios II system with coprocessor and Performance Counter
  8. software driver
  9. test and performance measurement programs
  10. performance measurement outcomes
  11. references

tutorial outline

this tutorial deals with:

project workflow

development main phases:

multicore coprocessor design

two-step production of the VHDL description of the multicore coprocessor:

the respective sources delay_collatz.vhd and multicore_delay_collatz.vhd are available in folder vhdl of the attached archive, as well as in the VHDL/code/e12 folder of the reserved lab area

the coprocessor is endowed with the core_select n-bit input, that encodes the core which the I/O operation is addressed to, while the done outputs of the individual cores are exposed as global status in a 2n-bit parallel output port

empty folders delay_collatz, mc_delay_collatz, and mc_interface are meant to host compilation and simulation projects for the two mentioned sources and the next one; folders with the same names under tests provide respective input files for simulation

coprocessor hardware interface

an instance of the multicore coprocessor component is embedded in the Avalon memory-mapped interface described by multicore_delay_collatz_avalon_interface.vhd and accesses the following Avalon bus signals:

the gathering of the 64-bit input for the coprocessor thus takes two bus cycles, therefore the interface must store the first-cycle data and later concatenate it with the second-cycle data; this leads to the classical two-process structure of the description:

on the other hand, the 32-bit output of a 16-bit data produced by a coprocessor core requires a zero-extension of the latter, that is done by the interface

consultation of multicore_delay_collatz_avalon_interface.vhd shows the relationships between the I/O signals of the computational component and the Avalon interface signals

coprocessor register map

the Qsys construction of a Nios II system with the coprocessor component, similar to that of the previous lab tutorial, assigns the coprocessor a base address and, starting at it, a memory area for its I/O registers

the following register map also shows the coprocessor component signals determined by the corresponding register offsets, indexed by the value of core_select in parentheses, where k = 2n is the number of parallel cores, and with legenda :

ro signal ao             ro signal ao
0 x0(0)[31..0] 0   2k delay(0) 8k
1 x0(0)[63..32] 4     ...  
  ...     3k-1 delay(k-1) 12k-4
2(k-1) x0(k-1)[31..0] 8(k-1)   3k status 12k
2k-1 x0(k-1)[63..32] 8k-4        

Nios II system with coprocessor and Performance Counter

the subsequent development phases are similar to those of the previous lab tutorial:

the Qsys construction of the Nios II system goes quicker if performed as a modification of the Qsys system out of the previous lab tutorial, by removing the delay_collatz_avalon_interface component and adding an instance of the multicore_delay_collatz_avalon_interface component

software driver

the TCL scripts for the generation of the software driver in the project BSP, provided in folder codesign/ip/multicore_delay_collatz_avalon_interface of the attached archive, are similar to those of the previous lab tutorial

the C sources of the software driver, provided in folder HAL under the same path, differ from those of the previous lab tutorial in the following aspects:

test and performance measurement programs

the test and performance measurement programs provided in folders codesign/amp* of the attached archive compute the delay for 2M initial points, starting with X_BASE = 1128784494896128

in both versions of the test, the program assigns core j the delay computation for the initial points x0 in the congruence class j = x0 mod MDC_N_CORES, thus for 2M/32 = 64K trajectories (on the average in the second version); the difference between the versions in codesign/amp_s* and those in codesign/amp_t* is as follows:

project creation parameters for the Monitor Program are summarized in the attached file MulticoreMonitorNotes.txt

performance measurement outcomes

compilation, loading on the FPGA and execution of program sequential_multicore_delay_collatz_timing.c, in the two projects codesign/amp_s and codesign/amp_s_o3, produces the Performance Counter Reports in the figure

Performance Report for the sequential version, optimization O1

Performance Report for the sequential version, optimization O3

the next Performance Counter Reports come out of the execution of program statustest_multicore_delay_collatz_timing, in the two projects codesign/amp_t and codesign/amp_t_o3

Performance Report for the status-tested version, optimization O1

Performance Report for the status-tested version, optimization O3

references

useful materials for the proposed lab experience: