pencil and rubber

Logo of Triple-A Level WCAG-1 Conformance, W3C-WAI Web Content Accessibility Guidelines 1.0

XHTML 1.0 Conformance Validation CSS 3 Conformance Validation
Logo of Department of Mathematics and Computer Science, Course on Dedicated systems, link to Forum

Design and implementation of a memory-mapped multicore coprocessor

Lecture 12 on Dedicated systems

Teacher: Giuseppe Scollo

University of Catania
Department of Mathematics and Computer Science
Graduate Course in Computer Science, 2018-19

Table of Contents

  1. Design and implementation of a memory-mapped multicore coprocessor
  2. lecture topics
  3. design space
  4. codesign evolution of the delay computation
  5. structure of a multicore coprocessor
  6. hardware interface constraints
  7. project workflow
  8. hardware design of the multicore coprocessor
  9. coprocessor hardware interface
  10. coprocessor register map
  11. Nios II system with coprocessor and Performance Counter
  12. software driver
  13. test and performance measurement programs
  14. performance measurement outcomes
  15. references

lecture topics

outline:

design space

possible evolutions of the codesign implemented in the previous lab tutorial may be conceived along two orthogonal development directions:

an example of orthogonal combination of both directions is given by the following objectives for a first project that here is going to be dealt with:

more significant functional extensions may be identified from consideration of functions, defined over Collatz trajectories, other than the delay but whose computation requires trajectory generation anyway

codesign evolution of the delay computation

a first design alternative to be considered about the replication of the hardware unit for the delay computation is:

the latter option is preferable in view of possible further extensions that would require access by the different instances to shared data, e.g. defined as configuration parameters

other design decisions relate to the number of parallel instances of the computational component, henceforth termed cores, and the size of the coprocessor I/O data

structure of a multicore coprocessor

the 64-bit extension of the single core input is easily obtained by a straightforward modification of the Gezel source from the previous lab tutorial, with the same correction to the VHDL output of the fdlvhd translator

the multicore coprocessor then ought to have circuits for the correct dispatching of I/O data between the interface and a core selected by the processor:

description in VHDL of multiplexing is simple if the core outputs are chained in a 2n×16-bit vector, as it is enough to use a selection operator on the vector

the more complex description in VHDL of demultiplexing is feasible by means of a logical shift operator, as it is exemplified for a generic decoder in Zwoliński, 4.2.3

hardware interface constraints

signal exchange at the multicore coprocessor I/O ports is to be adapted to the available signals of the Avalon-MM interface, taking a few constraints on these into account, such as:

since the Nios II processor may transfer at most 32 bits in a single bus transaction, it is convened that this be the coprocessor word width at the Avalon interface, that is to say, the width of the writedata and readdata signals

the register address space of the coprocessor is thus the interval [0, 3×2n], taking one address for the status register into account, hence address is (n+2)-bit wide

project workflow

development main phases:

hardware design of the multicore coprocessor

two-step production of the VHDL description of the multicore coprocessor:

the respective sources delay_collatz.vhd and multicore_delay_collatz.vhd are available in folder vhdl of the attached archive

the coprocessor is endowed with the core_select n-bit input, that encodes the core which the I/O operation is addressed to, while the done outputs of the individual cores are exposed as global status in a 2n-bit parallel output port

folders delay_collatz, mc_delay_collatz, and mc_interface are meant to host compilation and simulation projects for the two mentioned sources and the next one; folders with the same names under tests provide respective input files for simulation

coprocessor hardware interface

an instance of the multicore coprocessor component is embedded in the Avalon memory-mapped interface described by multicore_delay_collatz_avalon_interface.vhd and accesses the following Avalon bus signals:

the gathering of the 64-bit input for the coprocessor thus takes two bus cycles, therefore the interface must store the first-cycle data and later concatenate it with the second-cycle data; this leads to the classical two-process structure of the description:

on the other hand, the 32-bit output of a 16-bit data produced by a coprocessor core requires a zero-extension of the latter, that is done by the interface

consultation of multicore_delay_collatz_interface.vhd shows the relationships between the I/O signals of the computational component and the Avalon interface signals

coprocessor register map

the Qsys construction of a Nios II system with the coprocessor component, similar to that of the previous lab tutorial, assigns the coprocessor a base address and, starting at it, a memory area for its I/O registers

the following register map also shows the coprocessor component signals determined by the corresponding register offsets, indexed by the value of core_select in parentheses, where k = 2n is the number of parallel cores, and with legenda :

ro signal ao             ro signal ao
0 x0(0)[31..0] 0   2k delay(0) 8k
1 x0(0)[63..32] 4     ...  
  ...     3k-1 delay(k-1) 12k-4
2(k-1) x0(k-1)[31..0] 8(k-1)   3k status 12k
2k-1 x0(k-1)[63..32] 8k-4        

Nios II system with coprocessor and Performance Counter

the subsequent development phases are similar to those of the previous lab tutorial:

the Qsys construction of the Nios II system goes quicker if performed as a modification of the Qsys system out of the previous lab tutorial, by removing the delay_collatz_avalon_interface component and adding an instance of the multicore_delay_collatz_avalon_interface component

software driver

the TCL scripts for the generation of the software driver in the project BSP, provided in folder codesign/ip/multicore_delay_collatz_avalon_interface of the attached archive, are similar to those of the previous lab tutorial

the C sources of the software driver, provided in folder HAL under the same path, differ from those of the previous lab tutorial in the following aspects:

test and performance measurement programs

the test and performance measurement programs provided in folders codesign/amp* of the attached archive compute the delay for 2M initial points, starting with X_BASE = 1128784494896128

in both versions of the test, the program assigns core j the delay computation for the initial points in the congruence class j mod MDC_N_CORES, thus for 2M/32 = 64K trajectories (on the average in the second version); the difference between the versions in codesign/amp_s* and those in codesign/amp_t* is as follows:

project creation parameters for the Monitor Program are summarized in the attached file MonitorNotes.txt

performance measurement outcomes

compilation, loading on the FPGA and execution of program sequential_multicore_delay_collatz_timing.c, in the two projects codesign/amp_s and codesign/amp_s_o3, produces the Performance Counter Reports in the figure

Performance Report for the sequential version, optimization O1

Performance Report for the sequential version, optimization O3

the next Performance Counter Reports come out of the execution of program statustest_multicore_delay_collatz_timing, in the two projects codesign/amp_t and codesign/amp_t_o3

Performance Report for the status-tested version, optimization O1

Performance Report for the status-tested version, optimization O3

references

useful materials for the proposed lab experience: