pencil and rubber

Logo of Triple-A Level WCAG-1 Conformance, W3C-WAI Web Content Accessibility Guidelines 1.0

XHTML 1.0 Conformance Validation CSS 3 Conformance Validation
Logo of Department of Mathematics and Computer Science, Course on Dedicated systems, link to Forum

FPGA implementation of a memory-mapped coprocessor

Tutorial 11 on Dedicated systems

Teacher: Giuseppe Scollo

University of Catania
Department of Mathematics and Computer Science
Graduate Course in Computer Science, 2018-19

Table of Contents

  1. FPGA implementation of a memory-mapped coprocessor
  2. tutorial outline
  3. design decisions for a hardware acceleration case study
  4. Avalon interface and programming model for the case study
  5. project workflow
  6. coprocessor hardware interface
  7. coprocessor as a Qsys component (1)
  8. coprocessor as a Qsys component (2)
  9. coprocessor as a Qsys component (3)
  10. Nios II system with coprocessor and Performance Counter
  11. mapping to FPGA and compilation
  12. software driver
  13. test and performance measurement programs (1)
  14. test and performance measurement programs (2)
  15. test with blocking acceleration
  16. test with nonblocking acceleration
  17. references

tutorial outline

this tutorial deals with a hardware acceleration case study

example: design decisions for a hardware acceleration case study

the previous lab tutorial presented a software implementation of the delay computation of a Collatz trajectory with given start point

hardware implementations of the same function were the subject of previous lab experiences

the performance measurements carried out on the software implementation show that it consumes almost all of the program execution time

a first alternative to evaluate: to integrate the hardware function as a custom instruction or as a memory-mapped coprocessor?

other design decisions depend on this first decision, as follows

Avalon interface and programming model for the case study

the VHDL description of the circuit which computes the function is to be embedded into a component equipped with Avalon interfaces for the Clock, Reset, and Avalon MM Slave signals, so as to receive the initial data by a write operation and to return the result by a reply to a read operation

addressing of the coprocessor: since the (initial data) write and (final result) read operations take place at different times and have the same data size, a single address suffices

software driver : two macros and a function may be defined for the bus access software interface: DC_RESET(d), DC_START(d,x0), unsigned int delay(d), where d is the address assigned to the coprocessor

project workflow

development main phases:

coprocessor hardware interface

two VHDL sources implement the memory-mapped coprocessor:

both files are available in the vhdl folder of the attached archive, which is also located in the Nios II folder of the reserved lab area

consultation of the delay_collatz_interface.vhd source shows the relationships between the I/O signals of the computational component and the Avalon interface signals

coprocessor as a Qsys component (1)

folder codesign in the attached archive is preset to host the project development

after creation of project delay_collatz_codesign, with top-level entity having the same name, the construction of the custom component delay_collatz_interface may proceed

the new component type definition is shown in the figure

definition of a new component type delay_collatz_avalon_interface

coprocessor as a Qsys component (2)

the next step is the assignment of VHDL files that describe the component and their analysis, as shown in the figure

definition and analysis of files for synthesis of the component

coprocessor as a Qsys component (3)

finally, the new component definition ends with the definition of its Avalon interfaces and placement of its signals under the appropriate interfaces, as shown in the figure

definition of Avalon signals and interfaces of the component

Nios II system with coprocessor and Performance Counter

structure of the hardware system built with Qsys

address map following the Qsys assignments to system components

mapping to FPGA and compilation

for the construction of the Nios II system shown in the previous figures it may be useful to consult the Qsys introduction tutorial

the final steps to map the system to the FPGA are as follows:

in Qsys:

exit Qsys, then in Quartus:

software driver

folder script in the attached archive contains two TCL scripts for the generation of the software driver in the BSP for the project

these two scripts are to be copied in folder codesign/ip/delay_collatz_avalon_interface

the TCL scripts were written by analogy with the TCL script for the software driver of the Performance Counter, available in the Quartus Prime Lite 16.1 distribution under path
$SOPC_KIT_NIOS2/../ip/altera/sopc_builder_ip/altera_avalon_performance_counter

the motivation for this, perhaps unorthodox, way of producing the software driver lies in the twofold fact that

together with a somewhat reasonable level of operational analogy between the two components

test and performance measurement programs (1)

folder src in the attached archive contains the subject programs, which are to be copied in the provided folders for the creation of test and performance measurement projects under the Monitor Program, as follows:

project creation parameters are summarized in the attached file MonitorNotes.txt

main differences between the source of lab tutorial 10 and the present sequential version:

test and performance measurement programs (2)

the pipelined version of the program exhibits much stronger differences with respect to the program of lab tutorial 10:

the synchronization mechanism is very simple, thanks to properties of the custom component and of the waitrequest signal of the Avalon MM protocol:

test with blocking acceleration

compilation, loading on the FPGA and execution of program delay_collatz_sequential_timing.c, in the two projects codesign/amp_s and codesign/amp_s_o3, produces the Performance Counter Reports in the figure

Performance Report for the sequential version, optimization O1

Performance Report for the sequential version, optimization O3

a speed-up by an order of magnitude, w.r.t. the software computation in lab tutorial 10, results from the performance data in that case, with the same optimization levels

Performance Report for the software version, optimization O1

Performance Report for the software version, optimization O3

test with nonblocking acceleration

it is sensible to expect a further performance gain out of the nonblocking execution of the computation by the custom hardware

the comparison of the following Performance Counter Reports with the corresponding data for the implementation with all computation done in software, yields a 21x speed-up with default optimization O1 and a 16x speed-up with optimization O3; the corresponding speed-up values with blocking acceleration are 15x with O1 and 13x with O3

Performance Report for the pipelined version, optimization O1

Performance Report for the pipelined version, optimization O3

references

useful materials for the proposed lab experience: