Merge pull request #38 from level2fast/wes237c-sdaniels-dft

Wes237c sdaniels dft
KastnerRG · Dec 8, 2023 · ec52198 · ec52198 · rck289 · May 15, 2024
2 parents 2ee470d + 093eea3
commit ec52198
Show file tree

Hide file tree

Showing 3 changed files with 5 additions and 5 deletions.
diff --git a/cordic.tex b/cordic.tex
@@ -230,7 +230,7 @@ \section{Background}
 Table \ref{table:cordic} provides the statistics for the first seven iterations of a CORDIC. The first row is the ``zeroth'' rotation (i.e., when $i=0$), which is a $45^{\circ}$ rotation. It performs a scaling of the vector by a factor of $1.41421$. The second row does a rotation by $2^{-1} = 0.5$. This results in a rotation by $\theta = \arctan 2^{-1} = 26.565^{\circ}$. This rotation scales the vector by $1.11803$. The CORDIC gain is the overall scaling of the vector. In this case, it is the scaling factor of the first two rotations, i.e., $1.58114 = 1.41421 \cdot 1.11803$. This process continues by incrementing $i$ which results in smaller and smaller rotating angles and scaling factors. Note that the CORDIC gain starts to stabilize to $\approx 1.64676025812107$ as described in Equation \ref{eq:cordicgain}. Also, note as the angles get smaller, they have less effect on the most significant digits. 
 
 \begin{exercise}
-Describe the effect if the $i$th iteration on the precision of the results? That is, what bits does it change? How does more iterations change the precision of the final result, i.e., how do the values of $\sin \phi$ and $\cos \phi$ change as the CORDIC performs more iterations?
+Describe the effect of the $i$th iteration on the precision of the results? That is, what bits does it change? How does more iterations change the precision of the final result, i.e., how do the values of $\sin \phi$ and $\cos \phi$ change as the CORDIC performs more iterations?
 \end{exercise}
 
 

diff --git a/dft.tex b/dft.tex
@@ -104,7 +104,7 @@ \section{Fourier Series}
 \end{figure}
 
 
-Both of these relationships can be visualized as vectors in the complex plane as shown in Figure \ref{fig:sin_cos_exp}. Part a) shows the cosine derivation. Here we add the two complex vectors $e^{jx}$ and $e^{-jx}$. Note that the sum of these two vectors results in a vector on the real (in-phase or I) axis. The magnitude of that vector is $2 \cos(x)$. Thus, by dividing the sum of these two complex exponentials by $2$, we get the value $\cos (x)$ as shown in Equation \ref{eq:cos_exp}.  Figure \ref{fig:sin_cos_exp} b) shows the similar derivation for sine. Here we are adding the complex vectors $e^{jx}$ and $-e^{-jx}$. The result of this is a vector on the imaginary (quadrature or Q) axis with a magnitude of $2 \sin (x)$. Therefore, we must divide by $2j$ in order to get $\sin (x)$. Therefore, this validates the relationship as described in Equation \ref{eq:sin_exp}.
+Both of these relationships can be visualized as vectors in the complex plane as shown in Figure \ref{fig:sin_cos_exp}. Part a) shows the cosine derivation. Here we add the two complex vectors $e^{jx}$ and $e^{-jx}$. Note that the sum of these two vectors results in a vector on the real (in-phase or I) axis. The magnitude of that vector is $2 \cos(x)$. Thus, by dividing the sum of these two complex exponentials by $2$, we get the value $\cos (x)$ as shown in Equation \ref{eq:cos_exp}.  Figure \ref{fig:sin_cos_exp} b) shows the similar derivation for sine. Here we are adding the complex vectors $e^{jx}$ and $-e^{-jx}$. The result of this is a vector on the imaginary (quadrature or Q) axis with a magnitude of $2j \sin (x)$. Therefore, we must divide by $2j$ in order to get $\sin (x)$. Therefore, this validates the relationship as described in Equation \ref{eq:sin_exp}.
 
 
 \section{\gls{dft} Background}
@@ -257,7 +257,7 @@ \section{Storage Tradeoffs and Array Partitioning}
 
 Up until this point, we have assumed that the data in arrays (\lstinline|V_In[]|, \lstinline|M[][]|, and \lstinline|V_Out[]| are accessible at anytime.  In practice, however, the placement of the data plays a crucial role in the performance and resource usage. In most processor systems, the memory architecture is fixed and we can only adapt the program to attempt to best make use of the available memory hierarchy, taking care to minimize register spills and cache misses, for instance.  In HLS designs, we can also explore and leverage different memory structures and often try to find the memory structure that best matches a particular algorithm.  Typically large amounts of data are stored in off-chip memory, such as DRAM, flash, or even network-attached storage. However, data access times are typically long, on the order of tens to hundreds (or more) of cycles. Off-chip storage also relatively large amounts of energy to access, because large amounts of current must flow through long wires.  On-chip storage, in contrast can be accessed quickly and is much lower power.  I contrast it is more limited in the amount of data that can be stored.  A common pattern is to load data into on-chip memory in a block, where it can then be operated on repeatedly.  This is similar to the effect of caches in the memory hierarchy of general purpose CPUs.
 
-The primary choices for on-chip storage on in embedded memories (e.g., block RAMs) or in flip-flops (FFs). These two options have their own tradeoffs. Flip-flop based memories allow for multiple reads at different addresses in a single clock.  It is also possible to read, modify, and write a Flip-flop based memory in a single clock cycle.  However, the number of FFs is typically limited to around 100 Kbytes, even in the largest devices. In practice, most flip-flop based memories should be much smaller in order to make effective use of other FPGA resources.  Block RAMs (BRAMs) offer higher capacity, on the order Mbytes of storage, at the cost of limited accessibility. For example, a single BRAM can store more than 1-4 Kbytes of data, but access to that data is limited to two different addresses each clock cycle. Furthermore, BRAMs are required to have a minimum amount of pipelining (i.e. the read operation must have a latency of at least one cycle).  Therefore, the fundamental tradeoff boils down to the required bandwidth versus the capacity. 
+The primary choices for on-chip storage in embedded memories (e.g., block RAMs) or in flip-flops (FFs). These two options have their own tradeoffs. Flip-flop based memories allow for multiple reads at different addresses in a single clock.  It is also possible to read, modify, and write a Flip-flop based memory in a single clock cycle.  However, the number of FFs is typically limited to around 100 Kbytes, even in the largest devices. In practice, most flip-flop based memories should be much smaller in order to make effective use of other FPGA resources.  Block RAMs (BRAMs) offer higher capacity, on the order Mbytes of storage, at the cost of limited accessibility. For example, a single BRAM can store more than 1-4 Kbytes of data, but access to that data is limited to two different addresses each clock cycle. Furthermore, BRAMs are required to have a minimum amount of pipelining (i.e. the read operation must have a latency of at least one cycle).  Therefore, the fundamental tradeoff boils down to the required bandwidth versus the capacity. 
 
 If throughput is the number one concern, all of the data would be stored in FFs. This would allow any element to be accessed as many times as it is needed each clock cycle. However, as the size of arrays grows large, this is not feasible.  In the case of matrix-vector multiplication, storing a 1024 by 1024 matrix of 32-bit integers would require about 4 MBytes of memory.   Even using BRAM, this storage would require about 1024 BRAM blocks, since each BRAM stores around 4KBytes.  On the other hand, using a single large BRAM-based memory means that we can only access two elements at a time.  This obviously prevents higher performance implementations, such as in Figure \ref{fig:matrix_vector_unroll_inner_dfg}, which require accessing multiple array elements each clock cycle (all eight elements of \lstinline|V_In[]| along with 8 elements of \lstinline|M[][]|).  In practice, most designs require larger arrays to be strategically divided into smaller BRAM memories, a process called \gls{arraypartitioning}.  Smaller arrays (often used for indexing into larger arrays) can be partitioned completely into individual scalar variables and mapped into FFs.  Matching pipelining choices and array partitioning to maximize the efficiency of operator usage and memory usage is an important aspect of design space exploration in HLS.
 
@@ -430,7 +430,7 @@ \section{\gls{dft} optimization}
 Derive a formula for the access pattern for the 1D array $S'$ given as input the row number $i$ and column element $j$ corresponding to the array $S$. That is, how do we index into the 1D $S$ array to access element $S(i,j)$ from the 2D $S$ array.
 \end{exercise}
 
-To increase performance further we can apply techniques that are very similar to the matrix-vector multiply.  Previously, we observed that increasing performance of matrix-vector multiply required partitioning the \lstinline|M[][]| array.  Unfortunately, representing the $S$ matrix using the $S'$ means that there is no longer an effective way to partition $S'$ to increase the amount of data that we can read on each clock cycle. Every odd row and column of $S$ includes every element of $S'$.  As a result, there is no way to partition the values of $S'$ like were able to do with $S$.  The only way to increase the number of read ports from the memory that stores $S'$ is to replicate the storage.  Fortunately, unlike with a memory that must be read and written, it is relatively easy to replicate the storage for an array that is only read.  In fact, \VHLS will perform this optimization automatically when instantiates a \gls{rom} for an array which is initialized and then never modified.  One advantage of this capability is that we can simply move the $sin()$ and $cos()$ calls into an array initialization.  In most cases, if this code is at the beginning of a function and only initializes the array, then \VHLS is able to optimize away the trigonometric computation entirely and compute the contents of the ROM automatically.
+To increase performance further we can apply techniques that are very similar to the matrix-vector multiply.  Previously, we observed that increasing performance of matrix-vector multiply required partitioning the \lstinline|M[][]| array.  Unfortunately, representing the $S$ matrix using the $S'$ means that there is no longer an effective way to partition $S'$ to increase the amount of data that we can read on each clock cycle. Every odd row and column of $S$ includes every element of $S'$.  As a result, there is no way to partition the values of $S'$ like we were able to do with $S$.  The only way to increase the number of read ports from the memory that stores $S'$ is to replicate the storage.  Fortunately, unlike with a memory that must be read and written, it is relatively easy to replicate the storage for an array that is only read.  In fact, \VHLS will perform this optimization automatically when instantiates a \gls{rom} for an array which is initialized and then never modified.  One advantage of this capability is that we can simply move the $sin()$ and $cos()$ calls into an array initialization.  In most cases, if this code is at the beginning of a function and only initializes the array, then \VHLS is able to optimize away the trigonometric computation entirely and compute the contents of the ROM automatically.
 
 \begin{exercise}
 Devise an architecture that utilizes $S'$ -- the 1D version of the $S$ matrix. How does this affect the required storage space? Does this change the logic utilization compared to an implementation using the 2D $S$ matrix? 

diff --git a/fir.tex b/fir.tex
@@ -363,7 +363,7 @@ \section{Complex FIR Filter}
 \begin{figure}
 \centering
 \includegraphics[width=6in]{images/complex_fir}
-\caption{A complex FIR filter built from four real FIR filters. The input I and Q samples are feed into four different real FIR filters. The FIR filters hold the in-phase (FIR I) and quadrature (FIR Q) complex coefficients. }
+\caption{A complex FIR filter built from four real FIR filters. The input I and Q samples are fed into four different real FIR filters. The FIR filters hold the in-phase (FIR I) and quadrature (FIR Q) complex coefficients. }
 \label{fig:complex_fir}
 \end{figure}