A 6502 was typically clocked at 1MHz and did a 1-byte memory access essentially every clock cycle, which are nice round figures to use as a baseline. Let’s start with a historical example: the MOS 6502, first released in 1975 – 42 years ago, and one of the key chips in the microcomputer revolution. It’s much more useful to relate memory bandwidth to say the number of clock cycles or instructions being executed, to get a feel for what you can (and can’t) get away with. During inference the input batch is only 1 or a small number, and thus running these neural network requires the highest amount of memory bandwidth, so high it usually it is not possible to fully utilize even efficient hardware at full utilization.Absolute memory bandwidth figures tend to look fairly large, especially for GPUs. Each layer uses the equivalent of 3 linear-layer-like matrix multiplications in a GRU model. Deep Speech 2 system or similar use 4 RNN layers of 400 size (see here and here). RNN: memory bandwidth for recurrent neural networks is one of the highest. The bandwidth usage is low is because the same input data is used to compute 192 outputs, albeit with different small weight matrices. This may require about 3.2 GB/s to be performed on a 128 G-ops/s system with ~99% efficiency (SnowFlake Spring 2017 version). Of course if you have multiple inputs for the same linear layer (multiple vectors that need to be multiplied by the same matrix) then: BW = T/B, where B is the number of vectors or Batch.Ĭonvolutions: for convolution operation, the bandwidth requirements are usually lower, as an input map data can be used in several convolution operation in parallel, and convolution weights are relatively small.įor example: a 13 x 13 pixel map in a 3x3 convolution operation from 192 input maps to 192 output maps (as, for example, in Alexnet layer 3) requires: ~4MB weight data and ~0.1MB input data from memory. This means that if your system has 128 G-ops/s of performance, you will need a bandwidth of more than 128 GB/s to perform the operation at full system efficiency (provided, of course that the system can do this!). Given than bandwidth BW = total data transferred / time, in case of linear layers BW = T. If your system has T operations/second of performance, then the time to perform the computation is bM²/T. If the linear layer is used only for one vector, it will require to send the entire M² matrix of weights as computation occurs. Total data transferred is: b(M+M²) or ~bM² Linear layers: here a weight matrix of M by M is used to process a vector of M values with b bits. If there is no input or weight data re-use, then the bandwidth is at a maximum for a given application. sending more weights to process the same inputs.sending more inputs to be processed by the same weights.In general, if a computation re-uses data, it will require less memory bandwidth. Weights are neural network parameters, and input data (maps, activations) is the data you want to process from one layer to the next. Memory bandwidth and data re-use in deep neural network computation can be estimated with a few simple simulations and calculations.ĭeep neural network computation requires the use of weight data and input data. Computation and memory bandwidth in deep neural networksĮugenio Culurciello, Aliasger Zaidy and Vinayak Gokhale
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |