当前位置: 首页 > >

A board system for high-speed image analysis and neural networks

发布时间:

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 7, NO. 1, JANUARY 1996

1

A Board System for High-Speed Image Analysis and Neural Networks
Eduard Sckinger, a

Member, IEEE

and Hans-Peter Graf,

Member, IEEE

Abstract | Two ANNA neural-network chips are integrated on a 6U VME board, to serve as a high-speed platform for a wide variety of algorithms used in neural-network applications as well as in image analysis. The system can implement neural networks of variable sizes and architectures, but can also be used for ltering and feature extraction tasks that are based on convolutions. The board contains a controller implemented with eld programmable gate arrays (FPGA), memory and bus interfaces, all designed to support the high compute power of the ANNA chips. Compared to a previous evaluation board [1], this new system is designed for maximum speed and is roughly 10 times faster than the previous board. The system has been tested for such tasks as text location, character recognition, and noise removal as well as for emulating cellular neural networks (CNN). A sustained speed of up to 2 billion connections per second (GC/s) and a recognition speed of 1000 characters per second has been measured. Keywords | Image Analysis, Pattern Recognition, CNN, Neural-Network Chip.

M

I. Introduction

ANY neural-network chips for high-speed processing have been built and operated in test settings. To take full advantage of the speed of these chips, they must be integrated into a system that can support their speed, in particular can manage the overwhelming data ow to and from the chip. The importance of such systems lies in applications that require very high performances, for example reading 1000 characters per second or more. This is typically the case in industrial applications where speed is often the most critical factor determining the usefulness of a solution. The ANNA board, presented in this paper, has been designed for such applications. It is the second system-level design with the ANNA chip; a previous board [1] was designed as an evaluation platform, to test the usefulness of these chips. Several algorithms were implemented on this previous board, that proved the robustness and the power of these mixed analog-digital neural-network chips. System concepts that were chosen for their simplicity, limited the speed of that previous board to a recognition speed of about one hundred characters per second. The ANNA chip, however, has been designed to process thousand characters II. Architecture of the ANNA Board System per second [2]. The goal of the new board, described here, is to unlock the chip's full speed and make it available to The block diagram of the ANNA board is shown in Fig. 1 industrial applications. To achieve this ambitious goal par- and a photograph of the 6U VME board is presented in ticular care had to be taken that no bottleneck in the data Fig. 2. The performance data and the instruction set of the ow external to the chips would limit the performance. ANNA board are given in Tables II-B and II-B respectively. Two ANNA neural-network chips in the system operManuscript received November 17, 1993; revised September 23, ate as a single-instruction multiple-data (SIMD) array, i.e., 1994 and June 9, 1995. The authors are with AT&T Bell Laboratories, Holmdel, NJ 07733 they receive the same instructions but have di erent weight data stored internally. Each ANNA chip evaluates 8 neuUSA. Publisher Item Identi er S 1045-9227(96)00155-5. rons with up to 256 synapses per neuron in 4 clock cycles

To put the present system into perspective, we divide existing neurocomputers into two categories: special purpose systems and programmable systems. Special purpose systems have a xed network topology and in some cases even xed weights. In these systems, networks are often fully implemented, meaning that each neuron is implemented by a special piece of hardware. The Synaptics I-1000 check reader chip [3] is an example for this type. On the other hand are the programmable systems with programmable topology and weights. Programmable topology means that the number of layers, the neurons per layer, the arrangement of the neurons in the layer (e.g. fully connected or convolutional), the feed back paths and many more features can be de ned by the user. Programmable systems typically use a time multiplexed implementation, i.e., one hardware neuron implements many neurons in the network. One bene t of the multiplexed implementation is that the maximum size of the network is not limited to the number of neurons available in hardware; larger networks just take longer to evaluate. Adaptive Solution's CNAPS system [4] and the ANNA system presented here are of the programmable kind. The ANNA chip, described in detail in an earlier paper [2], is a programmable data-path chip and requires an external controller to form a complete system. In the system presented here two ANNA chips are combined with a controller and a host interface implemented with four Xilinx 4005-5PG156 eld programmable gate arrays (FPGA). Depending on the topology of the neural network a sustained speed of hundreds of millions to 2 billion connections per second are achieved. Section 2 presents the hardware architecture of the system. Section 3 rst introduces the type of neural network used in our applications and then presents four speci c applications (printed character recognition, text location & segmentation, stroke-based noise removal, and morphological operations with cellular neural networks) including graphical results and speed gures obtained with the ANNA board.

2
STATE DATA-BUS
Z-REGISTER STATE IN
12

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 7, NO. 1, JANUARY 1996

ANNA #2
CTRL & WEIGHTS

Z-REGISTER

STATE IN

CTRL & WEIGHTS 2 2 46

DECODE
2x12 26 64

2 x 64K x 3 State Memory
2x14

32K x 64 Code Memory
15

dpi (16) dpo (16)

pc (15)

Controller

lc1 (16) lc2 (16)

VME Interface

CSR PC/WC SP/WC DATA 32

32

MUX

ANNA #1

12

FIFO

16

VME Bus

Fig. 1. Block diagram of the ANNA board system.

(4  50 ns). The evaluation of a neuron involves the dotproduct between a weight and state vector followed by a squashing function. Additionally, the chip contains a barrel shifter for processing the input data. A detailed description of this chip can be found in [2]. The controller has two main tasks: (i) It has to supply the ANNA chips with a stream of instructions from the code memory. (ii) It has to supply the ANNA chips with raw input data and store the processed data in the state memory.

MUX

3

12

S-REGISTER

I-REGISTER

DEMUX

MUX

3

Fig. 2. Photograph of the 6U VME board containing the ANNA system.

B. The Host Interface All data transfers between the host computer and the ANNA board are routed through a bidirectional FIFO of 32 bit width on the VME bus side. A high data rate is achieved through packing 8 state values, each 3 bits, into one 32-bit word. The host interface automatically unpacks the data when storing it into the memory. In one VME bus cycle, 8 neuron state values or 1=2 of an instruction word can be transferred. A. The Code and State Memory Four 32-bit registers control the reading and writing of the memories, starting of the sequencer, and switching of The code memory (32 K  64) stores the instructions for the ping-pong memory. These registers are mapped into the ANNA chips and the controller as well as the synap- the host's VME address space. The VME interface suptic weight values. This memory is written by the host to ports an A32-D32-I7 slave capability. con gure the board for a particular network topology and weight set. C. The Controller The state memory (2  64 K  3) contains the network The tasks of the controller are the generation of adstates (neuron inputs/outputs) and can be read and writ- dresses for the code memory (sequencing), and the address ten by the host. To avoid I/O bottlenecks, two banks of computation necessary to transfer data to and from the state memory are used in a ping-pong fashion: one bank is state memory. See Table II-B for the instruction set of the accessed by the host for data transfer, while the other one is controller. The sequencer part is controlled by a program used by the ANNA chips for data processing. Note that in counter pc and two loop counters lc1 and lc2. Both loop Fig. 1 the state memory banks are physically organized as counters can be set to a value provided in the instruction 16 K  12 to enhance the upload/download bandwidth, to word. A special instruction decrements these counters and the ANNA chip they appear however as 64 K  3 memories. branches if zero has not been reached. These counters are

 SACKINGER AND GRAF: A BOARD SYSTEM FOR HIGH-SPEED IMAGE ANALYSIS AND NEURAL NETWORKS
Performance of the ANNA board system. Numbers in paranthesis are for a two chip system. The unit MS stands for mega-state which for ANNA is 106  3 Bits.

3

TABLE I

Parameter Computational Peak Performance: ANNA to State-Memory Peak Bandwidth: State-Memory to VME Peak Bandwidth: Peak Instruction Rate: On-Chip Weight Memory: State Memory (2 Banks): Instruction Memory: Weight Precision: State (Neuron I/O) Precision:
TABLE II

Value 10 GC=s (20 GC=s) 20 MS=s 80 MS=s 45 MIPS 4 K  6 (8 K  6) 2  64 K  3 32 K  64 6 Bits 3 Bits

Instruction set of the ANNA board system. Up to three instructions one controller and two ANNA instructions can be issued per clock cycle.

Type

Mnemonic Description CALC Evaluate 8 neurons with up to 256 synapses. SHIFT Shift up to 4 input states into the barrel shifter. ANNA STORE Move data vectors from the shifter into on-chip registers. RFSH Refresh 2 weight-storage capacitors. NOP No operation. SET Set data pointers and loop counters. GET Get data from state memory and post-increment pointer. PUT Store data in state memory and post-increment pointer. Controller JMP Jump unconditionally. DJNZ Decrement counter and jump if not zero. STOP Stop sequencer. NOP No operation. typically used to loop through 2-dimensional images. The state address computation is supported by two data pointers dpi and dpo that can be post incremented by a signed constant provided in the instruction word. This addressing mechanism provides the exibility necessary for scanning multiple 2-d images into the ANNA chips and writing the resulting images or feature maps back. the code word into 34 control (CTRL) and 12 weight-value lines. Both ANNA chips are connected to the same controller and share the state input, weight input, and control lines, except for two weight refresh enable lines used for loading di erent weight sets into the two ANNA chips. Because of the dynamic weight storage employed on the ANNA chip, a periodic refresh (every 1 to 10 ms) is necessary. This operation is done under program control, i.e., a refresh instruction is provided that charges the weight D. The ANNA Interface Two pipeline registers (S-register and Z-register in Fig. 1) storage capacitors to the desired value. allow the ANNA chips to compute while at the same III. Applications of the ANNA Board System time the controller reads new data from the state memIn the following section a number of representative apory or stores results into the state memory. A multiplexer (MUX) and a de-multiplexer (DMUX) convert the parallel data plications and their speed performance on the new ANNA streams of the ANNA chips into state serial ones for the board are discussed. All networks used in the following state memory suitable for random-access read and write are based on the so-called 2-d convolutional layer, a simple form of which has been introduced in [5, pp. 348]. This operations. The 64-bit code word is split into two halves: the lower layer type and its parameters will be de ned rst. half supplies the instructions and weight values to the A. The 2-D Convolutional Layer ANNA chips while the upper half is used for the controller (it contains values for loading sequencer registers, jump The fully connected neural-network layer is well known addresses, and increments/decrements for sequencer regis- and has been widely used. To make the relationship to the ters). A decoder (DECODE) translates the ANNA part of convolutional layer clear the transfer function of the fully

4

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 7, NO. 1, JANUARY 1996

connected layer shall be stated rst:

z = f

X i 1

where x are the inputs, z the outputs, w the weights, b the biases, and f () is a squashing function (often a sigmoid function). The topology of this layer is determined by specifying the number of inputs (i) and outputs (o). For image processing tasks the fully connected layer type is not well suited for several reasons: (i) It has no concept of locality, i.e., adjacent pixels are treated the same way as very distant ones. (ii) It has no concept of shift invariance, i.e., a slightly shifted input pattern can cause a drastically di erent response at the output. (iii) If applied to large images such as 512  512 pixels, it has an overwhelming amount of weights (e.g. 5124 = 7  1010). The convolutional layer introduces constraints such that locality and shift invariance are enforced and at the same time the number of free parameters are drastically reduced. The latter is not only important for hardware implementations but also to make learning from examples e ective (capacity control). The 2-d convolutional layer, also known as layer with local receptive elds and weight sharing, is mathematically de ned by z  = (2)

 =0

w  x + b

!

with  = 0; : : : ; o 1; (1)

f
with

 1 vr 1 hr 1 XX X

 =0 v=0 h=0

w vh  x (vs  +v)(hs +h) + b

!

 = 0; : : : ; fo 1  = 0; : : : ; d(hi hr + 1)=hs e 1 = 0; : : : ; d(vi vr + 1)=vs e 1: Similarly to the fully connected layer, several 2-d convoThe inputs x  and outputs z  of this layer are best lutional layer can be combined into a multilayer network visualized as three dimensional blocks: the rst index indicates the feature, the second the vertical pixel location and the third the horizontal pixel location. Apart from the nonlinear function f (), Eq. (3) corresponds to a correlation/convolution operation where the convolution kernels are composed out of the synaptic weights. The topology of this layer type is characterized by 8 independent parameters:  hi { number of horizontal input pixels  vi { number of vertical input pixels  { number of input feature maps  hr { horizontal extension of the local receptive eld (sometimes called the horizontal kernel size)  vr { vertical extension of the local receptive eld (sometimes called the vertical kernel size)  hs { horizontal subsampling ratio: the ratio of horizontal input to horizontal output size for large images.  vs { vertical subsampling ratio: the ratio of vertical input to vertical output size for large images.  fo { number of output feature maps The horizontal and vertical size of the output block follow from these parameters as ho = d(hi hr + 1)=hs e and vo = d(vi vr + 1)=vs e.

Fig. 3. Address block location. The upper picture shows the image of an envelope scanned at 67dpi. The text lines that have been found by the algorithm are marked in the lower picture. Blocks of text are then determined with a clustering algorithm and scored for being the destination address. The block with the highest score is shown in the lower picture.

or made recurrent by means of a feedback path. Convolutional and fully connected layers can also be mixed.
B. Text Location & Segmentation In the following application a text block, or more general the region of interest, must be located in a large image. See Fig. 3 for an example where the address block is automatically located in the image of an envelop [6]. In addition to the text block location, information for segmenting the text into words and characters must be extracted. These processing steps must be carried out before character recognition can take place. A feature detection network consisting of one 2-d convolutional layer can provided the relevant features in a robust manner. Figure 4 shows the input image and output feature maps of such a network applied to a piece of text taken from the address block in Fig. 3. The detector kernels (weights) are hand-crafted text-line detectors and derivatives of 2-d Gaussians acting as vertical and horizontal edge detectors. The biases are set to large negative numbers such that the neuron outputs turn on only if the feature is clearly present. The 16 output maps of the net-

 SACKINGER AND GRAF: A BOARD SYSTEM FOR HIGH-SPEED IMAGE ANALYSIS AND NEURAL NETWORKS
1 2 3 4 5 6 7 8 9 10 11 12 13 14

5
J

0

...

9A

...

Out

H4

H3
15 16

H2

Input

H1

Fig. 4. Neuron states of the text block location and segmentation network. The output maps contain the following features: maps 1{6: text-line detectors for various text heights (4 to 9 pixels), maps 9 and 10: character pitch detector (horizontal left and right edges), maps 12 and 13: word beginning/ending detectors (horizontal left and right edges on a larger scale), maps 15 and 16: base and top line detectors (vertical lower and upper edges).

Input

edges), maps 12 and 13: word beginning/ending detectors (horizontal left and right edges on a larger scale), maps 15 and 16 base and top line detectors (vertical lower and upper edges). The following table characterizes the one-layer network and shows the execution speed (including time for refresh) measured on the ANNA board (one ANNA chip con guration): hi vi hr vr hs vs fo 130 47 16 16 1 1 1 16 Connect. Time Speed 15:1 M 7:6 ms 2:0 GC=s Because of the 16  16 kernels, each neuron uses the maximum of 256 synapses provided by the ANNA chip which leads to an excellent utilization of the parallel resources and the very high speed of 2:0 GC=s or 2 billion connections per second. This is more that 100 times faster than what is achieved on a dual-processor SPARCstation 10 Model 41. Applied to a 512  512 image, the ANNA board processes

Fig. 5. Neuron states of the optical character recognizer. The unknown pattern (32  32 pixels) is applied to the input; the output indicates to which of the 73 classes the character belongs. The work contain the following features: maps 1{6: text-line output states correspond from left to right and top to bottom to detectors for various text heights (4 to 9 pixels), maps 9 the 10 digits, 26 uppercase characters, 26 lowercase characters, and 10: character pitch detector (horizontal left and right and 11 punctuation characters.

about 2 frames per second.
C. Optical Character Recognition A fast optical character recognition engine is required in several applications and was the driving force behind this ANNA board system. For example reading address blocks from envelopes for sorting purposes requires a speed of approximately 1000 characters per second (cps) to keep up with the speed of the Postal Services's scanning and sorting machine. In another application, a page reader converts printed text into ASCII for keyword searching. Since each page may easily contain a few thousand characters a 1000 cps speed is desirable. A neural network for optical character recognition (OCR) of printed characters (upper case, lower case, digits and punctuation) will be described as an example. A similar, but smaller network for the recognition of handwritten digits has been reported in [7], [8]. Figure 5 shows the neuron states in the network during the recognition of a

6

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 7, NO. 1, JANUARY 1996

printed character `A'. The rst hidden layer (H1) extracts four feature maps from the character image, the second layer (H2) performs averaging and reduces the resolution by four (subsampling), the third layer (H3) combines the four low-resolution feature maps and extracts 12 higherorder feature maps, the fourth layer (H4) reduces the resolution similarly to the second layer, and the last layer is a simple fully connected network with 73 output units. The output contains the classi cation result (letters, digits, and punctuation) in the form of the 1-out-of-73 code. All weights in the network were learned from examples using back-propagation training. The rst three layers of this network have been implemented on the ANNA board. The remaining layers require higher precision and are evaluated on the host.1 A signi cant speed-up over an all-software version is achieved because most of the computation (86 %) is carried out by the rst three layers which run on the fast hardware. The following table characterizes the rst three layers (all 2-d convolutional) and shows the execution speeds (including time for refresh) measured on the ANNA board (one ANNA chip con guration): Layer hi vi hr vr hs vs fo 1 32 32 5 5 1 1 1 4 2 28 28 2 2 2 2 4 4 3 14 14 5 5 1 1 4 12 Layer Connect. Time Speed 1 78,400 0:44 ms 180 MC=s 2 3,136 0:26 ms 12 MC=s 3 60,000 0:32 ms 190 MC=s The total of 141,536 connections are evaluated in 1:02 ms thus achieving a sustained speed of 140 MC=s.2 A dual processor SPARCstation 10 Model 41, for comparison, takes for the same task 12:3 ms.
D. Stroke-Based Noise Removal Images taken from envelopes or parcels often contain optical noise. The noise can be caused by textured paper or from patterns shining through the paper. Such noise can be reduced or removed by convolving the image with a 2-d Gaussian function of the appropriate size (low-pass lter) followed by thresholding. This function can be implemented with the same network topology as described above. Here, we shall consider a more sophisticated noise removal method based on stroke detection. The network consists of two 2-d convolutional layers: the rst layer detects strokes at di erent orientations and the second layer forms the logical `or' of these detectors. The neuron states of this network for a noisy input image are shown in Fig. 6. The stroke detectors are implemented as
1 Reference [8] compares implementations with three and four layers running on the ANNA chip. 2 This system speed is identical to the chip speed reported in [8] for a comparable task assuming an ideal controller. Faster ANNA programming techniques compensated for the somewhat lower data and instruction rates achieved in the practical system.

Output

1

5

2

6

3

7

4

8

Input

Fig. 6. Neuron states of the stroke-based noise remover. The 8 hidden maps correspond to various stroke directions starting from horizontal (map 1) progressing counter clockwise in 22 5 increments. The output image is the logical or of the hidden maps.
:

2-d Gaussian functions with a large variance in the direction of the stroke and a small variance orthogonal to the stroke direction. Such detectors have the advantage over radial ones that they suppress the noise without blurring the stroke. To capture all possible stroke directions 8 maps are produced at increments of 22:5. For instance, map 1 detects horizontal strokes and map 5 vertical strokes. The maps are combined into one image with a `soft or' function using neurons that have eight equally weighted inputs and the appropriate bias. For comparison, the same ltering task has also been tried with radial kernels and the stroke-based method turned out to be clearly better. The following table characterizes the network and shows the execution speeds of each layer (including time for refresh) measured on the ANNA board (one ANNA chip con guration): Layer hi vi hr vr hs vs fo 1 130 59 8 8 1 1 1 8 2 123 52 1 1 1 1 8 1 Layer Connect. Time Speed 1 3:3 M 5:7 ms 580 MC=s 2 51 K 3:5 ms 14 MC=s The total of 3.33 million connections are evaluated in 9:3 ms thus achieving a sustained speed of 360 MC=s. Applied to

 SACKINGER AND GRAF: A BOARD SYSTEM FOR HIGH-SPEED IMAGE ANALYSIS AND NEURAL NETWORKS

7

Output

1

3

2

4

Input

Fig. 7. Neuron states of the CNN hole ller after complete relaxation (60 iterations). The hidden maps 1 and 2 are computed with the hole lling templates of Eq. (4), map 3 contains the di erence between input and output thus acting as a hole detector, and map 4 is unused in this example.

a 512  512 image the ANNA board can process 2.7 frames per second. E. Morphological Operations with Cellular Neural Networks Cellular neural networks (CNN) [9], [10], [11] have been the subject of numerous studies because of their image processing capabilities and their suitability for (analog) parallel implementation. The neurons in a CNN are usually arranged on a 2-d rectangular grid. All neurons have the same weight vectors, regardless of the position in the grid. Each neuron connects to a neighborhood in the input image (uij ) as well as to the outputs of the neighboring neurons (yij ). The neighborhood is often chosen 3  3 but can be larger. The neuron dynamics of the discrete time CNN is given by

yi;j (t + 1) = 0

1 (3) X f@ Ak;l  yi+k;j+l (t) + Bk;l  ui+k;j+l + I A ;
k;l2N

where N de nes the neighborhood and the non-linear function f (x) = 1=2(jx + 1j jx 1j). The so-called \cloning templates" A and B and the bias I determine the operation

carried out by the CNN. For instance the values 20 1 03 20 0 03 A = 4 1 2 1 5 ; B = 4 0 4 0 5 ; I = 1 (4) 0 1 0 0 0 0 produce a hole ller [12]. Interestingly, Eq. (4) for the CNN is equivalent to a 2d convolutional layer with feed-back. The layer extracts from two maps ( = 2), the input and the current output states, using the kernels B and A respectively and produces the new output states. The equivalence of the original formulation in Eq. (4) and a recurrent 2-d convolutional layer shows that CNNs can be implemented easily on the ANNA board. Since the CNN neighborhood translates to the kernel size for ANNA, a neighborhood of up to 15  15 can be easily realized. In fully implemented CNN realizations, neighborhoods of 5  5 are already challenging due to the many interconnections between the cells. Due to the multiplexed implementation of the ANNA system, the number of cells for the ANNA implementation is only limited by the external state memory and can easily be several thousands. This is in contrast to the few 100 cells for a current fully implemented version [11]. A more general, two-layer version of the basic CNN, introduced above, has been implemented on the ANNA board. The rst layer produces four hidden feature maps and the second layer combines these maps into one. The neurons in the rst layer have 50 inputs connected to a 5  5 neighborhood in the input and output maps; the neurons in the second layer have four inputs, one for each map. This two-layer CNN is more powerful than the original one since it can realize non-monotonic cell functions [13]. Many popular CNN cloning templates such as those for noise removal, hole ller, corner and border extraction, shadow and connected component detector have successfully been tested on the ANNA board (see [10] for a list of the templates). Figure 7 illustrates the hole ller CNN running on the ANNA board as an example. The hidden maps 1 and 2 are computed with the standard hole lling templates Eq. (4), map 3 contains the di erence between input and output, thus acting as a hole detector, and map 4 is unused in this example. Figure 7 shows the network after complete relaxation (60 iterations). The following table characterizes the network and shows the execution speeds of each layer (including time for refresh) measured on the ANNA board (one ANNA chip con guration): Layer hi vi hr vr hs vs fo 1 130 88 5 5 1 1 1 4 2 126 84 1 1 1 1 4 1 Layer Connect. Time Speed 1 2:1 M 5:8 ms 360 MC=s 2 42 K 3:2 ms 13 MC=s The total of 2.16 million connections in 10,584 (= 84  126) cells are evaluated in 9:0 ms (one iteration) thus achieving a sustained speed of 240 MC=s.

8

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 7, NO. 1, JANUARY 1996

F. Speed Performance The measurements for the above applications have shown that the speed, measured in mega-connections per second, depends strongly on the topology of the network. Two orders of magnitude in speed variation have been found between the fastest and slowest layer for the same hardware and programming style. This variation is caused by (i) varying utilization of the parallel ANNA-chip hardware and (ii) varying ratio of computation to I/O-transactions. Small convolution kernels cause neurons to have much fewer than the maximum 256 synapses provided by the ANNA chip. For example, the second layer of the CNN network uses only 4 synapses whereas the text locator uses all 256 synapses. The unused synapses do not contribute to the computational speed and thus cause a loss in performance. Furthermore, layers with small convolution kernels have a lower computation to I/O ratio than those with large kernels. This means that to achieve a certain speed in MC/s (computation) a higher I/O bandwidth is required. The speed of such a layer is therefore more likely to be limited by I/O bandwidth than the computational capabilities. The measurements reported above con rm that kernel size vr  hr is the main factor in uencing the speed of the system.

[5] [6] [7]

[8] [9] [10] [11] [12] [13]

The presented board system runs neural-network and image-analysis applications at a speed 10 to 100 times faster than a dual processor SPARCstation 10. This speed is obtained by integrating the custom ANNA neuralnetwork chip into a fast board system carefully designed to avoid data bottlenecks. The ANNA board is controlled through the VME bus by a host system. The board is fully programmable and can be con gured for many applications including optical character recognition, feature extraction/detection, noise removal, and morphological operations e.g. with cellular neural networks (CNN). The authors would like to thank, among others, Steve Deiss of Applied Neurodynamics for logic, circuit board and FPGA design and manufacturing, Richard Wurth for VME interface design, Jane Bromley for training the character recognizer, and John Shamilian for system integration of the ANNA board.
References [1] Eduard Sckinger, Bernhard Boser, and Lawrence D. Jackel. A a neurocomputer board based on the ANNA neural network chip. In J. M. Moody, S. J. Hanson, and R. P. Lippman, editors, Neural Information Processing Systems, volume 4. Morgan Kaufmann Publishers, San Mateo, CA, 1992. [2] Bernhard Boser, Eduard Sckinger, Jane Bromley, Yann LeCun, a and Lawrence D. Jackel. An analog neural network processor with programmable network topology. IEEE J. Solid-State Circuits, 26(12):2017{2025, December 1991. [3] Dan Hammerstrom. Neural networks at work. IEEE Spectrum, pages 26{32, June 1993. [4] Matthew Grin, Gary Tahara, Kurt Knorpp, Ray Pinkham, and Bob Riley. An 11-million transistor neural network execution Acknowledgments

IV. Conclusions

engine. In ISSCC Dig. Tech. Papers, pages 180{181. IEEE Int. Solid-State Circuits Conference, 1991. David E. Rumelhart and James L. McClelland. Parallel Distributed Processing { Explorations in the Microstructure of Cognition, volume 1. MIT Press, Cambridge, Mass., 1986. H. P. Graf, C. R. Nohl, and J. Ben. Image recognition with an analog neural net chip. In Proceedings 11th IAPR International Conference on Pattern Recognition, pages D{11 { D{14. International Association for Pattern Recognition, 1992. Yann LeCun, Bernhard Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne Hubbard, and Lawrence D. Jackel. Handwritten digit recognition with a back-propagation network. In David S. Touretzky, editor, Neural Information Processing Systems, volume 2, pages 396{404. Morgan Kaufmann Publishers, San Mateo, CA, 1990. Eduard Sckinger, Bernhard Boser, Jane Bromley, Yann LeCun, a and Lawrence D. Jackel. Application of the ANNA neural network chip to high-speed character recognition. IEEE Trans. Neural Networks, 3(3):498{505, May 1992. Leon O. Chua and L. Yang. Cellular neural networks. IEEE Trans. Circuits Syst., 35:1257{1290, 1988. Angel Rodrguez-Vzquez, Servando Espejo, Rafael Domnguezi a i Castro, Jose L. Huertas, and E. Snchez-Sinencio. Current-mode a techniques for the implementation of continuous- and discretetime cellular neural networks. IEEE Trans. Circuits Syst., 40:132{146, March 1993. Servando Espejo, Angel Rodrguez-Vzquez, Rafael Domnguezi a i Castro, Jose L. Huertas, and E. Snchez-Sinencio. Smart-pixel a cellular neural networks in analog current-mode CMOS technology. IEEE J. Solid-State Circuits, 29:895{905, August 1994. Takashi Matsumoto, Leon O. Chua, and R. Furukawa. CNN cloning template: Hole- ller. IEEE Trans. Circuits Syst., 37:635{638, 1990. Leon O. Chua, Tams Roska, and Pter L. Venetianer. The CNN a e is universal as the turing machine. IEEE Trans. Circuits Syst., 40:289{291, 1993.




友情链接: