Spatial attention-based residual network for human burn … – Nature.com

Accurate diagnosis of human burns requires a sensitive model. ML and DL are commonly employed in medical imaging for disease diagnosis. ResNeXt, AlexNet, and VGG16 are state-of-the-art deep-learning models frequently utilized for medical image diagnosis. In this study, we evaluated and compared the performance of these models for diagnosing burn images. However, these models showed limited effectiveness in accurate diagnosis of burn degree and distinguishing grafts from non-grafts.

ResNeXt, a deep residual model, consists of 50 layers, while AlexNet and VGG16 are sequential models with eight and 16 layers, respectively. These layers extract features from the burned images during the models training process. Unfortunately, distinguishing between deep dermal and full-thickness burns can be challenging, as they share similar white, dark red, and brown colors. Consequently, highly delicate and stringent methods are required for accurate differentiation. AlexNet and VGG16, being sequential models, mainly extract low-level features, whereas ResNeXt excels in extracting high-dimensional features. A limitation is that these models can only learn positive weight features due to the ReLu activation function. This constraint may hinder their ability to precisely identify critical burn characteristics. The DL models, AlexNet, ResNeXt, VGG16, and InceptionV3 are widely used for medical image diagnosis, however, these models encounter challenges in accurately categorizing burn degrees and differentiating grafts from non-grafts. Finding effective ways to handle these challenges and improve feature extraction could lead to more sensitive and reliable burn diagnosis models.

The ResNeXt model33 influenced the BuRnGANeXt50 model. To construct a BuRnGANeXt50 model, the original ResNeXt models topology is modified. Moreover, the original ResNeXt was created to classify images into several categories with high computation costs. In this study, the method performs a multiclass and binary class classification task. Multiclass classification is used to assess burn severity based on burn depth. After that, based on depth, burns may be broken down into two distinct types: graft and non-graft. Reducing the first layer filter size from 77 to 55 is the first change to the original ResNext models design because a larger filter size resulted in lower pixel intensity in the burnt region. This has led to a rise in the frequency of spurious negative results for both grafts and non-grafts. Furthermore, the convolution sizes of Conv1, Conv2, Conv3, Conv4, and Conv5 are also changed to reduce the computation cost while maintaining cardinality. Furthermore, we applied Leaky ReLu instead of the ReLU activation for faster model convergence. Table 2 also shows that conv2, conv3, and conv4 are shrinking in size. After implementing all modifications, neurons decreased from 23106 to 5106, as shown in Table 3. The detailed architecture of the proposed model is shown in Fig.1.

Topology of BuRnGANeXt50 for human burn diagnosis.

This model has several essential building blocks, including convolution, residual, ReLU, activation, softmax, and flattened layer. The results of groups convolution of neurons inside the same kernel map are summed together by pooling layers, which reduce the input dimensionality and enhance the model performance. The pooling units in the proposed model constitute a grid, with each pixel representing a single voting location, and the value is selected to gain overlap while reducing overfitting. Figure2 describes the structure of the models convolution layer. Polling units form a grid, each pixel representing a single voting place being centered (z times z). In the provided model, we employ the standard CNN with parameters set to (S = z), but we add a charge of (S < z) to increase overlap and decrease overfitting34. The proposed architecture was developed to handle the unique issues of burn diagnosis, emphasizing decreasing overfitting and enhancing model accuracy.

The pooling layers are convolutions in a grouped manner.

The inner dot product is an essential part that neurons perform for the foundation of an artificial neural networks convolutional and fully connected layers. The inner dot product may compute the aggregate transform, as illustrated in Eq.(1).

$$mathop sum limits_{i = 1}^{K} w_{i} rho_{i}$$

(1)

represents the neurons k-channel input vector. Filter weight is given by (w_{i})for i-the neurons. This model replaces the elementary transformations with a more generic function (left( {w_{i} rho_{i} } right)). By expanding along a new dimension, this generic function reduces depth. This model calculates the aggregated transformations as follows:

$${Im }left( rho right) = mathop sum limits_{i = 1}^{{mathbb{C}}} Upsilon_{i} left( rho right)$$

(2)

The function (Upsilon_{i} (rho )) is arbitrarily defined. (Upsilon_{i}) project (rho) into low-dimensional embedding and then change it, similar to a primary neuron. ({mathbb{C}}) represents the number of transforms to be summed in Eq.(2). ({mathbb{C}}) is known as cardinality35. As the residual function, Eq.(2)s aggregated transformation serves36. (Fig.3):

$$x = rho + mathop sum limits_{i = 1}^{{mathbb{C}}} Upsilon_{i} left( rho right)$$

(3)

where (x) is the models predicted result.

Channel and spatial attention modules are depicted in (A) and (B), respectively, in these schematic illustrations.

Finally, at the top of the model a flattened and a global average pooling is added. The Softmax activation classifies burn into binary and multiclass. The softmax optimizer uses the exponent of each output layer to convert logits to probabilities37. The vector (Phi) is the system input, representing the feature set. Our study uses k classification when there are three levels of burn severity (k=3) and two levels of graft versus non-graft (k=2). For predicting classification results, the bias (W_{0} X_{0}) is added to each iteration.

$$p(rho = i|Phi^{left( j right)} ) = frac{{e^{{Phi^{left( j right)} }} }}{{mathop sum nolimits_{i = 0}^{k} e^{{Phi_{k}^{left( j right)} }} }}$$

(4)

$${text{In}};{text{which}};Phi = W_{0} X_{0} + W_{1} X_{1} + ldots + W_{k} X_{k}$$

(5)

The residual attention block, which allows attention to be routed across groups of separate feature maps, is shown in Fig.3. Furthermore, the channels extra feature map groups combine the spatial information of all groups via the spatial attention module, boosting CNNs capacity to represent features. It comprises feature map groups, feature transformation channels, spatial attention algorithms, etc. Convolution procedures can be performed on feature groups, and cardinality specifies the number of feature map groups. A new parameter, "S," indicates the total number of groups in the channel set38 and the number of subgroups in each of the N input feature groups. A channel scheduler is a tool that optimizes the processing of incoming data through channels. This method transforms feature subsets. G=N * S is the formula for the total number of feature groups.

Using Eq.(6), we conduct an essential feature modification on subgroups inside each group after channel shuffling.

$$gleft( {r,i,j} right) = left[ {begin{array}{*{20}c} {cos frac{rpi }{2}} & { - sin frac{rpi }{2}} \ {sin frac{rpi }{2}} & {cos frac{rpi }{2}} \ end{array} } right]left[ {begin{array}{*{20}c} i \ j \ end{array} } right]$$

(6)

Here (0le r<4,left(i,jright)) stands for the original matrixs coordinates. K represents the 33 convolution of the bottleneck block, and Output is written as (y_{s}). Then, for each (x_{s}) input

we have:

$$y_{s} = left{ {begin{array}{*{20}c} {Kleft( {g_{r} left( {x_{s} } right)} right)r,} & {s = 0} \ {Kleft( {g_{r} left( {x_{s} } right)} right) odot y_{0} } & {0 < r = s < 4} \ end{array} } right.$$

(7)

(g& r) here represents the input (x_{s}). (odot) corresponds to element multiplication in the matrixs related feature transformation. Features of x being transformed are shared across the three 33 convolution operators K.

Semantic-specific feature representations can be improved by exploiting the interdependencies among channel graphs. We use the feature maps channels as individual detectors. Figure3A depicts how we send the feature map of the (noin mathrm{1,2},...,N) group ({G}^{no}in {R}^{C/Ntimes Htimes W}) to the channel attention module. As a first step, we use geographic average pooling (GAP) to gather global context information linked to channel statistics39. The 1D channel attention maps ({C}^{no}in {R}^{C/N}) are then inferred using the shared fully connected layers.

$$C^{n} = D_{sigmoid} left( {D_{{{text{Re}} LU}} left( {GAPleft( {G_{n} } right)} right)} right)$$

(8)

("{D}_{sigmoid}and{D}_{mathit{Re}LU}") represents a fully linked layer that uses both "Sigmoid" and "ReLU" as activation functions. At last, Hadamard products are used to infer a groups attention map and the corresponding input features. Then the components from each group are weighted and added together to produce an output feature vector. The final channel attention map

$$C in R^{C/N times H times W} C = mathop sum limits_{n = 1}^{N} left( {C^{n} odot G^{n} } right)$$

(9)

Each groups 11 convolution kernel weight is multiplied by the 33 kernel weight from the subgroups convolutional layer. The global feature dependency is preserved by adding the groups channel attention weights, which all add up to the same value.

A spatial attention module is used to synthesize spatial links and increase the spatial size of associated features. The channel attention module is separate from that component. The spatial information of feature maps is first aggregated using global average pooling (GAP) and maximum global pooling (GMP)39 to obtain two distinct contextual descriptors. Next, by joining (GAP(C)in {R}^{1times Htimes W}andGMP(C)in {R}^{1times Htimes W}) connect to get ({S}_{c}in {R}^{2times Htimes W}).

$$S_{c} = GAPleft( C right) + GMPleft( C right)$$

(10)

The plus sign +denotes a linked feature map. The regular convolutional layer retrieves the spatial dimensional weight information to round things out. (S_{conv}) Final spatial attention map (Sin {R}^{C/Ntimes Htimes W}) is obtained by element-wise multiplying the input feature map (C) with itself.

$$S = Conv_{3 times 3} left( {S_{C} } right) odot C$$

(11)

("Con{v}_{3times 3}") means regular convolution, while "Sigmoid" denotes the activation function.

Leaky ReLU activation-based deep learning models do not rely on input normalization for saturation. Neurons in this model are more efficient at learning from negative inputs. Despite this, neural activity is calculated ({alpha }_{u,v}^{i}) At a point ((u,v)) by using the kernel (i), which facilitates generalization. The ReLU nonlinearity is then implemented. The ReLU nonlinearity is then implemented. The response normalized ({alpha }_{u,v}^{i}) is determined using the provided Eq.(12).

$$b_{u,v}^{i} = frac{{alpha_{u,v}^{i} }}{{left( {t + alpha mathop sum nolimits_{j - max (0,i,n/2)}^{min (N,1,i + n/2)} (alpha_{u,v}^{j} )^{2} } right)^{beta } }}$$

(12)

where (N) are the total number of layers and (t,alpha ,n,beta) are constants? This (sum {}) is computed for each of the (n) neighboring40. We trained the network using a (100 times 100 times 3) picture and the original ResNeXt CNN topologys cardinality hyper-parameter ({mathbb{C}}=32). The algorithm of the proposed method is shown below.

Algorithm of the proposed method.

All authors contributed to the conception and design of the study. All authors read and approved the final manuscript.

Excerpt from:

Spatial attention-based residual network for human burn ... - Nature.com

Related Posts

Comments are closed.