1.4. Linear Layer

1.4. Linear Layer#

A linear layer (also known as a fully connected or dense layer) applies an affine transformation to the input data, transforming it linearly. This transformation can be represented mathematically as:

\( y = xA^T + b \)

where:

\(x\) is the input,
\(A\) represents the layer’s weights,
and \(b\) is the bias term.

The linear layer is a foundational component in neural networks and is often used to map higher-dimensional data into a lower-dimensional space, such as for final classification outputs.

Convolutional Architectures#

In traditional CNN architectures, such as AlexNet and VGG, linear layers are typically found at the end of the network. After a series of convolutional layers, which extract features from the input data, the network often includes a few fully connected layers to interpret these features and make predictions. In AlexNet and VGG, the last three layers are fully connected and are responsible for the final classification.

In more recent architectures like ResNet, linear layers are less prominent, with only the final classification layer being linear. This final layer maps the network’s feature space to the required output nodes—e.g., for ImageNet, the linear layer provides outputs for 1,000 different classes. This makes the network more efficient by focusing most of the computation on convolutional layers, while the linear layer simply provides the final output mapping.

Let’s see an example:

# Import necessary components from PyTorch
import torch
from torch import nn

# Define a linear layer with 2048 input features and 1000 output features
# This example simulates a final classification layer mapping 2048 features to 1000 classes
linear_layer = nn.Linear(2048, 1000)

# Create a simulated input tensor of size (128, 2048), where 128 represents batch size
# and 2048 represents the feature vector length for each input
simulated_input = torch.randn(128, 2048)

# Apply the linear transformation to the simulated input
output = linear_layer(simulated_input)

# Print the output shape to verify the transformation result
print("Output size:", output.size())  # Expected shape: (128, 1000)

Output size: torch.Size([128, 1000])

In this example:

We define a linear layer with 2,048 input features and 1,000 output features, which could represent a typical setup for a classification task on a dataset with 1,000 classes.
We simulate an input batch of size 128, where each item has 2,048 features.
The layer transforms this input into an output of shape (128, 1,000), showing how each input is mapped to 1,000 possible output classes.

Transformer Architectures#

Transformer models, such as BERT and Vision Transformers (ViTs), make extensive use of linear layers within their architecture, especially within the Multilayer Perceptron (MLP) components. These linear layers are integral in projecting features to higher or lower dimensions, which supports tasks like attention mechanisms, information integration, and classification. We’ll explore these aspects in greater depth in the upcoming chapter focused on transformer architectures.

For now, let’s walk through a simple example to see how an MLP can be implemented in PyTorch. In this example, we’ll use dimensions similar to those in the ViT architecture, where inputs of size 768 are mapped to 3072 dimensions in the hidden layer.

# Import necessary libraries
import torch
from torch import nn

# Define the Multilayer Perceptron (MLP) class
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        
        # Define a sequential layer stack
        # This architecture follows the ViT (Vision Transformer) structure with:
        # - An input layer mapping 768 nodes to a hidden layer with 3072 nodes
        # - A ReLU activation function introducing non-linearity
        # - An output layer that returns to the original 768 nodes
        self.layers = nn.Sequential(
            nn.Linear(768, 3072, bias=True),  # Linear layer with bias term
            nn.ReLU(),                       # ReLU activation to introduce non-linearity
            nn.Linear(3072, 768, bias=True)  # Linear layer returning to 768 dimensions
        )

    # Define the forward pass of the MLP
    def forward(self, x):
        return self.layers(x)  # Apply the sequential layers to input x

# Instantiate the MLP model
MLP_model = MLP()
print(MLP_model)  # Display the model structure

MLP(
  (layers): Sequential(
    (0): Linear(in_features=768, out_features=3072, bias=True)
    (1): ReLU()
    (2): Linear(in_features=3072, out_features=768, bias=True)
  )
)

In this MLP example:

The first linear layer expands the input from 768 dimensions to 3072 dimensions.
The ReLU activation introduces a non-linearity, allowing the network to model more complex relationships.
The final layer reduces the dimensionality back to 768, ensuring compatibility with the input’s original size if needed for later processing in the transformer model.

This basic MLP structure demonstrates how transformers manage feature transformations within their architectures. We’ll explore transformers and their role in detail in later sections.

Discussion#

The remarkable success of artificial neural networks today can be traced back to foundational ideas from the early days of computational modelling. This journey began with the perceptron, developed by Frank Rosenblatt in 1957. Rosenblatt’s perceptron was one of the first models that could learn to distinguish patterns through a simple linear transformation. This development represented a breakthrough by introducing the concept of a learning algorithm based on data.

The perceptron’s limitations, however, restricted it to solving only linearly separable problems. This led to further innovations in neural modelling, including the multilayer perceptron (MLP), proposed by Ivakhnenko and Lapa. The MLP uses multiple layers of perceptrons with non-linear activation functions, allowing networks to solve more complex, non-linear problems by learning intricate patterns in data.

For an in-depth historical perspective, we recommend reading The Road to Modern AI, which covers the milestones that have shaped artificial intelligence into what we see today. This document highlights how each step in neural network development, from Rosenblatt’s perceptron to contemporary deep learning models, has contributed to our current capabilities in AI.

1.4. Linear Layer

Contents

1.4. Linear Layer#

Convolutional Architectures#

Transformer Architectures#

Discussion#