Activation Functions in Neural Networks: Types, Examples & Code

Published on June 17, 2026 - Updated on June 20, 2026

Activation functions in neural networks visualization — Activation functions sit inside neurons and shape how signals move through a neural network.

Quick Answer

An activation function is a mathematical gate at the end of a neuron. The neuron first combines inputs with weights and a bias, then the activation function transforms that raw value into the signal passed to the next layer.

z = (w1 * x1) + (w2 * x2) + ... + b
a = activation(z)

That small transformation changes the entire personality of a neural network. It controls whether the model behaves like a simple straight-line calculator or a flexible learner that can understand curved boundaries, context, and high-dimensional patterns.

What is an Activation Function?

In a neural network, each neuron receives numbers from the previous layer. Those numbers are multiplied by learned weights, added together with a bias, and turned into one raw score. This raw score is often called a pre-activation value or logit.

The activation function answers a simple question: how much of this signal should continue forward? Some functions compress values into a small range. Some remove negative values. Some convert scores into probabilities.

Hidden-layer activationsHelp the model learn useful internal features, such as edges, word relationships, or tabular patterns.

Output-layer activationsShape the final prediction, such as a probability, a class distribution, or a continuous number.

Training behaviorAffects gradient flow, convergence speed, and how efficiently the model learns from errors.

Why Non-Linearity Matters

The primary purpose of an activation function is to introduce non-linearity. Without non-linearity, a deep network is mathematically no stronger than a single linear layer.

linear(linear(linear(x))) = another_linear_function(x)

Imagine a model trying to separate two classes that curve around each other in feature space. A purely linear model can only draw a straight boundary. Once hidden layers use non-linear activations, the network can bend, fold, and combine features into richer shapes.

This is why activation functions are critical when building AI algorithms or small language models from scratch. They influence both learning capability and computational efficiency.

How a Neuron Uses Activation

Suppose a model receives two features for a support ticket: urgency and customer value. A neuron can combine those features into a score, then ReLU can decide whether that score is strong enough to become an active signal.

import numpy as np

features = np.array([0.8, 0.4])      # urgency, customer value
weights = np.array([1.2, -0.5])
bias = 0.1

z = np.dot(features, weights) + bias
a = max(0, z)  # ReLU activation

print(round(z, 3))  # 0.86
print(round(a, 3))  # 0.86

If the weighted score were negative, ReLU would output zero. That means this neuron would stay quiet for that input. In larger networks, many neurons specialize this way: some activate for one pattern, while others activate for another.

Types of Activation Functions

Different activation functions are useful in different parts of a network. The best choice depends on whether the layer is hidden or final, whether the task is classification or regression, and whether speed or smooth gradients matter most.

1. Linear Activation

Linear activation returns the input unchanged. It is usually not useful in hidden layers because it does not add non-linearity, but it is common in regression output layers where the model predicts an unrestricted number.

linear(z) = z

Example use case: predicting tomorrow's temperature, house price, or demand forecast as a continuous value.

2. Sigmoid

Sigmoid squeezes any real number into the range 0 to 1. Because the output looks like a probability, sigmoid is a natural fit for binary classification.

sigmoid(z) = 1 / (1 + exp(-z))

Example: a fraud model can output 0.92 to mean "high probability of fraud." The main drawback is saturation: very large or very small inputs create tiny gradients, which can slow learning.

3. Tanh

Tanh is similar to sigmoid but maps values into the range -1 to 1. Since it is centered around zero, it can be easier for some networks to optimize than sigmoid.

tanh(z) ranges from -1 to 1

Example: tanh can be useful when a hidden representation should express both negative and positive evidence. Like sigmoid, it can still saturate at extreme values.

4. ReLU

ReLU, short for Rectified Linear Unit, is the default hidden-layer choice for many feed-forward and convolutional networks. It keeps positive values and clips negative values to zero.

relu(z) = max(0, z)

ReLU is fast and simple. It also helps reduce the vanishing-gradient problem because positive values keep a steady gradient. Its weakness is the "dead neuron" problem: if a neuron keeps receiving negative inputs, it can stop learning.

5. Leaky ReLU and PReLU

Leaky ReLU is a small modification of ReLU. Instead of turning negative values into zero, it lets a tiny negative slope pass through. PReLU makes that slope learnable.

leaky_relu(z) = z if z > 0 else 0.01 * z

Example: if a feature is useful even when it is below zero, Leaky ReLU can keep that weak signal alive instead of deleting it completely.

6. ELU and SELU

ELU smooths the negative side of ReLU instead of using a hard zero. SELU is a self-normalizing variant that can help certain deep feed-forward networks keep activations well-scaled.

These functions are less universal than ReLU, but they are useful when you want smoother negative behavior or more stable activation statistics.

7. GELU and Swish

GELU and Swish are smooth activations that do not use the hard cutoff of ReLU. GELU is especially common in transformer-style models because it blends the input with a probability-like gate.

swish(z) = z * sigmoid(z)

Example: language models often benefit from smoother activation curves because the hidden states carry subtle contextual information instead of simple on/off signals.

8. Softmax

Softmax is normally used in the output layer for multiclass classification. It converts a vector of raw class scores into probabilities that add up to 1.

softmax(z_i) = exp(z_i) / sum(exp(z_j))

Example: if a document classifier predicts scores for "invoice", "receipt", and "contract", softmax can turn those scores into a clean class distribution.

logits = [2.0, 1.0, 0.1]
softmax(logits) = [0.659, 0.242, 0.099]

invoice: 65.9%
receipt: 24.2%
contract: 9.9%

Code Examples

Here is a compact NumPy implementation of the most common activation functions. This is useful when you are learning neural networks from scratch because it shows that the math is approachable.

import numpy as np

def linear(x):
    return x

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def tanh(x):
    return np.tanh(x)

def relu(x):
    return np.maximum(0, x)

def leaky_relu(x, slope=0.01):
    return np.where(x > 0, x, slope * x)

def softmax(logits):
    shifted = logits - np.max(logits)
    exp_values = np.exp(shifted)
    return exp_values / np.sum(exp_values)

x = np.array([-2.0, -1.0, 0.0, 1.0, 2.0])

print("sigmoid:", np.round(sigmoid(x), 3))
print("tanh:   ", np.round(tanh(x), 3))
print("relu:   ", relu(x))
print("leaky:  ", leaky_relu(x))
print("softmax:", np.round(softmax(np.array([2.0, 1.0, 0.1])), 3))

Expected output:

sigmoid: [0.119 0.269 0.5   0.731 0.881]
tanh:    [-0.964 -0.762  0.     0.762  0.964]
relu:    [0. 0. 0. 1. 2.]
leaky:   [-0.02 -0.01  0.    1.    2.  ]
softmax: [0.659 0.242 0.099]

Mini Network Example

The next example shows one hidden layer with ReLU and one softmax output layer. It is not a full training loop, but it demonstrates how activations appear during forward propagation.

import numpy as np

def relu(x):
    return np.maximum(0, x)

def softmax(logits):
    shifted = logits - np.max(logits)
    exp_values = np.exp(shifted)
    return exp_values / np.sum(exp_values)

# Two input features: pages and keyword density
x = np.array([0.6, 0.9])

# Hidden layer: 2 inputs -> 3 neurons
W1 = np.array([
    [0.8, -0.4, 0.3],
    [0.2, 0.9, -0.5],
])
b1 = np.array([0.1, 0.0, 0.2])

# Output layer: 3 hidden activations -> 3 classes
W2 = np.array([
    [1.1, -0.3, 0.2],
    [-0.2, 0.8, 0.4],
    [0.5, 0.1, 0.7],
])
b2 = np.array([0.0, 0.1, -0.1])

hidden_raw = x @ W1 + b1
hidden_active = relu(hidden_raw)
logits = hidden_active @ W2 + b2
probabilities = softmax(logits)

print("hidden raw:   ", np.round(hidden_raw, 3))
print("hidden active:", np.round(hidden_active, 3))
print("class probs:  ", np.round(probabilities, 3))

Choosing the Right Activation

A practical rule is to choose hidden-layer activations for learning behavior and output-layer activations for prediction meaning.

Hidden layers in simple networks - start with ReLU
Hidden layers in transformer-style models - try GELU or Swish
Binary classification output - use sigmoid
Multiclass classification output - use softmax
Multi-label classification output - use one sigmoid per label
Regression output - use linear activation unless the target range must be constrained

Common Mistakes

Using sigmoid everywhere - it can saturate and slow hidden-layer learning
Applying softmax before cross-entropy in some libraries - many loss functions expect raw logits and apply a stable softmax internally
Using softmax for multi-label tasks - softmax assumes classes compete, while multi-label tasks need independent probabilities
Ignoring the output range - the final activation should match what the prediction means

Activation Function FAQ

What is an activation function in a neural network?

An activation function is a mathematical operation that transforms a neuron's weighted input into the output passed to the next layer. It gives neural networks the non-linear behavior they need to learn complex patterns.

Which activation function should I use in hidden layers?

ReLU is a strong default for many feed-forward and convolutional neural networks. GELU or Swish are common choices in transformer-style models, especially when smoother gradients are useful.

When should I use softmax?

Use softmax in the output layer when exactly one class should be selected from multiple classes, such as image classification, document classification, or intent prediction.

Conclusion

Activation functions are small mathematical choices with large consequences. They decide how neurons pass information forward, how gradients move backward, and what kind of patterns the network can represent.

For most projects, start with ReLU or GELU in hidden layers. Use sigmoid for binary outputs, softmax for one-of-many classification, and linear activation for open-ended numeric predictions. Once that foundation is clear, experimenting with alternatives becomes much easier.

Related Topics: