Anurag builds things

Self-Supervised Learning

Wed, 18 Mar 2026 00:00:00 GMT

What It Is

Self-supervised learning (SSL) trains models on unlabeled data by generating supervision from the data itself. Instead of human-provided labels, the model solves a pretext task – a proxy objective that forces it to learn useful structure.

Examples of pretext tasks:

Contrastive: pull augmented views of the same image together, push different images apart (SimCLR, MoCo)
Masked prediction: mask part of the input and predict it (BERT, MAE)
Predictive: predict future frames, next tokens, or missing patches

Intuition

Labels are expensive. Structure is free.

Images have spatial coherence. Text has sequential coherence. Video has temporal coherence. SSL exploits these natural regularities to learn representations that capture what matters in the data – without anyone telling the model what to look for.

The key insight: a model that can solve a hard pretext task (e.g., reconstruct a masked image region) must have learned something meaningful about the domain.

Simple Example

Take an image. Crop it twice, apply different augmentations. The model must learn that both crops came from the same source. To do this, it has to understand content (what’s in the image) and ignore style (color jitter, rotation, scale).

The result: an encoder that maps semantically similar inputs to nearby points in embedding space – without ever seeing a label.

Why It Matters

Scale: unlabeled data is orders of magnitude more available than labeled data
Transfer: SSL representations often transfer better than supervised ones to new domains
Foundation models: GPT, CLIP, DINO – the most capable models are pretrained with self-supervision
Cost: eliminates the annotation bottleneck, especially for domains where labeling requires expertise (medical imaging, satellite data)

SSL is not a niche technique. It is the default pretraining paradigm for modern AI systems.

Learning Representations Without Labels (SimCLR)

Wed, 18 Mar 2026 00:00:00 GMT

Goal

Learn visual representations without labels using contrastive learning. Specifically, implement SimCLR (Simple Framework for Contrastive Learning of Visual Representations) from scratch and evaluate the quality of learned embeddings on CIFAR-10.

Plan

Implement SimCLR on CIFAR-10 and validate:

Can the model learn useful representations without labels?
How sensitive is performance to augmentations?
How does batch size affect learning?

Initial Setup

Encoder: ResNet-18
Projection head: 2-layer MLP
Loss: NT-Xent
Dataset: CIFAR-10

Hyperparameters will be tuned incrementally during experiments.

What I Expect

Augmentations will be critical for learning meaningful representations
Larger batch sizes may improve performance (more negative samples)
Training stability may depend on temperature and normalization

Next Steps

Implement data pipeline and augmentations
Implement NT-Xent loss
Run first small-scale training

References

Why Augmentations Matter in Contrastive Learning

Wed, 18 Mar 2026 00:00:00 GMT

Status

Planned experiment – not yet executed.

Hypothesis

Augmentations are the most critical design choice in contrastive learning. They define what invariances the model learns – get them wrong and the representations are useless, regardless of architecture or training budget.

Setup

Train SimCLR on CIFAR-10 with different augmentation configurations and measure linear probe accuracy:

Config	Augmentations
Minimal	Random crop only
Moderate	Crop + horizontal flip + grayscale
Full	Crop + color jitter + flip + grayscale + blur
Aggressive	Full + extreme crop ratios + strong color distortion

All other hyperparameters held constant (ResNet-18, batch 512, 200 epochs, temperature 0.5).

Expected Outcome

Minimal augmentations will produce weak representations – the contrastive task becomes too easy (trivial shortcuts like position matching)
Full augmentations will force the model to learn semantic features, producing the strongest embeddings
Aggressive augmentations may hurt early training by making the pretext task too hard

Why This Matters

Most SSL papers treat augmentations as a hyperparameter table in the appendix. In practice, they are the experiment. The augmentation pipeline implicitly defines what the model treats as “same” vs “different” – which is the entire learning signal in contrastive methods.

Understanding this connection between augmentations and learned invariances is essential before scaling to harder domains (video, medical imaging) where the right invariances are less obvious.