This project focuses on classifying light curves of astronomical transients using recurrent neural networks.
If you are interested in the implementation and results and would like to skip the motivation and introduction to RNNs and LSTMs, you can jump directly to the project section HERE. In addition, the actual code used for this project is on my Github.
Motivation
Astronomical transients are sources that suddenly appear in the sky and evolve on timescales of days to months (technically, these are often referred to as slow transients). Because we never know in advance where they will occur, the only practical way to discover them is through wide-field surveys that repeatedly scan the same parts of the sky and report photometric measurements. What we get from these surveys is a light curve (flux as a function of time). Different transients evolve differently, and with experience you start to develop intuition just by looking at their temporal behavior. But intuition is not classification.
To truly determine the physical origin of a transient we usually need optical spectroscopy (flux as a function of wavelength) and we look for specific emission or absorption features. The problem is that spectroscopy is expensive. Modern surveys discover far more candidates than we can realistically follow up. That means we have to decide which objects are worth spending precious telescope time on.
Two important classes of optical transients are Supernovae (SNe) and Tidal Disruption Events (TDEs). In this post I focus only on their optical light curves (if you are curious about the physics behind them, see my Research page). TDEs are rare compared to SNe. As part of my work, I regularly go through newly discovered sources and try to assess whether they are promising TDE candidates. One of the main contaminants in TDE searches is SNe, and in particular Type Ia SNe (as opposed to Type II, Ib, and Ic; yes, astronomers are extremely creative with naming conventions).
There are a few recurring differences between SNe Ia and TDEs that we rely on:
-
Timescale.
SNe Ia typically peak about 10–20 days after first light and decline relatively quickly. TDEs can peak later and evolve more slowly. -
Color evolution (g−r).
When observed in two optical bands (for example g and r), their color evolution differs. TDEs tend to remain blue for longer periods, while SNe Ia cool significantly after peak, which manifests as measurable changes in their color.
In practice, most classification pipelines rely on engineered features and heuristic cuts. For example, proximity to the center of the host galaxy or constraints on rise time (e.g., TDEscore; (1), NEEDLE; (2)). These approaches work, but they encode specific physical assumptions.
At some point I started asking a simpler question:
If I ignore everything except the alert photometry light curve itself — no host galaxy information, no spatial priors — how far can we get?
That question is what led me to explore recurrent neural networks for transient classification.
RNNs and LSTMs
If the only information I allow myself to use is the light curve (a sequence of measurements ordered in time) then the model I choose must respect that temporal structure.
Recurrent Neural Networks (RNNs) are designed exactly for this setting. Unlike standard feed-forward networks, which treat each input independently, RNNs process sequences step by step while maintaining an internal hidden state. This hidden state acts as a form of memory, allowing the model to incorporate information from previous time steps when interpreting the current one. I like to think of a vanilla RNN as applying the same small neural network repeatedly across time, passing along a compressed summary of what it has seen so far.
However, vanilla RNNs struggle when sequences become long. During training, gradients can either shrink exponentially (vanishing gradients) or grow uncontrollably (exploding gradients). In practice, this makes it difficult for the network to learn long-range dependencies - exactly the type of behavior we care about when light curves evolve over tens to hundreds of days. This limitation motivated more sophisticated recurrent architectures, most notably the Long Short-Term Memory (LSTM) network.
The key idea behind LSTMs is controlled memory. Instead of blindly updating a hidden state, the network learns how much information to keep, forget, and add at each time step through gating mechanisms.
Following the PyTorch convention, an LSTM computes the following gates at time step $t$:
Input gate \(i_t = \sigma \left( W_{ii} x_t + b_{ii} + W_{hi} h_{t-1} + b_{hi} \right)\)
Forget gate \(f_t = \sigma \left( W_{if} x_t + b_{if} + W_{hf} h_{t-1} + b_{hf} \right)\)
Cell (candidate) gate \(g_t = \tanh \left( W_{ig} x_t + b_{ig} + W_{hg} h_{t-1} + b_{hg} \right)\)
Output gate \(o_t = \sigma \left( W_{io} x_t + b_{io} + W_{ho} h_{t-1} + b_{ho} \right)\)
Here, $\sigma$ is the sigmoid activation and $x_t$ is the input at time $t$, while $h_{t-1}$ is the previous hidden state.
The cell state is then updated as
\[c_t = f_t \odot c_{t-1} + i_t \odot g_t\]This equation captures the essence of the LSTM:
- $f_t$ determines what to forget from the previous memory,
- $i_t$ controls how much new information to add,
- $g_t$ proposes candidate content to write into memory.
Finally, the hidden state, the quantity that is passed forward and eventually used for prediction, is computed as
\[h_t = o_t \odot \tanh(c_t)\]In other words, the LSTM maintains an explicit memory vector ($c_t$) that can persist over long timescales, while the hidden state ($h_t$) acts as the filtered, task-relevant representation of that memory.
Classification of TDEs based on alert photometry light curves
I wanted to test how far we could push classification using only alert photometry - no host galaxy information, no spatial priors, no engineered physical features. Just the light curve.
The dataset
Code for this section can be found here.
For this project I use public alert photometry from the Zwicky Transient Facility (ZTF), an all-sky optical survey with a cadence of roughly two days. I intentionally restrict myself to alert photometry only, since the goal is to evaluate how much information is encoded purely in the temporal evolution of the source.
The dataset consists of 66 TDEs and 555 Type Ia SNe. This immediately introduces two challenges:
- The sample size is small, especially for TDEs.
- The classes are highly imbalanced.
The small number of TDEs is a real limitation, and any conclusions here should be interpreted with that caveat in mind. That said, upcoming surveys such as LSST will dramatically increase the number of detected transients, making approaches like this much more powerful in the near future.
Each object is observed in two bands (g and r). For each band we have:
- Time (Julian date)
- Flux
- Flux uncertainty
This gives six raw features per time step across the two bands. I also want to note here that since color (g-r) is an important feature I tried to explicitly pass it (with a combined mask), however, this resulted in a magnificant failure. I speculate that this is because the data is very noisy and simply subtracting g and r bands is not enough to account for that.
A major practical issue is that every object is sampled differently. Some have dense early coverage, others sparse late-time data, and the temporal baseline varies significantly. Neural networks, however, expect uniform tensors.
To standardize the data across objects, I apply the following procedure:
- Find the earliest detection across both bands and set it to $t=0$.
- Construct a common time grid up to 300 days with 1-day cadence.
- Interpolate the observed fluxes onto this grid.
- Create binary masks that indicate whether a given time step corresponds to an actual observation (1) or an interpolated value (0).
This masking step is important: I want the network to benefit from interpolation for tensor consistency, but I do not want it to treat interpolated values as real measurements.
In addition, I allow optional per-object normalization and log-scaling to increase dynamic range stability during training.
After preprocessing, each time step contains seven features:
- Normalized time
- Flux and flux uncertainty in g
- Flux and flux uncertainty in r
- A mask for g
- A mask for r
This tensor representation is what ultimately enters the network.mechanism makes LSTMs a natural starting point.
The network
Code for this section can be found here.
I use an LSTM network as a binary classifier to distinguish between TDEs and SNe. The overall structure of the model is relatively simple (see the GitHub repository for the exact hyperparameters used in this setup):
-
First, the input features are passed through a projection layer - a linear transformation from the input dimension to the hidden dimension. This allows the network to learn an internal representation before entering the recurrent stage.
-
The projected features are then passed through the LSTM described above, which processes the sequence step by step and builds a temporal representation of the light curve.
-
After the LSTM, I apply masked mean pooling. Concretely, I multiply the LSTM outputs by the observational masks to suppress contributions from time steps that correspond only to interpolated values. I then average over the valid time steps. This ensures that the network focuses only on actual measurements while still benefiting from a uniform tensor structure.
-
Finally, the pooled representation is passed through a small feed-forward head consisting of:
- A linear layer
- A ReLU activation
- A dropout layer
- A final linear layer that produces the logits
The output of the network is therefore a single logit per object, which is later converted into a probability of being a TDE.
The training
Code for this section can be found here.
During training, I combine a sigmoid activation with a binary cross-entropy loss, implemented using torch.nn.BCEWithLogitsLoss. This formulation is numerically stable and directly optimizes the binary classification objective.
The optimizer used is AdamW with weight decay (see the GitHub repository for the exact hyperparameters used in this setup).
To account for the strong class imbalance, I use stratified splits to preserve the TDE/SN ratio in both training and validation sets. To further mitigate class imbalance, I use the pos_weight argument in BCEWithLogitsLoss, scaling the loss contribution of the minority (TDE) class.
I choose a 70/30 train/validation split instead of the more traditional 80/20. Because the number of TDEs is small, reducing the validation set too much would result in very few TDEs per fold, making performance estimates noisy. With a 30% validation fraction, we retain a meaningful number of TDEs in validation. Finally, I use 5-fold cross-validation to obtain more robust performance statistics and reduce sensitivity to a particular data split.
Results
All the analysis presented here is calculated in this notebook
Using the method described above, we are able to distinguish Type Ia SNe from TDEs with a balanced accuracy of $0.89 \pm 0.02$ across folds.
Because the dataset is highly imbalanced, raw accuracy is not an informative metric. Instead, we use balanced accuracy, defined as
\[\frac{1}{2} (TPR + TNR),\]where
\[TPR = \frac{TP}{TP + FN}, \quad TNR = \frac{TN}{TN + FP}.\]Here, we define the positive class as TDE and the negative class as SN Ia.
The recall for each class is relatively high ($0.90 \pm 0.04$ for TDEs and $0.88 \pm 0.06$ for SNe Ia). However, the precision for TDEs is lower (approximately $0.5$), resulting with an F1 score of $0.63 \pm 0.08$. Given that the scientific goal is to avoid missing TDEs, this tradeoff is acceptable: we recover roughly 90% of TDEs while keeping the number of false positives at a manageable level for follow-up.
The classifier threshold also plays an important role. By increasing the decision threshold from $0.5$ to $0.7$, the TDE precision improves to approximately $0.6$, at the cost of lowering TDE recall to about $0.78$. This illustrates the fundamental precision–recall tradeoff and highlights that the optimal operating point depends on the scientific objective.
Conclusions and future efforts
In this work, I used an LSTM-based classifier to distinguish between TDEs and Type Ia SNe using highly imbalanced and sparsely sampled optical light curves. Despite the limited sample size and the intentionally restricted feature set (alert photometry only), the classifier achieves strong balanced accuracy and high recall for the minority class.
Many operational classifiers used by survey teams incorporate substantially more information, including host galaxy properties, spatial offsets, and engineered physical features, and reach comparable overall accuracy. These approaches often achieve higher precision, particularly for partial light curves.
The goal here, however, was slightly different: to isolate the information content of the temporal evolution itself. The fact that a relatively simple recurrent architecture can perform this well suggests that a large fraction of the discriminative signal is already encoded in the light curve shape and color evolution.
Looking forward, there are several natural extensions:
-
Training on partial light curves.
A major practical improvement would be to train the model to classify objects at different stages of their evolution. Early-time classification is scientifically valuable, as it enables rapid follow-up and time-critical observations. -
Exploring alternative architectures.
Transformer-based models have recently been applied to transient classification (e.g., ATCAT; (3)) and have demonstrated comparable recall with improved precision and overall F1 score. It would be interesting to directly compare recurrent and attention-based approaches under identical data constraints. -
Using Large Language Models (LLMs) for structured scientific reasoning.
At first glance, LLMs may seem like an overkill for transient classification. However, unlike standard neural classifiers, they can generate explicit reasoning in natural language. For example, describing rise time behavior, color evolution, or deviations from typical templates. While such reasoning does not guarantee physical correctness, it offers an intriguing possibility: combining automated classification with interpretable, human-readable justification. If carefully constrained and validated, this approach could complement purely numerical classifiers and potentially change how we prioritize follow-up decisions.
Ultimately, the motivation is not just to improve a metric, but to better allocate limited observational resources. As time-domain surveys continue to scale, automated and physically informed classification tools will become increasingly essential.
References
(1) Robert Stein et al 2024 ApJL 965 L14 (2) Xinyue Sheng et al 2024 MNRAS, Volume 531, Issue 2 (3) Zora Tung 2025 arXiv:2511.00614