Voxel - Image Clip Alignment

The ability to read a person’s mind offers numerous potential applications. Imagine the incredible possibilities for people who have lost the use of some limbs to effectively control robotic prostheses as if they were their original limbs. This technology could also be especially beneficial for those with locked-in syndrome, providing them with a means of communication with the world beyond simply relying on eye tracking. While it’s true that, as with many new advances in expanding the modalities machine learning algorithms are interacting with, there is the potential for negative consequences in the future, currently the advances in this field are leading to incremental positives. Therefore, it is worth pursuing this avenue further.

In this project, we tackled the task of matching brain scans to images that people were viewing during the scan, effectively “reading their minds.” To accomplish this, we leveraged contrastive loss and the CLIP model framework. CLIP, “efficiently learns visual concepts from natural language supervision” and “can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized” by encoding both images and text into their own latent spaces where the property holds that corresponding image and text embeddings will have a high cosine similarity. The inspiration for our method came from the LIT paper, which found that by freezing one embedding model in CLIP and substituting a new, untrained embedding model for the other, the performance matched the original model. This implies that the two embedding models do not need to be trained in parallel and the embeddings produced can be more generally applicable. We extended the method in the LIT paper by substituting one of the embedding towers with a tower that encodes a completely different modality - in our case, an intensity grid of 3D voxels produced during an MRI. Our hope was that by training this new model to align with the pre-trained image embedding model, we would achieve high performance with the limited amount of training data we had.

The task of training a mind-reading model requires a specific and difficult-to-acquire dataset. Gathering datasets for this kind of task is especially difficult due to the high bar of entry for taking high resolution brain scans. To overcome this obstacle, we used the Natural Scenes dataset which provided us with scan-image pairs for 8 subjects, resulting in a total of 8000 pairs. However, this limited dataset forced us to make design choices for the network that could potentially hinder its performance. For example, to prevent overfitting, we greatly reduced the depth of the convolutional head and found that low batch sizes and high dropout rates yielded the best results. We hypothesize that the only reason it generalized as well as it did was because we were aligning with a powerful CLIP embedding.

In the development of our network, the voxel encoder network named VoxelCLIP, we went through multiple iterations before arriving at our final version. Our initial version, a fully convolutional network with no residual connections, achieved only a 10% top 5 recall on the validation set. Our second attempt, a full transformer acting on a flattened representation, was too large and overfit the data. We eventually landed on a 4-layer convolution head for feature extraction, followed by two attention layers with residual connections to increase the receptive field. This version achieved our best results, with a top 5 recall score of 60% given 500 options, meaning the network was able to predict the image being seen within 5 guesses 60% of the time. Additionally, this network allows us to perform clustering tasks, providing insights into the similarity of images in the MRI latent space.

The next step in utilizing the trained MRI encoder for generative tasks is to apply a technique similar to DALLE2, which would convert the MRI embeddings into image embeddings that can then be decoded into actual images. While I was not able to pursue this further, members of my team are still working towards this goal, demonstrating the potential for even more advanced applications of this technology.

The technology developed in this project has the potential to be incredibly useful in the future. By training a new encoder for MRI scans, we are able to “read” people’s visual thoughts. This has numerous practical applications, such as aligning images with text to determine what words a person is thinking. While this project was primarily an exploration of model design and the trade-offs that must be made when working with a constrained dataset, it is easy to see how this technology could be applied in the real world. Additionally, this project allowed me to gain valuable skills in model design and working with constrained datasets.

Work in progress - Please forgive the dust…