Anyscale and deepsense.ai have joined forces to develop a cutting-edge fashion image retrieval system, leveraging the latest advancements in multimodal AI technology. This collaboration aims to provide a scalable solution for e-commerce platforms, enabling users to search for products using both text and image inputs.
Introduction
According to Anyscale, the project employs a modular and service-oriented design, facilitating easy customization and scalability. The core technology revolves around Contrastive Language-Image Pre-training (CLIP) models to generate text and image embeddings, which are then indexed using Pinecone.
Application Overview
The e-commerce industry often faces challenges in providing accurate search results due to inconsistent product metadata. By incorporating text-to-image and image-to-image search capabilities, this new system bridges the gap between user intent and available inventory. The application utilizes scalable data pipelines and backend services powered by Anyscale, ensuring seamless performance even under heavy loads.
Multi-modal Embeddings
The system's backend processes involve generating embeddings using CLIP models. These embeddings are stored in a vector database, enabling efficient similarity searches. The process includes:
- Preparing the dataset, such as the InFashAIv1 dataset, which includes images and descriptions.
- Creating text and image embeddings using CLIP.
- Indexing these embeddings in Pinecone.
OpenAI's original CLIP model and other fine-tuned versions like FashionCLIP are used to capture the nuances of different domains, enhancing the search accuracy.
A Scalable Data Pipeline
Ray Data is employed for efficient, distributed data processing. The pipeline includes data ingestion, processing, embedding generation, and vector upserting. This distributed approach ensures scalability and efficiency, crucial for handling large datasets.
Application Architecture
The application is built using Ray Serve deployments, allowing for easy scaling and maintenance. The architecture includes several components:
- GradioIngress: The frontend service providing an intuitive UI for users.
- Multimodal Similarity Search Service: The backend API handling search requests.
- Image and Text Search Services: Independent services for processing image and text queries.
- Pinecone: The vector database storing embeddings for efficient search.
Using Fine-tuned vs. Original CLIP
Incorporating both the original and fine-tuned CLIP models allows for comprehensive search results. While OpenAI's CLIP focuses on specific clothing items, FashionCLIP offers a holistic understanding of outfits, capturing the overall vibe and style.
Conclusion
This collaboration between Anyscale and deepsense.ai provides a practical roadmap for building scalable, efficient, and intuitive image retrieval systems for e-commerce. By leveraging advanced AI models and scalable infrastructure, the solution addresses the challenges of metadata inconsistency and enhances the user experience.
Future Work
Future efforts will focus on exploring new multi-modal models like LLaVA and PaliGemma to further improve retail and e-commerce systems. These advancements aim to enhance personalized recommendations, product insights, and customer interactions.
Image source: Shutterstock