๐ฏ Objective
Master the "eyes and ears" of the device. You will learn to combine the Vision and Speech frameworks
with Foundation Models to create multimodal experiences.
Daily Breakdown
Day 1
Advanced Vision & Object Tracking
Topic: Real-time visual understanding.
Tasks:
- Implement
VNRecognizeTextRequest and VNDetectBarcodesRequest using the latest Swift Concurrency APIs.
- Use Object Tracking to maintain state across video frames.
Day 2
Image Analysis for LLMs
Topic: Transforming Pixels to Tokens.
Tasks:
- Extract "Visual Embeddings" from an image.
- Pass image descriptions/metadata into the
LanguageModelSession to "ask questions" about a photo.
Day 3
Local Speech-to-Text (STT) & Audio Analysis
Topic: Transcription and Sound Classification.
Tasks:
- Use SFSpeechRecognizer for real-time, on-device transcription.
- Implement SoundAnalysis framework to detect specific environmental triggers (e.g., a baby crying or a siren).
Day 4
On-Device Image Generation (Generative Playground)
Topic: Controlling Image Diffusion.
Tasks:
- Use the Image Playgrounds API to generate assets based on app context.
- Implement "Image-to-Image" workflows where a user sketch is refined by the on-device model.
Day 5
Latency & Thermal Management
Topic: Production-grade AI.
Tasks:
- Use MetricKit to monitor the impact of Vision tasks on battery life.
- Implement "Frame Skipping" and "Low Power Mode" logic for AI processing.
๐งช
Friday Lab: "The Seeing Assistant"
Project: Build an app that uses the camera to "describe" the world to a user.
- Input: Live camera feed.
- Logic: Use Vision to detect 3 objects, then pipe those labels into a Foundation Model to generate a poetic 1-sentence description.
- Output: Use AVSpeechSynthesizer to speak the description aloud.