๐ŸŽฏ Objective

Master the "eyes and ears" of the device. You will learn to combine the Vision and Speech frameworks with Foundation Models to create multimodal experiences.

Daily Breakdown

Day 1

Advanced Vision & Object Tracking

Topic: Real-time visual understanding.

Tasks:
  • Implement VNRecognizeTextRequest and VNDetectBarcodesRequest using the latest Swift Concurrency APIs.
  • Use Object Tracking to maintain state across video frames.
Day 2

Image Analysis for LLMs

Topic: Transforming Pixels to Tokens.

Tasks:
  • Extract "Visual Embeddings" from an image.
  • Pass image descriptions/metadata into the LanguageModelSession to "ask questions" about a photo.
Day 3

Local Speech-to-Text (STT) & Audio Analysis

Topic: Transcription and Sound Classification.

Tasks:
  • Use SFSpeechRecognizer for real-time, on-device transcription.
  • Implement SoundAnalysis framework to detect specific environmental triggers (e.g., a baby crying or a siren).
Day 4

On-Device Image Generation (Generative Playground)

Topic: Controlling Image Diffusion.

Tasks:
  • Use the Image Playgrounds API to generate assets based on app context.
  • Implement "Image-to-Image" workflows where a user sketch is refined by the on-device model.
Day 5

Latency & Thermal Management

Topic: Production-grade AI.

Tasks:
  • Use MetricKit to monitor the impact of Vision tasks on battery life.
  • Implement "Frame Skipping" and "Low Power Mode" logic for AI processing.
๐Ÿงช

Friday Lab: "The Seeing Assistant"

Project: Build an app that uses the camera to "describe" the world to a user.

  1. Input: Live camera feed.
  2. Logic: Use Vision to detect 3 objects, then pipe those labels into a Foundation Model to generate a poetic 1-sentence description.
  3. Output: Use AVSpeechSynthesizer to speak the description aloud.