#vision-language-model — quidproquo

ai deep-dive May 23, 2026

Midscene.js: Betting on Pure Vision for Cross-Platform UI Automation

An MIT-licensed open-source UI automation framework from ByteDance (~13k GitHub stars). UI actions rely solely on feeding screenshots to vision-language models (Qwen3-VL / Doubao / Gemini-3 / UI-TARS), with no DOM parsing. A single JS API works across Web / Android / iOS / desktop, and starting from v1.0, the DOM action mode was removed entirely. The trade-off: each step is slower and more token-expensive.

#midscene #ui-automation #vision-language-model #mcp #agent #bytedance

ai deep-dive May 9, 2026

DeepSeek-OCR: The 10x Compression Experiment That Turns Long Context into Images

DeepSeek-OCR's paper is titled Contexts Optical Compression -- OCR is just the means; what it actually validates is that 'rendering text as images and feeding them to a VLM' achieves 10x compression at 97% accuracy. This is a qualitative shift for long-context LLM and RAG token costs.

#ocr #deepseek #vision-language-model #long-context #context-compression #rag