Its an MoE 196B param model with 11b active params. It handles a context window of 256k tokens.
> by employing a 3:1 Sliding Window Attention (SWA) ratio—integrating three SWA layers for every one full-attention layer. This hybrid approach ensures consistent performance across massive datasets or long codebases while significantly reducing the computational overhead typical of standard long-context models.
I wonder how this models performs in the needle in a haystack bench.
Another interesting bit is the following:
> Step 3.5 Flash brings elite level intelligence to local environments. It runs securely on high-end consumer hardware (e.g., Mac Studio M4 Max, NVIDIA DGX Spark), ensuring data privacy without sacrificing performance
Pricing for step-3.5-flash:
- Input: $0.1/1M tokens
- Output: $0.3/1M tokens
Its an MoE 196B param model with 11b active params. It handles a context window of 256k tokens.
> by employing a 3:1 Sliding Window Attention (SWA) ratio—integrating three SWA layers for every one full-attention layer. This hybrid approach ensures consistent performance across massive datasets or long codebases while significantly reducing the computational overhead typical of standard long-context models.
I wonder how this models performs in the needle in a haystack bench.
Another interesting bit is the following:
> Step 3.5 Flash brings elite level intelligence to local environments. It runs securely on high-end consumer hardware (e.g., Mac Studio M4 Max, NVIDIA DGX Spark), ensuring data privacy without sacrificing performance
The chinese is pushing hard to democratize llms!