Preface#
Although the D9000 has reached the end of its life cycle now, after all, it has been a year. It may not be very meaningful to rehash old topics, but it should be discussed.
Previously written, with errors, just bear with it#
The Tianji 9000 was released on December 16, 2021, manufactured using TSMC N4. The first to launch was the Find X5 Pro Tianji version. Of course, this time we actually consider it the first layout. Although tech released the dieshot quite early, it still requires funding (how could a poor student have money?), so it was only recently that I had time to create a drawing. Although I had already drawn it when the dieshot was released, I was slightly unclear about some parts at that time, so it was delayed until now. It can still be considered a kind of first release.
So here is the first release of the dieshot.
As you can see, it's very blurry, but I made some annotations, increased the contrast, and enhanced the resolution.
After all, it’s still black and white, so the contrast can be increased a bit for a more comfortable viewing experience and better identification.
In this image, it is obvious that the GPU cluster is in the upper right corner, which is the Mali G710 MC10. Although it theoretically supports up to 16 MP, here MC (multi-core) and MP (multi-processor) mean the same thing, with no difference.
It is clear that the Mali G710 has two sets of ALU clusters and three GPU cores (the orange part is the GPU cluster I/O). 4+4+2, though I don't know why it was designed this way. I briefly looked at the dieshot of the D81 released at the same time, and it didn't have this design, which is quite strange.
The lower left corner shows the GPU cache (yellow), which should be 3 MiB.
It should be noted that compared to the previous generation G78, although both belong to the Valhall architecture, each shader core contains two execution engines.
This achieves a doubling of shaders, theoretically supporting up to MC16.
In terms of GPU architecture, the G710 is an iteration of the G78, the G510 is an iteration of the G57, and the G310 is an iteration of the G31. All are based on the Valhall architecture, which has been used since the G77. The Valhall core has changed compared to the previous Bifrost generation, featuring a new superscalar engine (improving IPC and PW values), a simplified ISA, and a new instruction set that is more compiler-friendly. It also features dynamic instruction scheduling, working with APIs like Vulkan, adopting new data structures. For example, the previous Bifrost architecture was 4-wide/8-wide, while G72's execution part includes 4-wide scalar SIMD with a warp size of 4. G76 increased to two 4-wide units with a warp size of 8. This narrow warp design leads to ineffective filling of enough threads during scheduling, while Valhall increases the warp to 16-wide, thus increasing ALU utilization.
The execution engines have been merged from three into one large one, but the actual ALU still consists of two parts, 2x16-wide FMA.
Compared to Bifrost, where each execution engine has its own digital path control logic, scheduler, and instruction cache, which is relatively wasteful of resources, the G710, compared to its predecessor, has each shader core containing two execution engines, achieving a doubling of shaders.
The engines still consist of two processing units, but with slight changes. With the wide size and integer throughput remaining unchanged, the G710's processing units are divided into 4x4-wide. Each engine has dedicated resources, achieving a doubling of FMA per core per cycle. The new TMU units can process 8 bilinear texels per cycle.
The G710 replaces Mali's original job manager with the so-called CSF, responsible for scheduling and draw calls.
The G610 is essentially a G710 with fewer than 7 cores.
From the direct parameters, the G510 and G310 are indeed quite formidable (100% indeed), but the G31 hasn't been updated in hundreds of years.
The G510's shader core has an additional execution engine, with each execution engine optionally equipped with 2 clusters of processing units, similar to the G710. However, one of the engines in the G510 can be equipped with only one processing unit, with optional FMA processing capabilities of 48-64 per cycle. Additionally, the texture units can be optionally configured for 4 texels/8 texels per cycle, with options for 2-6 cores and L2 configurations. The FMA processing capability per cycle can be selected as 48-64, with options for 2-6 CU.
The memory bandwidth is 60GB (4x16x3750x2÷8), with SLC being 6 MiB.
Although the G510 shows significant improvements, it is also updated under the condition that G57 has not changed much in price. This means that while it appears to have a huge upgrade, the improvement points are due to the fact that it hasn't been updated in hundreds of years, and suddenly a SIV change is made to let Cambridge handle it. The same goes for the G310; the G31 really hasn't moved in hundreds of years. This update superficially shows a 100% performance increase, but compared to contemporaneous or even previous CU, you will find the improvement is quite lonely. This is the cleverness of ARM.
Let's talk about the CPU part. The above image shows the CPU cluster of the D9K, located at the lower part of the SoC.
This processor uses the ARMv9 instruction set architecture. ARMv9 is an update based on v8.5, including all v8.5 subsets.
It adds the SVE2 extension. SVE2 is essentially an extension of SVE, but SVE is purely aimed at HPC, while SVE2 achieves compatibility with the NEON instruction set. In a sense, SVE2 is positioned as a successor to NEON, allowing for more flexible data access. Although SVE itself has been taken over by v8.2, the actual usage is only for ARMv1 server IP. GEMM and BF16 are included in v8.6, so support will only come with v9.1. NEON has a width of 128 bits (fixed), while SVE/SVE2 has a minimum of 128 bits and can be expanded to 2048 bits. While NEON's method (fixed at 128 bits) is not unfeasible, it is cumbersome (especially in terms of scalability, as it depends on the size of SVE registers to determine where to place "data").
In contrast, SVE's approach offers scalability and ease of deployment. The performance of SVE2 is the same as that of HPC's SVE, but in terms of DSP and multimedia, the performance is 1.2x for 128 bits, 2x for 256 bits, and 3.5x for 512 bits.
1+3+4 tri-cluster. From early aSMP technology to the previous big.LITTLE, and now to the advanced DynamIQ technology. DynamIQ technology allows up to 32 clusters, with each cluster having up to 8 cores, to run on different voltage curves, of course, this is for servers. MTK engineers use HP + BP + HE, which is the tri-cluster, where the HP inner core runs at 3.4 GHz to meet high-performance needs. It should be noted that current domestic software still primarily relies on single-core performance.
The BP core balances power and performance, while the HE is positioned as an ultra-low power, ultra-low voltage standby core. The ARMv9 introduces the DSU110, which allows workload switching across multiple clusters for optimal power efficiency.
The DSU110, with ARM X2 in the upper left corner, shows changes in core architecture. Compared to the previous X1, the pipeline has been shortened by one stage, and the D9K is fixed at 3.4 GHz. Roughly estimating, it might perform similarly to X1 at 3.75 GHz (but this is the engineering machine frequency; the actual is 3.05 GHz). Meanwhile, L3 has been allocated 8 MiB (theoretical maximum of 16 MiB, but ARM's PPT targets the 1135G7). The neighboring SDM only provides up to 4 MiB.
Although under the same process and low frequency, X2's efficiency ratio is slightly worse than X1, as long as performance is up, X2's efficiency ratio is stronger. This means ARM recommends running the big core at high frequency to improve efficiency, but MTK's MP products haven't provided high frequencies, so they only gave 3.05 GHz. Therefore, it is estimated that there will be a D9000 Plus, similar to special products like 8250AB.
Although it seems that X2 has better efficiency, in fact, X2's efficiency at the same frequency is definitely more power-consuming than X1. From ARM's PPT, ARM does not use the common P/W but uses Performance/Power Curve, which means efficiency is evaluated based on the same performance. Under the same performance, X2 can run at a lower frequency to take advantage of low voltage. Based on a 16% IPC lead (using different cache comparison methods), it can achieve about a 30% efficiency boost at low frequencies (土 3). Of course, at the same frequency, X2's power will be higher (that's hidden in the corner).
At the same time, in A710, a sufficiently balanced PPAC is its design language. In design, A710 is a minor revision of A78. The front end has been modified less than X2, but still has sufficient increases, with the branch prediction window cache doubled and an increase in TLB cache (32-48). Uop has been cut, and dispatch has changed from 6-wide to 5-wide, reducing the pipeline by one cycle.
In the PPT, with 8 MiB L3, compared to A78 with 4 MiB L3, performance increases by 10% or power consumption decreases by 30% at the same performance. Moreover, A710 is suitable for high frequencies, while the PPT indicates that A78 is not suitable for high frequencies.
In A510, this is ARM's most significant recent change. First, decoding has increased from 2 to 3, adding branch prediction. A510 can be combined into a dual-core compound or a single-core compound. L2 TLB and VPU loading can be selected as 2x64-bit or 2x128-bit (it is estimated that 128-bit will be used in single-core compounds). AA32 has been removed. If the small core adds AA32 support, it will lead to increased power consumption, so ARM cautiously retained AA32 in A710.
In terms of efficiency, at low frequencies, it is even worse than A55, and only at high frequencies is it better than A55. But who would use high-frequency small cores? The most obvious feature of A510 is that it can be combined into a dual-core compound or used as a single-core compound. The dual-core compound shares L2 cache, L2 TLB, and VPU, while the single-core A510 has exclusive access to its own L2 cache, L2 TLB, and VPU.
In the back-end, the integer part has:
3 integer ALUs, 1 complex MAC, 1 DIV unit, 1 branch dispatch port, 1 integer division unit, LSU, and pure storage unit. The VPU has: PALU, SHARED VPU (encryption unit VALU, VMAC, VMC, VPU 128-bit (encryption 1, VALU 1, VMAC 1)).
In terms of loading, there are 2 load/1 store pipelines, with a pipeline width of 2x128 bits, 3-wide sequential decoding, and branch prediction. It has a 128-bit prefetch pipeline, capable of fetching 4 instructions per clock cycle. The VPU path size can be 2x64-bit or 2x128-bit. L1d is separate from the MMU.
In the upper left corner is the modem part.
There isn't much to say about the modem part, but compared to others' modem parts, it's quite impressive.
They are all, ah, cut from the original image. The CPU + Modem ratio is the same, but the CPU vs. CPU calculation is incorrect because it represents the area ratio of CPU/modem in their respective dies. It's quite astonishing. However, the SDM's modem needs to be compatible with millimeter waves, while the MTK M80 seems not to need it, which is also a consideration for area. But it's best to just take a look, as detailed interpretations require funding. Ah, and I'm not a professional, so just bear with it.
Speaking of X65, it has a large modem cache, reportedly shared with the ISP, but unclear.
The lower left corner is the video (accurately should be called streaming media) decoding part. This decoding also uses big.LITTLE, with 2 large encoders and 2 small encoders. Although I don't know the significance of this, it looks very, ah, professional. Generally, video decoding is usually a task for the CPU/GPU, but this time an APU was added for, ah, auxiliary calculations.
This is the APU, in the lower right area. It's actually the NPU; MTK just calls it APU. Inside, you can see that the area of the APU is quite large. According to the PPT, the APU is responsible for many additional calculations for units like GPU, ISP, CPU, and Decode. The yellow part is the 3.5-4 MiB APU cache. This time, it features a design intent of 4 large and 2 small NPUs, with 4 Performance cores and 2 Flexible cores. It seems to involve some RISC-V elements, but ARM is also a possibility.
Above the streaming media decoding part are some unclear units, such as Imagiq 790, which is actually the ISP. This time, it supports synchronous processing of three cameras at 32 MPx3, 18-bit HDR video, with a processing speed of 9 billion pixels per second and a maximum of 320 million pixels for CMOS. Some data from the processing part is handled by the APU. You can directly see the red part of the ISP core.
Below, MiraVision 790 is the display processing unit, but specific analysis is not possible. The upper right corner has some unknown units; my skills are lacking.
The lower right corner has the regular USB buffer, with two USB I/O ports, totaling three, and one near the GPU. The memory part is LPDDR5X 4x16-bit, supporting a maximum of 7500 MHz, and that's it.
Finally, let's talk about TSMC N4. Ah, the price of a single wafer for N4 is about $26,000 to $30,000, after all, N5 increased from $26,000 to $28,000. N4 is actually a small iteration of N5. Ti's view is that it counts as a node, but it is clearly not. However, Ti considers any change in CPP/MMP to count as a node. N5: TSMC's N5/N5P. TSMC's 5nm is a new node with 5 sub-points: N5, N5P, N4, N4P, and N4X.
N5: TSMC has 3 libraries for N5, including 6-track UHD library, 7.5-track HD, and 9T HP library. The 6T UHD (density library) has a cell height of 180 (6x30) nm, with 137.6 Mtr. The 7.5TUD (performance library) is 225 (7.5x30) nm with 92.3 Mtr. The 9T HP is 270 (9x30) nm. CPP is 48 nm, and MMP is 30 nm. The theoretical maximum density of N5 LPE, which is 6T UHD, can reach 137.6 Mtr/mm² (predicted). The actual maximum density is...
The biggest problem with N5 is thermal density. At 1.8X density, power consumption only reduces to 0.7X, which is very unfavorable for high-performance scenarios.
N5 introduces 7 new VTs (SVTLL, SVT, LVTLL, LVT, uLVTLL, uLVT, eLVT). eLVT adds an extra 10% power consumption increase, thanks to Via Piller and post-process metal technology optimizations, achieving an overall improvement of 35% (N5 HPC's uLVT compared to N7's uLVT, with a frequency increase of 25% and unchanged power consumption).
N5P mainly reduces power consumption, using the same design rules (DRC) and being fully compatible with N5. It optimizes power reduction in FEOL and MOL.
N4: Compared to N5, the Mtr changes very little; its 7 Mtr is just a shortening of its MMP... For example, in D9000, X2 uses a 210 nm library, but in D9K, all cores of the CPU subsystem are N5's 210 uLVT.
There are a total of N4, N4P, and N4X, with the same three libraries: 6-track UHD library, 7.5-track HD, and 9T HPC. In the N4 HP library, it shares the N5 HP library. N4/N6 is suitable for running at low frequencies. The 6T UHD library has a cell height of 168 (6x28) nm with 146 Mtr. The 7.5T is 210 (7.5x28) nm with 97.8 Mtr. CPP is 48 nm, and MMP is 28 nm, reducing by 2 nm... From 30 nm down to 28 nm, relying on shortening the metal layer spacing and gate spacing, and the number of metal layers...
The theoretical maximum density of N4 LPE, which is 6T UHD, can reach 146 Mtr/mm².
N4P is just a transitional product, mainly to reduce masks, using more EUV layers, adding 6 layers.
N4X is the ultimate product of 5nm, providing high performance under high voltage conditions, as it can be compatible with 5nm design kits... However, one cannot simply measure by MTR, as MTR definitions vary among different companies.
Moreover, increasing voltage leads to exponentially increasing power consumption. Under high fixed frequencies, dynamic power consumption accounts for the majority (because a core performs differently under different processes, and at the target frequency, the designed frequency will always differ from the realized frequency (specifics will be written in a few days)).