Aros/Developer/Docs/HIDD/ATI

Introduction

 * 2006 ati from radeon 7000 to x600
 * 2011 ati from r400 and r500 added
 * 20?? ati HD r600 and up
 * 20?? ati south islands r800 and up - could look at vulkan api for this

Initial 2d ati driver written mid 2000s

Some work was done to implement 2d acceleration for r4xx and r5xx card, nearly the same functions as older card and has not been tested. There were some problem switching resolution with the Ati x1300

Hit or miss luck with igp's with this 2011 update so far. Some work better than others.

Gallium 3D

 * Before HD chipsets no gallium
 * During and after HD gallium support
 * SI vulkan api with OpenGL wrapper

https://docs.mesa3d.org/gallium/index.html

Trying to port DRM, to get 3d acceleration to work... Must get 3d acceleration to work to get 2d acceleration to work with r600-evergreen card. You output 2d with 3d. 2d acceleration was removed from r600 evergreen and beyond cards.

For the moment there only software rendering for r600 r700 card, some of the 2d functions do not work properly either.

DRM
Next step, port most of the code of drm/radeon/ so I can compile it with the radeon-driver. http://www.kernel.org/ /drivers/gpu/drm/radeon/     /drivers/gpu/drm/

Now trying to set-up the GART table to get the ringbuffer working. The function r600_do_init_cp(... in file “r600_cp.c”. /drivers/gpu/drm/radeon/r600_cp.c

The codes in workbench/hidd/hidd.nouveau contain BOTH nouveau and generic drm code. You should be able to reuse a large portion of generic stuff in radeon driver. As for GART, what basically need to do it provide to the driver memory allocated at 4k boundary simulating pages. The nouveau driver already has the necessary code to map this into cards address space. Assuming the same is true for radeon driver.

Dri1 - locking
Hardware Locking for the Direct Rendering Infrastructure Rickard E. Faith, Jens Owen, Kevin E. Martin Precision Insight, Inc. $Date: 1999/05/11 22:45:24 $, $Revision: 1.6 $

This paper examines the locking requirements of typical DMA-based hardware currently available for the PC platform with respect to the Direct Rendering Infrastructure (DRI). A locking algorithm is described. This algorithm enhances a typical kernel-call-based blocking lock with the single optimization that an entity that held the lock most recently can require the lock without using the kernel call. The algorithm relies on atomic instructions and is constant in time and space. Familiarity with the DRI design documents is assumed [OM98, MFOA99].

1. Preamble

1.1 Copyright

Copyright © 1999 by Precision Insight, Inc., Cedar Park, Texas. All Rights Reserved.

Permission is granted to make and distribute verbatim copies of this document provided the copyright notice and this permission notice are preserved on all copies.

1.2 Trademarks

OpenGL is a registered trademark of Silicon Graphics, Inc. Unix is a registered trademark of The Open Group. The `X' device and X Window System are trademarks of The Open Group. XFree86 is a trademark of The XFree86 Project. Linux is a registered trademark of Linus Torvalds. Intel is a registered trademark of Intel Corporation. All other trademarks mentioned are the property of their respective owners.

2. Locking Requirements

Although some cards may support some subset of simultaneous accesses, typical Intel-based PC graphics hardware (costing less than US$2000 in 1998) does not allow simultaneous, interleaved access to the frame buffer memory, the MMIO-based command FIFO, and the DMA-based command FIFO. For example, typical cards will allow frame buffer and command FIFO activity while processing updates to the MMIO registers that control the hardware cursor position. Some cards will atomically add commands sent via DMA to the command FIFO, permitting some kinds of simultaneous MMIO and DMA. However, other cards will intermingle MMIO and DMA commands in the command FIFO on a word-by-word basis, thereby completely corrupting the command FIFO. Other cards are not robust enough to permit frame buffer access while other operations are occurring.

Because of these limitations, typical hardware will require a single (per-device) lock that cooperatively restricts access to the hardware by the X server, the kernel, and 3D direct-rendering clients. This section briefly outlines how the hardware lock would typically be used by each of the three types of DRI entity. Some hardware may not require locking in all of the instances listed in this section.

2.1 Kernel

DMA Buffer Dispatch

When the hardware is ready for another DMA dispatch from the kernel's DMA buffer queue, the kernel will obtain the hardware lock. If the notification that another DMA dispatch is possible came via a hardware interruption, then the kernel may have to reset the interrupt on the hardware. If the current GLXContext is different from the GLXContext required by the next DMA buffer to be dispatched, then the kernel will have to update the hardware graphics context, possibly via a callback to the X server (see below).

Graphics Context Save and Restore

If the X server is performing hardware graphics context switches on behalf of the kernel, then the kernel will hold the hardware lock on behalf of the X server while the context is switched. The X server may perform the context switch by doing MMIO or by issuing a request to dispatch DMA immediately (without obtaining another lock), and will then issue an ioctl to tell the kernel that the context has been switched and it is now safe to proceed with DMA dispatches.

Vertical Retrace

When an operation (e.g., a DMA buffer dispatch) must be synchronized with the next vertical retrace, the kernel will obtain the lock and poll or wait for an interruption before performing the operation.

2.2 3D Client Library: Software Fallbacks

The 3D client library will obtain the hardware lock when performing software fallbacks that require direct hardware access (e.g., access to the frame buffer). During this time, the client may issue high-priority blocking DMA requests that bypass the normal kernel DMA buffer queues (and that do not require the kernel to obtain the lock).

The 3D client library assumes that the kernel DMA buffer queue for the current GLXContext has been flushed, that other kernel DMA buffer queues are halted (implied by taking the lock), that the hardware is quiescent, and that the current full graphics context is available (including textures and display lists). The initial DRI implement will assume that this context is stored on the hardware, not in the SAREA.

2.3 X Server

Hardware Cursor

Depending on the requirements of the graphics hardware, the lock may be needed to move the hardware cursor. (At the time of this writing, we have not identified any hardware that requires locking for hardware cursor movement, but we note the possibility so that future driver implementors can check for it.) 2D Rendering Without Changing Window Offsets

2D rendering by the X server will require the lock, and may require quiescent hardware (depending on the operation being performed and the hardware being used).

Region Modifications

Region modifications (e.g., moving windows) require the lock and quiescent hardware. Further, all of the kernel DMA buffer queues must be flushed (it may be possible to compute an optimized set of DMA queues that must be flushed, based on the window to be moved and the GLXContexts touched by that window). After the region modification, the window-changed ID will be updated (this update may not have to be locked).

DGA

When the application program makes use of the XFree86-DGA protocol extension, the X server will hold the lock on behalf of the DGA client. XFree86-DGA must be supported for clients that access the hardware directly but that do not have knowledge of the DRI.

OpenGL/Fullscreen

A new protocol extension that allows fullscreen OpenGL access within the framework of the DRI will be provided. The server will issue an ioctl to halt all kernel DMA buffer queues for existing GLXContexts. From that point onward, all newly created GLXContexts will have operational (i.e., unhalted) kernel DMA buffer queues. This implies that the client must issue the OpenGL/Fullscreen request before creating any GLXContexts.

3. Optimization Opportunities

3.1 Analysis

The X server, 3D clients, and the kernel will share the lock cooperatively. Since client processes can die while holding the lock, some process must detect client death and reclaim the lock. The X server can detect client death when the connection used for the X protocol fails. However, in the case of a UNIX Domain Socket (i.e., a named pipe), timely notification requires active I/O on the connection. The kernel-level device driver, however, knows about client death as soon as the client closes the connection to the kernel device driver. Timely notification of client death, together with other kernel-level issues (e.g., handling of SIGSTOP [FM99]), make the use of the kernel as the ultimate lock arbiter compelling.

However, obtaining the hardware lock via an ioctl is a heavyweight operation. If an entity is performing several short operations, the lock will have to be taken and released for each operation in order to provide user-perceived interactivity. The next section explores methods for avoiding an ioctl for each locking operation by introducing a two-tiered lock. This section outlines the requirements for the two-tiered lock design by exploring the transitions between common entity states and identifying the transitions that are performance-critical according to the following general criteria:

Interaction is optimized for a single GLXContext. User-perceived responsiveness of the X server is maintained.

For purposes of analysis, the system has the following states: DA: The kernel dispatching DMA buffers for GLXContext A.   DB: The kernel dispatching DMA buffers for GLXContext B.    SA: A 3D client doing software fallbacks for GLXContext A. Software fallbacks are assumed to be slow operations, such that the overhead of obtaining the hardware lock, regardless of implementation, is negligible. SB: A 3D client doing a software fallbacks for GLXContext B.   2D: The X server performing 2D rendering. HC: The X server moving the hardware cursor. This operation is lightweight, but a human is involved, so responsiveness is more important than overall speed. RM: The X server performing region modifications. The importance of optimization will be ranked on the following scale:

Optimization is critical to meet performance goals. Optimization will help meet performance goals, but is not critical. Optimization may help meet performance goals, but may not have a large performance impact.

The rankings are shown below for transitions between any two states: DA      DB       SA       SB       2D       HC       RM

DA       1        2        3        3        2        2        3

DB       2        1        3        3        2        2        3

SA       3        3        3        3        2        2        3

SB       3        3        3        3        2        2        3

2D       2        2        2        2        1        2        3

HC       2        2        2        2        2        2        3

RM       3        3        3        3        3        3        3

3.2 Discussion

Since our goals are to optimize single GLXContext throughput as well as X server responsiveness, it is most important that locking for DMA dispatch and for 2D rendering be optimized. If the kernel holds the lock for long periods of time, responsiveness is compromised. If the X server holds the lock for long periods of time, rendering throughput is compromised. Hence, the kernel and the X server must not hold the lock for more than minimum necessary amount of time. However, because both DMA dispatch and 2D rendering operations are relatively short, the lock will be taken and released with great frequency: heavyweight ioctl-based locking may account for a significant percentage of the time used to perform the operations. In the case of advanced MMIO-based graphics hardware (i.e., vs. available DMA-based hardware), the overhead of the ioctl-based lock will be prohibitive.

Because of these requirements and observations, the design of a two-tiered lock was undertaken. We adopt operating system terminology and call the two methods of obtaining the lock ``heavyweight and ``lightweight. The key observation from the above discussion is that locking must be optimized for a single GLXContext taking and releasing the lock many consecutive times (for this discussion, the X server will be considered as a special GLXContext). In particular, the lightweight lock will not be designed to be shared between GLXContexts, because such design could possibly complicate the algorithm at the expense of the more critical case (e.g., by requiring additional work in the lightweight routines to set flags or check for additional state that is important, but seldom or never used in the single GLXContext case).

4. Lock Design

4.1 Introduction

The following assumptions will simplify lock design:

Locking will be performed using the GLXContext ID to determine which entity last had the lock (or a hash of this ID that makes at least two bits (e.g., the high bits) and one ID (e.g., zero) available as flags.   The heavyweight lock will use the ioctl interface to the kernel (an ioctl has the approximate overhead of a system call).    The heavyweight lock will be used to resolve all cases of contention.    The lightweight lock will be stored in a user-space shared memory segment that is available to all locking entities.    A pointer-sized compare-and-set (CAS) atomic primitive is available. This is true for most modern processors, including Intel processors starting with the Intel486 (a double-wide CAS instruction is available starting with the Pentium). Similar lightweight algorithms can be designed using other atomic primitives. (For older hardware, such as the Intel386, which will have extremely poor 3D graphics throughput, the lightweight lock may simply fallback to the heavyweight lock.)

4.2 Previous Work

[MCS91] discusses synchronization algorithms that make use of busy waiting and atomic operations, as abstracted by the fetch-and-phi primitive, but do not discuss a two-tiered locking mechanism.

[BKMS98] describes a ``thin lock that can be ``inflated to a ``fat lock''. However, a fat lock cannot ``deflate'' to a thin lock, so these methods do not apply to our work.

[L87] describes a method that, in the absence of contention, uses 7 memory accesses to provide mutual exclusion. If Lamport's algorithm is modified to provide for a kernel-level fallback in the case of contention, an algorithm with fewer reads and writes may be possible. However, since all modern architectures provide atomic fetch-and-phi primitives, there is limited value in exploring an algorithm that depends only on atomic reads and writes.

[LA93] explore "two-phase" algorithms that combine busy-waiting with blocking. The two-tiered lock described in this paper is a Lim and Agarwal two-phase lock with the polling value set to 1. At this time, however, we impose the restriction that only the last process to have the lock can obtain the lock by checking a synchronization variable -- all other lock acquisitions must block. In the future, after we obtain more experience with DRI contention, we may extend our locking algorithm to be a completely general two-phase lock.

Lim an Agarwal note that the idea of combining the advantages of polling and signalling was first suggested by Ousterhout in [O82]. Unfortunately, we were not able to obtain a copy of this paper.

4.3 Locking Algorithm

Algorithms and structures are presented in a C-like programming language.

Lock Structure

The lock structure is a simple cache-line aligned integer. To avoid processor bus contention on a multiprocessor system, there should not be any other data stored in the same cache line. typedef struct lock_s { unsigned int context; unsigned int padding[3]; } lock_t;

Flags

Bits in the lock word will be used to claim the lock, and to notify the client that the kernel must be involved in contention resolution.
 * 1) define LOCK_HELD 0x80000000
 * 2) define LOCK_CONT 0x40000000

Compare-and-Swap (CAS)

This is a standard 32-bit compare-and-swap (CAS) routine. It can be implemented atomically with a single Intel486 instruction. On a RISC processor, CAS is usually implemented with the instruction pair load-linked/store-conditional. int CAS(int *address,           int compare_value,            int update_value) {       if (*address == compare-value) { *address = update-value; return SUCCEED; } else return FAIL; }

Get Lightweight Lock void get_lightweight_lock(lock_t *L, int my_context) {       if (CAS(&L->context, my_context, LOCK_HELD|my_context) == FAIL) { // Contention, so we use the kernel to arbitrate get_heavyweight_lock(L, my_context); }   }

Release Lightweight Lock void release_lightweight_lock(lock_t *L, int my_context) {       if (CAS(&L->context, LOCK_HELD|my_context, my_context) == FAIL) { // Kernel is requesting the lock release_heavyweight_lock(L, my_context); }   }

Get Heavyweight Lock void get_heavyweight_lock(lock_t *L, int my_context) {       for  { do { // If lock held, mark it as contended // Otherwise, try to get it              current_context = L->context; if (current_context & LOCK_HELD) new = LOCK_CONT | current_context; else new = LOCK_HELD | my_context; } while (CAS(&L->context, current_context, new)); if (new == LOCK_HELD|my_context) break; // Have lock // Didn't get lock // Suspend process until lock available place_current_process_on_queue; schedule; // blocks until wake_up_queued_processes called // Loop and try to obtain lock again }   }

Release Heavyweight Lock void release_heavyweight_lock(lock_t *L, int my_context) {       L->context = 0; // CAS can be used to detect multiple unlocks wake_up_queued_processes; }

Discussion

Both getting and releasing a lightweight lock requires a CAS instruction, which may cause a processor bus LOCK cycle. The bus LOCK is avoided on Pentium Pro and later processors if the cache line is resident in the cache of the current processor. Since this is typically the case for the use of this lock, scalability should be good in the number of processors. On RISC processors, the load-linked/store-conditional instruction pair is used to implement CAS. This instruction pair does not cause bus locks, and scales well in the number of processors.

If the lock release uses a simple write operation instead of a CAS, then there will be a race condition between the time the process determines that the kernel wants the lock, and the time the lock is released. During this time, the kernel could set the bit making the request. If the process does not realize the kernel is waiting for the lock, then the kernel will not obtain the lock until the same process or another process requests the lock again. Since this may never happen, deadlock could result.

5. Acknowledgements

Thanks to James H. Anderson for discussing lock implementation issues.

6. References

[AM97] James H. Anderson and Mark Moir. Universal constructions for large objects. Submitted to IEEE Transactions on Parallel and Distributed Systems, June 1997. Available from http://www.cs.pitt.edu/~moir/Papers/anderson-moir-tpds97.ps. (An earlier version of this paper was published in Proceedings of the Ninth International Workshop on Distributed Algorithms, Lecture Notes in Computer Science 972, Springer-Verlag, pp. 168-182, September 1995.)

[BKMS98] David F. Bacon, Ravi Konuru, Chet Murthy, and Mauricio Serrano. Thin locks: featherweight synchronization for Java. Proceedings of the ACM SIGPLAN '98 Conference on Programming Language Design and Implementation (Montreal, Canada, 17-19 June 1998). Published as SIGPLAN Notices 33, 5 (May 1998), 258-268.

[FM99] Rickard E. Faith and Kevin E. Martin. A Security Analysis of the Direct Rendering Infrastructure. Cedar Park, Texas: Precision Insight, Inc., 1999.

[L87 Leslie Lamport. A fast mutual exclusion algorithm. ACM Transactions on Computer Systems 5, 1 (Feb. 1987), 1-11.

[LA93] Beng-Hong Lim and Anat Agarwal. Waiting algorithms for synchronization in large-scale multiprocessors. ACM Transactions on Computer Systems 11, 3 (Aug. 1993), 253-294.

[M92] Henry Massalin. Synthesis: An Efficient Implementation of Fundamental Operating System Services. Ph.D. dissertation, published as Technical Report CUCS-039-92. Graduate School of Arts and Sciences, Columbia University, 1992, 71-91. Available from ftp://ftp.cs.columbia.edu/reports/reports-1992/cucs-039-92.ps.gz.

[MCS91] John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems 9, 1 (Feb. 1991), 21-65.

[MFOA99] Kevin E. Martin, Rickard E. Faith, Jens Owen, Allen Akin. Direct Rendering Infrastructure, Low-Level Design Document. Cedar Park, Texas: Precision Insight, Inc., 1999.

[O82] J. K. Ousterhout. Scheduling techniques for concurrent systems. In Proceedings of the 3rd International Conference on Distributed Computing Systems. IEEE, New York, 1982, pp. 22-30.

[OM98] Jens Owen and Kevin E. Martin. A Multipipe Direct Rendering Architecture for 3D. Cedar Park, Texas: Precision Insight, Inc., 15 September 1998. Available from http://www.precisioninsight.com/dr/dr.html. 10°

Vulkan
Graphics Core Next (GCN) architecture from Southern Islands (GCN 1.0) and up supports Vulkan (the Radeon 7000 series is Vulkan 1.0, while 8000 series Sea Islands is GCN 2 and supports Vulkan 1.1). Earlier non GCN architectures shouldn’t support Vulkan natively.

Vulkan could be the only API that is necessary at the driver level. OpenGL and Direct3D can be used through wrappers.

Radeon video memory structure, GPU have its own virtual memory and page table like CPU.

Render loop needed and other major parts:


 * Initialization and acquiring information.
 * Graphics memory allocation (AMDGPU_GEM_CREATE).
 * GPU virtual memory mapping (AMDGPU_GEM_VA).
 * Sending command buffers to GPU ring buffers (AMDGPU_CS).
 * Handling interrupts to know when command execution completed (AMDGPU_WAIT_CS)

Radeon GPU has it own MMU unit so each process using 3D acceleration have its own GPU virtual address space. GPU use 2 level page translation table to map virtual addresses to GPU physical addresses.

On Southern Islands Radeon GPU physical memory layout is: 0…VRAM size: VRAM (video RAM) VRAM size…GTT end: GTT (CPU memory mapped to GPU) Ring buffer needs to manage the read writes to from sys ram mmu to gpu mmu with compiled shaders and drawing commands to be frd down gpu pipelines

Reference counted GPU buffer class and GPU memory manager. Memory manager can allocate memory of 3 types:
 * VRAM mappable to CPU,
 * VRAM not mappable to CPU,
 * CPU memory mapped to GPU (GTT). GTT buffer

A GART page table that maps CPU memory to GPU GTT memory range. So CPU memory mapping is in userland without special kernel driver. Making sure the GPU DMA engine can write to GTT memory

Indirect buffers allows to execute commands on ring buffer without copying it to ring buffer. Instead execute indirect buffer command is written to ring buffer with indirect buffer address and size as parameter. Vulkan driver prepares and sends commands in indirect buffers.

GTT buffer test. bufAdr is OS area address mapped to GPU by GART and written by GPU DMA engine.