Back to Molly Rocket Molly Rocket
Molly Rocket games and everything even tangentially related
 

  FAQFAQ  SearchSearch  UsergroupsUsergroups 
Log inLog in  RegisterRegister
 

Hardware/API improvements to help
Goto page Previous  1, 2
 
Post new topic Reply to topic    Molly Rocket Forum Index -> Sparse Virtual Texturing
View previous topic :: View next topic  
Author Message
sylvan



Joined: 29 Feb 2008
Posts: 8

PostPosted: Sat Mar 01, 2008 12:17 am    Post subject: Reply with quote

It would be cool if the GPU could basically do this whole thing for us. I.e. it just spits out a list of page faults each frame (and automatically tags that page as "invalid" in the TLB to avoid too many duplicates and then moves down the mipchain), then the CPU reads this and periodically uploads one page at a time to the GPU. These pages wouldn't have any notion of texels or anything, it would just be a chunk of memory (4K, 8K?)

This would be very nice because we wouldn't have to worry about filtering and all that noise, the GPU would just say "oi, I need 4K of data starting at virtual adress A", and the CPU would look this adress up (in some range set structure) to see which texture is responsible for that virtual address, then go off and load that bit of data from disk (or a compressed intermediate cache), grab a page from a pool somewhere to stick it in (using "second chance" or something similar to find an available page) and then post it to the GPU which would then enter it into its page table.

Even more flexible would be to obviously just get an interrupt on the CPU and handle it however you want to, but that may be slow...

Obviously if we get more flexible gpus (larrabee?) we could basically write this ourselves.
Back to top
View user's profile
Won



Joined: 21 Sep 2005
Posts: 506
Location: New York

PostPosted: Sat Mar 01, 2008 9:03 pm    Post subject: Reply with quote

My guess is that TomF and company have already implemented much of this on Larrabee. Smile

How does this system chose which pages to replace? Does the GPU or CPU decide? Aside from that, it sounds much like the asynchronous query system. I'm not sure why you would rather traffic in bytes than pixels.

It would be nice you didn't have to do anything special to deal with filtering, which means the hardware would have to do some funky stuff at tile boundaries, where some of the filter samples could come from tiles that are from radically different mip levels. I imagine this would be similar to what you would need to support texture borders or filtering across cubemap faces.

Speaking of which, borders and cubemaps might be an interesting thing to generalize; like allow users to stitch 2-D textures borders in various patterns, although I'm not sure how you would do a page map for this.

It would also be nice if the indirection provided by the page map could be used to control rendering to physical textures as well (including masking chunks), to make it easier/faster to do something like rendering to adaptive shadow maps. You'd have to have different edge increments or whatever for rendering at different resolutions simultaneously.
Back to top
View user's profile Visit poster's website
sylvan



Joined: 29 Feb 2008
Posts: 8

PostPosted: Sat Mar 01, 2008 9:23 pm    Post subject: Reply with quote

I'd like the CPU to decide which pages to replace. Basically you'd just keep a pool of physical pages, and then send up information about what page goes where to the GPU which will enter it into its page table. The various bits for each page has to be accessible by the CPU for this to work though (e.g. the "referenced" bit would be needed for the page replacement strategy). So it may be better to actually keep the page table on the CPU and do all of this logic yourself. A TLB miss would then trigger an interrupt on the CPU, and your job then is to quickly return the physical address of the virtual address in question (if it's there). If it's not there we still need a way for the page fault to "fail" which would cause the GPU to try the next MIP level instead. And I guess the GPU can keep an "invalid" bit in the TLB for this address so it won't trigger an interrupt for that same page again (unless that TLB entry happens to be overwritten). Then when the CPU has loaded a page it would just flush the TLB entry for the virtual address which caused the page fault, which will cause that address to trigger another interrupt, which means you can just return the newly allocated physical address for that page.

More flexible, but if we have to go to the CPU every time the TLB misses that could be costly (the goal would be that the vast majority of TLB misses would be actual page faults, rather than just running out of TLB entries). Though with lots of threads to hide latency (and assuming this only happens a few times per frame) it may be totally acceptable. It may be that a large TLB is cheaper than implementing a bunch of page logic on the GPU, though... Any hardware people care to chime in?

I'd prefer to deal directly with bytes rather than texels because it simplifies things greatly. You don't have to worry about different formats, different compression schemes etc., just chunks of bytes.
Back to top
View user's profile
Won



Joined: 21 Sep 2005
Posts: 506
Location: New York

PostPosted: Sat Mar 01, 2008 10:49 pm    Post subject: Reply with quote

Well, things like texture format/layout is exactly what you want the GPU to worry about, not the application. At least this is the OpenGL style.

I don't think CPU interrupts for GPU virtual texture faults would be good for interactive performance, since those things have enough trouble masking the latency from GPU memory, let alone some arbitrary CPU callback. If the interrupt doesn't actually block the GPU, then why make it an interrupt rather than an asynchronous query? Something like a GPU callback might make sense here; conditional rendering can/should be triggered on faults as well as occlusion queries. This could be how you implement lazy, sparse rendering of things like reflection maps without much CPU intervention. But you'd still likely have a frame of delay.

Here's an idea: allow conditional renders to be specified to be delayed a frame, so that in an AFR system a separate GPU performs the conditional rendering overlapped with the current frame.

Also, I think discussions about TLBs confuse the issue somewhat, since the application developer should really only have to worry about page maps and not TLBs, and the API should transparently invalidate the TLB when necessary. Not to say that a TLB isn't important (could double fillrate or better), but it is an implementation detail. But if you insist, you might as well start the debate about whether texture caches are virtually or physically addressed, and how this might interact with render-to-sparse-texture or explicit page map control.
Back to top
View user's profile Visit poster's website
sylvan



Joined: 29 Feb 2008
Posts: 8

PostPosted: Sat Mar 01, 2008 11:18 pm    Post subject: Reply with quote

Yes, precisely, the GPU should be the one that worries about texture format and layout (not the CPU), which is why the pages should be bytes, not texels. Not sure if you were agreeing with me there or if I'm just misunderstanding your argument. It seems reasonably simple to stick a mostly standard virtual memory system (with the notable addition of "failure") on the GPU, that just requests regular "chunks of bytes" in the standard way, and leave everything else the same (except, as you point out, extra logic to do filtering properly when the initial adressing of a sample returns "failure" and you have to go down the mip chain).

The GPU does need to block on the page fault if the page table is on the CPU side of things (because it needs to know the physical address before it can continue, if the page is indeed available, else it needs to be told that the page is not available), which is why it needs to be an interrupt. If the page fault is handled on the GPU entirely then it could (should) be asynchronous (like I said in my original post - you just read it at the end of the frame). You'd really just need a fixed buffer of page fault entries, as it's actually completely okay to ignore a page fault when the buffer is full (they'll just get handled next frame).

The problem is that if we want the CPU to be able to decide which pages to replace (and I guess we do? But maybe not, it could just implement a reasonable strategy on the GPU and I'd be happy with that) then the CPU needs to gather some form of usage statistics (typically a "referenced" bit), so if the GPU needs to pass data over to the CPU about page usage anyway, it may make sense to just keep the page table on the CPU for extra flexibility... But yeah, latency would be an issue. Though I'm not convinced it would be prohibitive, if you have enough threads to hide the latency on the GPU and a TLB large enough so that it doesn't hit the CPU interrupt needlessly (i.e. when the page is available, and was used fairly recently)... I'm just brainstorming though, I may be completely off...
Back to top
View user's profile
TomF



Joined: 18 Feb 2007
Posts: 107
Location: Seattle

PostPosted: Sun Mar 02, 2008 3:09 am    Post subject: Reply with quote

Quote:
How does this system chose which pages to replace?

The same way a normal OS does - not very well Sad It's surprisingly difficult to know which pages were used last, but you can play tricks with access/dirty bits or putting pages into "purgatory" (marking them invalid when they're not) to have a fairly decent idea. So although it's a bunch of heuristic hacks, in practice it works pretty well. It doesn't really matter if the GPU or the CPU does this, as ideally you're not doing it very often (absolutely limited by PCIe upload bandwidth, though in practice there's limits that kick in sooner). Of course if your GPU is a CPU, it's a moot question, but from my 3Dlabs days I remember that all the page-table manipulation was done on the host CPU and it worked just fine.

If you really want a not-very-smart GPU to handle its own page faults, the CPU could give it a list of free pages it can use when required, and keep that list topped up each frame.

Quote:
which means the hardware would have to do some funky stuff at tile boundaries

The simple thing is to say that if any of the pages for the texels that make up a sample are missing, you drop a mipmap level. This only grows the size of missing pages by a texel, and is far simpler to implement than some half-on-half-off scheme.

Quote:
make it easier/faster to do something like rendering to adaptive shadow maps

Yes, it does Smile

Quote:
I'd prefer to deal directly with bytes rather than texels because it simplifies things greatly.

The main reason to do this is that nothing but the texture sampler cares about texel sizes. So all the code dealing with page tables is agnostic - code, data, heap, textures, vertex buffers, whatever - the OS and hardware doesn't know and doesn't care.

Remember that most GPUs swizzle memory in various ways so that a contiguous chunk of memory is a square or rectangular piece of texture data.

The idea of dropping mipmap levels on a fault and servicing it later is what I call "just too late" rendering, instead of "just in time" (well, more like "just hang on a sec"). Plenty of people use this in a coarser manner on the consoles, and when you do it right it works fine (Halo 2 and Mass Effect did not do it right unfortunately). But even JIT is not too bad - the first page-faulting 3Dlabs hardware had no way to continue until the page fault was serviced by the CPU, and it worked amazingly well.

A TLB is just a cache of the page-walk. It should be transparent to the coder, and managed by the OS when required - just like any other cache.[/b]
Back to top
View user's profile Visit poster's website
sylvan



Joined: 29 Feb 2008
Posts: 8

PostPosted: Sun Mar 02, 2008 12:40 pm    Post subject: Reply with quote

TomF wrote:
the first page-faulting 3Dlabs hardware had no way to continue until the page fault was serviced by the CPU, and it worked amazingly well.


I think this would work well in situations where you have lots of data on the CPU side, and all you have to worry about on a page fault is find the correct page in the system memory and upload it. In those cases you can probably get away with very little GPU memory and just upload the required pages as needed (i.e. you could have like 32-64MB of GPU memory and get away with it). I don't think it would work when the data you're rendering overwhelms even the system memory and has to be loaded from disk though...

Along those lines I've been thinking about how feasible it would be for the next generation of consoles to add another level to the virtual memory hierarchy (between RAM and DVD/harddrive) that's several times larger than system memory (say 16-32 gigs or thereabouts), and much slower, but still fast enough that you could easily grab small number of pages from it every frame in a JIT page fault manner. This in combination with some sort of conservative coarse prefetching (i.e. on a per-portal cell basis or something) that grabs data from disk into this secondary memory could completely eliminate popping while still giving us unique and very high-res texturing...
Not sure exactly what that storage would be, possibly just "the cheapest ram we could find". But I'm not a hardware designer, so I'm not clear on if there's actually any good candidates that would be cheap enough to make this worthwhile (more so than just spending that money on more primary memory)...
Back to top
View user's profile
Display posts from previous:   
Post new topic Reply to topic    Molly Rocket Forum Index -> Sparse Virtual Texturing All times are GMT
Goto page Previous  1, 2
Page 2 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
Molly Rocket topic RSS feed 
Molly Rocket topic RSS feed 
Molly Rocket topic RSS feed, first posts only 


Powered by phpBB © 2001, 2005 phpBB Group