Back to Molly Rocket Molly Rocket
Molly Rocket games and everything even tangentially related
 

  FAQFAQ  SearchSearch  UsergroupsUsergroups 
Log inLog in  RegisterRegister
 

Hardware/API improvements to help
Goto page 1, 2  Next
 
Post new topic Reply to topic    Molly Rocket Forum Index -> Sparse Virtual Texturing
View previous topic :: View next topic  
Author Message
icastano



Joined: 01 Jul 2007
Posts: 32

PostPosted: Fri Feb 22, 2008 8:19 am    Post subject: Hardware/API improvements to help Reply with quote

Sean, your lecture was pretty good, the best I've seen so far at GDC this year. I'm sorry I had to leave in the middle of it, but I had to catch my train back home. I didn't get the chance to ask you some questions, but here it goes:

- What extensions to the graphics pipeline/api would you propose to accelerate this application?

- Something that would come to mind is a mechanism to find out what texture data you need, and avoid the CPU readback. Do you have any suggestion of how that mechanism should look like?
Back to top
View user's profile
sean



Joined: 01 Feb 2005
Posts: 1392
Location: Kirkland WA

PostPosted: Sun Feb 24, 2008 3:58 am    Post subject: Reply with quote

icastano wrote:
Sean, your lecture was pretty good, the best I've seen so far at GDC this year.


Thank you!

Quote:
What extensions to the graphics pipeline/api would you propose to accelerate this application?


I'm not sure anything specific is needed on the API front. A lot of the things in the implementation work in a way that seems to be really the obvious "right" way. It would be cool if there was someway to avoid needing to pad the edge of pages, because the hardware could fetch from the other pages (at greater cost, but it's infrequent), but the complexity of that is huge and doesn't seem worth it.

I'm not sure where the biggest costs in the pipeline are, since I didn't work very hard on optimization. One obvious thing for high quality is that I take the partial derivatives of the original texture coordinates, and then use those to drive the sampling of a later stage (after a simple rescaling). I assume hardware computing mipmapping for real textures doesn't use finite differences for the coordinates, but analytically. It would be nice if the pipeline allowed, and the compiler was smart enough, that it could forward the analytic derivatives from the texture coordinates to the later sample operation, so they wouldn't need to be recomputed. (Maybe they already do this, I don't know.)

If we omit the final texture fetch and just talk about the cost in the shader of remapping an input virtual coordinate to a final physical coordinate, the best known shader is only two instructions (this is Carmack's bilinear shader that I hand-waved about). However, that doesn't handle trilinear, and doing trilinear well adds a lot of bloat, so the stuff above could bring that way back down.

Quote:
Something that would come to mind is a mechanism to find out what texture data you need, and avoid the CPU readback. Do you have any suggestion of how that mechanism should look like?


I really have no good ideas. Maybe it's possible you could try to do the page management on the GPU, but it seems crazy. I guess what you could do on the GPU is try to reduce your data--there's going to be tons of exactly identical items (the same page ID and mipmap level), since pages are generally covering large numbers of texels. So some kind of GPU sort-and-find-uniques algorithm might allow reducing the data you need to read back.

What I think would work reasonably well is to be a frame behind--allow the readback to be delayed and only available frame later. At high-frame rates, the quality loss would be minimal, but this would clean up a lot of potential pipelining issues (I think). I don't know if there's actually a way to do that and make it go faster, though (can you parallelize the readback?)
Back to top
View user's profile Visit poster's website
sean



Joined: 01 Feb 2005
Posts: 1392
Location: Kirkland WA

PostPosted: Sun Feb 24, 2008 4:33 am    Post subject: Reply with quote

The other thing that harware should always do is be orthogonal.

For example, as I said it would be nice if we could just directly get at the analytic derivatives of the texture coordinates, assuming they're computed that way. This is mostly a performance orthogonality; we can compute them by hand, but if that uses finite differencing it's probably slower and less accurate. Is that really worth burning transistors on making this orthogonal (allowing that data to be forwarded from the texture unit)? Maybe not.

But here's an orthogonality issue that I've encountered twice while working on SVT: Mipmapping beyond 1x1. (Uh, assuming hardware hasn't started doing this, or been doing it all along, and I just never found out.)

Now wait, you say, how can you mipmap beyond 1x1? And if you could, is that really orthogonality?

The mipmap level above a 1x1 mipmap level would be 1x1. This generalizes the practice already visible from non-square textures, whose mipmaps go:

  • 16 x 4
  • 8 x 2
  • 4 x 1
  • 2 x 1
  • 1 x 1


Either coordinate, if it drops to only one texel wide, simply stays 1 texel wide for the remaining mipmaps. So the "orthogonality" here is in having that just always be true, even if you hit 1x1.

The two cases where I've encountered it: with a page size of, say, 128, the coarsest mipmap your SVT can have is not 1x1, but 128x128... because the coarsest page table you can have is 1x1. If you could have additional "more coarse" tables above 1x1, you could use those as page table entries referencing additional pages, which would be incomplete. (E.g. the next page would only use a 64x64 region of the 128x128 page, etc.)

Another application for this is using a mipmap to compute a trilinear factor, by storing the mip level in the texture and using trilinear. If you need to support 10 mip levels, this requires a 1K x 1K texture, which is wasteful (and inefficient on the fine levels), and allowing many more mip levels becomes quickly infeasible. Imagine if it was 1k x 1k, but once you hit the 1x1 mip level, you could keep going "up" with coarser mip levels, each of which is still 1x1 but that's fine because its only job is to store the mip level to be trilineared. In fact, forget 1K x 1K; you'd just make EVERY level be 1x1, and the thing would use teensy amounts of memory.

(Currently I actually use one of these textures in the optimized shader that I ran on my laptop for the first pass (pre-readback) to compute the needed mipmap level.)

This is just a way of computing a function per mip level. You could also allow just passing the mip-level computation from the texture coordinates straight into the shader, and do an arbitrary function on that. That works for this case, but is less general and less flexible.

And of course you can just use partials to compute the expected mip level by hand if you have partials available (which is what I do in GLSL), so in some sense this is irrelevant to future hardware, but I wanted to point out that it COULD be running better on the older hardware if they had just bothered to make this feature orthogonal instead of assuming 'well, why would you ever want to go up past 1x1'? Thinking that you wouldn't is absolutely failing to think about your texture+shader as a general computation unit, and falling into thinking about them
Back to top
View user's profile Visit poster's website
Won



Joined: 21 Sep 2005
Posts: 506
Location: New York

PostPosted: Sun Feb 24, 2008 6:19 am    Post subject: Reply with quote

Well, it seems that alot of the complexity would go away if GPUs supported virtual texture memory (as Carmack has wanted for some time). Sean alludes to an important difference when he says that, unlike normal virtual memory, you can fudge things with placeholders, which suggests that what you need is some smart kind of non-blocking faulting behavior.

What would be cool is if you could tell the GPU to use some massive virtual texture, and as it rendered it, it used the best MIP level available. Then, you could query the hardware for the texture tiles that "faulted" and the texture tiles that "idled" and leave it to software to deal with the replacement/upload or even perform prefetch.
Back to top
View user's profile Visit poster's website
sean



Joined: 01 Feb 2005
Posts: 1392
Location: Kirkland WA

PostPosted: Sun Feb 24, 2008 6:23 am    Post subject: Reply with quote

Yeah, in my original draft I mentioned DX10 Virtual Textures, which appear to just be Textures using virtual memory, where the GPU memory is backed by the CPU memory (compare VM where the CPU memory is backed by disk).

That's fine (although I guess irrelevant on consoles or 3d cards that have unified memory), but it's just the most trivial thing and isn't particularly streaming/procedural generation friendly. If you tried to stream with such a system, you'd still have to chop your stuff up into textures that could fit in the CPU.

Also, I'm not sure why this needs to be "in" DX, I guess they added something to the driver model. Presumably hardware in OpenGL at least could have been doing this all along.
Back to top
View user's profile Visit poster's website
Won



Joined: 21 Sep 2005
Posts: 506
Location: New York

PostPosted: Sun Feb 24, 2008 6:51 am    Post subject: Reply with quote

I think there is an important distinction that SVTM has over straightforward virtual texture memory, and that is: faults don't block rendering. And yes, you want things to be streaming/procedurally friendly, too.

I suppose with virtual texture memory you could have something like a per-tile LOD clamp, but you would still have no way of knowing what tiles were resident since you don't actually have control over that. Which comes back the idea that faulting (and idling) textures should be queryable, so you could let the app manage the bandwidth to the card.
Back to top
View user's profile Visit poster's website
sean



Joined: 01 Feb 2005
Posts: 1392
Location: Kirkland WA

PostPosted: Sun Feb 24, 2008 7:03 am    Post subject: Reply with quote

Oh yeah, I forgot to comment on that.

I agree this would be cool. It seems like a giant new architecture explicitly for this specific thing, so I don't know, maybe it doesn't make sense, to be able to stream this extra data out... I'd hate to add something specifically for this technology since it's not clear how useful it will turn out to be in practice. (Maybe people use it for three years and abandon it.)

Ok, here you go: what you need is support for multiple render targets, with different targets having different levels of AA (dunno if that exists yet).

Now, here's how it works. We'll combine the passes that render the actual frame and render the data-to-read-back. In one render target, we do our regular rendering, with whatever data is available (subbing mipmaps etc). In the other render target, we'll write out what page we wanted to access.

This is pretty much exactly what you're asking for, we're just using existing facilities to implement it.

Of course, we incur a one-frame latency on the readback. That's fine for moving but it sucks for mouselook rotation. But maybe that's just the price of doing business.

The only problem is that it's actually BETTER if we render it in a separate pass because we can decouple what we render in the two cases.

Here are some ways a decoupled readback render is useful:

  • If we decide we want to actually predictively render the scene a few frames in advance of the camera, we can't do that anymore. This might be crucial for disk streaming.
  • It won't help for mouselook, but we could help rotation a little by actually rendering with a different camera, with a wider FOV, so we're always downloading stuff 20-30 degrees outside the FOV, meaning we can handle a rotation rate of 20 degrees/frame (say, at 30 hz, that's 600 degrees/second) without visible loss.


If we do it all in one pass, we pay a frame of latency and give up the ability to do the above decoupling. Now, if that frame of latency got us from 30hz to 60hz, that would be totally worth it in general. But if it gets us from 30hz to 60hz, but it turns out we need the predictive/extra-spread camera for visual quality or disk streaming, it sucks.

Still, it seems reasonable to try.
Back to top
View user's profile Visit poster's website
TomF



Joined: 18 Feb 2007
Posts: 107
Location: Seattle

PostPosted: Sun Feb 24, 2008 6:28 pm    Post subject: Reply with quote

sean wrote:
Maybe it's possible you could try to do the page management on the GPU, but it seems crazy.

Oi! Who you calling crazy?

sean wrote:
I don't know if there's actually a way to do that and make it go faster, though (can you parallelize the readback?)

It's totally sensible to do this asynchronously. We already have things like asynchronous visibility queries that return their results a frame (or so) late - this feedback of which pages were missing would use the same mechanism. Ideally, the hardware would report a page missing and then use a smaller mipmap level that was present. Ideally, we'd have an API that supported such a concept.
Back to top
View user's profile Visit poster's website
Won



Joined: 21 Sep 2005
Posts: 506
Location: New York

PostPosted: Sun Feb 24, 2008 7:51 pm    Post subject: Reply with quote

Yeah, the more I think about it, it is definitely the logical place for it, and I think I envisioned the same async query interface. It would also help to know which resident tiles were not used (or a per tile fragment counter) so that you could do something reasonable for replacement. Although this count would necessarily be delayed by a frame, so it might make sense to keep this info on the card, and have tiles automatically replace tiles with the lowest fragment counts, which gives you a LFU-like replacement policy. But that's starting to sound very specialized.

Thinking in terms of queries, there is also NVX_conditional_render, which allows you to have associate a rendering "callback" predicated on the result of a query. This might be useful, if you are doing something like procedurally generating SVTM GPU-side, or if you want to do some kind of processing or filtering on the queries before feeding them back to the CPU.
Back to top
View user's profile Visit poster's website
TomF



Joined: 18 Feb 2007
Posts: 107
Location: Seattle

PostPosted: Mon Feb 25, 2008 1:04 am    Post subject: Reply with quote

Won wrote:
But that's starting to sound very specialized.

Or extremely general-purpose Smile
Back to top
View user's profile Visit poster's website
Won



Joined: 21 Sep 2005
Posts: 506
Location: New York

PostPosted: Mon Feb 25, 2008 1:22 am    Post subject: Reply with quote

Well, to clarify: the query interface would be nice and general. It was just the auto-LFU business that I thought was too specific.
Back to top
View user's profile Visit poster's website
bengarney



Joined: 25 Feb 2008
Posts: 7

PostPosted: Tue Feb 26, 2008 3:31 am    Post subject: Reply with quote

sean wrote:

The mipmap level above a 1x1 mipmap level would be 1x1. This generalizes the practice already visible from non-square textures, whose mipmaps go:


This raises a question:

Why couldn't you get by with a 1024x1px texture instead of a square one for the mipmap lookup texture?
Back to top
View user's profile
sean



Joined: 01 Feb 2005
Posts: 1392
Location: Kirkland WA

PostPosted: Tue Feb 26, 2008 3:42 am    Post subject: Reply with quote

Because it depends on the partials of both coordinates. However, you can probably get away with two 1024x1 1D textures (well, one that you sample from twice, maybe that's what you meant), and take a min or a max or something, if the way the actual hardware mip computation combines them is min or max... which I think it is. I did mention that somewhere... probably in the rough draft and it got cut. :)
Back to top
View user's profile Visit poster's website
bengarney



Joined: 25 Feb 2008
Posts: 7

PostPosted: Tue Feb 26, 2008 8:55 am    Post subject: Reply with quote

Quote:
Because it depends on the partials of both coordinates.


That seems weird to me somehow, so I did a quick test and if I scaled my UV coordinates correctly it seems like mip selection works as normal - that is, if I make the mip texture 1024px by 1px, then scale the V coordinates by 1024 with wrapping on, it appears to select mips normally.

You can see a shot at http://farm3.static.flickr.com/2315/2292843639_7cc1169d28_o.png. The mip levels are color coded: white is 1024px, black is 512px, then goes red, green, yellow, and I think the rest are invisible.

Does this only work because I'm not doing more complex filtering than trilinear?

Thanks for bugging me to look at the original talk - old_talk.txt was very informative, you really looked at this stuff thoroughly. Smile

I did have a couple questions on areas where your outline was sparse.. assuming you're interested in answering them, should I ask here or kick them out into another thread?
Back to top
View user's profile
sean



Joined: 01 Feb 2005
Posts: 1392
Location: Kirkland WA

PostPosted: Tue Feb 26, 2008 9:06 am    Post subject: Reply with quote

Oh, hey, you may be right. I was thinking you meant a 1D texture, but if you do a 2D texture that's 1024x1, yeah, I guess that just works. Nice!

If it's not about hardware/api performance stuff, yeah, we should move it elsewhere. If you have a bunch of questions, maybe just start a general question thread.
Back to top
View user's profile Visit poster's website
Display posts from previous:   
Post new topic Reply to topic    Molly Rocket Forum Index -> Sparse Virtual Texturing All times are GMT
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
Molly Rocket topic RSS feed 
Molly Rocket topic RSS feed 
Molly Rocket topic RSS feed, first posts only 


Powered by phpBB © 2001, 2005 phpBB Group