Back to Molly Rocket Molly Rocket
Molly Rocket games and everything even tangentially related
 

  FAQFAQ  SearchSearch  UsergroupsUsergroups 
Log inLog in  RegisterRegister
 

SVT in Vertex Shaders

 
Post new topic Reply to topic    Molly Rocket Forum Index -> Sparse Virtual Texturing
View previous topic :: View next topic  
Author Message
bengarney



Joined: 25 Feb 2008
Posts: 7

PostPosted: Mon Feb 25, 2008 9:13 am    Post subject: SVT in Vertex Shaders Reply with quote

First off - Sean, great talk. I really enjoyed it. Doing the presentation in-demo was genius. Smile It was great to see someone getting the tech together then pulling off an awesome talk with it. I'm excited to see these forums up so everyone can talk about it.

First post on these forums so I should introduce myself. I'm Ben Garney, I work at GarageGames and have done a lot of work with the Torque engines. I did some prototyping work on a system similar to SVT recently, along with a coworker of mine named Brian Richardson, and I wanted to share our results.

We do several things differently:

1. We preprocess the geometry into chunks by clipping it with a grid in UV space. Given you have to unwrap the world anyway, this seems like an acceptable change.

2. We calculate bounding spheres for each chunk and use that to analytically estimate what needs texels. (Thanks to TomF for math.)

3. We offset and bias each chunk's UV coordinates in the vertex shader, so that our pixel shaders can just do normal sampling - no dependent read or complex math.

4. We allow variable size allocations into the scratch texture. So a chunk might have a 4px or a 2048px tile allocated to it, or anywhere in between. This lets us keep coarser paging information. This does impact cache management complexity; I like the same-size-tile approach better.

The benefits? No readback or pre-render, and no extra cost in the pixel shader. Cool!

I'm interested to hear what people think about this - does it seem viable? Our prototype does too much dumb stuff to be indicative of speed. Sampling issues might also be a concern, too - haven't tested against anisotropy or any other exotic scenarios.
Back to top
View user's profile
sean



Joined: 01 Feb 2005
Posts: 1392
Location: Kirkland WA

PostPosted: Mon Feb 25, 2008 4:25 pm    Post subject: Reply with quote

In the rough draft of the talk, I basically pitched the big value of doing it in the pixel shader is that it's geometry independent. You can do polygonal LOD, you can repaint the same texture on multiple surfaces, etc. etc. They behave EXACTLY like regular textures for all purposes.

For example, you don't have to use these draped over terrain. That isn't really the point, it's just the easiest way to talk about it, and I had to cut a ton of little stuff like this from the talk and just focus on the simplest application.

So instead of a big terrain, for example, you can instead make 64 4K x 4K textures, each of which has their own page table, but shares the physical texture. And you can turn on wrap mode on the page tables, and draw an indoor scene.

That said, if you can do it all in the vertex shader and it works for you, that's certainly cool. I think it's less expensive, but less general.
Back to top
View user's profile Visit poster's website
bengarney



Joined: 25 Feb 2008
Posts: 7

PostPosted: Mon Feb 25, 2008 9:42 pm    Post subject: Reply with quote

Is the cost for a dependent texture read only a halving in fill rate? Or is it worse? I don't have performance numbers for that, and it'd be helpful in determining the relative wins of these approaches. To be clear I think that having a straight up virtualized texture is a huge win - one reason I bring this other approach is to see, is it worth pursuing or is it a small enough win it's not worth the trouble?

Also, I see in some other threads there's been discussion of doing analytic estimation of required texture data. What are your thoughts on that? The readback is elegant but costly (much cheaper on consoles).
Back to top
View user's profile
sean



Joined: 01 Feb 2005
Posts: 1392
Location: Kirkland WA

PostPosted: Mon Feb 25, 2008 11:12 pm    Post subject: Reply with quote

Is it even a halving, if you're already doing multiple reads (diffuse, gloss, normal map)? They all become dependent on the page table computation, but I don't know how it all stacks up. With the GPUs able to hide latency well using lots of little threads, I'm not sure the dependency hurts that much.
Back to top
View user's profile Visit poster's website
TomF



Joined: 18 Feb 2007
Posts: 107
Location: Seattle

PostPosted: Tue Feb 26, 2008 7:42 am    Post subject: Reply with quote

bengarney wrote:
Is the cost for a dependent texture read only a halving in fill rate? Or is it worse? I don't have performance numbers for that, and it'd be helpful in determining the relative wins of these approaches.

It's kinda orthogonal. It's yet another bottleneck in the GPU pipeline with a moderately fixed resource size - the number of dependent phases vs the number of hardware registers. If you blow that out, you start to suck, if you don't blow it out, you don't notice.

WARNING! Utterly fictitious numbers ahead. I just pulled these out of my arse to illustrate the process. They're not meant to be representative of any real hardware (and certainly not meant to represent any not-yet-real hardware - it does things a slightly different way).

The actual tradeoff is probably best summarised like so. Imagine the hardware has a fixed-size register set that is pretty large - let's say it's 256 registers. Now think about it looking at a shader and seeing how many registers it needs for that shader - let's say it needs 8 registers. So it knows it can run 256/8 = 32 copies of the shader at once. Each texture lookup needs a certain number of clocks to hide the latency of memory access and the filtering and so on - let's say it's 1024 clocks. So if the shader is more than 1024/32 = 32 clocks long, you can successfully hide the latency by executing instructions from the other shaders. If it's less than 32 clocks, then you can't, and you stall waiting for the texture accesses. Actually, the number is 1024/(32-1) = 33 clocks, because you can only run *other* shaders, not the one that did the request. And actually real hardware tends to group shaders together, e.g. it runs 4 at once, so you can't run any of those shaders while waiting for the texture fetch to come back, so it's actually 1024/(32-4) = 37 clocks.

Simultaneous texture reads don't affect this much - the latency is basically the same if you make 1 request or 4. But dependent reads can't be done at the same time. So if you have a shader with a read then another dependent read, then the shader has to find the time to absorb two reads sequentially, i.e. 2048 clocks. So now the rest of the shader has to be 2048/(32-4) = 73 clocks long to avoid being limited by texture reads.

So the obvious thing to say is - long complex shaders absorb dependent reads better right? Well yes, except that they also require more registers. So a simple shader with a single texture read might only require 4 registers, while a complex one with three dependent texture reads might require 12. So you can only run a third as many shader instances. Crunching the maths:

Simple shader: 256/4=64 instances. One texture read, so (1024*1)/(64-4) = 17 clocks is the breakeven point.

Simple shader: 256/12=21 instances. Three dependent reads, so (1024*3)/(21-4) = 181 clocks is the breakeven point.

So instinctively it looks like we just tripled the complexity of the shader, but actually we moved the breakeven point up by 10x. Obviously that is all a huge approximation, with tapeworm numbers, but you can see that dependent reads can be pretty costly unless you have some pretty big complex shaders.
Back to top
View user's profile Visit poster's website
sean



Joined: 01 Feb 2005
Posts: 1392
Location: Kirkland WA

PostPosted: Tue Feb 26, 2008 7:51 am    Post subject: Reply with quote

Jesus, those hardware designers totally fucked the compiler writers.

I mean, now they have to optimize to minimize shader length AND minimize register usage, but of course the two affect each other and it's a mess to make the trade-off right.
Back to top
View user's profile Visit poster's website
bengarney



Joined: 25 Feb 2008
Posts: 7

PostPosted: Tue Feb 26, 2008 9:46 am    Post subject: Reply with quote

I did a quick test with the nVidia 8600 in my MacBook Pro and got:

# dependent samples (always at least one sample) - fps
0 - 27
1 - 25
2 - 21
3 - 17

Even having 4 non-dependent samples gets the same FPS as just 1 sample does.

My methodology is crap, of course, but it appears that you start losing performance when you go over 1 dependent read.

I wonder how hard it would be to throw together a little benchmarking framework that could test stuff like this and give us real answers for questions like "what is cost trade off for dependent reads vs. normal reads"... It would also help avoid methodology problems. Smile
Back to top
View user's profile
Won



Joined: 21 Sep 2005
Posts: 506
Location: New York

PostPosted: Tue Feb 26, 2008 3:28 pm    Post subject: Reply with quote

How much does your GPU bench?

http://graphics.stanford.edu/projects/gpubench/

A start. Open source, and some interesting results posted.

Edit: OK, the results don't post dependent fetch benchmarks, but the benchmark itself is capable of doing it.

http://graphics.stanford.edu/projects/gpubench/test_fetchcosts.html
Back to top
View user's profile Visit poster's website
Won



Joined: 21 Sep 2005
Posts: 506
Location: New York

PostPosted: Tue Feb 26, 2008 3:50 pm    Post subject: Reply with quote

sean wrote:
Jesus, those hardware designers totally fucked the compiler writers.


It used to be that another thing that used your register space is were the interpolated attributes. Dunno if this is still true.

But I don't think provisioning registers is a particularly easy problem to solve in hardware, and I think they actually did a good (or at least reasonable) job with that. As far as I know, register usage has always been an issue in programming fragment shaders (the real reason why half-precision was faster on the original Geforce FX), and there were tools that told you how many registers you were using so that you could restructure your code to deal with it.

Anyway, it is just another point on the "store v. recompute" tradeoff that compilers typically have to make anyway.
Back to top
View user's profile Visit poster's website
Display posts from previous:   
Post new topic Reply to topic    Molly Rocket Forum Index -> Sparse Virtual Texturing All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
Molly Rocket topic RSS feed 
Molly Rocket topic RSS feed 
Molly Rocket topic RSS feed, first posts only 


Powered by phpBB © 2001, 2005 phpBB Group