|
View previous topic :: View next topic
|
| Author |
Message |
sean
Joined: 01 Feb 2005 Posts: 1392 Location: Kirkland WA
|
| |
Posted: Thu Aug 07, 2008 1:49 pm Post subject: |
 |
|
| TomF wrote: | | do something smarter to make those pages exist, i.e. commit physical memory to them and fill them with interesting data yourself. |
That's fine for procedural texturing but doesn't work for disk streaming, obviously, where you're just-too-late instead of just-in-time.
But ok, cool, 64-bit. Although this is still misleading:
| Quote: | | You shouldn't have to do any of that. |
Obviously if you can do this, you avoid having to put down borders on your tiles, and you avoid having to make a first pass to determine what you need to do. However, that then prevents you from smartly load-balancing what you need to do, so I'm not sure it's really going to turn out to be a good idea in the long run, as opposed to e.g. just having a one-frame lag in determining what you need.
| Quote: | | As everyone knows, you should not be using floating-point numbers for world positions! <insert rant here> |
Floating point world representations are the worst possible choice, except all the others.
| Code: | Pentium 4 (CPUID 0F2n)
IMUL: 14 / 3 [x86 integer mul, 32x32->64 ]
FMUL: 7 / 2 [x87 floating point mul, full 80-bit precision I guess]
MULSS: 6 / 2 [single-precision mul in SSE]
MULPS: 6 / 2 [4 parallel single-precision muls in SSE]
MULPD: 6 / 2 [double-precision mul in SSE2]
MULPS: 6 / 2 [2 parallel double-muls in SSE2]
PMULLW: 8 / 1 [16-bit int muls, 4 parallel in MMX]
PMULLW: 8 / 2 [16-bit int muls, 8 parallel in SSE2]
PMULUDQ: 8 / 1 [MMX, 32-bit * 32-bit -> 64 bit ]
PMULUDQ: 8 / 2 [SSE2, two of the above] |
|
|
| Back to top |
|
 |
TomF
Joined: 18 Feb 2007 Posts: 107 Location: Seattle
|
| |
Posted: Thu Aug 07, 2008 5:13 pm Post subject: |
 |
|
You don't have to make those pages exist immediately. You can just resample with a mipmap bias/clamp, use that data this frame, then when the data comes from host/disk/whatever, fill in the page then. We expect just-too-late to be the default mode of operation. You can still load-balance just fine - only fill in X number of pages per frame.
All those instruction timings are fascinating, but why do you care about multiplying absolute coordinates together? BSPing the whole world is a bit 1990s. All you normally do with positions is add and subtract them, and "add rax, rbx" is something like quad-issue on modern cores. |
|
| Back to top |
|
 |
sean
Joined: 01 Feb 2005 Posts: 1392 Location: Kirkland WA
|
| |
Posted: Thu Aug 07, 2008 10:24 pm Post subject: |
 |
|
Lord knows we never rotate any vectors!
| Quote: | | You don't have to make those pages exist immediately. You can just resample with a mipmap bias/clamp, use that data this frame, then when the data comes from host/disk/whatever, fill in the page then. |
This is not substantially different from non-Larrabee, though! |
|
| Back to top |
|
 |
casey Site Admin
Joined: 18 Dec 2004 Posts: 1768 Location: Seattle
|
| |
Posted: Fri Aug 08, 2008 12:02 am Post subject: |
 |
|
| sean wrote: | | Lord knows we never rotate any vectors! |
Of course not - that would break the pre-computed lighting, which is the only reason we have all these texels in the first place!
- Casey |
|
| Back to top |
|
 |
Won
Joined: 21 Sep 2005 Posts: 506 Location: New York
|
| |
Posted: Fri Aug 08, 2008 5:00 am Post subject: |
 |
|
| sean wrote: | | This is not substantially different from non-Larrabee, though! |
Well, without padding, you can more easily reuse the stream of faults for multiple layers of textures with different resolutions. That being said, I'm guessing you might have to be kind of careful in what you do when the texture fetch causes a page fault. It is risky to issue another texture fetch, since that can cause a double fault, which I assume is still a very bad thing. |
|
| Back to top |
|
 |
sean
Joined: 01 Feb 2005 Posts: 1392 Location: Kirkland WA
|
| |
Posted: Fri Aug 08, 2008 6:56 am Post subject: |
 |
|
I agree it should let you get away without needing tile padding.
That's nice, but I'm not sure it's actually that big a deal.
Edit: I guess I come off as too negative-nancy here, but basically, since SVT basically just works if you do pad and do have gradients, and there are some grey issues with Larrabee's advantages, it doesn't seem totally huge. Clearly a bunch of complexity is thrown out if you can use native pages and not have to fake the pages yourself, etc. So e.g. 16-tap anisotropy without needing tiles padded by 8 samples all around, that's probably great, and you don't need two copies of the mipmaps so maybe you save another 25% there.
Except nobody's actually sat down and implemented it, so does it really work? What do you do on a page miss caused by significant anisotropy? You still have to drop the mip level of the whole sample, right? Is the cache writing viable? What's the Larrabee page size, and is it as effective a size as you can do by hand? Etc.
So, from a practical standpoint, it seems like all the "workarounds" for the classic GPU just work, and maybe Larrabee is simpler on some fronts but more complex on other fronts. It will almost certainly save memory usage and instruction count, but not necessarily a meaningful amount in practice in a complex shader. |
|
| Back to top |
|
 |
jeffatrad
Joined: 24 Feb 2008 Posts: 126
|
| |
Posted: Fri Aug 08, 2008 12:10 pm Post subject: |
 |
|
I think the problem is that it's difficult to talk about the specifics of why LRB will be good for this, when we can only speak in generalities. This is lame (especially in this case), when you know your technique very well and just want to know the exact ways it will be better instead of platitudes.
The general reason why I think this will be a good LRB situation is that I feel like there are lots of corner cases that you need to handle once you roll out SVT in the large. This is awkward with the standard GPU model since you have to figure out ways to make the fixes for the corner cases affect just those areas. LRB allows you to handle those types of issues better, imo, since hey, it's just normal C code.
It's kind of like the pattern matching model in C++ templates. Yes, you can make cool things like Boost once you handle all the weirdnesses with counter rules and specializtion tricks, but it's never not awkward. And you chafe in all the places where you just want to write want you want directly instead of trying to figure out how to phrase it in terms of how the compiler is going to match against your templates definitions.
So, in your and JohnC's case, you have done all the neavy lifting to handle the weirdnesses that you have encountered so far, but when I hear about the crazy lengths you have to go to, my intuition just says that a more programmable and customizable system is going to make this massively more robust, faster and simpler.
But that's just intuition (about something I can't talk about in specifics in relation to an area that I haven't written any code for), while you have real world experience, so I dunno. <grin> In a few weeks, you'll be able to make the call yourself, I suppose...
->Jeff |
|
| Back to top |
|
 |
TomF
Joined: 18 Feb 2007 Posts: 107 Location: Seattle
|
| |
Posted: Sat Aug 09, 2008 6:05 am Post subject: |
 |
|
I think from the replies here that we're talking at cross-purposes some of the time - probably because of a misunderstanding about what the hardware does and doesn't do. I can't actually give you the hardware spec - I like my balls where they are - so you might have to read between the lines a bit here.
But software I can totally talk about. As of last Monday the gloves are off, and I'm doing a talk at Siggraph on Thursday that mentions all this stuff (though it's only 20 minutes, so it doesn't go into this sort of detail).
| Quote: | | So, from a practical standpoint, it seems like all the "workarounds" for the classic GPU just work |
If you need cross-platform first and foremost, this hardware support gives you nothing but extra speed. It's quite a large amount of extra speed, but if you're already running fast enough, who cares, right? So let's leave that aside and assume we're in a world where you do care. Which is the assumption every discussion kinda has to start with, otherwise what's the point, right?
So if you just want to switch from a "software" SVT system such as Sean's or John's, to a "hardware" system (they both use a bit of both of course, I'm just using these as typing shorthand), but you don't want to change the fundamental game engine or artwork, here's where the speed increase comes from:
-No need for the first pass.
-No feedback of that data to the host to manage the page tables.
-In the second pass (which is now the only pass), you don't need 14(?) shader instructions and 2 texture reads, you just read the texture with a perfectly normal texture instruction. If it works, it gives the shader back filtered RGBA data just like a non-SVT system. If it doesn't (which we hope is rare), it gives the shader back the list of faulting addresses, and the shader puts them in a queue for later and does whatever backup plan it wants (usually resampling at a lower mipmap level).
-No memory padding needed.
-No speed penalty when you hit pages that are present.
All that is going to add up to some pretty serious speed improvements. My understanding is that Rage's lighting model is severely constrained because every texture sample costs them so much performance, they can only read a diffuse and a normal map - they just can't afford anything fancier. But if texture samples are now the same cost with SVT as without, that's got to be worth a fair bit.
Note that the page-fault handler system is basically the same in both cases - you get a fault, and you can decide to either service it then, or service it later (end of frame, next frame, whatever). It's a hugely important and gnarly part of the system - lots of prefetch heuristics and knowing what to evict and so on, but it's basically the same algorithm whichever system you use.
So that's "why is my game faster?" But there's a bunch of extra benefits that you just can't get with the software SVT in a practical manner. Again, this is only relevant if you don't mind that some hardware can't do this yet, but let's just pretend for a second:
-No significant changes to any shader or rendering code (compared to a non-SVT engine).
-Works with all filtering/wrap/clamp modes.
-Works with all texture types (cube maps, volumes, etc)
-Works with translucency and multiple layers.
Those are pretty useful things, whether you already have an SVT system up and running or not.
| Quote: | | Except nobody's actually sat down and implemented it, so does it really work? |
Yes, it works. I was at 3Dlabs writing drivers when they implemented it and shipped hardware and wrote the drivers and so on, so yes it works in practice in real machines with real apps. And that was a far more primitive system then the Larrabee implementation - the host had to do an absurd amount of work on page faults. But it still did work. And of course Intel has people writing and testing SVT systems right now - it seems to work just fine. |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|