Monday, November 30, 2009

Languages and vGPU

Now that, despite all the protein supplements, I've lost weight again and got myself a beard (grew not bought) I feel like I really nailed the homeless pirate look.

In the spirit of camaraderie I spent a few minutes yesterday fixing one of the longest bugs we had in KDE which was the inability to spell check Indic, Middle Eastern, Asian and a few other languages. The issue was that the text segmentation algorithm, due to my uncanny optimism, was defining words as "things that are composed of letters". That isn't even true for English (e.g. "doesn't" is a word that's composed of more than just letters and breaking it up at the non-letter produces two non-words) and it was utterly and completely wrong for pretty much every language that wasn't using a Latin like alphabet. I switched the entire codebase to use the unicode text boundary specification, tr29-11 as implemented in the QTextBoundaryFinder and it's fixed now.

In other knews we've released the Gallium3D driver for VMware's vGPU (right now called svga, but we'll rename it as soon as we'll have some time). It means that we can accelerate all the APIs supported by Gallium in the virtual machine which is fairly impressive.

Tuesday, September 01, 2009

Accelerating X with Gallium

Spring is coming. I'm certain of it because I know for a fact that it's not Spring right now and seasons have a tendency to wrap around. Obviously that means I need to work a bit on the acceleration code for X11. It's just like going to a doctor, sure, it sucks, but at the back of your mind you know you need to do it.

So what's different about this code? Well, it's not an acceleration architecture. We're reusing Exa. It's the implementation.

Currently if you want to accelerate X11, or more precisely Xrender, you implement some acceleration architecture in your X11 driver. It may be Exa, Uxa, Xaa or something else (surely also with an X and A in the name because that seems to be a requirement here). Xrender is a bit too complicated for any 2D engine out there, which means that any code which accelerates Xrender will use the same paths OpenGL acceleration does. In turn this means that if we had an interface which exposed how hardware works we could put OpenGL and Xrender acceleration on top of the same driver. One driver, accelerating multiple APIs.

That's exactly what Gallium is all about so it shouldn't be a big surprise that this is what we're doing. We're simply writing an Xorg state tracker for Gallium in which Exa is implemented through the Gallium interface.

So with a Gallium driver, besides OpenGL, OpenGL ES and OpenVG you'll get X11 acceleration. I'm specifically saying here X11 because after implementing Exa we'll move to accelerating Xv, which will give us a nice set of ops which are accelerated in a typical X application.

Friday, August 14, 2009

More 2D in KDE

An interesting question is: if the raster engine is faster at gross majority of graphics and Qt X11 engine falls back on quite a few features anyway why shouldn't we make raster the default until OpenGL implementations/engine are stable enough.

There are two technical reasons for that. Actually that's not quite true, there's more but these two are the ones that will bother a lot of people.

The first is that we'd kill KDE performance over network. So everyone who uses X/KDE over network would suddenly realize that their setup became unusable. Their sessions would suddenly send images for absolutely everything all the time... As you can imagine institutions/schools/companies who use KDE exactly in this way wouldn't be particularly impressed if suddenly updating their installations would render them unusable.

The second reason is that I kinda like being able to read and sometimes even write text. Text tends to be pretty helpful during the process of reading. Especially things like emails and web browsing get a lot easier with text. I think a lot of people shares that opinion with me. To render text we need fonts, in turn those are composed of glyphs. To make text rendering efficient, we need to cache the glyphs. When running with the X11 engine we render the text using Xrender, which means that there's a central process that can technically manage all the glyphs used by applications running on a desktop. That process is the Xserver. With the raster engine we take Xserver out of the equation and suddenly every single application on the desktop needs to cache the glyphs for all the fonts they're using. This implies that every application suddenly uses megs and megs of extra memory. They all need to individually cache all the glyphs even if all of them use the same font. It tends to work ok for languages with a few glyphs e.g. 40+ for english (26 letters + 10 digits + a few punctuation marks). It doesn't work at all for languages with more. So unless it will be decided that KDE can only be used by people with languages whose alphabets contain about 30 letters or less, then I'd hold off with making raster engine the default.

While the latter problem could be solved with some clever shared memory usage or forcing Xrender on top of raster engine (actually I shouldn't state that as a fact, I haven't looked at font rendering in raster engine in a while and maybe that was implemented lately), it's worth noting that X11 engine is simply fast enough to not bother over a few frames in this way or another. Those few frames that you'd gain would mean unusable KDE for others.

And if you think that neither of the above two points bothers you and you'd still want to use raster engine by default you'll have to understand that I just won't post instructions on how to do that here. If you're a developer, you already know how to do it and if not there are trivial ways of finding that out from the Qt sources. If you're not a developer then you really should stick globally to the defaults and can simply test the applications with the -graphicssystem switches.

2D in KDE

So it seems a lot of people is wondering about this. By this I mean why dwarfs always have beards. Underground big ears would be probably a better evolutionary trait, but elfs got dibs on those.

Qt, and therefore KDE, deals with 3 predominant ways of rendering graphics. I don't feel like bothering with transitions today, so find your own way from beards and dwarfs to Qt/KDE graphics. Those three ways are:
  • On the CPU with no help from the GPU using the raster engine
  • Using X11/Xrender with the X11 engine
  • Using OpenGL with the OpenGL engine
There's a couple of ways in which the decision about which one of those engines is being used is made.

First there's the default global engine. This is what you get when you open a QPainter on a QWidget and its derivatives. So whenever you have code like

void MyWidget::paintEvent(QPaintEvent *)
{
QPainter p(this);
...
}

you know the default engine is being used. The rules for that are as follows:
  • GNU/Linux : X11 engine is being used
  • Windows : Raster engine is being used
  • Application has been started with -graphicssystem= option :
    • -graphicssystem=native the rules above apply
    • -graphicssystem=raster the raster engine is being used by default
    • -graphicssystem=opengl the OpenGL engine is being used by default
Furthermore depending on which QPaintDevice is being used, different engines will be selected. The rules for that are as follows:
  • QWidget the default engine is being used (picked as described above)
  • QPixmap the default engine is being used (picked as described above)
  • QImage the raster engine is being used (always, it doesn't matter what engine has been selected as the default)
  • QGLWidget, QGLFramebufferObject, QGLPixelBuffer the OpenGL engine is being used (always, it doesn't matter what engine has been selected as the default)
Now here's where things get tricky: if the engine doesn't support certain features it will have to fallback to one engine that is sure to work on all platforms and have all the features required by the QPainter api - that is the raster engine. This was done to assure that all engines have the same feature set.

While OpenGL engine should in general never fallback, that is not the case for X11 and there are fallbacks. One of the biggest immediate optimizations you can do to make your application run faster is to assure that you don't have fallbacks. A good way to check for that is to export QT_PAINT_FALLBACK_OVERLAY and run your application against a debug build of Qt, this way the region which caused a fallback will be highlighted (the other method is to gdb break in QPainter::draw_helper). Unfortunately this will only detect fallbacks in Qt.

All of those engines also use drastically different methods of rendering primitives.
The raster engine rasterizes primitives directly.
The X11 engine tessellates primitives into trapezoids, that's because Xrender composites trapezoids.
The GL engine either uses the stencil method (described in this blog a long time ago) or shaders to decompose the primitives and the rest is handled by the normal GL rasterization rules.

Tessellation is a fairly complicated process (also described a long ago in this blog). To handle degenerate cases the first step of this algorithm is to find intersections of the primitive. In the simplest form think about rendering figure 8. There's no way of testing whether the given primitive is self-intersecting without actually running the algorithm.
To render with anti-aliasing on the X11 engine we have to tessellate. We have to tessellate because Xrender requires trapezoids to render anti-aliased primitives. So if the X11 engine is being used and the rendering is anti-aliased whether you're rendering a line, heart or a moose we have to tessellate.

Someone was worried that it's a O(n^2) process which is of course completely incorrect. We're not using a brute force algorithm here. The process is obviously O(nlogn). O(nlogn) complexity on the cpu side is something that both the raster and X11 engines need to deal with. The question is what happens next and what happens in the subsequent calls.

While the raster engine can deal with all of it while rasterizing, the X11 engine can't. It has to tessellate, send the results to the server and hope for the best. If the X11 driver doesn't implement composition of trapezoids (which realistically speaking most of them doesn't) this operation is done by Pixman. In the raster engine the sheer spatial locality almost forces better cache utilization than what could be realistically achieved by the "application tessellate->server rasterization" process that the X11 engine has to deal with. So without all out acceleration in this case X11 engine can't compete with the raster engine. While simplifying a lot it's worth remembering that in terms of cycles register access is most likely smaller or equal to 1 cycle, access to L1 data cache is likely about 3 cycles, L2 is probably about 14 cycles, while the main memory is about 240 cycles. So for CPU based graphics efficient memory utilization is one of the most crucial undertakings.

With that in mind, this is also the reason why a heavily optimized purely software based OpenGL implementation would be a lot faster than raster engine is at 2D graphics. In terms of memory usage OpenGL pipeline is simply a lot better at handling memory than the API QPainter provides.

So what you should take away from this is that if you're living in the perfect world, the GL engine is so much better than absolutely anything else Qt/KDE have it's not even funny, X11 follows it and the raster engine trails far behind.

The reality with which you're dealing with is that when using the X11 engine, due to the fallback you will be also using the raster engine (either on the application side with Qt raster engine or the server side with Pixman) and unfortunately in this case "the more the better" doesn't apply and you will suffer tremendously. Our X11 drivers don't accelerate chunks of Xrender, the applications don't have good means of testing what is accelerated, so what Qt does is simply doesn't use many of its features. So even if the driver would accelerate for example gradient fills and source picture transformations it wouldn't help you because Qt simply doesn't use them and always falls back to the raster engine. It's a bit of a chicken and an egg problem - Qt doesn't use it because it's slow, it's slow because no one uses it.

The best solution to that conundrum is to try running your applications with -graphicssystem=opengl and report any problems you see to both Qt software and the Mesa3D/DRI bugzillas because the only way out is to make sure that both our OpenGL implementations and OpenGL usage in the rendering code on the applications side are working efficiently and correctly. The quicker we get the rendering stack to work on top of OpenGL the better off we'll be.

Friday, May 15, 2009

OpenGL ES

Two facts, one you didn't need to know, other that you should know. I'm the World Champion in being Vegetarian. Did you know that? Of course not, you probably didn't even know it's a competition. But it is. And I won it. My favorite vegetable? Nachos.
Technically they're a fourth of a fifth generation vegetable. But vegetables are like software, you never want to work with them on the first iteration. Like you wouldn't want to eat corn, but mash it, broil/fry it and you got something magical going on in the form of tortilla chips.

The other fact: OpenGL ES state trackers for Gallium just went public. That includes both OpenGL ES 1.x and 2.x. Brian just pushed them into opengl-es branch. Together with OpenVG they should be part of the Mesa3D 7.6 release.

At this point Mesa3D almost becomes the Khronos SDK and hopefully soon we'll add an OpenCL state tracker and start working on bumping the OpenGL code to version 3.1.

Sunday, May 10, 2009

OpenVG Release

I've been procrastinating lately. Where I define procrastination as: doing actual work, instead of writing blog entries. "Can you do that? Redefine words because you feel like it?", oh, sure you can. As long as you don't care about anyone understanding what you're saying, then it's perfectly fine. In fact it's good for you, the less people understands you, the more likely it is that they'll classify you as a genius. Life is awesome like that. That's right, you come here for technical stuff and get blessed with some serious soul searching stuff, enjoy.

We have released the OpenVG state tracker for Gallium.

It's a complete implementation of OpenVG 1.0. While my opinion on 2D graphics and long term usefulness of OpenVG changed drastically during the last year I'm very happy we were able to release the code. After the Mesa3D 7.5 we're going to merge it into master and 7.6 will be the first release that includes both OpenGL and OpenVG.


[You're exuberant about that]. I figured I'll start including stage directions for the blog so that you know how you feel about reading this.

I don't have much to write about the implementation itself. If you have a Gallium driver, then you're good to go and have accelerated 2d and 3d. It's great to see more state trackers being released for Gallium and this whole concept of "accelerating multiple APIs" on top of one driver architecture becoming a reality. [You love it and with a smile, wave bye]

Monday, March 23, 2009

Intermediate representation

I've spent the last two weeks working from our office in London. The office is right behind Tate Modern with a great view of London which is really nice. While there I was mainly trying to get the compilation of shaders rock solid for one of our drivers.

In Gallium3D currently we're using TGSI intermediate representation. It's a fairly neat representation with instruction semantics matching those from relevant GL extensions/specs. If you're familiar with D3D you know that there are subtle differences between seemingly similar instructions in both APIs. Modern hardware tends to follow the D3D semantics more closely than the GL ones.

Some of the differences include:
  • Indirect addressing offsets, in GL offsets are between -64 to +63, while D3D wants them >= 0.
  • Output to temporaries. In D3D certain instructions can only output to temporary registers while TGSI doesn't have that requirement.
  • Branching. D3D has both if_comp and setp instructions which can perform essentially any of the comparisons, while we immitate them with a SLT (set on less than) or similar with a IF instruction with semantics of the one in GL_NV_fragment_program2.
  • Looping.
None of this is particularly difficult but we had to implement almost exactly the same code in another driver of ours and that makes me a little uneasy. There's a number of ways to fix this. We could change the semantics of TGSI to match D3D, or we could implement transformation passes that operate on TGSI and do this transformation directly on the IR before it gets to the driver, or we could radically change the IR. Strangely enough it's the solution number #3, which while appearing the most bizzare, is the one I like the most. It's because it goes hand in hand with the usage of LLVM inside Gallium.

Especially because there's a lot generic transformations that really should be done on the IR itself. A good example of that includes code injections due to missing pieces of the API state on a given GPU, e.g. shadow texturing, including the compare mode, compare function and the shadow ambient value, which if the samplers on the given piece of hardware don't support need to be emulated with code injections after each sampling operation which utilizes the sampler that had the above mentioned state set on it.

Unfortunately right now the LLVM code in Gallium is far from finished. Due to looming deadlines on various projects I never got to spend the amount of time that is required to get in shape.
I'll be able to piggy back some of the LLVM work on top of the OpenCL state tracker code which is exciting.

Also, has anyone noticed that the last two blogs had no asides or any humor in them? I'm experimenting with "just what I want to say and nothing more" style of writing, wondering if it indeed, will be easier to read.

Thursday, March 12, 2009

KDE graphics benchmarks

This is really a public service announcement. KDE folks please stop writing "graphics benchmarks". It's especially pointless if you quantify your blog/article with "I'm not a graphics person but...".

What you're doing is:

timer start
issue some draw calls
timer stop

This is completely and utterly wrong.
I'll give you an analogy that should make it a lot easier to understand. Lets say you have a 1MB and a 100MB lan and you want to write a benchmark to compare how fast you can download a 1GB file on both of those, so you do:

timer start
start download of a huge file
timer stop

Do you see the problem? Obviously the file hasn't been downloaded by the time you stopped the timer. What you effectively measured is the speed at which you can execute function calls.
And yes, while your original suspicion that the 100MB line is a lot faster is still likely true, your test in no way proves that. In fact it does nothing short of making some poor individuals very sad due to the state of computer science. Also "So what that the test is wrong, it still feels faster" is not a valid excuse, because the whole point is that the test is wrong.

To give your tests some substance always make your applications run for at least a few seconds reporting the frames per second.

Or even better don't write them. Somewhere out there, some people, who actually know what's going on, have those tests written. And those people, who just happen to be "graphics people" have reasons for making certain things default. So while you may think you've made this incredible discovery, you really haven't. There's a kde-graphics mailing list where you can pose graphics question.
So to the next person who wants to write a KDE graphics related blog/article please, please go through kde-graphics mailing list first.

Monday, February 09, 2009

Video and other APIs

I read today a short article about video acceleration in X. First of all I was already unhappy because I had to read. Personally I think all the news sites should come in a comic form. With as few captions as possible. Just like Perl I'm a write-only entity. Then I read that Gallium isn't an option when it comes to video acceleration because it exposes programmable pipeline and I got dizzy.

So I got off the carousel and I thought to myself: "Well, that's wrong", and apparently no one else got that. Clearly the whole "we're connected" thing is a lie because it doesn't matter how vividly I think about stuff, others still don't see what's in my head (although to the person who sent me cheese when I was in Norway - that was extremely well done). So cumbersome as it might be I'm doing the whole "writing my thoughts down" thing. Look at me! I'm writing! No hands! sdsdfewwwfr vnjbnhm nhn. Hands back on.

I think the confusion stems from the fact that the main interface in Gallium is the context interface, which does in fact model the programmable pipeline. Because of the incredibly flexible nature of the programmable pipeline a huge set of APIs is covered just by reusing the context interface. But in modern GPUs there are still some fixed function parts that are not easily addressable by the programmable pipeline interface. Video is a great example of that. To a lesser degree so is basic 2D acceleration (lesser because some of the modern GPUs don't have 2D engines at all anymore).

But, and it's a big but ("and I can not lie" <- song reference, pointing out to let everyone know that I'm all about music) nothing stops us from adding interfaces which deal exclusively with the fixed function parts of modern GPUs. In fact it has already been done as the work on a simple 2D interface has already started.

Basic idea is that the state trackers which need some specific functionality use the given interface. For example the Exa state tracker would use the Gallium 2d interface, instead of the main context interface. In this case the Gallium hardware driver will have a choice: it can either implement the given interface directly in hardware, or it can use the default implementation.

The default implementation is something Gallium will provide as part of the auxiliary libraries. The default implementation will use the main context interface to emulate the entire functionality of the other interface.

Video decoding framework would use the same semantics. So it would be an additional interface(s) with default implementation on top of the 3D pipeline. Obviously some parts of the video support are quite difficult to implement on top of the 3D pipeline but the whole point of this is that: for hardware that supports it you get the whole shabangabang, for hardware that doesn't you get a reasonable fallback. Plus in the latter case the driver authors don't have to write a single line of hardware specific code.

So a very nice project for someone would be to take VDPAU, VA-API or any video framework of your choice and implement a state tracker for that API on top of Gallium and design an interface(s) that could be added to Gallium to implement the API in a way that makes full usage of the fixed functionality video units found in the GPUs. I think this is the way our XvMC state tracker is heading.
This is the moment where we break into a song.

Wednesday, February 04, 2009

Latest changes

I actually went through all my blog entries and removed spam. That means that you won't be able to find anymore links to stuff that can enlarge your penis. I hope this action will not shatter your lives and you'll find consolation in all the spam that you're getting via email anyway. And if not I saved some of the links. You never know, I say.
I also changed the template, I'd slap something ninja related on it, but I don't have anything that fits. Besides nowadays everyone is a graphics ninja. I'm counting hours until the term will be added to the dictionary. So my new nickname will be "The Lost Son of Norway, Duke of Poland and King of England". Aim high is my motto.

As a proud owner of exactly zero babies I got lots of time to think about stuff. Mostly about squirrels, goats and the letter q. So I wanted to talk about some of the things I've been thinking about lately.

Our friend ("our" as in the KDE communities, if you're not a part of it, then obviously not your friend, in fact he told me he doesn't like you at all) Ignacio CastaƱo has a very nice blog entry about 10 things one might do with tessellation.
The graphics pipeline continues evolving and while reading Ignacio's entry I realized that we haven't been that good about communicating the evolution of Gallium3D.
So here we go.

I've been slowly working towards support for geometry shaders in Gallium3D. Interface wise the changes are quite trivial, but a little bigger issue is that some (quite modern) hardware, while perfectly capable of emitting geometry in the pipeline is not quite capable of actually supporting all of the features of geometry shader extension. The question of how to handle that is an interesting one, because just simple emission of geometry via a shader is a very desirable feature (for example path rendering in OpenVG would profit from that).

I've been doing some small API cleanups lately. Jose made some great changes to the concept of surfaces, which became pure views on textures. As a follow up, over the last few days we have disassociated buffers from them, to make it really explicit. It gives drivers the opportunity to optimize a few things and with some changes Michel is working on avoid some redundant copies.

A lot of work went into winsys. Winsys, which is a little misnomer, was a part of Gallium that did too much. It was supposed to be a resource manager, handle command submission and handle integration with windowing systems and OS'es. We've been slowly chopping parts of it away. Making it a lot smaller and over the weekend managed to hide it completely from the state tracker side.

Keith extracted the Xlib and DRI code from winsys and put it into separate state trackers. Meaning that just like WGL state tracker, the code is actually sharable between all the drivers. That is great news.

Brian has been fixing and implementing so many bugs/features that people should start writing folk songs about him. Just the fact that we now support GL_ARB_framebuffer_object deserves at least a poem (not a big one, but certainly none of that white, non-rhyming stuff, we're talking full fledged rhymes and everything... You can tell that I know a lot about poetry can't you)

One thing that never got a lot of attention is that Thomas (who did get one of them baby thingies lately) released his window systems buffer manager code.

Another thing that didn't get a lot of attention is Alan's xf86-video-modesetting driver. It's a dummy X11 driver that uses DRM for modesetting and Gallium3D for graphics acceleration. Because of that it's hardware independent, meaning that all hardware that has a DRM driver and Gallium3D driver automatically works and is accelerated under X11. Very neat stuff.

Alright, I feel like I'm cutting into your youtube/facebook time and like all "Lost Sons of Norway, Dukes of Poland and Kings of England" I know my place, so that's it.

Sunday, February 01, 2009

OpenCL

I missed you so much. Yes, you. No, not you. You. I couldn't blog for a while and I ask you (no, not you) what's the point of living if one can't blog? Sure, there's the world outside of computers, but it's a scary place filled with people that, god forbid, might try interacting with you. Who needs that? It turns out that I do. I've spent the last week in Portland on the OpenCL working group meeting which was a part of the Khronos F2F.

For those who don't know (oh, you poor souls) OpenCL is what could be described as "the shit". That's the official spelling but substituting the word "the" for "da" is considered perfectly legal. Longer description includes the expansion of the term OpenCL to "Open Computing Language" with an accompanying wikipedia entry. OpenCL has all the ingredients, including the word "Open" right in the name, to make it one of the most important technologies of the coming years.

OpenCL allows us to tap into the tremendous power of modern GPUs. Not only that but also one can use OpenCL with accelerators (like the physics chips or Cell SPU's) and CPUs. On top of that hardware OpenCL provides both task-based and data-based parallelism making it a fascinating options for those who want to accelerate their code. For example if you have a canvas (Qt graphicsview) and you spend a lot of time doing collision detection, or if you have image manipulation application (Krita) and you spend a lot of time in effects and general image manipulation, or if you have a scientific chemistry application with an equation solver (Kalzium) and want to make it all faster, or if you have wonky hair and like to dance polka... OK, the last one is a little a fuzzy but you get the point.

Make no mistake, OpenCL is little bit more complicated than just "write your algorithm in C". Albeit well hidden, the graphics pipeline is still at the forefront of the design, so there are some limitations (remember that for a number of really good and few inconvenient reasons GPUs do their own memory management so you can not just move data structures between main and graphics memory). It's one of the reasons that you won't see a GCC based OpenCL implementation any time soon. OpenCL requires run time execution, it allows sharing of buffers with OpenGL (e.g. OpenCL image data type can be constructed from GL textures or renderbuffers) and it forces code generation to a number of different targets (GPUs, CPUs, accelerators). All those things need to be integrated. For sharing of buffers between OpenGL and OpenCL the two APIs need to go through some kind of a common framework - be it a real library or some utility code that exposes addresses and their meaning to both OpenGL and OpenCL implementations.

Fortunately we already have that layer. Gallium3D maps perfectly to the buffer and command management in OpenCL. Which shouldn't be surprising given that they both care about the graphics pipeline. So all we need is a new state tracker with some compiler framework integrated to parse and code generate from the OpenCL C language. LLVM is the obvious choice here because unlike GCC, LLVM has libraries that we can use for both (to be more specific it's Clang and LLVM). So yes, we started on an OpenCL state tracker, but of course we are far, far away from actual conformance. Being part of a large company means that we have to go through extensive legal reviews before releasing something in the open so right now we're patiently waiting.
My hope is that once the state tracker is public the work will continue at a lot faster pace. I'd love to see our implementation pass conformance with at least one GPU driver by summer (which is /hard/ but definitely not impossible).