'Cpp' category

« Older entries
Newer entries »

Casting pointers to references

Casting a pointer (like Foo *) to a reference (like Foo &) via reinterpret_cast or a C-style cast probably doesn't do what you want.

References ("refs") exist so that you can make libraries with user-defined constructs that "feel like" a built-in language abstraction. Refs are definitely confusing if you've transitioned from C to C++ — they're "pointerish" in the sense that the compiler ultimately boils them down to pointer values, but "not" in the sense that the language semantics restrict their use. [*]

I came across one such casting bug today, and wondered what the compiler actually emits for it.

As it turns out, GCC warns when you cast a pointer to its corresponding ref type:

test.cpp:12:23: warning: casting ‘int*’ to ‘int&’ does not dereference pointer

Unfortunately, if you cast it to a corresponding const ref type it stays silent. Consider this snippet of C++ code:

#include <stdio.h>

extern int SomeGlobal;

void DumpValue(const int &value)
{
    printf("%d\n", value);
}

int main() {
    int *pval = &SomeGlobal;
    DumpValue((const int &) pval);
    return 0;
}

Note that the correct approach is to use the deref operator (*) on pval to turn it into an int &, which is compatible with the const int & signature of DumpValue.

After a quick give-me-the-assembly command line sequence:

g++ -o test.o -c test.cpp
objdump -d -r test.o # Get assembly with inline linker relocation directives.

We can see the resulting x64 assembly:

0000000000000025 <main>:
  25:   55                      push   %rbp
  26:   48 89 e5                mov    %rsp,%rbp
  29:   48 83 ec 10             sub    $0x10,%rsp
  2d:   48 c7 45 f8 00 00 00    movq   $0x0,-0x8(%rbp)
  34:   00
                        31: R_X86_64_32S    SomeGlobal
  35:   48 8d 45 f8             lea    -0x8(%rbp),%rax
  39:   48 89 c7                mov    %rax,%rdi
  3c:   e8 00 00 00 00          callq  41 <main+0x1c>
                        3d: R_X86_64_PC32   _Z9DumpValueRi-0x4
  41:   b8 00 00 00 00          mov    $0x0,%eax
  46:   c9                      leaveq
  47:   c3                      retq

Walking through it step by step:

  • Instruction 2d is placing the address of SomeGlobal into the stack frame, at location -0x8(%rbp). [†] It currently has $0x0 as a value, with a note for the linker to replace that with the address for SomeGlobal when the linking process figures out where SomeGlobal lives.

  • Instruction 35 computes the address of that stack slot with a lea instruction (which is like a fancy-pants add).

  • Instructions 35 and 39 make that address of the stack slot into the first argument (%rdi) to DumpValue.

So, the argument won't contain the address of SomeGlobal, like we were hoping to provide to DumpValue, but the stack slot address instead. [‡] The cast resulted in a pointer to its operand — the behavior that you would expect if you took a value type and casted it to a ref, like so:

#include <stdio.h>

struct MyStruct {
    int foo, bar;
};

void DumpValues(const MyStruct &ms)
{
    printf("%d %d\n", ms.foo, ms.bar);
}

int main(void) {
    MyStruct ms = {42, 1024};
    DumpValues(reinterpret_cast<const MyStruct &>(ms));
    return 0;
}

Footnotes

[*]

See ISO C++ (14882:2003) 8.3.2 #4:

  • You can't have references to references, arrays of references, or pointers to references

  • You can't have uninitialized references

  • A null reference technically can't exist in a "well defined" program, because dereffing the null pointer causes undefined behavior

[†]

Recall that on x64, the stack grows "down" in memory space; i.e. as you push more function frames due to function invocation, the value in %rsp gets smaller. The base pointer is at the start of the frame, in the highest address, and the stack pointer %rsp is at the end of the frame, in the lowest address. The return address is at 8(%rbp), the previous frame's %rbp value is at 0(%rbp), and the first local stack slot for this function is -8(%rbp).

[‡]

On an LP64 system like my x64 Linux machine we can see half of the stack slot value through this reference.

Remembrance for Codington

I marched past the broken-down memory fences into Codington, white-knuckle clutching my .vimrc caliber sawed-off. It was a ghost town — not another developer within picoseconds.

"Can't get distracted," I mumbled into the abyss, "Sweep the perimeter, get a feel for what we're dealing with."

All it took was a few femtos of strafing — what had happened in Codington hit you in the face like a bag of bricks. All you could see was mutables and const_casts strewn around among the faux-silver bullet casings and streaks of blood that trail indescernably into the distance.

The plague had descended upon Codington as it had so many cities before. I spat in disdain and snarled.

"Const disease."

Code ☃ Unicode

Let's come to terms: angle brackets and forward slashes are overloaded. Between relational operators, templates, XML tags, (HTML/squiggly brace language) comments, division, regular expressions, and path separators, what don't they do?

I think it's clear to everyone that XML is the best and most human readable markup format ever conceived (data serialization and database backing store applications heartily included), so it's time for all that crufty old junk from yesteryear to learn its place. Widely adopted web standards (such as Binary XML and E4X) and well specified information exchange protocols (such as SOAP) speak for themselves through the synergy they've utilized in enterprise compute environments.

The results of a confidential survey I conducted conclusively demonstrate beyond any possibility of refutation that you type more angle brackets in an average markup document than you will type angle-bracket relational operators for the next ten years.

In conclusion, your life expectancy decreases as you continue to use the less-than operator and forward slash instead of accepting XML into your heart as a first-class syntax. I understand that some may not enjoy life or the pursuit of happiness and that they will continue to use deprecated syntaxes. To each their own.

As a result, I have contributed a JavaScript parser patch to rectify the situation: the ☃ operator is a heart-warming replacement for the (now XML-exclusive) pointy-on-the-left angle bracket and the commonly seen tilde diaeresis ⍨ replaces slash for delimiting regular expressions. I am confident this patch will achieve swift adoption, as it decreases the context sensitivity of the parser, which is a clear and direct benefit for browser end users.

The (intolerably whitespace-sensitive) Python programming language nearly came to a similar conclusion to use unicode more pervasively, while simultaneously making it a real programming language by way of the use of types, but did not have the wherewithal to see it through.

Another interesting benefit: because JavaScript files may be UTF-16 encoded, this increases the utilization of bytes in the source text by filling the upper octets with non-zero values. This, in the aggregate, will increase the meaningful bandwidth utilization of the Internet as a whole.

Of course, I'd also recommend that C++ solve its nested template delimiter issue with ☃ and ☼ to close instead of increasing the context-sensitivity of the parser. [*] It follows the logical flow of start/end delimiting.

As soon as Emoji are accepted as proper unicode code points, I will revise my recommendation to suggest using the standard poo emoticon for a template start delimiter, because increased giggling is demonstrated to reduce the likelihood of head-and-wall involved injuries during C++ compilation, second only to regular use of head protection while programming.

Footnotes

[*]

Which provides a direct detriment to the end user — optimizing compilers spend most of their time in the parser.

Thoughts on C++ in small memory footprint embedded development

Background

During my senior year I took on an ECE491 Independent Project course to follow up on a ECE476 Microcontrollers project. On the completion of ECE491 we had created a Low Speed USB 2.0 stack library for the Atmel Mega32 Microcontroller using ~$6 worth of hardware and ~6000 standard lines of C.

Everybody in ECE476 used the CodeVisionAVR IDE, but we were a unique group in using avr-gcc. Though most students were okay with it, there were some minor features missing from the CodeVision compiler at the time, such as the ability to allocate objects on the heap. ;)

We rewrote the ECE476 code base in ECE491, again using avr-gcc, because we realized that the USB protocol was a lot more complex than the stack we had originally written. One of my main gripes in ECE491 was that I was writing highly object oriented code in a language which didn't support any of the syntax. I'm starting to hack on the code base again, and a port to C++ seems like a good idea (since avr-g++ is also available), but there are some significant trade-offs running through my mind.

The Trade-offs

Things I want from C++ in the project:

  • Namespaces — I hate worrying about global namespace issues. Though I'm not a huge fan of C++'s namespacing implementation, I'll sure take it over no namespaces. :)

  • Templates — Casting to and from void *s in "containers" and haphazardly faking (efficient) tuples with void **s is extremely dangerous and causes really annoying bugs.

  • Class syntax — It's ugly faking object orientation in C. Supported syntax is important to me because writing object oriented code in C "requires" you to prefix every function with a class name, like so:

    void data_packet_recalculate_crc16(data_packet_t self) { ... }
    

    Plus, cool tools like doxygen don't pick up on the fact you're writing in an object oriented style and adjust output accordingly (not that I'd expect them to).

  • Exceptions — If you've ever written a large amount of C code, you learn how precious basic exceptions are. gotos and cleanup code tend to get old after a short period of time.

  • Default arguments — I'm a fan of default arguments since it cuts down on the number of wrapper functions you have to write and maintain (though keyword arguments are even cooler :).

Things I don't want from C++ in the project:

  • Larger binaries — The Atmel ATMega32 has 32KB of memory to fit my entire software stack, which I've found to be extremely confining, even with my C implementation (which manages reuse through void *s). The fact that templates cause the compiler to replicate code makes me worry.

  • Don't really care, but maybe worth mentioning: vtables — I want inheritance with vptrs because it makes maintenance significantly easier for the straightforward class hierarchies; however, 16MHz chips are slow. Damn slow. Having vtables for some of the most primitive data structures seems ominous; however, I'm pretty sure that performance is a lot lower on the totem pole than maintainability in this instance.

The First Google Hit Says...

I've read through Reducing C++ Code Bloat and found it thought provoking. Though the article writes about gcc 3.4 and I'm using gcc 4.2, I can't imagine that the underlying code-bloat concepts have changed much. I'm betting a lot of the compiler directive advice is taken care of by gcc's -Os, but I'll make a note to check it out.

It seems sensible to give up on exceptions ahead of time, but there seems to be some hope that the compiler can figure out good code reuse for the templates. I'm compiling to ELF, then performing and objcopy to turn it into Intel Hex object format — I'm hoping that the conversion is trivial and the good ELF compilation referenced in the article will stick.

In the end it seems like I'm just gambling on how much template reuse will occur. I sure hope that if I do all the porting-to-C++ work it optimizes well — template hoisting looks like one of those idioms I'd prefer to leave alone. :(

Experimenting with C++ inheritance diamonds and method resolution order

It seems like g++ uses the order indicated in the class' derivation list over the order indicated in the base specifier list, as in the below example:

#include <iostream>

using namespace std;

class Animal {
public:
    Animal() {
        cout < < "Animal!" << endl;
    }
};

class Man : public virtual Animal {
public:
    Man(string exclamation) {
        cout << "Man " << exclamation << "!" << endl;
    }
};

class Bear : public virtual Animal {
public:
    Bear(string exclamation) {
        cout << "Bear " << exclamation << "!" << endl;
    }
};

class Pig : public virtual Animal {
public:
    Pig(string exclamation) {
        cout << "Pig " << exclamation << "!" << endl;
    }
};

class ManBearPig : public Man, public Bear, public Pig {
public:
    ManBearPig(string exclamation)
        : Pig(exclamation), Bear(exclamation), Man(exclamation)
    {
        cout << "ManBearPig " << exclamation << "!" << endl;
    }
};

int main() {
    ManBearPig mbp("away");
    return 0;
}
cdleary@gamma:~/projects/sandbox/sandbox_cpp$ g++ diamond.cpp && ./a.out
Animal!
Man away!
Bear away!
Pig away!
ManBearPig away!

Note that this experiment is a pretty (very) weak basis for the conclusion — it could be using lexicographic order, order based on the day of month, or any number of other unlikely heuristics :) A lot more experimentation is necessary before getting a discernible pattern, but I just felt like messing around.

Edit (07/27/08): Using correct "base specifier list" terminology instead of my made-up "class initializer list" terminology.