Understanding Objective-C by transpiling it to C++


TL;DR: This article describes how to transpile Objective-C to C++, and use that approach to gain understanding of the Objective-C runtime

Apple heavily pushes for Swift as the programming language for its platforms. However, Objective-C is not going anywhere yet. A 2023 study reveals that “Objective-C is still at the core of iOS and is used directly or indirectly by most apps”. Also, most frameworks shipped on macOS (as we saw on a previous post) are still written in Objective-C.

As you probably know, Objective-C is a superset of C. In fact, the Objective-C runtime is a plain C library. An awesome trick that Wojciech Reguła recently introduced me to is to transpile Objective-C to C++. This is a great way to learn more about the Objective-C runtime, and how Objective-C works under the hood.

In this article, we will transpile an example Objective-C program to C++, highlight some interesting parts of the generated code, and explore some of the history and current status of this work on the LLVM project.

Example: Transpiling “Hello World”

Let’s look at an example, based on the following sample Objective-C program:

// main.m
#import <Foundation/Foundation.h>

int main() {
  @autoreleasepool {
    NSLog(@"Hello World");
  }

  return EXIT_SUCCESS;
}

To transpile this Objective-C program to C++, we can use Clang’s -rewrite-objc option, along with the -Wno-everything option to quiet warnings that are irrelevant for the sake of this post, and the -fno-ms-extensions to disable Microsoft-specific extensions (more on this later):

$ xcrun clang main.m -o main.cc -rewrite-objc -Wno-everything -fno-ms-extensions

The main.cc output will be a pretty big C++ file (over 60k lines on my system) that looks something like this:

#ifndef __OBJC2__
#define __OBJC2__
#endif
struct objc_selector; struct objc_class;
struct __rw_objc_super {
    struct objc_object *object;
    struct objc_object *superClass;
    __rw_objc_super(struct objc_object *o, struct objc_object *s) : object(o), superClass(s) {}
};

// ...

int main() {
  /* @autoreleasepool */ { __AtAutoreleasePool __autoreleasepool;
    NSLog((NSString *)&__NSConstantStringImpl__var_folders_sy_wb_f149x2v9_j6xdhfrtr9c00000gn_T_main_fca8a5_mi_0);
  }
  return 0;
}
static struct IMAGE_INFO { unsigned version; unsigned flag; } _OBJC_IMAGE_INFO = { 0, 2 };

Let’s explore some interesting parts of the resulting code, starting with a simple one.

While we won’t showcase it in this article, -rewrite-objc can also be used to transpile Objective-C++ to C++.

Inspecting NSString static strings

Here is our initial simple NSLog invocation:

NSLog(@"Hello World");

Which the re-writer translated to:

NSLog((NSString *)&__NSConstantStringImpl__var_folders_sy_wb_f149x2v9_j6xdhfrtr9c00000gn_T_main_6b2f4b_mii_0);

Our “Hello World” constant string is statically allocated as a __NSConstantStringImpl,

static __NSConstantStringImpl __NSConstantStringImpl__var_folders_sy_wb_f149x2v9_j6xdhfrtr9c00000gn_T_main_6b2f4b_mii_0 __attribute__ ((section ("__DATA, __cfstring"))) = {__CFConstantStringClassReference,0x000007c8,"Hello World",11};

The __NSConstantStringImpl structure looks like this:

struct __NSConstantStringImpl {
  int *isa;
  int flags;
  char *str;
#if _WIN64
  long long length;
#else
  long length;
#endif
};

Cross-referencing this with the brace initialization of our __NSConstantStringImpl instance, we can determine that the object is a __CFConstantStringClassReference, that it has the flags 0x000007c8, that the actual string is Hello World, and that its length is 11. If you are curious about the flags integer, the CFString implementation, part of the Core Foundation framework, tells us that it is an immutable, UTF-8 string that uses the default allocator, and whose contents are not freed up.

The (section ("__DATA, __cfstring")) attribute specifies that the string must be stored in the __cfstring section of the __DATA (read/write) segment of the resulting Mach-O executable. To better understand this, let’s compile the “Hello World” Objective-C program (in the usual way) and inspect it using the open-source MachOView desktop application.

Inspecting __DATA_CONST.__cfstring and __TEXT.__cstring with MachOView

In this example, C string literals are stored at specific offsets of the __cstring section of the __TEXT (read-only) segment, and the CFString objects are stored in the __cstring section of the __DATA_CONST segment, pointing back at the offset of the C strings.

Note that the Clang Objective-C to C++ re-writer does not add a const qualifier to the __NSConstantStringImpl instance, resulting in the object being stored in the __DATA segment, instead of the __DATA_CONST segment as the normal Objective-C compilation process seems to do. We will touch on why these differences exist later in the post.

Even more interestingly, we can see the members of the __NSConstantStringImpl structure being laid out in the executable. The first entry corresponds to the isa offset, the second entry corresponds to the flags integer, the third entry corresponds to the str C string offset (as we saw before), and the fourth entry corresponds to the length of the string.

Mach-O example of __NSConstantStringImpl

Coming back to the generated C++ code, before invoking NSLog, the __NSConstantStringImpl instance is treated as a cast to NSString, which is defined as follows:

// @class NSString;
#ifndef _REWRITER_typedef_NSString
#define _REWRITER_typedef_NSString
typedef struct objc_object NSString;
typedef struct {} _objc_exc_NSString;
#endif

According to the above definition, NSString is an alias (typedef) to objective-c_object, which according to the Objective-C runtime, corresponds to a pointer to an arbitrary Objective-C object. That is, objective-c_object equals the well-known id Objective-C type. In fact, the generated C++ code defines id like this:

typedef struct objc_class *Class;
struct objc_object {
    Class _Nonnull isa __attribute__((deprecated));
};
typedef struct objc_object *id;

Inspecting @autoreleasepool blocks

Since the introduction of ARC (Automatic Reference Counting), the NSAutoReleasePool cannot be directly used, and was replaced by @autoreleasepool blocks.

If we take a look at the generated C++ code, we can see that Clang re-wrote the @autoreleasepool block as follows:

/* @autoreleasepool */ { __AtAutoreleasePool __autoreleasepool;
  NSLog((NSString *)&__NSConstantStringImpl__var_folders_sy_wb_f149x2v9_j6xdhfrtr9c00000gn_T_main_fca8a5_mi_0);
}

The key here is the __AtAutoreleasePool class, defined close to the beginning of the generated file:

struct __AtAutoreleasePool {
  __AtAutoreleasePool() {atautoreleasepoolobj = objc_autoreleasePoolPush();}
  ~__AtAutoreleasePool() {objc_autoreleasePoolPop(atautoreleasepoolobj);}
  void * atautoreleasepoolobj;
};

This is a C++ RAII (Resource Acquisition Is Initialization) wrapper over the objective-c_autoreleasePoolPush and objective-c_autoreleasePoolPop private C functions of the runtime.

These functions are not covered by the Apple documentation, and are not declared on the public headers of the Objective-C runtime, which you can confirm with the following grep(1) command:

$ grep objc_autorelease $(xcode-select --print-path)/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/objc/*

In a previous article, we explored how to extract the dyld shared cache of your system libraries. Assuming your extracted cache is located at $HOME/dyld-cache-arm64e, you can confirm objective-c_autoreleasePoolPush and objective-c_autoreleasePoolPop are globally exposed symbols of libobjc.A.dylib using nm(1):

$ nm -g $HOME/dyld-cache-arm64e/usr/lib/libobjc.A.dylib | grep objc_autorelease
00000001800a4afc T __objc_autoreleasePoolPop
00000001800a4b00 T __objc_autoreleasePoolPrint
00000001800a4af8 T __objc_autoreleasePoolPush
0000000180075850 T _objc_autorelease
00000001800739ec T _objc_autoreleasePoolPop
00000001800738ac T _objc_autoreleasePoolPush
0000000180076b8c T _objc_autoreleaseReturnValue

You can also find references to these functions in the TDB that declares exported symbols for libobjc.A.dylib:

$ grep objc_autorelease < $(xcode-select --print-path)/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/lib/libobjc.A.tbd
           __objc_atfork_parent, __objc_atfork_prepare, __objc_autoreleasePoolPop,
           __objc_autoreleasePoolPrint, __objc_autoreleasePoolPush, __objc_beginClassEnumeration,
           _objc_allocateProtocol, _objc_autorelease, _objc_autoreleasePoolPop,
           _objc_autoreleasePoolPush, _objc_autoreleaseReturnValue, _objc_begin_catch,

Coming back to our generated code, these private functions that are not declared in the Objective-C runtime headers are consumed like this:

extern "C" __declspec(dllimport) void * objc_autoreleasePoolPush(void);
extern "C" __declspec(dllimport) void objc_autoreleasePoolPop(void *);

Microsoft Extensions

You might be puzzled by the seemingly Windows-specific __declspec(dllimport) attribute.

Let’s dig a bit into it. I’m running AppleClang 1500.0.40.1 (Xcode 15.0.1), which corresponds to LLVM 16. In LLVM 16, the Objective-C re-writer we are using is implemented in clang/lib/Frontend/Rewrite/RewriteModernObjC.cpp.

You might have noted clang/lib/Frontend/Rewrite/RewriteObjC.cpp, which corresponds to the old -rewrite-legacy-objc Clang option. That re-writer is deprecated and should not be used anymore.

Taking a look into RewriteModernObjC.cpp, we can see that the re-writer has various conditionals around LangOpts.MicrosoftExt for performing Microsoft-specific rewrites. For example, lines 5930 to 5935 contain the following logic:

if (LangOpts.MicrosoftExt) {
  Preamble += "#define __OBJC_RW_DLLIMPORT extern \"C\" __declspec(dllimport)\n";
  Preamble += "#define __OBJC_RW_STATICIMPORT extern \"C\"\n";
}
else
  Preamble += "#define __OBJC_RW_DLLIMPORT extern\n";

As you might expect, this is the reason we initially passed the -fno-ms-extensions. However, these Microsoft-specific conditionals are not consistently handled at the moment. For example, you might find FIXME comments like the one in lines 1012 to 1014:

// FIXME. Is this attribute correct in all cases?
Setr = "\nextern \"C\" __declspec(dllimport) "
"void objc_setProperty (id, SEL, long, id, bool, bool);\n";

More specific to our case, the re-writer (incorrectly?) hardcodes __declspec(dllimport) for objective-c_autoreleasePoolPush and objective-c_autoreleasePoolPop in lines 6045 to 6046:

Preamble += "extern \"C\" __declspec(dllimport) void * objc_autoreleasePoolPush(void);\n";
Preamble += "extern \"C\" __declspec(dllimport) void objc_autoreleasePoolPop(void *);\n\n";

Is Objective-C just a transpiler?

If you got this far, you might be wondering how LLVM makes use of this Objective-C re-writer. When you compile Objective-C, this re-writer is not used.

Instead, LLVM has an Objective-C frontend that directly compiles to LLVM IR (Intermediate Representation), which is transformed to machine code by the LLVM backend. You can peek into the production-ready Objective-C frontend for LLVM 16 at clang/lib/CodeGen/CGObjC.cpp.

Limitations of the re-writer

The fact that normal Objective-C compilation follows a different process explains some inconsistencies we saw with the re-writer in this article, like the fact that static strings are put in the __DATA segment instead of in the __DATA_CONST segment and missing conditionals around Microsoft-specific extensions and dllimport.

Apart from minor inconsistencies, the re-writer seems to have many other issues. Unless you provide trivial examples that do not make use of the Foundation framework, the generated C++ code does not compile. For example, while experimenting with the “Hello World” program presented at the beginning of this chapter, I found references to wrong structure names, some Objective-C @property declarations not being re-written, invalid typedef aliases, and more.

If we take a detour into LLVM again, Clang’s README states that “Clang is useful for a number of things beyond just compiling code: we intend for Clang to be host to a number of different source-level tools.” Turns out that the Objective-C re-writer is just an side experiment best-effort tool started in 2007 by Chris Lattner, creator of LLVM and Swift.

Over the last 15 years, this re-writer experiment had consistent casual contributions and a growing end-to-end test suite. Even if it is still not perfect, you can already learn many things about Objective-C with it!

HN Discussion: https://news.ycombinator.com/item?id=38498934.