| 1 | ================ |
| 2 | Initializer List |
| 3 | ================ |
| 4 | This discussion took place in https://reviews.llvm.org/D35216 |
| 5 | "Escape symbols when creating std::initializer_list". |
| 6 | |
| 7 | It touches problems of modelling C++ standard library constructs in general, |
| 8 | including modelling implementation-defined fields within C++ standard library |
| 9 | objects, in particular constructing objects into pointers held by such fields, |
| 10 | and separation of responsibilities between analyzer's core and checkers. |
| 11 | |
| 12 | **Artem:** |
| 13 | |
| 14 | I've seen a few false positives that appear because we construct |
| 15 | C++11 std::initializer_list objects with brace initializers, and such |
| 16 | construction is not properly modeled. For instance, if a new object is |
| 17 | constructed on the heap only to be put into a brace-initialized STL container, |
| 18 | the object is reported to be leaked. |
| 19 | |
| 20 | Approach (0): This can be trivially fixed by this patch, which causes pointers |
| 21 | passed into initializer list expressions to immediately escape. |
| 22 | |
| 23 | This fix is overly conservative though. So i did a bit of investigation as to |
| 24 | how model std::initializer_list better. |
| 25 | |
| 26 | According to the standard, ``std::initializer_list<T>`` is an object that has |
| 27 | methods ``begin(), end(), and size()``, where ``begin()`` returns a pointer to continuous |
| 28 | array of ``size()`` objects of type T, and end() is equal to begin() plus size(). |
| 29 | The standard does hint that it should be possible to implement |
| 30 | ``std::initializer_list<T>`` as a pair of pointers, or as a pointer and a size |
| 31 | integer, however specific fields that the object would contain are an |
| 32 | implementation detail. |
| 33 | |
| 34 | Ideally, we should be able to model the initializer list's methods precisely. |
| 35 | Or, at least, it should be possible to explain to the analyzer that the list |
| 36 | somehow "takes hold" of the values put into it. Initializer lists can also be |
| 37 | copied, which is a separate story that i'm not trying to address here. |
| 38 | |
| 39 | The obvious approach to modeling ``std::initializer_list`` in a checker would be to |
| 40 | construct a SymbolMetadata for the memory region of the initializer list object, |
| 41 | which would be of type ``T*`` and represent ``begin()``, so we'd trivially model ``begin()`` |
| 42 | as a function that returns this symbol. The array pointed to by that symbol |
| 43 | would be ``bindLoc()``ed to contain the list's contents (probably as a ``CompoundVal`` |
| 44 | to produce less bindings in the store). Extent of this array would represent |
| 45 | ``size()`` and would be equal to the length of the list as written. |
| 46 | |
| 47 | So this sounds good, however apparently it does nothing to address our false |
| 48 | positives: when the list escapes, our ``RegionStoreManager`` is not magically |
| 49 | guessing that the metadata symbol attached to it, together with its contents, |
| 50 | should also escape. In fact, it's impossible to trigger a pointer escape from |
| 51 | within the checker. |
| 52 | |
| 53 | Approach (1): If only we enabled ``ProgramState::bindLoc(..., notifyChanges=true)`` |
| 54 | to cause pointer escapes (not only region changes) (which sounds like the right |
| 55 | thing to do anyway) such checker would be able to solve the false positives by |
| 56 | triggering escapes when binding list elements to the list. However, it'd be as |
| 57 | conservative as the current patch's solution. Ideally, we do not want escapes to |
| 58 | happen so early. Instead, we'd prefer them to be delayed until the list itself |
| 59 | escapes. |
| 60 | |
| 61 | So i believe that escaping metadata symbols whenever their base regions escape |
| 62 | would be the right thing to do. Currently we didn't think about that because we |
| 63 | had neither pointer-type metadatas nor non-pointer escapes. |
| 64 | |
| 65 | Approach (2): We could teach the Store to scan itself for bindings to |
| 66 | metadata-symbolic-based regions during scanReachableSymbols() whenever a region |
| 67 | turns out to be reachable. This requires no work on checker side, but it sounds |
| 68 | performance-heavy. |
| 69 | |
| 70 | Approach (3): We could let checkers maintain the set of active metadata symbols |
| 71 | in the program state (ideally somewhere in the Store, which sounds weird but |
| 72 | causes the smallest amount of layering violations), so that the core knew what |
| 73 | to escape. This puts a stress on the checkers, but with a smart data map it |
| 74 | wouldn't be a problem. |
| 75 | |
| 76 | Approach (4): We could allow checkers to trigger pointer escapes in arbitrary |
| 77 | moments. If we allow doing this within ``checkPointerEscape`` callback itself, we |
| 78 | would be able to express facts like "when this region escapes, that metadata |
| 79 | symbol attached to it should also escape". This sounds like an ultimate freedom, |
| 80 | with maximum stress on the checkers - still not too much stress when we have |
| 81 | smart data maps. |
| 82 | |
| 83 | I'm personally liking the approach (2) - it should be possible to avoid |
| 84 | performance overhead, and clarity seems nice. |
| 85 | |
| 86 | **Gabor:** |
| 87 | |
| 88 | At this point, I am a bit wondering about two questions. |
| 89 | |
| 90 | * When should something belong to a checker and when should something belong to the engine? |
| 91 | Sometimes we model library aspects in the engine and model language constructs in checkers. |
| 92 | |
| 93 | * What is the checker programming model that we are aiming for? Maximum freedom or more easy checker development? |
| 94 | |
| 95 | I think if we aim for maximum freedom, we do not need to worry about the |
| 96 | potential stress on checkers, and we can introduce abstractions to mitigate that |
| 97 | later on. |
| 98 | If we want to simplify the API, then maybe it makes more sense to move language |
| 99 | construct modeling to the engine when the checker API is not sufficient instead |
| 100 | of complicating the API. |
| 101 | |
| 102 | Right now I have no preference or objections between the alternatives but there |
| 103 | are some random thoughts: |
| 104 | |
| 105 | * Maybe it would be great to have a guideline how to evolve the analyzer and |
| 106 | follow it, so it can help us to decide in similar situations |
| 107 | |
| 108 | * I do care about performance in this case. The reason is that we have a |
| 109 | limited performance budget. And I think we should not expect most of the checker |
| 110 | writers to add modeling of language constructs. So, in my opinion, it is ok to |
| 111 | have less nice/more verbose API for language modeling if we can have better |
| 112 | performance this way, since it only needs to be done once, and is done by the |
| 113 | framework developers. |
| 114 | |
| 115 | **Artem:** These are some great questions, i guess it'd be better to discuss |
| 116 | them more openly. As a quick dump of my current mood: |
| 117 | |
| 118 | * To me it seems obvious that we need to aim for a checker API that is both |
| 119 | simple and powerful. This can probably by keeping the API as powerful as |
| 120 | necessary while providing a layer of simple ready-made solutions on top of it. |
| 121 | Probably a few reusable components for assembling checkers. And this layer |
| 122 | should ideally be pleasant enough to work with, so that people would prefer to |
| 123 | extend it when something is lacking, instead of falling back to the complex |
| 124 | omnipotent API. I'm thinking of AST matchers vs. AST visitors as a roughly |
| 125 | similar situation: matchers are not omnipotent, but they're so nice. |
| 126 | |
| 127 | * Separation between core and checkers is usually quite strange. Once we have |
| 128 | shared state traits, i generally wouldn't mind having region store or range |
| 129 | constraint manager as checkers (though it's probably not worth it to transform |
| 130 | them - just a mood). The main thing to avoid here would be the situation when |
| 131 | the checker overwrites stuff written by the core because it thinks it has a |
| 132 | better idea what's going on, so the core should provide a good default behavior. |
| 133 | |
| 134 | * Yeah, i totally care about performance as well, and if i try to implement |
| 135 | approach, i'd make sure it's good. |
| 136 | |
| 137 | **Artem:** |
| 138 | |
| 139 | > Approach (2): We could teach the Store to scan itself for bindings to |
| 140 | > metadata-symbolic-based regions during scanReachableSymbols() whenever |
| 141 | > a region turns out to be reachable. This requires no work on checker side, |
| 142 | > but it sounds performance-heavy. |
| 143 | |
| 144 | Nope, this approach is wrong. Metadata symbols may become out-of-date: when the |
| 145 | object changes, metadata symbols attached to it aren't changing (because symbols |
| 146 | simply don't change). The same metadata may have different symbols to denote its |
| 147 | value in different moments of time, but at most one of them represents the |
| 148 | actual metadata value. So we'd be escaping more stuff than necessary. |
| 149 | |
| 150 | If only we had "ghost fields" |
| 151 | (https://lists.llvm.org/pipermail/cfe-dev/2016-May/049000.html), it would have |
| 152 | been much easier, because the ghost field would only contain the actual |
| 153 | metadata, and the Store would always know about it. This example adds to my |
| 154 | belief that ghost fields are exactly what we need for most C++ checkers. |
| 155 | |
| 156 | **Devin:** |
| 157 | |
| 158 | In this case, I would be fine with some sort of |
| 159 | AbstractStorageMemoryRegion that meant "here is a memory region and somewhere |
| 160 | reachable from here exists another region of type T". Or even multiple regions |
| 161 | with different identifiers. This wouldn't specify how the memory is reachable, |
| 162 | but it would allow for transfer functions to get at those regions and it would |
| 163 | allow for invalidation. |
| 164 | |
| 165 | For ``std::initializer_list`` this reachable region would the region for the backing |
| 166 | array and the transfer functions for begin() and end() yield the beginning and |
| 167 | end element regions for it. |
| 168 | |
| 169 | In my view this differs from ghost variables in that (1) this storage does |
| 170 | actually exist (it is just a library implementation detail where that storage |
| 171 | lives) and (2) it is perfectly valid for a pointer into that storage to be |
| 172 | returned and for another part of the program to read or write from that storage. |
| 173 | (Well, in this case just read since it is allowed to be read-only memory). |
| 174 | |
| 175 | What I'm not OK with is modeling abstract analysis state (for example, the count |
| 176 | of a NSMutableArray or the typestate of a file handle) as a value stored in some |
| 177 | ginned up region in the store. This takes an easy problem that the analyzer does |
| 178 | well at (modeling typestate) and turns it into a hard one that the analyzer is |
| 179 | bad at (reasoning about the contents of the heap). |
| 180 | |
| 181 | I think the key criterion here is: "is the region accessible from outside the |
| 182 | library". That is, does the library expose the region as a pointer that can be |
| 183 | read to or written from in the client program? If so, then it makes sense for |
| 184 | this to be in the store: we are modeling reachable storage as storage. But if |
| 185 | we're just modeling arbitrary analysis facts that need to be invalidated when a |
| 186 | pointer escapes then we shouldn't try to gin up storage for them just to get |
| 187 | invalidation for free. |
| 188 | |
| 189 | **Artem:** |
| 190 | |
| 191 | > In this case, I would be fine with some sort of ``AbstractStorageMemoryRegion`` |
| 192 | > that meant "here is a memory region and somewhere reachable from here exists |
| 193 | > another region of type T". Or even multiple regions with different |
| 194 | > identifiers. This wouldn't specify how the memory is reachable, but it would |
| 195 | > allow for transfer functions to get at those regions and it would allow for |
| 196 | > invalidation. |
| 197 | |
| 198 | Yeah, this is what we can easily implement now as a |
| 199 | symbolic-region-based-on-a-metadata-symbol (though we can make a new region |
| 200 | class for that if we eg. want it typed). The problem is that the relation |
| 201 | between such storage region and its parent object region is essentially |
| 202 | immaterial, similarly to the relation between ``SymbolRegionValue`` and its parent |
| 203 | region. Region contents are mutable: today the abstract storage is reachable |
| 204 | from its parent object, tomorrow it's not, and maybe something else becomes |
| 205 | reachable, something that isn't even abstract. So the parent region for the |
| 206 | abstract storage is most of the time at best a "nice to know" thing - we cannot |
| 207 | rely on it to do any actual work. We'd anyway need to rely on the checker to do |
| 208 | the job. |
| 209 | |
| 210 | > For std::initializer_list this reachable region would the region for the |
| 211 | > backing array and the transfer functions for begin() and end() yield the |
| 212 | > beginning and end element regions for it. |
| 213 | |
| 214 | So maybe in fact for std::initializer_list it may work fine because you cannot |
| 215 | change the data after the object is constructed - so this region's contents are |
| 216 | essentially immutable. For the future, i feel as if it is a dead end. |
| 217 | |
| 218 | I'd like to consider another funny example. Suppose we're trying to model |
| 219 | |
| 220 | .. code-block:: cpp |
| 221 | |
| 222 | std::unique_ptr. Consider:: |
| 223 | |
| 224 | void bar(const std::unique_ptr<int> &x); |
| 225 | |
| 226 | void foo(std::unique_ptr<int> &x) { |
| 227 | int *a = x.get(); // (a, 0, direct): &AbstractStorageRegion |
| 228 | *a = 1; // (AbstractStorageRegion, 0, direct): 1 S32b |
| 229 | int *b = new int; |
| 230 | *b = 2; // (SymRegion{conj_$0<int *>}, 0 ,direct): 2 S32b |
| 231 | x.reset(b); // Checker map: x -> SymRegion{conj_$0<int *>} |
| 232 | bar(x); // 'a' doesn't escape (the pointer was unique), 'b' does. |
| 233 | clang_analyzer_eval(*a == 1); // Making this true is up to the checker. |
| 234 | clang_analyzer_eval(*b == 2); // Making this unknown is up to the checker. |
| 235 | } |
| 236 | |
| 237 | The checker doesn't totally need to ensure that ``*a == 1`` passes - even though the |
| 238 | pointer was unique, it could theoretically have ``.get()``-ed above and the code |
| 239 | could of course break the uniqueness invariant (though we'd probably want it). |
| 240 | The checker can say that "even if ``*a`` did escape, it was not because it was |
| 241 | stuffed directly into bar()". |
| 242 | |
| 243 | The checker's direct responsibility, however, is to solve the ``*b == 2`` thing |
| 244 | (which is in fact the problem we're dealing with in this patch - escaping the |
| 245 | storage region of the object). |
| 246 | |
| 247 | So we're talking about one more operation over the program state (scanning |
| 248 | reachable symbols and regions) that cannot work without checker support. |
| 249 | |
| 250 | We can probably add a new callback "checkReachableSymbols" to solve this. This |
| 251 | is in fact also related to the dead symbols problem (we're scanning for live |
| 252 | symbols in the store and in the checkers separately, but we need to do so |
| 253 | simultaneously with a single worklist). Hmm, in fact this sounds like a good |
| 254 | idea; we can replace checkLiveSymbols with checkReachableSymbols. |
| 255 | |
| 256 | Or we could just have ghost member variables, and no checker support required at |
| 257 | all. For ghost member variables, the relation with their parent region (which |
| 258 | would be their superregion) is actually useful, the mutability of their contents |
| 259 | is expressed naturally, and the store automagically sees reachable symbols, live |
| 260 | symbols, escapes, invalidations, whatever. |
| 261 | |
| 262 | > In my view this differs from ghost variables in that (1) this storage does |
| 263 | > actually exist (it is just a library implementation detail where that storage |
| 264 | > lives) and (2) it is perfectly valid for a pointer into that storage to be |
| 265 | > returned and for another part of the program to read or write from that |
| 266 | > storage. (Well, in this case just read since it is allowed to be read-only |
| 267 | > memory). |
| 268 | |
| 269 | > What I'm not OK with is modeling abstract analysis state (for example, the |
| 270 | > count of a NSMutableArray or the typestate of a file handle) as a value stored |
| 271 | > in some ginned up region in the store.This takes an easy problem that the |
| 272 | > analyzer does well at (modeling typestate) and turns it into a hard one that |
| 273 | > the analyzer is bad at (reasoning about the contents of the heap). |
| 274 | |
| 275 | Yeah, i tend to agree on that. For simple typestates, this is probably an |
| 276 | overkill, so let's definitely put aside the idea of "ghost symbolic regions" |
| 277 | that i had earlier. |
| 278 | |
| 279 | But, to summarize a bit, in our current case, however, the typestate we're |
| 280 | looking for is the contents of the heap. And when we try to model such |
| 281 | typestates (complex in this specific manner, i.e. heap-like) in any checker, we |
| 282 | have a choice between re-doing this modeling in every such checker (which is |
| 283 | something analyzer is indeed good at, but at a price of making checkers heavy) |
| 284 | or instead relying on the Store to do exactly what it's designed to do. |
| 285 | |
| 286 | > I think the key criterion here is: "is the region accessible from outside |
| 287 | > the library". That is, does the library expose the region as a pointer that |
| 288 | > can be read to or written from in the client program? If so, then it makes |
| 289 | > sense for this to be in the store: we are modeling reachable storage as |
| 290 | > storage. But if we're just modeling arbitrary analysis facts that need to be |
| 291 | > invalidated when a pointer escapes then we shouldn't try to gin up storage |
| 292 | > for them just to get invalidation for free. |
| 293 | |
| 294 | As a metaphor, i'd probably compare it to body farms - the difference between |
| 295 | ghost member variables and metadata symbols seems to me like the difference |
| 296 | between body farms and evalCall. Both are nice to have, and body farms are very |
| 297 | pleasant to work with, even if not omnipotent. I think it's fine for a |
| 298 | FunctionDecl's body in a body farm to have a local variable, even if such |
| 299 | variable doesn't actually exist, even if it cannot be seen from outside the |
| 300 | function call. I'm not seeing immediate practical difference between "it does |
| 301 | actually exist" and "it doesn't actually exist, just a handy abstraction". |
| 302 | Similarly, i think it's fine if we have a ``CXXRecordDecl`` with |
| 303 | implementation-defined contents, and try to farm up a member variable as a handy |
| 304 | abstraction (we don't even need to know its name or offset, only that it's there |
| 305 | somewhere). |
| 306 | |
| 307 | **Artem:** |
| 308 | |
| 309 | We've discussed it in person with Devin, and he provided more points to think |
| 310 | about: |
| 311 | |
| 312 | * If the initializer list consists of non-POD data, constructors of list's |
| 313 | objects need to take the sub-region of the list's region as this-region In the |
| 314 | current (v2) version of this patch, these objects are constructed elsewhere and |
| 315 | then trivial-copied into the list's metadata pointer region, which may be |
| 316 | incorrect. This is our overall problem with C++ constructors, which manifests in |
| 317 | this case as well. Additionally, objects would need to be constructed in the |
| 318 | analyzer's core, which would not be able to predict that it needs to take a |
| 319 | checker-specific region as this-region, which makes it harder, though it might |
| 320 | be mitigated by sharing the checker state traits. |
| 321 | |
| 322 | * Because "ghost variables" are not material to the user, we need to somehow |
| 323 | make super sure that they don't make it into the diagnostic messages. |
| 324 | |
| 325 | So, because this needs further digging into overall C++ support and rises too |
| 326 | many questions, i'm delaying a better approach to this problem and will fall |
| 327 | back to the original trivial patch. |
| 328 | |