InitializerLists.rst source code [clang_source_code/docs/analyzer/developer-docs/InitializerLists.rst]

1	================
2	Initializer List
3	================
4	This discussion took place in https://reviews.llvm.org/D35216
5	"Escape symbols when creating std::initializer_list".
6
7	It touches problems of modelling C++ standard library constructs in general,
8	including modelling implementation-defined fields within C++ standard library
9	objects, in particular constructing objects into pointers held by such fields,
10	and separation of responsibilities between analyzer's core and checkers.
11
12	Artem:
13
14	I've seen a few false positives that appear because we construct
15	C++11 std::initializer_list objects with brace initializers, and such
16	construction is not properly modeled. For instance, if a new object is
17	constructed on the heap only to be put into a brace-initialized STL container,
18	the object is reported to be leaked.
19
20	Approach (0): This can be trivially fixed by this patch, which causes pointers
21	passed into initializer list expressions to immediately escape.
22
23	This fix is overly conservative though. So i did a bit of investigation as to
24	how model std::initializer_list better.
25
26	According to the standard, ``std::initializer_list<T>`` is an object that has
27	methods ``begin(), end(), and size()``, where ``begin()`` returns a pointer to continuous
28	array of ``size()`` objects of type T, and end() is equal to begin() plus size().
29	The standard does hint that it should be possible to implement
30	``std::initializer_list<T>`` as a pair of pointers, or as a pointer and a size
31	integer, however specific fields that the object would contain are an
32	implementation detail.
33
34	Ideally, we should be able to model the initializer list's methods precisely.
35	Or, at least, it should be possible to explain to the analyzer that the list
36	somehow "takes hold" of the values put into it. Initializer lists can also be
37	copied, which is a separate story that i'm not trying to address here.
38
39	The obvious approach to modeling ``std::initializer_list`` in a checker would be to
40	construct a SymbolMetadata for the memory region of the initializer list object,
41	which would be of type ``T*`` and represent ``begin()``, so we'd trivially model ``begin()``
42	as a function that returns this symbol. The array pointed to by that symbol
43	would be ``bindLoc()``ed to contain the list's contents (probably as a ``CompoundVal``
44	to produce less bindings in the store). Extent of this array would represent
45	``size()`` and would be equal to the length of the list as written.
46
47	So this sounds good, however apparently it does nothing to address our false
48	positives: when the list escapes, our ``RegionStoreManager`` is not magically
49	guessing that the metadata symbol attached to it, together with its contents,
50	should also escape. In fact, it's impossible to trigger a pointer escape from
51	within the checker.
52
53	Approach (1): If only we enabled ``ProgramState::bindLoc(..., notifyChanges=true)``
54	to cause pointer escapes (not only region changes) (which sounds like the right
55	thing to do anyway) such checker would be able to solve the false positives by
56	triggering escapes when binding list elements to the list. However, it'd be as
57	conservative as the current patch's solution. Ideally, we do not want escapes to
58	happen so early. Instead, we'd prefer them to be delayed until the list itself
59	escapes.
60
61	So i believe that escaping metadata symbols whenever their base regions escape
62	would be the right thing to do. Currently we didn't think about that because we
63	had neither pointer-type metadatas nor non-pointer escapes.
64
65	Approach (2): We could teach the Store to scan itself for bindings to
66	metadata-symbolic-based regions during scanReachableSymbols() whenever a region
67	turns out to be reachable. This requires no work on checker side, but it sounds
68	performance-heavy.
69
70	Approach (3): We could let checkers maintain the set of active metadata symbols
71	in the program state (ideally somewhere in the Store, which sounds weird but
72	causes the smallest amount of layering violations), so that the core knew what
73	to escape. This puts a stress on the checkers, but with a smart data map it
74	wouldn't be a problem.
75
76	Approach (4): We could allow checkers to trigger pointer escapes in arbitrary
77	moments. If we allow doing this within ``checkPointerEscape`` callback itself, we
78	would be able to express facts like "when this region escapes, that metadata
79	symbol attached to it should also escape". This sounds like an ultimate freedom,
80	with maximum stress on the checkers - still not too much stress when we have
81	smart data maps.
82
83	I'm personally liking the approach (2) - it should be possible to avoid
84	performance overhead, and clarity seems nice.
85
86	Gabor:
87
88	At this point, I am a bit wondering about two questions.
89
90	* When should something belong to a checker and when should something belong to the engine?
91	Sometimes we model library aspects in the engine and model language constructs in checkers.
92
93	* What is the checker programming model that we are aiming for? Maximum freedom or more easy checker development?
94
95	I think if we aim for maximum freedom, we do not need to worry about the
96	potential stress on checkers, and we can introduce abstractions to mitigate that
97	later on.
98	If we want to simplify the API, then maybe it makes more sense to move language
99	construct modeling to the engine when the checker API is not sufficient instead
100	of complicating the API.
101
102	Right now I have no preference or objections between the alternatives but there
103	are some random thoughts:
104
105	* Maybe it would be great to have a guideline how to evolve the analyzer and
106	follow it, so it can help us to decide in similar situations
107
108	* I do care about performance in this case. The reason is that we have a
109	limited performance budget. And I think we should not expect most of the checker
110	writers to add modeling of language constructs. So, in my opinion, it is ok to
111	have less nice/more verbose API for language modeling if we can have better
112	performance this way, since it only needs to be done once, and is done by the
113	framework developers.
114
115	Artem: These are some great questions, i guess it'd be better to discuss
116	them more openly. As a quick dump of my current mood:
117
118	* To me it seems obvious that we need to aim for a checker API that is both
119	simple and powerful. This can probably by keeping the API as powerful as
120	necessary while providing a layer of simple ready-made solutions on top of it.
121	Probably a few reusable components for assembling checkers. And this layer
122	should ideally be pleasant enough to work with, so that people would prefer to
123	extend it when something is lacking, instead of falling back to the complex
124	omnipotent API. I'm thinking of AST matchers vs. AST visitors as a roughly
125	similar situation: matchers are not omnipotent, but they're so nice.
126
127	* Separation between core and checkers is usually quite strange. Once we have
128	shared state traits, i generally wouldn't mind having region store or range
129	constraint manager as checkers (though it's probably not worth it to transform
130	them - just a mood). The main thing to avoid here would be the situation when
131	the checker overwrites stuff written by the core because it thinks it has a
132	better idea what's going on, so the core should provide a good default behavior.
133
134	* Yeah, i totally care about performance as well, and if i try to implement
135	approach, i'd make sure it's good.
136
137	Artem:
138
139	> Approach (2): We could teach the Store to scan itself for bindings to
140	> metadata-symbolic-based regions during scanReachableSymbols() whenever
141	> a region turns out to be reachable. This requires no work on checker side,
142	> but it sounds performance-heavy.
143
144	Nope, this approach is wrong. Metadata symbols may become out-of-date: when the
145	object changes, metadata symbols attached to it aren't changing (because symbols
146	simply don't change). The same metadata may have different symbols to denote its
147	value in different moments of time, but at most one of them represents the
148	actual metadata value. So we'd be escaping more stuff than necessary.
149
150	If only we had "ghost fields"
151	(https://lists.llvm.org/pipermail/cfe-dev/2016-May/049000.html), it would have
152	been much easier, because the ghost field would only contain the actual
153	metadata, and the Store would always know about it. This example adds to my
154	belief that ghost fields are exactly what we need for most C++ checkers.
155
156	Devin:
157
158	In this case, I would be fine with some sort of
159	AbstractStorageMemoryRegion that meant "here is a memory region and somewhere
160	reachable from here exists another region of type T". Or even multiple regions
161	with different identifiers. This wouldn't specify how the memory is reachable,
162	but it would allow for transfer functions to get at those regions and it would
163	allow for invalidation.
164
165	For ``std::initializer_list`` this reachable region would the region for the backing
166	array and the transfer functions for begin() and end() yield the beginning and
167	end element regions for it.
168
169	In my view this differs from ghost variables in that (1) this storage does
170	actually exist (it is just a library implementation detail where that storage
171	lives) and (2) it is perfectly valid for a pointer into that storage to be
172	returned and for another part of the program to read or write from that storage.
173	(Well, in this case just read since it is allowed to be read-only memory).
174
175	What I'm not OK with is modeling abstract analysis state (for example, the count
176	of a NSMutableArray or the typestate of a file handle) as a value stored in some
177	ginned up region in the store. This takes an easy problem that the analyzer does
178	well at (modeling typestate) and turns it into a hard one that the analyzer is
179	bad at (reasoning about the contents of the heap).
180
181	I think the key criterion here is: "is the region accessible from outside the
182	library". That is, does the library expose the region as a pointer that can be
183	read to or written from in the client program? If so, then it makes sense for
184	this to be in the store: we are modeling reachable storage as storage. But if
185	we're just modeling arbitrary analysis facts that need to be invalidated when a
186	pointer escapes then we shouldn't try to gin up storage for them just to get
187	invalidation for free.
188
189	Artem:
190
191	> In this case, I would be fine with some sort of ``AbstractStorageMemoryRegion``
192	> that meant "here is a memory region and somewhere reachable from here exists
193	> another region of type T". Or even multiple regions with different
194	> identifiers. This wouldn't specify how the memory is reachable, but it would
195	> allow for transfer functions to get at those regions and it would allow for
196	> invalidation.
197
198	Yeah, this is what we can easily implement now as a
199	symbolic-region-based-on-a-metadata-symbol (though we can make a new region
200	class for that if we eg. want it typed). The problem is that the relation
201	between such storage region and its parent object region is essentially
202	immaterial, similarly to the relation between ``SymbolRegionValue`` and its parent
203	region. Region contents are mutable: today the abstract storage is reachable
204	from its parent object, tomorrow it's not, and maybe something else becomes
205	reachable, something that isn't even abstract. So the parent region for the
206	abstract storage is most of the time at best a "nice to know" thing - we cannot
207	rely on it to do any actual work. We'd anyway need to rely on the checker to do
208	the job.
209
210	> For std::initializer_list this reachable region would the region for the
211	> backing array and the transfer functions for begin() and end() yield the
212	> beginning and end element regions for it.
213
214	So maybe in fact for std::initializer_list it may work fine because you cannot
215	change the data after the object is constructed - so this region's contents are
216	essentially immutable. For the future, i feel as if it is a dead end.
217
218	I'd like to consider another funny example. Suppose we're trying to model
219
220	.. code-block:: cpp
221
222	std::unique_ptr. Consider::
223
224	void bar(const std::unique_ptr<int> &x);
225
226	void foo(std::unique_ptr<int> &x) {
227	int *a = x.get(); // (a, 0, direct): &AbstractStorageRegion
228	*a = 1; // (AbstractStorageRegion, 0, direct): 1 S32b
229	int *b = new int;
230	b = 2; // (SymRegion{conj_$0<int >}, 0 ,direct): 2 S32b
231	x.reset(b); // Checker map: x -> SymRegion{conj_$0<int *>}
232	bar(x); // 'a' doesn't escape (the pointer was unique), 'b' does.
233	clang_analyzer_eval(*a == 1); // Making this true is up to the checker.
234	clang_analyzer_eval(*b == 2); // Making this unknown is up to the checker.
235	}
236
237	The checker doesn't totally need to ensure that ``*a == 1`` passes - even though the
238	pointer was unique, it could theoretically have ``.get()``-ed above and the code
239	could of course break the uniqueness invariant (though we'd probably want it).
240	The checker can say that "even if ``*a`` did escape, it was not because it was
241	stuffed directly into bar()".
242
243	The checker's direct responsibility, however, is to solve the ``*b == 2`` thing
244	(which is in fact the problem we're dealing with in this patch - escaping the
245	storage region of the object).
246
247	So we're talking about one more operation over the program state (scanning
248	reachable symbols and regions) that cannot work without checker support.
249
250	We can probably add a new callback "checkReachableSymbols" to solve this. This
251	is in fact also related to the dead symbols problem (we're scanning for live
252	symbols in the store and in the checkers separately, but we need to do so
253	simultaneously with a single worklist). Hmm, in fact this sounds like a good
254	idea; we can replace checkLiveSymbols with checkReachableSymbols.
255
256	Or we could just have ghost member variables, and no checker support required at
257	all. For ghost member variables, the relation with their parent region (which
258	would be their superregion) is actually useful, the mutability of their contents
259	is expressed naturally, and the store automagically sees reachable symbols, live
260	symbols, escapes, invalidations, whatever.
261
262	> In my view this differs from ghost variables in that (1) this storage does
263	> actually exist (it is just a library implementation detail where that storage
264	> lives) and (2) it is perfectly valid for a pointer into that storage to be
265	> returned and for another part of the program to read or write from that
266	> storage. (Well, in this case just read since it is allowed to be read-only
267	> memory).
268
269	> What I'm not OK with is modeling abstract analysis state (for example, the
270	> count of a NSMutableArray or the typestate of a file handle) as a value stored
271	> in some ginned up region in the store.This takes an easy problem that the
272	> analyzer does well at (modeling typestate) and turns it into a hard one that
273	> the analyzer is bad at (reasoning about the contents of the heap).
274
275	Yeah, i tend to agree on that. For simple typestates, this is probably an
276	overkill, so let's definitely put aside the idea of "ghost symbolic regions"
277	that i had earlier.
278
279	But, to summarize a bit, in our current case, however, the typestate we're
280	looking for is the contents of the heap. And when we try to model such
281	typestates (complex in this specific manner, i.e. heap-like) in any checker, we
282	have a choice between re-doing this modeling in every such checker (which is
283	something analyzer is indeed good at, but at a price of making checkers heavy)
284	or instead relying on the Store to do exactly what it's designed to do.
285
286	> I think the key criterion here is: "is the region accessible from outside
287	> the library". That is, does the library expose the region as a pointer that
288	> can be read to or written from in the client program? If so, then it makes
289	> sense for this to be in the store: we are modeling reachable storage as
290	> storage. But if we're just modeling arbitrary analysis facts that need to be
291	> invalidated when a pointer escapes then we shouldn't try to gin up storage
292	> for them just to get invalidation for free.
293
294	As a metaphor, i'd probably compare it to body farms - the difference between
295	ghost member variables and metadata symbols seems to me like the difference
296	between body farms and evalCall. Both are nice to have, and body farms are very
297	pleasant to work with, even if not omnipotent. I think it's fine for a
298	FunctionDecl's body in a body farm to have a local variable, even if such
299	variable doesn't actually exist, even if it cannot be seen from outside the
300	function call. I'm not seeing immediate practical difference between "it does
301	actually exist" and "it doesn't actually exist, just a handy abstraction".
302	Similarly, i think it's fine if we have a ``CXXRecordDecl`` with
303	implementation-defined contents, and try to farm up a member variable as a handy
304	abstraction (we don't even need to know its name or offset, only that it's there
305	somewhere).
306
307	Artem:
308
309	We've discussed it in person with Devin, and he provided more points to think
310	about:
311
312	* If the initializer list consists of non-POD data, constructors of list's
313	objects need to take the sub-region of the list's region as this-region In the
314	current (v2) version of this patch, these objects are constructed elsewhere and
315	then trivial-copied into the list's metadata pointer region, which may be
316	incorrect. This is our overall problem with C++ constructors, which manifests in
317	this case as well. Additionally, objects would need to be constructed in the
318	analyzer's core, which would not be able to predict that it needs to take a
319	checker-specific region as this-region, which makes it harder, though it might
320	be mitigated by sharing the checker state traits.
321
322	* Because "ghost variables" are not material to the user, we need to somehow
323	make super sure that they don't make it into the diagnostic messages.
324
325	So, because this needs further digging into overall C++ support and rises too
326	many questions, i'm delaying a better approach to this problem and will fall
327	back to the original trivial patch.
328

Clang Project