Use string views instead of passing std:wstring by const&

(giodicanio.com)

37 points | by Orochikaku 3 days ago ago

54 comments

delta_p_delta_x a day ago ago

The zero-terminated string is by far C's worst design decision. It is single-handedly the cause for most performance, correctness, and security bugs, including many high-profile CVEs. I really do wish Pascal strings had caught on earlier and platform/kernel APIs used it, instead of an unqualified pointer-to-char that then hides an O(n) string traversal (by the platform) to find the null byte.
There are then questions about the length prefix, with a simple solution: make this a platform-specific detail and use the machine word. 16-bit platforms get strings of length ~2^16, 32 b platforms get 2^32 (which is a 4 GB-long string, which is more than 1000× as long as the entire Lord of the Rings trilogy), 64 b platforms get 2^64 (which is ~10^19).
Edit: I think a lot of commenters are focusing on the 'Pascalness' of Pascal strings, which I was using as an umbrella terminology for length-prefixed strings.

[-]
- david2ndaccount a day ago ago
  
  Pascal strings might be the only string design worse than C strings. C Strings at least let you take a zero copy substring of the tail. Pascal strings require a copy for any substring! Strings should be two machine words - length + pointer (aka what is commonly called a string view). This is no different than any other array view. Strings are not a special case.
  
  [-]
  - Joker_vD a day ago ago
    
    Yeah, I too feel that storing the array's length glued to the array's data is not that good of an idea, it should be stored next to the pointer to the array aka in the array view. But the thrall of having to pass around only a single pointer is quite a strong one.
    
    [-]
    - Someone a day ago ago
      
      > I too feel that storing the array's length glued to the array's data is not that good of an idea, it should be stored next to the pointer to the array aka in the array view.
      That’s not cache-friendly, though. I think the short string optimization (keeping short strings alongside the string length, but allocating a separate buffer for longer strings. See https://devblogs.microsoft.com/oldnewthing/20240510-00/?p=10... for how various C++ compilers implement that) may be the best option.
      
      [-]
      - Joker_vD 20 hours ago ago
        
        > That’s not cache-friendly, though.
        How so? The string implementations in that post are pretty much that:
        struct string { char* ptr; size_t size; union { size_t capacity; char buf[16]; };
        The pointer and the size are stored together, and they may optionally be located right next to the string's actual data, but only for very small, locally-allocated, short-lived strings; but in normal usage, that pointer points somewhere into the heap.
        
        [-]
        
        Someone 4 hours ago ago
        
        > they may optionally be located right next to the string's actual data, but only for very small, locally-allocated, short-lived strings
        Only for small strings. Locally allocated and short-lived aren’t required for short string optimization to take an effect.
        Also, I can’t find a good reference, but “only for small strings” in many programs means “for most strings”.
    - kstrauser a day ago ago
      
      Is there a reason for the string not to be a struct, so that you're still just passing around a pointer to that struct (or even just passing it by value)?
      
      [-]
      - tczMUFlmoNk a day ago ago
        
        I might guess that GP is referring not to interface ergonomics (for which a struct is a perfectly satisfactory solution, as you describe), but to implementation efficiency. A pointer is one word. A slice / string view is two words: a length and a pointer. A pointer to a slice is one word, but requires an additional indirection. I personally agree that slices are probably the best all-around choice, but taking double the memory (and incurring double the register pressure, etc.) is a trade-off that's fair to mention.
  - delta_p_delta_x a day ago ago
    
    > C Strings at least let you take a zero copy substring of the tail
    This is a special-case optimisation that I'm happy to lose in favour of the massive performance and security benefits otherwise.
    Isn't length + pointer... Basically a Pascal string? Unless I am mistaken.
    I think what was unsaid in your second point is that we really need to type-differentiate constant strings, dynamic strings, and string 'views', which Rust does in-language, and C++ does with the standard library. I prefer Rust's approach.
    
    [-]
    - vlovich123 a day ago ago
      
      If I recall correctly a pascal string has the length before the string. Ie to get the length you dereference the pointer and look backwards N bytes to get the length. A pascal string is still a single pointer.
      You cannot cheaply take an arbitrary view of the interior string - you can only truncate cheaply (and oob checks are easier to automate). That’s why pointer + length is important because it’s a generic view. For arrays it’s more complicated because you can have a stride which is important for multidimensional arrays.
    - masklinn a day ago ago
      
      > Isn't length + pointer... Basically a Pascal string? Unless I am mistaken.
      Length + pointer is a record string, a pascal string has the length at the head of the buffer, behind the pointer.
      
      [-]
      - nopurpose a day ago ago
        
        Many years ago when reading Redis code I saw the same pattern: they pass around simple pointer to data, but there is a fixed length metadata just before that.
        
        [-]
        
        masklinn a day ago ago
        
        I assume it’s either Antirez’s sds or a variant / ancestor thereof, yes. It stores a control block at the head of the string, but the pointer points past that block, so it has metadata but “is” a C string.
    - undefined a day ago ago
      
      [deleted]
    - LoganDark a day ago ago
      
      Pascal strings store the string's length by its data, whereas fat pointers store the length by the address of the data.
      The main difference is that if a string's length is by its data, you can't easily construct a pointer to part of that data without copying it into a new string, whereas if instead the length is by the data's address, you can cheaply construct pointers to any substring (by coming up with new length+address pairs) without having to construct entire new strings.
  - gizmo686 a day ago ago
    
    C strings also allow you to do a 0 copy split by replacing all instances of the delimeter with null (although you need to keep track of the end-of-list seperatly).
    
    [-]
    - masklinn a day ago ago
      
      You also need to own the buffer otherwise you’re corrupting someone else’s data, or straight up segfaulting.
      
      [-]
      - theamk 18 hours ago ago
        
        As long as you clearly document that the incoming data is going to be modified, it's not a problem. And in a lot of cases, the data either comes from the network or is read from the file - so the buffer is going to be discarded at the end anyway... why not reuse it?
        And yes, today it would be easier to make a copy of the data... but remember we are talking about 90's, where RAM is measured in megabytes and your L1 cache may be only 8KB or so.
  - hasley a day ago ago
    
    The "zero copy substring" in C is in general not a valid C string since it is not guaranteed to be zero-terminated. For both languages one could define a string view as a struct with a pointer plus size information. So, I do not see why Pascal is worse in this regard than C.
  - undefined a day ago ago
    
    [deleted]
  - theamk a day ago ago
    
    x86 had 6 general-purpose working registers total. Using length + pointers would have caused a lot of extra spills.
    
    [-]
    - masklinn a day ago ago
      
      “Sure your software crashes and your machines get owned, but at least they’re not-working very fast!”
      
      [-]
      - tialaramex a day ago ago
        
        Right. This is so often the excuse for terrible designs in C and C++. It's wrong, "But it's faster". No, it's just wrong, only for correct answers does it matter whether you were faster. If just any answer was fine there's no need to write any of this software.
- theamk a day ago ago
  
  First common 32 bit system was Win 95, which required 4MB of RAM (not GB!). The 4-byte prefix would be considered extremely wasteful in those times - maybe not for a single string, but anytime when there is a list of strings involved, such as constants list. (As a point of reference, Turbo Pascal's default strings still had 1-byte length field).
  Plus, C-style strings allow a lot of optimizations - if you have a mutable buffer with data, you can make a string out of them with zero copy and zero allocations. strtok(3) is an example of such approach, but I've implemented plenty of similar parsers back in the day. INI, CSV, JSON, XML - query file size, allocate buffer once, read it into the buffer, drop some NULL's into strategic positions, maybe shuffle some bytes around for that rare escape case, and you have a whole bunch of C strings, ready to use, and with no length limits.
  Compared to this, Pascal strings would be incredibly painful to use... So you query file size, allocate, read it, and then what? 1-byte length is too short, and for 2+ byte length, you need a secondary buffer to copy string to. And how big should this buffer be? Are you going to be dynamically resizing it or wasting some space?
  And sure, _today_ I no longer write code like that, I don't mind dropping std::string into my code, it'd just a meg or so of libraries and 3x overhead for short strings - but that's nothing those days. But back when those conventions were established, it was really really important.
  
  [-]
  - kstrauser a day ago ago
    
    > First common 32 bit system was Win 95
    We're just going to ignore Amigas, and any Unix workstations?
  - amluto a day ago ago
    
    > query file size, allocate buffer once, read it into the buffer, drop some NULL's into strategic positions, maybe shuffle some bytes around for that rare escape case, and you have a whole bunch of C strings, ready to use, and with no length limits.
    I have also done this, but I would argue that, even at the time, the design was very poor. A much much better solution would have been wise pointers — pass around the length of the string separately from the pointer, much like string_view or Rust’s &str. Then you could skip the NULL-writing part.
    Maybe C strings made sense on even older machines which had severely limited registers —- if you have an accumulator and one resister usable as a pointer, you want to minimize the number of variables involved in a computation.
  - delta_p_delta_x a day ago ago
    
    > zero copy and zero allocations
    This is a red herring, because when you actually read the strings out, you still need to iterate through the length for each string—zero copy, zero allocation, but linear complexity.
    > query file size, allocate buffer once, read it into the buffer, drop some NULL's into strategic positions, maybe shuffle some bytes around for that rare escape case, and you have a whole bunch of C strings, ready to use, and with no length limits.
    I write parsers in a very different way—I keep the file buffer around as read-only until the end of the pipeline, prepare string views into the buffer, and pipe those along to the next step.
    
    [-]
    - theamk 18 hours ago ago
      
      I don't see what's "red herring" about it - for a reasonable format, any parsing will normally be O(n) complexity, so all we can do is to decrease constant factor.
      So _today_ I write parsers in a very different way as well, copying strings is very cheap (today) and not worth it extra complexity.
      But remember we are talking about the past, when those conventions are being established. And back in the 90's, zero copy and zero allocations were real advantage. Not in the theoretical CS sense, but in very practical - remember there was _no_ "dynamically resizing vector" in C's (or Pascal's) stdlib, it's just raw malloc() and realloc(), and it is up to you to assemble vector from it as needed. And free()/malloc() overhead was non-trivial, you had to re-use and grow the buffer as needed. And you want to store the parsed data, storing separate length would double your index size! So a parse-in-place + null-terminated strings approach would give you both smaller code and smaller runtime, at the expense of a few sharp corners. But we were all running with scissors back then.
    - dh2022 a day ago ago
      
      I think the concern was conserving memory ( which was scarce back then) and not iterating through each substring.
      
      [-]
      - delta_p_delta_x a day ago ago
        
        I am very sceptical about that. Much safer and cleaner languages like ML and Lisp were contemporary to C, and were equally developed on memory-scarce hardware.
        
        [-]
        
        theamk 18 hours ago ago
        
        Maybe on the high-end machines in some fancy lab somewhere?
        All I saw were 386's and 486's, and I am pretty sure every piece of software I ever used was either C or Turbo Pascal or direct assembly. In the mid-90s, Java appeared and I remember how horribly slow those Java apps were compared to C/Pascal code.
        
        kelnos a day ago ago
        
        They were also comparatively slow, no? And their runtimes used up much more of that scarce memory than a C program did.
      - priceishere a day ago ago
        
        But does it even conserve memory? Copying a string when you have the length is 2 bytes of machine code on x86 (rep movsb).
        Remember, code takes up memory too.
  - priceishere a day ago ago
    
    How do you drop nulls in the middle of a string without requiring O(N) extra space to restore the original characters?
  - dev-ns8 a day ago ago
    Besides my DA/Algo classes in College, I've never used C seriously. And you know, it's semantics like this that really make me go WTF lol....
    From strtok man page... "The first time that strtok() is called, str should be specified; subsequent calls, wishing to obtain further tokens from the same string, should pass a null pointer instead."
    Really?? a null pointer.. This is valid code:
```
  char str[] = "C is fucking weird, ok? I said it, sue me.";
  char *result = strtok(str, ",");
  char *res = strtok(NULL, ",");
```
    Why is that ok?
    [-]
    - kelnos a day ago ago
      
      You have to understand the context, and the time period. Memory and CPU cycles were precious. All computers being 24/7 networked wasn't a thing, so security wasn't much of a concern. API design tended to reflect that.
      
      [-]
      - dev-ns8 a day ago ago
        
        Not mentioned in my initial comment, but yeah, I'm viscerally aware of the affect the time period and resources at the time have on API design in C and other languages from that time period.
        The null pointer in place of the operand here just seemed like a really good quirk to point out
    - tialaramex a day ago ago
      
      It's like this because the 1970s C programmer, typically a single individual, is expected to maintain absolute knowledge of the full context of everything at all times. So these functions (the old non-re-entrant C functions) just assume you - that solo programmer - will definitely know which string you're currently tokenising and would never have say, a sub-routine which also needs to tokenize strings.
      All of this is designed before C11, which means that hilariously it's actually always Undefined Behaviour to write multi-threaded code in C. There's no memory ordering rules yet in the language, and if you write a data race (how could you not in multi-threaded code) then the Sequentially Consistent if Data Race Free proof, SC/DRF does not apply and in C all bets are off if you lose Sequential Consistency† So in this world that's enough, absolute mastery and a single individual keeping track of everything. Does it work? Not very well but hey, it was cheap.
      † This is common and you should assume you're fucked in any concurrent language which doesn't say otherwise. In safe Rust you can't write a data race so you're always SC, in Java losing SC is actually guaranteed safe (you probably no longer understand your program, but it does have a meaning and you could reason about it) but in many languages which say nothing it's game over because it was game over in C and they can't do better.
- jmyeet a day ago ago
  
  The C string and C++'s backwards compatibility supporting it is why I think both C and C++ are irredeemable. Beyond the bounds overflow issue, there's no concept of ownership. Like if you pass a string to a C function, who is responsible for freeing it? You? The function you called? What if freeing it is conditional somehow? How would you know? What if an error prevents that free?
  C++ strings had no choice but to copy to underlying string because of this unknown ownership and then added more ownership issues by letting you call the naked pointer within to pass it to C functions. In fact, that's an issue with pretty much every C++ container, including the smart pointers: you can just call get() an break out of the lifecycle management in unpredictable ways.
  string_view came much later onto the scene and doesn't have ownership so you avoid a sometimes unnecessary copy but honestly it just makes things more complex.
  I honestly think that as long as we continue to use C/C++ for crucial software and operating systems, we'll be dealing with buffer overflow CVEs until the end of time.
  
  [-]
  - hrmtst93837 a day ago ago
    
    Irredeemable is a bit much. C APIs often bury ownership in docs or naming, so callers guess whether the callee borrows the buffer or takes it, and that guess causes a lot of the pain.
    string_view helps, but only because it states "non-owning" in the type. You can still hand out a view into dead storage and get the same bug with nicer syntax.
Surac a day ago ago

I work on embedded computers with mostly around 64K RAM using C99. Any form of alloc is forbidden. So I implemented a string lib that works with what is called here as views. I hold length and contend in preallocated arrays. Each string has exactly 127 characters and is also zero-terminated to fulfill C-API needs, and my tables can hold between 16 and 64 strings depending on the project. There is even a safety zero at index 127 enforced in any operation. This system allows for fast, non-copy workflow, and ownership is always obvious; a string is not owned. I even have different "arenas" for different parts of the system that can clear independently. I use this approach also in a desktop context, albeit scaled up in length and number. This combines view, zero delimiter, ownership, and arena-like management altogether.
tialaramex a day ago ago

Starting by defining the non-owning slice references, rather than the owning container types is such a massive advantage that it's very telling that Stroustrup didn't get this right in his language.
Today Bjarne will insist that the correct understanding of ownership and lifetimes was always inherent in his language and if you point out that it basically only warrants a brief mention in his early books about the language he'll say that the newer books have much more about this, as if that's not a confession...
Because &str (the non-owning reference to a slice of UTF-8 encoded text) in Rust is so ubiquitous, it's completely reasonable in Rust to use anything from the simple standard library owning String type, which is literally a growable array, Vec<u8> plus the promise about UTF-8 encoding, through to text which lives only on the stack as a re-interpreted array, or at the other extreme a raw pointer-sized short-string optimisation https://crates.io/crates/cold-string -- where unless it fits inline the string is length-prefixed on the heap like Pascal. All these approaches fit different niches and since a mere reference to the string is the same for all of them they're compatible. In C++ you will run into too many problems and regret trying.
Edited: Correct that String lives in the stdlib -- it isn't baked in to the core language because your tiny embedded system may not have an allocator to go around creating fancy growable arrays, but it still might want the non-owning string slice, &str which truly is in the core language.
abcde666777 a day ago ago

Man, I really don't miss working in C++. Used to be my daily driver until I ended up in C# land. I understand why C++ is the way it is, I understand why it's still around and the purposes it serves, but in terms of the experience of using the language... I wouldn't want to go back.
Panzerschrek a day ago ago

System APIs requiring passing a null-terminated string are also painful to use from other languages, where strings are not null-terminated by default. They basically require taking a copy of a string and adding a null-terminator before performing a call.
chirsz a day ago ago

It's best to avoid using std::wstring and other wchar_t-related facilities, as they are highly non-portable across different platforms. If you need to interact with the Win32 API, use char16_t and std::u16string, so that anyone knows it contains a UTF-16 encoded string and knows how to use and process it.

[-]
- ack_complete a day ago ago
  
  The Windows API uses WCHAR = wchar_t, so if you use char16_t, you have to convert back and forth to avoid running afoul of strict aliasing rules. This imposes conversion costs without benefits; both using wchar_t directly or converting to/from UTF-8 are better.
  
  [-]
  - delta_p_delta_x 17 hours ago ago
    
    Or, set up the manifest in your app[1], and just use UTF-8 and `std::string`/`std::string_view` everywhere.
    [1]: https://learn.microsoft.com/en-gb/windows/apps/design/global...
breuwi a day ago ago

[Deleted, misread]

[-]
- Matheus28 a day ago ago
  
  Since C++11, data() is also required to be null terminated. Per your own source and cppreference.
  
  [-]
  - breuwi a day ago ago
    
    LOL, I need to learn to click on the more modern tabs. Will delete comment.
- beached_whale a day ago ago
  
  std::string since C++ 11 guarantees the buffer is zero terminated. The reasoning being thread safety of const members. https://eel.is/c++draft/basic.string#general-3
quotemstr a day ago ago

It's usually the case that the more strident someone is in a blog post decrying innovation, the more wrong he is. The current article is no exception.
It's possible to define your own string_view workalike that has a c_str() and binds to whatever is stringlike can has a c_str. It's a few hundred lines of code. You don't have to live with the double indirection.

[-]
- apparatur a day ago ago
  
  I think the article is trying to address the question from the title. Definining your own string_view workalike is probably possible, but I'm not sure if it's OK to use it in a public API, for example. Choosing to use const string & may be more suitable.
- pwdisswordfishy a day ago ago
  
  Or wait until P3655 ships, which will bring std::wcstring_view.