Planet CQ

April 24, 2024

Things on a content management system - Jörg Hoh

AEM CS & Mongo exceptions

If you are an avid log checker on your AEM CS environments you might have come across messages like this in your authoring logs:

02.04.2024 13:37:42:1234 INFO [cluster-ClusterId{value='6628de4fc6c9efa', description='MongoConnection for Oak DocumentMK'}-cmp57428e1324330cluster-shard-00-02.2rgq1.mongodb.net:27017] org.mongodb.driver.cluster Exception in monitor thread while connecting to server cmp57428e1324330cluster-shard-00-02.2rgq1.mongodb.net:27017 com.mongodb.MongoSocketException: cmp57428e1324330cluster-shard-00-02.2rgq1.mongodb.net 
at com.mongodb.ServerAddress.getSocketAddresses(ServerAddress.java:211) [org.mongodb.mongo-java-driver:3.12.7]
at com.mongodb.internal.connection.SocketStream.initializeSocket(SocketStream.java:75) [org.mongodb.mongo-java-driver:3.12.7]
...
Caused by: java.net.UnknownHostException: cmp57428e1324330cluster-shard-00-02.2rgq1.mongodb.net

And you might wonder what is going on. I get this question every now and then, often assuming that this something problematic. Because we have all learned that stacktraces normally indicate problems. And on first sight this indicates a problem, that a specific hostname cannot be resolved. Is there a DNS problem in AEM CS?

Actually this message does not indicate any problem. The reason behind this is the way how mongodb implemented scaling operations. If you up- or downscale the mongo cluster, this does not happen in-place, but you get actually a new mongo cluster of the new size and of course the same content. And this new cluster comes with a new hostname.

So in this situation there was a scaling operation, and AEM CS connected to the new cluster and now looses connection to the old cluster, because the older cluster is stopped and its DNS entry is removed. Which is of course expected. And for that reason you can also see that this is logged on level INFO, and not as an ERROR.

Unfortunately this is a log message created by the mongo-driver itself, so this cannot be changed on the Oak level by removing the stacktrace from this message and changing the message itself. And for that reason you will continue to see it in the AEM CS logs, until a new improved mongo driver changes that.

by Jörg at April 24, 2024 10:52 AM

March 04, 2024

Things on a content management system - Jörg Hoh

Performance test modelling (part 5)

This is part 5 and the final post of the blog post series about performance test modelling; see part 1 for an overview and the links to all articles of this series.

In the previous post I discussed the impact of the system which we test, how the modelling of the test and the test content will influence the result of the performance test, and how you implement the most basic scenario of the performance tests.

In this blog post I want to discuss the predicted result of a performance test and the actual outcome of it, and what you can do when these do not match (actually they rarely do on the first execution). Also I want to discuss the situation where after golive you encounter that a performance test delivered the expected results, but did not match the observed behavior in production.

The performance test does not match the expected results

In my experience every performance, no matter how good or bad the basic definition is, contains at least 2 relevant data points:

  1. the number of concurrent users (we discussed that already in part 1)
  2. and an expected result, for example that the transaction must be completed within N seconds.

What if you don’t meet the performance criteria in point 2? This is typically the time when customers in AEM as a Cloud Service start to raise questions to Adobe, about number of pods, hardware details etc, as if the problem can only be the hardware sizing on the backend. If you don’t have a clear understanding about all the implications and details of your performance tests, this often seems to be the most natural thing to ask.

But if you have built a good model for your performance test, your first task should be to compare the assumptions with the results. Do you have your expected cache-hit ratio on the CDN? Were some assumptions in the model overly optimistic or pessimistic? As you have actual data to validate your assumptions you should do exactly that: go through your list of assumptions and check each one of them. Refine them. And when you have done that, modify the test and start another execution.

And at some point you might come to the conclusion, that all assumptions are correct, you have the expected cache-hit ratio, but the latency of the cache misses is too high (in which case the required action is performance tuning of individual requests). Or that you have already reduced the cache MISSES (and cache PASSES) to the minimum possible and that the backend is still not able to handle the load (in which case the expected outcome should be an upscale); or it can also be both.

That’s fine, and then it’s perfect to talk to Adobe, and share your test model, execution plan and the results. I wrote in part 1:

As you can imagine, if I am given just a few diagrams with test results and test statistics as preparation for this call with the customer … this is not enough, and very often more documentation about the test is not available. Which often leads to a lot of discussions about some very basic things and that adds even more delay to an already late project and/or bad customer experience.

But in this situation, when you have a good test model and have done your homework already, it’s possible to directly have a meaningful discussion without the need to uncovering all the hidden assumptions. Also, if you have that model at hand, I assume that performance tests are not an afterthought, and that there are still reasonable options to do some changes, which will either completely fix the situation or at least remediate the worst symptoms, without impacting the go-live and the go-live date too much.

So while this is definitely not the outcome we all work, design, build and ultimately hope for, it’s still much better than the 2nd option below.

I hope that I don’t need to talk about unrealistic expectations in your performance tests, for example delivering a p99,9 with 200 ms latency, while at the same time requiring a good number of requests always be handled by the AEM backend. You should have detected these unrealistic assumptions much earlier, mostly during design and then in the first runs during the evolution phase of your test.

Scenario 2: After go-live the performance is not what it’s supposed to be

In this scenario a performance test was either not done at all (don’t blame me for it!) or the test passed, but the results of the performance tests did not match the observed reality. This often shows up as outages in production or unbearable performance for users. This is the worst case scenario, because everyone assumed the contrary as the performance test results were green. Neither the business nor the developer team are prepared for it, and there is no time for any mitigation. This normally leads to an escalated situation with conference calls, involvement from Adobe, and in general a lot of stress for all parties.

The entire focus is on mitigation, and we ( I am speaking now as a member of the Adobe team, who is often involved in such situations) will try to do everything to mitigate that situation by implementing workarounds. As in many cases the most visible bottleneck is on the backend side, upscaling the backend is indeed the first task. And often this helps to buy you some time to perform other changes. But there are even cases, where an upscale of 1000% would be required to somehow mitigate that situation (which is possible, but also very short-lived, as every traffic spike on top will require additional 500% …); also it’s impossible to speed up the latency of a single-threaded request of 20 seconds by adding more CPU. These cases are not easy to solve, and the workaround often takes quite some time, and is often very tailored; and there cases where a workaround is not even possible. In any way it’s normally not a nice experience for no-one of the involved parties.

I refer to all of these actions as “workaround“. In bold. Because they are not not the solution to the challenge of performance problems. They cannot be a solution because this situation proves that the performance test was testing some scenarios, but not the scenario which shows in the production environment. It also raises valid concerns on the reliability of other aspects of the performance tests, and especially about the underlying assumptions. Anyway, we are all trying to do our best to get the system back to track.

As soon as the workarounds are in place and the situation is somehow mitigated, 2 types of questions will come up:

  1. How does a long-term solution look like?
  2. Why did that happen? What was wrong with the performance test and the test results?

While the response to (1) is very specific (and definitely out of scope of this blog post), the response to (2) is interesting. If you have a good documented performance test model you can compare its assumptions with the situation in which the production performance problem happened. You have the chance to spot the incorrect or missing assumption, adjust your model and then the performance test itself. And with that you should be able to reproduce your production issue in a performance test!

And if you have a performance failing test, it’s much easier to fix the system and your application, and apply some specific changes which fix this failed test. And it gives you much more confidence that you changed the right things to make the production environment handle the same situation again in a much better way. Interestingly, this gives also to some large extent the response to the question (1).

If you don’t have such a model in this situation, you are bad off. Because then you either start building the performance test model and the performance test from scratch (takes quite some time), or you switch to the “let’s test our improvements in production” mode. Most often the production testing approach is used (along with some basic testing on stage to avoid making the situation worse), but even that takes time and a high number of production deployments. While you can say it’s agile, other might say it’s chaos and hoping for the best… the actual opposite of good engineering practice.

Summary

In summary, when you have a performance test model, you are more likely to have less problems when your system goes live. Mostly because you have invested time and thoughts in that topic. And because you acted on it. It will not prevent you from making mistakes, forgetting relevant aspects and such, but if that happens you have a good basis to understand quickly the problem and also a good foundation to solve them.

I hope that you learned in these posts some aspects about performance tests which will help you to improve your test approach and test design, so you ultimately have less unexpected problems with performance. And if you have less problems with that, my life in the AEM CS engineering team is much easier 🙂

Thanks for staying with me for throughout this first planned series of blog posts. It’s a bit experimental, although the required structure in this topic led to some interesting additions on the overall structure (the first outline just covered 3 posts, now we are at 5). But I think that even that is not enough, I think that some aspects deserve a blog post on their own.

by Jörg at March 04, 2024 09:22 PM

February 26, 2024

Things on a content management system - Jörg Hoh

Performance test modelling (part 4)

This the 4th post of the blog post series about performance test modelling; see part 1 for an overview and the links to all articles of this series.

In the parts 2 and 3 I outlined relevant aspects when it comes to model your performance tests:

  • The modelling of the expected load, often as expressed as “concurrent users”.
  • The realistic modelling of the system where we want to conduct the performance tests, mostly regarding the relevant content and data.

In this blog post I want show how you deduce from that data, what specific scenarios you should cover by a performance tests. Because there is no single test, which tells you that the resulting end-user performance is good or not.

The basic performance test scenario

Let’s start with a very simple model, where we assume that the traffic rate is quite identical for the whole day; and therefor the performance test resembles that model:

On first sight this is quite simple to model, because you performance test will execute requests at a constant rate for the whole period of time.

But as I outlined in part 3, even if it seems that simple, you have to include at least some background noise. Also you have to take into account, that initially the cache-hit ratio is poor at the beginning, so you have to implement a cache-warmup phase (normally implement as a ramp-up phase, in which the load is increasing up the planned plateau) and just start to measure there.

So our revised plan rather looks like this this

Such a test execution (with the proper modelling of users, requests and requested data) can give you pretty good results if your model assumes a pretty constant load.

What about if your model requires you model a much more fluctuating request rate (for example if your users/visitors are primarily located in North America, and during the night you have almost no traffic, but it starts to increase heavily on the american morning hours? In that case you probably model the warmup in a way, that it resembles the morning increase on traffic, both in frequency and rate. That shouldn’t be that hard, but requires a bit more explicit modelling than just a simple rampup.

To give you some practical hints towards some basic parameters:

  • Such a performance test should run at least 2-3 hours, and even if you see that the results are not what you expect, not terminating it can reveal interesting results.
  • The warmup phase should at least cover 30 minutes; not only to give the caches time to warm-up, but also to give the backend systems time to scale to their “production sizing”; when you don’t execute performance test all the time, the system might scale down because there is no sense in having many systems idling when there is not load.
  • It can make sense to start not with the 100% of the targeted load, but with smaller numbers and start to increase from there. Because only then you can see the bottleneck which your test hits first. If you start already with 100% you might just see a lot of blockings, but you don’t know which one is the most impeding one.
  • When you are implementing a performance test in the context of AEM as a Cloud Service, I recommend to also use my checklist for Performance testing on AEM CS which gives some more practical hints how to get your tests right; although a few aspects covered there are covered in more depth in this post series as well.

When you have such a test passing the biggest part of the work is done; and based on your models you can do execute a number of different tests based to answer more questions.

Variations of the basic performance

The above model just covers an totally average day. But of course it’s possible to vary the created scenario to respond to some more questions:

  • What happens if the load of the day is not 100%, but for some reasons 120%, with identical assumptions about user behavior and traffic distribution? That’s quite simple, because you just increase a number in the performance test.
  • The basic performance test runs just for a few hours and stops then. It gives you the confidence that the system can operate at least these many hours, but a few issues might go unnoticed. For example memory leaks accumulating over time might get only visible after many hours of load. For that reason it makes sense to run your test for 24-48 hours continuously to validate that there is no degradation over that time.
  • What’s the behavior when the system goes into overload? An interesting question (but only if it does not break already when hitting the anticipated load) which is normally answered by a break test; then you increase the load more and more, until the situation really gets out of hand. If you have enough time, that’s indeed something you can try, but let’s hope that’s not very relevant 🙂
  • How does the system behave when your backend systems are not available? What if they come online again?

And probably many more interesting scenarios, which you can think of. But you should only perform these, when you have the basic version test right.

When you have your performance tests passing, the question is still: How does it compare to production load? Are we actually testing the right things? In the next and last post of this series I will cover the options you have when performance test does not match the expected results and also the worst case scenario: What happens if you find out after golive that your performance tests were good, but the production environment behaves very differently?

by Jörg at February 26, 2024 01:53 PM

February 20, 2024

Things on a content management system - Jörg Hoh

CDN and dispatcher – 2 complementary caching layers

I sometimes hear the question how to implement cache invalidation for the CDN. Or the question is why AEM CS still operates with a dispatcher layer when it now has a more powerful CDN in front of it.

The questions are very different, but the answer is in both cases: the CDN is no replacement for the dispatcher, and the dispatcher does not replace the CDN. They serve different purposes, and they combination of these two can be a really good package. Let me explain this.

The dispatcher is very traditional cache. It’s fronting the AEM systems and the cache status is actively maintained by cache invalidation so it always delivers current data. But from an end-user perspective this cache is often far away in terms of network latency. If my AEM systems are hosted in Europe, and end-users from Australia are reaching it, the latency can get huge.

The CDN is the contrary, it serves the content from many locations across the world, being as close to the end-user as possible. But the CDN cache invalidation is cumbersome, and for that reason most often TTL-based expiration is used. That means, you have to accept that there is a chance, that new content is already available, but the CDN can still deliver old content.

Not everyone is happy with that; and if that’s a real concern, short TTLs (in the range of a few minutes) are the norm. That means, that many files on the CDN will get stale every few minutes, which results in cache misses; and a cache miss on the CDN goes back to origin. But of course the reality is, that not many pages change every 10 minutes; actually very few. But customers want to have that low TTL just in case a page was changed, and that change needs to get visible to all endusers as soon as possible. .

So you have a lot of cache misses on the CDN, which trigger a re-fetch of the file from origin, and and because many of the files have not changed, you refetch the exactly same binary which got stale seconds ago. Actually a waste of resources, because your origin system delivers the same content over and over again to the CDN a consequence of these misses. So you could keep your AEM instances busy all the time, re-rendering the same requests over and over, always creating the same response.

Introducing the dispatcher caching, fronting the actual AEM instance. If the file has not changed, the dispatcher will deliver the same file (or just HTTP 304 not modified, which even avoids sending the content again). And it’s fast, much faster than letting AEM rendering the same content again. And if the file has actually changed, it’s rendered once and then reused for all the future CDN cache misses.

The combination of these 2 types of caching approaches help you to deliver content from the edge while at the same time having a reasonable latency for content updates (that means the time between replicating a change to the publish instances until all users across the world can see it) without the need to have a huge number of AEM instances in the background.

So as a conclusion, using the CDN and the dispatcher cache is a good combination, if setup properly.

by Jörg at February 20, 2024 05:22 PM

February 09, 2024

Things on a content management system - Jörg Hoh

Performance tests modelling (part 3)

This is post 3 in my series about Performance Test Modelling. See the first post for an overview of this topic.

In the previous 2 posts I discussed the importance of having a clearly defined model of the performance tests, and that a good definition the load factors (typically measured by “concurrent users”) is required to build a realistic test.

In this post I cover the influence of the test system and test data on the performance test and its result, and why you should spend effort to create a test with a realistic set of data/content. In this post we will do a few thought experiments, and to judge the results of each experiment, we will use the cache-hit ratio of a CDN as a proxy metric.

Let’s design a performance test for a very simple site: It just consists of 1 page, 5 images and 1 CSS and 1 JS file; 8 files in total. Plus there is a CDN for it. So let’s assume that we have to test with 100, 500 and 1000 concurrent users. What’s the test result you expect?

Well, easy. You will get the same test result for all tests irrespective of the level of concurrency; mostly because after the first requests a files will be delivered from the CDN. That means no matter with what concurrency we test, the files are delivered from the CDN, for which we assume it will always deliver very fast. We do not test our system, but rather the CDN, because the cache hit ratio is quite close to 100%.

So what’s the reason why we do this test at all, knowing that the tests just validate the performance promises of the CDN vendor? There is no reason for it. The only reason why we would ever execute such a test is that on test design we did not pay attention to the data which we use to test. And someone decided that these 7 files are enough for satisfy the constraints of the performance test. But the results do not tell us anything about the performance of the site, which in production will consists of tens of thousands of distinct files.

So let’s us do a second thought experiment, this time we test with 100’000 files, 100 concurrent users requesting these files randomly, and a CDN which is configured to cache files for 8 hours (TTL=8h). With regard to to chache-hit-ratio, what is the expectation?

We expect that the cache-hit ratio starts low for quite some time, this is the cache-warming phase. And then it starts to increase, but it will never hit 100%, as after some time cache entries will expire on the cache and start produce cache-misses. This is a much better model of reality, but it still has a major flaw: In reality, requests are not randomly distributed, but normally there are hotspots.

A hotspot consists of files, which are requested much more often than average. Normally these are homepages or other landing pages, plus other pages which users normally are directed to. This set of files is normally quite small compared to the total amount of files (in the range of 1-2%), but they make up 40-60% of the overall requests, and you can easily assume a Pareto distribution (the famous 80/20 rule), that 20% of the files were responsible for 80% of the requests. That means we have a hotspot and a long-tail distribution of the requests.

If we modify the same performance test to take that distribution into account, we end up with a higher cache-hit ratio, because now the hotspot can be delivered mostly from the CDN. But on the long-tail we will have more cache-misses, because they are requested that rarely, so they can expire on the CDN without being requested again. But in total the cache-hit ratio will be better than with the random distribution, especially on the often-requested pages (which are normally the ones we care about most).

Let’s translate this into a graph which displays the response time.

This test is now quite realistic, and if we only focus on the 95 percentile (p95; that means if we take 100 requests, 95 of them are faster than this) the result would meet the criteria; but beyond that the response time is getting higher, because there are a lot of cache misses.

This level of realism in the test results comes with a price: Also the performance test model and the test preparation and execution are much more complicated now.

And till now we only considered users, but what happens when we add random internet noise and the search engines (the unmodelled users from the first part of this series) into the scenario? These will add more (relative) weight to the long-tail, because these requests do not necessarily follow the usual hotspots, but we have to assume a more random distribution for these.

That means that then the cache-hit ratio will be lower again, as there will be much more cache-misses now; and of course this will also increase the response time of the p95. And: it will complicate the model even further.

So let’s stop here. As I have outlined above, the most simple model is totally unrealistic, but making it more realistic makes the model more complex as well. And at some point the model is no longer helpful, because we cannot transform it into a test setup without too much effort (creating test data/content, complex rules to implement the random and hotspot-based requests, etc). That means especially in the case of the test data and test scenarios we need to find the right balance. The right balance between the investment we want to make into tests and how close it should mirror the reality.

I also tried to show you, how far you can get without doing any kind of performance test. Just based on some assumptions were able to build a basic understanding how the system will behave, and how some changes of the parameters will affect the result. I use this technique a lot and it helps me to quickly refine models and define the next steps or the next test iteration.

In the next post I will discuss various scenarios which you should consider in your performance test model, including some practical recommendations how to include them in your test model.

by Jörg at February 09, 2024 01:57 PM

February 01, 2024

Things on a content management system - Jörg Hoh

Performance tests modelling (part 2)

This is is the second blog post in the series about performance test modelling. You can find the overview over this series and links to all its articles in the post “Performance tests modelling (part 1)“.

In this blog post I want to cover the aspect of “concurrent users”, what it means in the context of a performance test and why its important to clearly understand its impact.

Concurrent users is an often used measure to indicate the the load put to a system, expressed by usage in a definition, how many users are concurrently using that system. And for that reason many performance tests provide as quantitative requirement: “The system should be able to handle 200 concurrent users”. While that seems to be a good definition on first sight, it leaves many questions:

  • What does “concurrent” mean?
  • And what does “user” mean?
  • Are “200 concurrent users” enough?
  • Do we always have “200 concurrent users”?

Definition of concurrent

Let’s start with the first question: What does “concurrent” really mean on a technical level? How can we measure that our test indeed does “200 concurrent users” and not just 20 or 1000?

  • Are there any server-side sessions which we can count and which directly give this number? And that we setup our test in a way to hit that number?
  • Or do we have to rely on more vague definitions like “users are considered concurrent when they do a page load less than 5 minutes apart”? And that we design our test in that way?

Actually it does not matter at all, which definition you choose. It’s just important that you explicitly define which definition you use. And what metric you choose to understand that you hit that number. This is an important definition when it comes to implementing your test.

And as a side-note: Many commercial tools have their own definition of concurrent, and here the exact definition does not matter as well, as long as you are able to articulate it.

What is a user?

The next question is about “the user” which is modeled in the test; to simplify the test and test executions one or more “typical” user personas are created, which visit the site and perform some actions. Which is definitely helpful, but it’s just that: A simplification, because otherwise our model would explode because of the sheer complexity and variety of user behavior. Also sometimes we don’t even know what a typical “user” does on our site, because that system will be brand-new.

So this is a case, where we have a huge variance in the behavior of the users, which we should outline in our model as a risk: The model is only valid if the majority of the users are behaving more or less as we assumed.

But is this all? Are really all users do at least 10% of the actions we assume they do?

Let’s brainstorm a bit and try to find answers for these questions:

  • Does the google bot behave like that? All the other bots of the search engines?
  • What about malware scanners which try to hit a huge list of WordPress/Drupal/… URLs on your site?
  • Other systems performing (random?) requests towards your site?

You could argue, that this traffic has less/no business value, and for that reason we don’t test for it. Also it could be assumed that this is just a small fraction of the overall user traffic, and can be ignored. But that is just an assumption, and nothing more. You just assume that it is irrelevant. But often these requests are not irrelevant, not all all.

I encountered cases where not the “normal users” were bringing down a system, but rather this non-normal type of “user”. An example for that are cases where the custom 404 handler was very slow, and for that reason the basic undocumented assumption “We don’t need to care about 404s, as they are very fast” was violated and brought down the site. All performance tests passed, but the production system failed nevertheless.

So you need to think about “user” in a very broad sense. And even if you don’t implement the constant background noise of the internet in your performance test, you should list it as factor. If you know that a lot of this background noise will trigger a HTTP statuscode 404, you are more likely to check that this 404 handler is fast.

Are “200 concurrent users” enough?

One information every performance has is the number of concurrent users which the system must be able to handle. But even if we assume, that “concurrent” and “users” are both defined as well, is this enough?

First, on what data is this number based on? Is it a number based on data derived from another system, which the new system should replace? That’s probably the best data you can get. Or when you build a new system, is it based on good marketing data (which would be okay-ish), based on assumptions of the expected usage or just numbers we would like to see (because we assume that a huge number of concurrent users means a large audience and a high business value)?

So probably this is the topic which will be discussed the most. But the number and the way how that number is determined should be challenged and vetted. Because it’s one the corner-stones of the whole performance test model. It does not make sense to build a high performance and scalable system when afterwards you find out that the business numbers we grossly overrated, and a smaller and cheaper solution would have delivered the same results.

What about time?

A more important is aspect which is often overlooked is the timing; how many users are working on the site at every moment? Do you need to expect the maximum number 8 hours every day or just during the peak days of the year? Do you have a more or less constant usage or only during business hours in Europe?

This heavily depends on the type of your application and the distribution of your audience. If you build an intranet site for a company only located in Europe, the usage during the night is pretty much “zero”, and it will start to increase at 0600 in the morning (probably the Germans going to work early :-)), hitting the max usage between 09 and 16 o’clock and going to zero at latest at 22 o’clock. The contrast to it is a site visited world-wide by customers, where we can expect a higher and almost flat line; of course with variations depending on the number of people being up.

This influences your tests as well, because in both cases you don’t need to simulate spikes, that means a 500% increase of users within 5 minutes. On the other hand, if you plan for large marketing campaigns addressing millions of users, this might exactly be the situation you need to plan and test for. Not to mention if you book a slot during the Superbowl break.

Why is this important? Because you need to test only scenarios which you will expect to see in production. And ignore scenarios which we don’t have any value for you. For example it’s a waste of time and investment to test for a sudden spike in the above mentioned intranet case for the European company, while it’s essential for marketing campaigns to test a scenario, where such a spike comes on top of the normal traffic.

Summary

“N concurrent users” itself is not much information; and while it can serve as input, your performance test model should contain a more detailed understanding of that definition and what it means to the performance test. Otherwise you will focus just on a given number of users of this idealistic type and ignore every other scenario and case.

In the next blog post I will cover how the system and the test data itself will influence the result of the performance test.

by Jörg at February 01, 2024 06:26 PM

January 26, 2024

Things on a content management system - Jörg Hoh

Performance tests modelling (part 1)

In my last blog post about performance test I outlined best practices about building and executing a performance test with AEM as a Cloud Service. But intentionally I left out a huge aspect of the topic:

  • How should your test look like?
  • What is a realistic test?
  • And what can a test result tell you about the behavior of your production environment?

These are hard question, and I often find that these questions are not asked. Or people are not aware that these questions should be asked.

This is the first post in a series of blog posts, in which I want to dive a bit deeper into performance testing in the context of AEM and AEM CS (and many aspects can probably get generalized to other web applications as well). Unlike my other blog posts it addresses topics on a higher level (I will not refer to any AEM functionality or API, and won’t even mention AEM that often), because I learned over time, that very often performance tests are done based on a lot of assumptions. And that it’s very hard to discuss the details of a performance tests if these assumptions are not documented explicitly. I had such discussions in these 2 contexts:

  • The result of a performance test (in AEM as a Cloud Service) is poor and the customer wants to understand what Adobe will do.
  • After golive severe performance problems show up on production; and the customer wants to understand how this can happen as their tests showed no problems.

As you can imagine, if I am given just a few diagrams with test results and test statistics as preparation for this call with the customer … this is not enough, and very often more documentation about the tests is not available. Which often leads to a lot of discussions about some very basic things and that adds even more delay to an already late project and/or bad customer experience. So you can also consider this blog series as a kind of self-defense. If you were asked to read this post, now you know 🙂

I hope that this series will also help you improve your way of doing performance tests, so we all will have less of these situations to deal with.

This post series consists of these individual posts:

And a word upfront to the term “performance test”: I summarize a number of different tests types under that term, which are executed with different intentions, and which come with many names: “Performance tests”, “Load tests”, “Stress tests”, “Endurance tests”, “Soak tests”, and many more. Their intention and execution differ, but in the end they can all benefit from the same questions which I want to cover this blog series. So if you read “performance test”, all of these other tests are meant as well.

What is a performance test? And why do we do them?

A performance test is a tool to predict the future, more specifically how a certain system will behave in a more-or-less defined scenario.

And that outlines already two problems which performance tests have.

  • It is a prediction of the future. Unlike a science experiment it does not try to understand the presence and extrapolate into the future. It does not the have same quality as “tomorrow we will have a sunrise, even if the weather is clouded”, but rather goes into the direction of “if my 17 year old son wants to celebrate his birthday party with his friends at home, we better plan a full cleaning of the house for the day after”. That means no matter how well you know your son and his friends (or the system you are building), there is still an element of surprise and unknown in it.
  • The scenario which we want to simulate is somehow “defined”. In quotes, because in many cases the definitions of that scenario are pretty vague. We normally base these definitions on previous experience we have made and some best practices of the industry.

So it’s already clear from these 2 items, that this prediction is unlikely to be exact and 100% accurate. But it does not need to be accurate, it just needs to be helpful.

A performance test is helpful if it delivers better results than our gut feeling; and the industry has learned that our gut feeling is totally unreliably when it comes to the behaviour of web applications under production load. That’s why many enterprise golive procedures require a performance tests, which will always deliver a more reliable result as gut feeling. But just creating and executing a performance test does not make this a helpful performance test.

So a helpful performance test is also a test, which mimics the reality close enough, that you don’t need to change your plans immediately after your system goes live and hits reality. Unfortunately you only know if your performance test was helpful after you went live. It shares this situation with other test approaches as well; for example a 100% unittest coverage does not mean, that your code does not have bugs, it’s just less likely.

What does that mean for performance tests and their design?

First, a performance test is based on a mental model of your system and the to-be reality, which must be documented. All its assumptions and goals should be explicitly documented, because only then a review can be done. And a review helps to uncover blind spots in our own mental model of the system, its environment and the way how it is used. It helps to clearly outline all known factors which influence the test execution and also its result.

Without that model, it is impossible to compare the test result with reality and try to understand which factor or aspect in the test was missing, misrepresented or not fully understood, which lead to a gap between test result and reality. If you don’t have a documented model, it’s possible to question everything, starting from the model to the correct test execution and the results. If you don’t have a model, the result of a performance test is just a PDF with little to no meaning.

Also you must be aware that this mental model is a massive simplification, as it is impossible to factor in all aspects of the reality, also because the reality changes every day. You will change your application, new releases of AEM as a Cloud Service will be deployed, you add more content, and so on.

Your mental model will never be complete and probably also never be up-to-date, and that will be reflected in your performance test. . But if you know that, you can factor it in. For example you know that in 3 months time the number of content has doubled, and you can decide if it’s helpful to redo the performance test with changed parameters. It’s now a “known unknown”, and no more a “unknown unknown”. You can even decide to ignore factors, if they deem not relevant to you, but of course you should document it.

When you have designed and documented such a model, it is much easier to implement the test, execute the test and reason about the results. Without such a model, there is much more uncertainties in every piece of the test execution. It’s like developing software without a clear and shared understanding what exactly you want to develop.

That’s enough for this post. As promised, this is more abstract than usual, but I hope you liked it and it helps to improve your tests. In the next blog posts I will look into a few relevant aspects which should be covered by your model.

by Jörg at January 26, 2024 05:27 PM

January 23, 2024

CQ5 Blog - Inside Solutions

12 Steps to Migrate AEM from On-Premise to the Cloud

12 Steps to Migrate AEM from On-Premise to the Cloud

Is your organization harnessing the full potential of Adobe Experience Manager (AEM) to deliver exceptional digital experiences across various channels?

If you currently run your websites on AEM On-Premise or rely on Adobe’s Managed Services, it’s time to embark on a journey into the Cloud.

In 2020, AEM introduced the next generation of CMS with AEM as a Cloud Services. It’s time for you and your company to prepare for this transition and embrace this next-generation CMS in the Cloud.

At One Inside – A Vass Company, we’ve worked with several large enterprises, helping them move from AEM on-premise to AEM Cloud Service, executing seamless migrations in less than three months.

Within this comprehensive guide, our AEM experts have compiled their knowledge and address the following questions:

  • How can you smoothly transition from AEM on-premise to AEM Cloud?
  • What are the critical steps for a successful migration to AEM Cloud?
  • What common pitfalls should you avoid?

But before delving into the steps, let’s explore the compelling advantages of embracing AEM Cloud.

What are the benefits of moving to AEM Cloud?

As with any enterprise project, it is essential to demonstrate the clear benefits of migrating your AEM installations to the Cloud to your organization and board.

Let’s explore why this transition is a necessary step.

Moving from AEM On-Premises or managed services to AEM Cloud offers numerous advantages, including:

Reduced Cost of Ownership and Mid-term ROI

The total cost of ownership with AEM Cloud is drastically reduced.  Your company might get savings on several aspects:

  • License: Licensing costs may decrease since the new pricing model is usage-based. Additionally, transitioning to the Cloud provides you with a fresh opportunity to engage in price negotiations with Adobe.
  • Operational Costs: AEM Cloud simplifies many operational aspects, such as environment management and automated version updates.
  • Infrastructure and Hosting: If you previously hosted AEM on your premises, you’ll experience substantial infrastructure and hosting expenses savings. This eliminates the cost of maintaining infrastructure.
  • Workforce: The number of full-time employees (FTEs) required for the project will decrease, resulting in cost reductions.

While the migration project incurs initial expenses, our team has successfully migrated websites to AEM Cloud in less than three months.

The timeline can vary depending on integration complexity and the number of websites and domains involved.

Based on our analysis, the return on investment (ROI) for such a project typically falls below three years. In other words, migrating to AEM Cloud is a worthwhile investment.

Your CMS is always up-to-date, ensuring you have access to the latest features.

With AEM as a Cloud Services, you can say goodbye to version upgrade projects.

Adobe automatically updates the CMS with the latest features, eliminating the concept of versions. It operates like any other Software as a Service, ensuring you are always working with the most current version.

It’s more secure

Security is a primary concern for large enterprises, and AEM as a Cloud Service could offer enhanced security compared to your current setup.

The solution is continuously monitored, and regular patches are applied promptly whenever a security issue is detected.

Read this document about Adobe Cloud Service Security Overview for more details.

99.9% Uptime

With AEM Cloud, your website will always be online. This solution can efficiently scale horizontally and vertically to consistently maintain this high level of service, effectively managing even the most intensive traffic loads.

What are the main benefits of Adobe Experience Manager as a Cloud Service?

No Learning Curve

One significant advantage of transitioning to AEM Cloud is that your marketing team will find the tool familiar.

Despite significant changes in architecture, release processes, and operations, the end-user experience remains unchanged.

Content editors won’t notice any differences following the migration if you use the latest on-premise version.

This means you won’t need to invest time and resources in managing this change or providing extensive training to your team.

Focus on Innovation and Achieve a Faster Time to Market

Managing the operation of an Enterprise CMS is a practice rooted in the past. It’s time for your organization to embrace this new reality.

With AEM Cloud, you can accelerate innovation for several reasons:

  • Your workforce can be fully dedicated to projects that create value.
  • You gain access to the latest innovations from Adobe.

Thanks to our extensive experience with AEM Cloud Service and collaboration with multiple clients, we have witnessed a significantly improved time to market. Projects are completed swiftly, and new websites can be launched within months.

When your company has a new product or service to showcase, you’ll reap the benefits of working with this new generation CMS.

Moving from AEM On-Premise to AEM as a Cloud Service step-by-step

This section will guide you through migrating from AEM On-Premise to AEM as a Cloud Service.

Each step is carefully designed to ensure a smooth and successful transition to the Cloud, covering critical aspects from initial analysis to going live.

AEM On Premise to Cloud Migration Project Steps

Step 1 – Analyze, Plan, and Estimate the Effort

The initial step in this journey is to understand AEM as a Cloud Service and the associated changes and deprecated features.

Some noteworthy changes include:

  • Architecture changes with automatic horizontal scaling
  • Project code structure
  • Asset storage
  • Built-in CDN
  • Dispatcher configuration
  • Network and API connections, including IP whitelisting
  • DNS & SSL certificate configuration
  • CI/CD pipelines
  • AEM author access with Adobe account
  • User groups & permissions

Additionally, it’s crucial to evaluate your current AEM installation, particularly in terms of connections and integrations with other services:

  • APIs or endpoints within the internal network
  • Third-party services, especially those protected by IP whitelisting
  • Any data import services to AEM
  • Login with closed user group (CUG)

These elements should be carefully reviewed, as some adjustments may be necessary.

Another critical aspect is effective communication with current stakeholders, partners, and the Adobe team. Onboarding these parties from the project’s outset is essential, with clear task assignments and timeframes.

For example, you will later discover that the involvement of your internal IT team is required. Informing them in advance is crucial to prevent project delays.

Furthermore, it’s essential to review your licensing agreements with Adobe and ensure that you have the appropriate subscriptions for AEM as a Cloud Service.

While this initial step may only take a few days, it is vital in assessing critical aspects of your installation, defining the project plan and effort, and sharing this information with key stakeholders.

Step 2 – Prepare the code for AEM as a Cloud Service

This step aims to ensure your current AEM installation and its code base are ready for the Cloud while remaining compatible with your existing on-premise instances.

While we won’t go deep into all the structural changes required for AEM Cloud in this article, we’ll provide an overview to keep it easily digestible for all readers.

Adobe offers a helpful tool called the Adobe Best Practices Analyzer designed to evaluate your current AEM implementation and offer guidance on improvements to align with best practices and Adobe standards.

The report generated by this tool covers:

  • Application functionality in need of refactoring.
  • Repository items that should be relocated to supported locations.
  • Legacy user interface dialogs and components that require modernization.
  • Deployment and configuration issues.
  • AEM 6.x features replaced by new functionalities or are currently unsupported on AEM as a Cloud Service.

It’s important to note that an AEM expert should review the Adobe Best Practices Analyzer report, as it will not fully comprehend the entire codebase and its implications.

Following the assessment, an AEM architect or developer can restructure the codebase and apply new practices per the latest AEM Archetype.

A recommended practice is further refactoring and reviewing outdated features from your current codebase.

Since comprehensive testing of the entire website and application will be necessary later on, taking the opportunity to eliminate technical debt and establish a more robust foundation is advantageous.

Step 3 – Prepare AEM Cloud Environments

This step aims to prepare the cloud environment and set up AEM Cloud Manager, the backbone of AEM as a Cloud Service. Importantly, this step can be conducted concurrently with the previous one.

Adobe Cloud Manager offers a user-friendly interface that simplifies configuring environments, setting up pipelines, and configuring certificates, DNS, and other essential services.

Step 4 – Migrate Your Projects and Code to AEM Cloud

By this stage, your code has been refactored, and any changes incompatible with the on-premise setup have been implemented and migrated to make it cloud-ready.

Additionally, all necessary environments (test, staging, production) have been appropriately configured and are ready to host your code.

This step is relatively straightforward and involves pushing your code to the Cloud Git repository. During this phase and until the go-live, it is advisable to enforce a feature freeze.

However, if you cannot afford to freeze features in your production environment or if critical changes must be applied to your on-premise installation, it is feasible to backport the code to the Cloud later.

At One Inside, we have experience handling such situations, but it’s essential to understand that a code freeze can help mitigate the risks of project delays and increased complexity.

Ready to Move to AEM Cloud?

Don’t wait any longer! Reach out to our experts now, and let’s make your move seamless and successful!

Step 5 – Validate Integration with Core Services or External APIs

Chances are, your website relies on data from third-party services or internal applications.

To ensure seamless integration with these services, specific network configurations must be carried out using the Cloud Manager.

Furthermore, AEM as a Cloud Service offers a static IP address that must be whitelisted on your end to enable connectivity with our on-premise applications.

This step is crucial for establishing a secure and uninterrupted connection between your AEM Cloud environment and your core services or external APIs.

Step 6 – Integrate Adobe Target, Adobe Analytics, and the Adobe Experience Cloud Suite

Since you are already utilizing AEM for your websites, it’s probable that you also rely on other solutions within the Adobe Experience Cloud suite, including Adobe Analytics and Adobe Target.

The integration of these solutions is typically straightforward, and they should seamlessly operate within your web pages.

Your existing usage of AEM makes it easier to extend the integration to other Adobe Experience Cloud components, enhancing your ability to analyze and optimize your digital experiences.

Step 7 – Migrate Content

Content migration is an important step, but it doesn’t have to be overly concerning. The structure of the content between your on-premise website and the newly created AEM Cloud website remains the same.

To make this process sound less daunting, you can think of it as a content move, similar to transferring content from your staging environment to the production environment.

Additionally, Adobe offers various tools to streamline this task, such as the Content Transfert Tool, which is specially designed for migrating existing content from your AEM On-Premise to AEM Cloud, and the Package Manager, which facilitates the import and export of repository content.

When we refer to content migration, it encompasses more than just pages; it includes all content within your repository, including:

  • Page content
  • Assets
  • User and group data

Furthermore, since you may continue to create content on your productive site while performing the migration, the tool supports differential content top-up.

You can only transfer changes made since the last content migration, ensuring an efficient and up-to-date transition.

Step 8 – Test, Test, Test

We are approaching the final stages of the migration journey. Although some testing has occurred throughout the various steps, it’s now time for a comprehensive User Acceptance Testing (UAT) session.

Your dedicated testing team and business users should actively participate in this critical phase. It’s essential to have a detailed test strategy in place before commencing UAT.

Including authors in the testing process serves multiple purposes.

Not only does it expedite their familiarity with the new environment, but they are also the individuals most acquainted with how the components should function.

Their input, knowledge, and support are pivotal in ensuring your digital presence remains clear and distinctive.

Conducting thorough testing ensures your migration to AEM Cloud is successful, and your website operates seamlessly in its new environment.

Step 9 – Redirect Domains

This is the final step before going live, and it’s the point where your IT network team plays a key role.

They will manage certificates, DNS configurations, and domain redirection.

As emphasized at the beginning of this guide, it’s crucial that your IT stakeholders were informed from day one of this project about these critical milestones, and tasks were allocated accordingly.

They should be well-prepared and aware of what needs to be done, as preparations for this phase have been ongoing for several weeks.

Effective coordination in this step is essential to prevent delays in the overall process and the go-live date.

Ensuring a smooth domain redirection, your website seamlessly transitions to its new AEM Cloud environment.

Step 10 – Go Live

This step might seem the most stressful, but paradoxically, it’s also the simplest.

Your website has undergone extensive testing, and everything functions seamlessly in the cloud environment. It’s time for the final transition, shifting from your AEM On-Premise instance to the AEM Cloud instance.

The switch will be seamless for your end-users, and they won’t experience any interruptions in service. With careful planning and execution, this step should mark the successful culmination of your migration to AEM Cloud.

“The migration to AEM Cloud is a source of great satisfaction for both the business and IT stakeholders to see the website actively running in the Cloud, moving into a new era of better performance and exciting possibilities to enhance the customer experience. ”
Martyna Wilczynska

Martyna Wilczynska

Project Manager at One Inside – A VASS Company

Step 11 – Train your Team

Your editors won’t require specific training as the admin interface remains the same.

However, it’s important to note that a new essential tool, Adobe Cloud Manager, has been introduced.

Your IT or DevOps teams should manage this tool, or you can delegate site maintenance to your Adobe Partner.

Our AEM experts can offer training to ensure your IT team possesses the necessary skills and knowledge to handle critical tasks related to SSL Certificates, domain linking, whitelisting, and account management.

Step 12 – Decommission the On-Premise Instance

As a final recommendation, keeping your on-premise server running for 2 to 4 weeks after the migration is advisable.

This precaution provides a safety net in case of any critical situations where you might need to switch back to the on-premise instance.

While, based on our experience, such a reversal is rarely necessary, it’s prudent to manage this potential risk.

Once the hyper-care phase is concluded, you can confidently shift your entire focus to your new AEM as a Cloud Service instance, knowing you have a contingency plan in place if needed.

Need personal guidance with our experts?

Ready to explore your AEM Cloud migration? Book some time with us, so we can evaluate your needs and help you prepare for a seamless move!

Lessons Learned from AEM as a Cloud Service Migration Projects and Best Practices

After several successful migrations to AEM as a Cloud Service, our team has gathered excellent knowledge, and we would like to share some best practices that will help you mitigate the risk in this project.

Start with a Thorough Analysis

Begin your Cloud Migration project with a comprehensive analysis. Avoid rushing the assessment of your current AEM On-Premise setup. It’s crucial to evaluate dependencies and elements that require refactoring carefully.

If this is your first migration, invest time in research and documentation for a project of this nature.

Even if you have an internal team handling AEM, consider seeking support from an experienced Adobe Partner. Their expertise can prove invaluable in ensuring a successful migration.

Manage Stakeholders’ Dependencies

Taking care of stakeholders’ dependencies early in the project is crucial. Multiple members of your organization will play pivotal roles at significant project milestones.

We’ve already mentioned the IT team’s role in managing the network, but other groups may be involved, such as security and quality assurance.

At the project’s start, it’s essential to communicate your expectations clearly with these teams and provide them with precise dates for their involvement.

This proactive approach helps prevent delays and ensures a smooth progression of the project.

Not your typical Scrum project

What may come as a surprise is that a Cloud Migration project does not fully correspond to your typical Scrum-managed IT project.

In the regular framework, we focus on delivering the highest presentable value in the shortest amount of time, and we present our solutions to the clients, constantly asking for feedback.

An AEM Cloud Migration project primarily involves refactoring the backend code, which may not be presentable to the stakeholders until the website is in the acceptance environment in the Cloud and ready for testing.

Regular Team and Stakeholder Meetings

As the three-month timeline swiftly progresses, staying in sync with your team and key stakeholders is essential.

We highly recommend establishing a weekly update routine to track progress, identify and address risks, and implement mitigation plans.

During these weekly reviews, pay particular attention to dependencies with other teams and assess the advancement of their activities. This proactive approach ensures everyone is aligned and swiftly responds to evolving project needs.

“Clear communication with clients is key to risk mitigation, issue identification, and progress updates during the migration. It alleviates client stress and ensures transparency in their digital journey.”
Michael Kleger

Michael Kleger

Project Manager at One Onside

Relationship with Adobe

License negotiations must be completed to gain access to the cloud environment.

Equally important is discussing with your Adobe account manager to negotiate to keep a standby server on-premise for a specified period as a fallback.

From our experience, initiating such conversations as early as possible allows for negotiating the most advantageous and flexible transition away from the on-premise infrastructure.

Furthermore, in the event of unexpected issues, you may require support from Adobe’s team. It’s possible that certain features may not function properly when refactored for the Cloud.

To expedite the response time of Adobe Support, it is essential to collaborate with an Adobe Partner who maintains a strong relationship with the Adobe team.

For instance, at One Inside, we have cultivated a partnership with Adobe spanning over a decade, and our office is located within 30km of the AEM team responsible for building AEM as a Cloud Service.

This close relationship can be invaluable in certain situations. Over the years, we have developed a robust relationship with Adobe as a company and its talented individuals.

This gives us an advantage in problem-solving, as we possess intimate knowledge of whom to contact without navigating multiple support levels.

Avoid Developing on the On-Premise Instance During Migration

Avoid introducing new developments to your live websites whenever possible while the migration progresses. This practice helps prevent numerous issues.

However, we acknowledge that implementing a three-month code freeze is often impractical.

To mitigate potential problems, ensure that the code on both environments is synchronized and optimized for the Cloud before making any further enhancements to your on-premise branch.

This alignment minimizes complications during the migration process.

Leverage the Opportunity to Enhance Design Flaws

During the migration process, you’ll have the opportunity to test your entire website thoroughly.

Seize this moment to enhance various aspects of your site, including architecture, code refactoring, and minor design adjustments.

In our migration projects, we’ve successfully incorporated improvements such as image rendition generation, frontend enhancements, and optimizations related to performance and caching.

This migration window allows you to transition to the Cloud and enhance your website’s overall quality and functionality.

Key Takeaways for Your AEM as a Cloud Service Migration

In conclusion, migrating to AEM as a Cloud Service is a transformative journey that requires careful planning and execution.

AEM Cloud Service is the future of AEM and this migration sets the foundation.

Throughout this article, we’ve shared valuable insights and best practices from successful AEM Cloud migrations. From analyzing dependencies to fostering solid relationships with Adobe, from weekly team updates to optimizing design flaws, these lessons can guide you toward a successful migration.

Embrace the challenges and opportunities of transitioning to the Cloud, and remember that a well-executed migration can lead to a more efficient, secure, and innovative digital experience for your organization and its users.

With the right approach and the support of experienced partners, you can confidently navigate this journey and deliver excellent results.

We would like to express our gratitude to the talented individuals within our company who contributed to this article, including Martyna Wilczynska, Basil Kohler, Michael Kleger and Samuel Schmitt.

Samuel Schmitt

Samuel Schmitt

Digital Solution Expert

Would you like to receive our next article?

Subscribe to our newsletter and we will send you the next article about Adobe Experience Manager.

The post 12 Steps to Migrate AEM from On-Premise to the Cloud appeared first on One Inside.

by Samuel Schmitt at January 23, 2024 10:37 AM

January 12, 2024

Things on a content management system - Jörg Hoh

Sling Model Exporter & exposing ResourceResolver information

Welcome to 2024. I will start this new year with a small advice regarding Sling Models, which I hope you can implement very easy on your side.

The Sling Model Exporter is based on the Jackson framework, and it can serialize an object graph, with the root being the requested Sling Model. For that it recursively serializes all public & protected members and return values of all simple getters. Properly modeled this works quite well, but small errors can have large consequences. While missing data is often quite obvious (if the JSON powers an SPA, you will find it not properly working), too much data being serialized is spotted less frequently (normally not at all).

I am currently exploring options to improve performance, and I am a big fan of the ResourceResolver.getPropertyMap() API to implement a per-resourceresolver cache. While testing such an potential improvement I found customer code, in which the ResourceResolver is serialized via the Sling Model Exporter into JSON. In that case the code looked like this:

@SlingModel
public class MyModel {
 @Self
 Resoruce resource;

 ResourceResolver resolver;

 @PostConstruct
 public void init() {
   resolver = resource.getResourceResolver();
 }
}

(see this good overview at Baeldung of the default serialization rules of Jackson.)

And that’s bad in 2 different aspects:

  • Security: The serialized ResourceResolver object contains next to the data returned by the public getters (e.g. the search paths, userId and potentially other interesting data) also the complete propertyMap. And this serialized cache is probably nothing you want to expose to the consumer of this JSON.
  • Exceptions: If the getProperty() cache contains instances of classes, which are not publicly exposed (that means these class definitions are hidden within some implementation packages), you will encounter ClassNotFound exceptions during serialization, which will break the export. And instead a JSON you get an internal server error or a partially serialized object graph.

In short: It is not a good idea to serialize a ResourceResolver. And honestly, I have not found a reason to say why this should be possible at all. So right now I am a bit hesitant to use the propertMap as cache, especially in contexts where the Sling Model Exporter might be used. And that blocks me to work on some interesting performance improvements 😦

To unblock this situation, we have introduced a 2 step mechanism, which should help to overcome this situation:

  1. In the latest AEM as a Cloud Service release 14697 (both in the cloud as well as in the SDK) a new WARN message has been added when your Model definition causes a ResourceResolver to be serialized. Search the logs for this message “org.apache.sling.models.jacksonexporter.impl.JacksonExporter A ResourceResolver is serialized with all its private fields containing implementation details you should not disclose. Please review your Sling Model implementation(s) and remove all public accessors to a ResourceResolver.
    It should contain also a referecene to the request path, where this is happening, so it should be easily possible to identify the Sling model class which triggers this serialization and change that piece of code so the ResourceResolver is not serialized anymore. Note, that the above message is just a warning, the behavior remains unchanged.
  2. As a second measure also functionality is implemented, which allows to block the serialization of ResourceResolver via the Sling Model Exporter completely. Enabling this is a breaking change for all AEM as a Cloud Service customers (even if I am 99.999% sure that it won’t break any functionality), and for that reason we cannot enable this change on the spot. But at some point this is step is necessary to guarantee that the above listed 2 problems will never happen.

Right now the first step is enabled, and you will see this log message. If you see this log message, I encourage you to adapt your code (the core components should be safe) so ResourceResolvers are no longer serialized.

In parallel we need to implement step 2; right now the planning is not done yet, but I hope to activate step 2 some time later in 2024 (not before mid of the year). But before this is done, there will be formal announcements in the AEM release notes. And I hope that with this blog post and the release notes all customers have adapted their implementation, so that setting this switch will not change anything.

Update (January 19, 2024): There is now a piece of official AEM documentation covering this situation as well.

by Jörg at January 12, 2024 06:23 PM

December 16, 2023

Things on a content management system - Jörg Hoh

A review of 2023

It’s again December, and so time to review a bit my activities of 2023 in this blog.

I have to admit, I am not a reliable writer, as I write very infrequent. And it’s not because of lack of time, but rather because I rarely find content which (in my opinion) is worth to write about. I don’t have large topics, which i split up into a series of posts. If you ever saw a smaller series of posts, that mostly happen by accident. I was just working on aspects of the system, which at some point I wrote about and afterwards started to understand more. That was the default of the last 15 years … (OMG, am I blogging here really for that long already? Probably, the first post “Why use the dispatcher?” went live on December 22, 2008. So this is the 15th anniversary.)

I started 2023 with 4 posts till the end of of September:

But something changed in October. I had already prepared 2 postings, but I started to come up with more topics within days; it ended with 6 blog posts in October and November, which is quite a pace for this blog:

It felt incredible to be able to announce every few days a new blog post. I don’t think that I can keep that frequency, but I will try to write more often in 2024. I just noted a few topics for the next posts already, stay tuned 🙂

Also, if you are reading this blog because you found the link to it somewhere, but you are interested in the topics I write about: You can get notified of new posts immediately by providing me (well, WordPress) your email address (you should see it on the right rail of this page). Alternatively if you are old-style, you can also subscribe to the RSS Feed of this blog, which also contains the full content of the postings. That might be interesting for you, as I normally reference new posts on other channels with some delay, and sometimes I even skip it completely (or simply forget).

Thanks for your attention, and I wish you all a successful and happy 2024.

by Jörg at December 16, 2023 06:03 PM

November 25, 2023

Things on a content management system - Jörg Hoh

Thoughts on performance testing on AEM CS

Performance is an interesting topic on its own, and I already wrote a bit about it in this blog (see the overview). I have not written yet about performance testing in the context of AEM CS. It’s not that it is fundamentally different, but there are some specifics, which you should be aware of.

  • Perform your performance tests on the Stage environment. The stage environment is kept at the same sizing as the production environment, so it should deliver the same behavior. and your PROD environment, if you have the same content and your test is realistic.
  • Use a warmup phase. As the Stage environment is normally downscaled to the minimum (because there is no regular traffic), it can take a bit of time until it has upscaled (automatically) to the same number of instances as your PROD is normally operating with. That means that your test should have a proper warmup period, during you which increase the traffic to the normal 100% level of production. This warmup phase should take at least 30 minutes.
  • I think that any test should take at least 60-90 minutes (including warmup); even if you see early that the result is not what you expect to be, there is often something to learn even from such incorrect/faulty situations. I had the case that a customer was constantly terminating the after about 20-25 minutes, claiming that something was not working server-side as they expected it to be. Unfortunately the situation has not yet settled, so I was not able to get any useful information from the system.
  • AEM CS comes with a CDN bundled to the environment, and that’s the case also for the Stage environment. But that also means that your performance test should contain all requests, which you expect to be delivered from the CDN. This is important because it can show if your caching is working as intended. Also only then you can assess the impact of the cache misses (when files expire on the CDN) on the overall performance.
  • While you are at it, you can run a stage pipeline during the performance test and deploy new code. You should not see any significant change in performance during that time.
  • Oh yes, also do some content activations in that time. That makes your test much more realistic and also reveal potential performance problems when updating content (e.g. because you constantly invalidate the complete dispatcher cache).
  • You should focus on a large content set when you do the performance test. If you only test a handful of pages/assets/files, you are mostly testing caches (at all levels).
  • Campaign-traffic” is rarely tested. This is traffic, which has some query strings attached (e.g. “utm_source”, “gclid” and such) to support traffic attribution. These parameters are ignored while rendering, but they often bypass all caching layers, hitting AEM. And while a regular performance test only tests without these paramters, if you marketing department runs a facebook campaign, the traffic from that campaign looks much different, and then the results of your performance tests are not valid anymore.

Some words as precaution:

  • A performance test can look like a DOS, and your requests can get blocked for that reason. This can happen especially if these requests are originating from a single source IP. For that reason you should distribute your load injector and use multiple source IP addresses. In case you still get blocked, please contact support so we can adapt accordingly.
  • AEM CS uses an affinity cookie to indicate that requests of a user-agent are handled by a specific backend system. If you use the same affinity cookie throughout all your performance tests, you just test a single backend system; and that effectively disables any loadbalancing and renders the performance test results unusable. Make sure that you design your performance tests with that in mind.

I general I prefer it much if I can help you during the performance phase, than to handle escalations for of bad performance and potential outages because of it. I hope that you think the same way.

by Jörg at November 25, 2023 11:45 AM

November 19, 2023

Things on a content management system - Jörg Hoh

If you have curl, every problem looks like a request

If you are working in IT (or a crafter) you should know the saying: “When you have a hammer, every problem looks like a nail”. It describes the tendency of people, that if they have a tool, which helps them to reliably solve a specific problem, that they will try to use this tool at every other problem, even if it does not fit at all.

Sometimes I see this pattern in AEM as well, but not with a hammer, but with “curl”. Curl is a commandline HTTP client, and it’s quite easy to fire a request against AEM and do something with the output of it. It’s something every AEM developer should be familiar with, also because it’s a great tool to automate things. And if you talk about “automating AEM”, the first thing people often come up with is “curl”…

And there the problem starts: Not every problem can be automated with curl. For example take a periodic data export from AEM. The immediate reaction of most developers (forgive me if I generalize here, but I have seen this pattern too often!) is to write a servlet to pull all this data together, create a CSV list and then use curl to request this servlet every day/week.

Works great, does it? Good, mark that task as done, next!

Wait a second, on prod it takes 2 minutes to create that list. Well, not a problem, right? Until it takes 20 minutes, because the number of assets is growing. And until you move to AEM CS, where the timeout of requests is 60 seconds, and your curl is terminated with a statuscode 503.

So what is the problem? It is not the timeout of 60 seconds; and it’s also not the constantly increasing number of assets. It’s the fact, that this is a batch operation, and you use a communication pattern (request/response), which is not well suited for batch operations. It’s the fact, that you start with curl in mind (a tool which is built for the request/response pattern) and therefor you build the implementation around it this pattern. You have curl, so every problem is solved with a request.

What are the limits of this request/response pattern? Definitely the runtime is a limit, and actually for 3 reasons:

  • The timeout for requests on AEM CS (or basically any other loadbalancer) is set for security reasons and to keep the prevent misuse. Of course the limit of 60 seconds in AEM CS is a bit arbitrary, but personally I would not wait 60 seconds for a webpage to start rendering. So it’s as good as any higher number.
  • There is another limit, which is determined by the availability of the backend system, which is actually processing this request. In an high-available and autoscaling environment systems start and stop in an automated fashion, managed by a control-plane which operates on a set of rules. And these rules can enforce, that any (AEM-) system will be forced to shutdown at maximum 10 minutes after it has stopped to receive new requests. And that means for a requests, which would take constantly 30+ minutes, that it might be terminated, without finishing successfully. And it’s unclear if your curl would even realize it (especially if you are streaming the results).
  • (And technically you can also add that the whole network connection needs to be kept open for that long, and AEM CS itself is just a single factor in there. Also the internet is not always stable, you can experience network hiccups and any point in time. It’s normally just well hidden by retrying failing requests. Which is not an option here, because it won’t solve the problem at all.)

In short: If your task can take long (say: 60+ seconds), then a request is not necessarily the best option to implement it.

So, what options do you have then? Well, the following approach works also in AEM CS:

  1. Use a request to create and initiate your task (let’s call it a “job”);
  2. And then poll the system until this job is completed, then return the result.

This is an asynchronous pattern, and it’s much more scalable when it comes to the amount of processing you can do in there.

Of course you cannot use a single curl command anymore, but now you need to write a program to execute this logic (don’t write it in a shell-script please!); but on the AEM side you can now use either sling jobs or AEM workflows and perform the operation.

But this avoids this restriction on 60 seconds and it can handle restarts of AEM transparently, at least on author side. And you have the huge benefit, that you can collect all your errors during the runtime of this job and decide afterwards, if the execution was a success or failed (which you cannot do in HTTP).

So when you have long-running operations, check if you need to do them within a request. In many cases it’s not required, and then please switch gears to some asynchronous pattern. And that’s something you can do even before the situation starts to get a problem.

by Jörg at November 19, 2023 04:32 PM

November 09, 2023

Things on a content management system - Jörg Hoh

Identify repository access

Performance tuning in AEM is typically a tough job. The most obvious and widely known aspect is the tuning of JCR queries, but that’s all; if your code is not doing any JCR query and still slow, it’s getting hard. For requests my standard approach is to use “Recent requests” and identify slow components, but that’s it. And then you have threaddumps, but these are hardly helping here. There is no standard way to diagnose further without relying on gut feeling and luck.

When I had to optimize a request last year, I thought again about this problem. And I asked myself the question:
Whenever I check this request in the threaddumps, I see the code accessing the repository. Why is this the case? Is the repository slow or is it just accessing the repository very frequently?

The available tools cannot answer this question. So I had to write myself something which can do that. In the end I committed it to the Sling codebase with SLING-11654.

The result is an additional logger, (“org.apache.sling.jcr.resource.AccessLogger.operation” on loglevel TRACE) which you can enable and which can you log every single (Sling) repository access, including the operation, the path and the full stacktrace. That is a huge amount of data, but it answered my question quite thoroughly.

  • The repository is itself is very fast, because a request (taking 500ms in my local setup) performs 10’000 times a repository access. So the problem is rather the total number of repository access.
  • Looking at the list of accessed resources it became very obvious, that there is a huge number of redundant access. For example these are the top 10 accessed paths while rendering a simple WKND page (/content/wknd/language-masters/en/adventures/beervana-portland):
    • 1017 /conf/wknd/settings/wcm/templates/adventure-page-template/structure
    • 263 /
    • 237 /conf/wknd/settings/wcm/templates
    • 237 /conf/wknd/settings/wcm
    • 227 /content
    • 204 /content/wknd/language-masters/en
    • 199 /content/wknd
    • 194 /content/wknd/language-masters/en/adventures/beervana-portland/jcr:content
    • 192 /content/wknd/jcr:content
    • 186 /conf/wknd/settings

But now with that logger, I was able to identify access patterns and map them to code. And suddenly you see a much bigger picture, and you can spot a lot of redundant repository access.

With that help I identified the bottleneck in the code, and the posting “Sling Model performance” was the direct result of this finding. Another result was the topic for my talk at AdaptTo() 2023; checkout the recording for more numbers, details and learnings.

But with these experiences I made an important observation: You can use the number of repository access as a proxy metric for performance. The more repository access you do, the slower your application will get. So you don’t need to rely so much on performance tests anymore (although they definitely have their value), but you can validate changes in the code by counting the number of repository access performed by it. Less repository access is always more performant, no matter the environmental conditions.

And with an additional logger (“org.apache.sling.jcr.AccessLogger.statistics” on TRACE) you can get just the raw numbers without details, so you can easily validate any improvement.

Equipped with that knowledge you should be able to investigate the performance of your application on your local machine. Looking forward for the results 🙂

(This is currently only available on AEM CS / AEM CS SDK, I will see to get it into an upcoming AEM 6.5 servicepack.)

by Jörg at November 09, 2023 01:28 PM

November 04, 2023

Things on a content management system - Jörg Hoh

The Explain Query tool

When there’s a topic which has been challenging forever in the AEM world, then it’s JCR queries and indexes. It can feel like an arcane science, where it’s quite easy to mess up and end up with a slow query. I learned it also the hard way, and a printout of the JCR query cheatsheet is always below my keyboard.

But there were some recent changes, which made the work with query performance easier. First, in AEM CS the Explain Query tool has been added, which is also available via the AEM Developer Console. It displays queries, slow queries, number of rows read, the used index, execution plan etc. But even with that tool alone it’s still hard to understand what makes a query performant or slow.

Last week there was a larger update to the AEM documentation (thanks a lot, Tom!), which added a detailed explanation of the Explain Query tool. Especially it drills down into the details of the query execution plan and how to interpret it.

With this information and the good examples given there you should be able to analyze the query plan of your queries and optimize the indexes and queries before you execute them the first time in production.

by Jörg at November 04, 2023 06:22 PM

October 16, 2023

Things on a content management system - Jörg Hoh

3 rules how to use an HttpClient in AEM

Many AEM applications consume data from other systems, and in the last decade the protocol of choice turned out to the HTTP(S). And there are a number of very mature HTTP clients out, which can be used together with AEM. The most frequently used variant is the Apache HttpClient, which is shipped with AEM.

But although the HttpClient is quite easy to use, I came across a number of problems, many of them result in service outages. In this post I want to list the 3 biggest mistakes you can make when you use the Apache HttpClient. While I observed the results in AEM as a Cloud Service, the underlying effects are the same on-prem and in AMS, the resulting effects can be a bit different.

Reuse the HttpClient instance

I often see that a HttpClient instance is created for a single HTTP request, and in many cases it’s not even closed properly afterwards. This can lead to these consequences:

  • If you don’t close the HttpClient instance properly, the underlying network connection(s) will not be closed properly, but eventually timeout. And until then the network connections stays open. If you using a proxy with a connection limit (many proxies do that) this proxy can reject new requests.
  • If you re-create a HttpClient for every request, the underlying network connection will get re-established every time with the latency of the 3-way handshake.

The reuse of the HttpClient object and its state is also recommended by its documentation.

The best way to make that happen is to wrap the HttpClient into an OSGI service, create it on activation and stop it when the service is deactivated.

Set agressive connection- and read-timeouts

Especially when an outbund HTTP request should be executed within the context of a AEM request, performance really matters. Every milisecond which is spent in that external call makes the AEM request slower. This increases the risk of exhausting the Jetty thread pool, which then leads to non-availability of that instance, because it cannot accept any new requests. I have often seen AEM CS outages because a backend was not responding slowly or not at all. All requests should finish quickly, and in case of errors must also return fast.

That means, timeouts should not exceed 2 second (personally I would prefer even 1 second). And if your backend cannot respond that fast, you should reconsider its fitness for interactive traffic, and try not to connect to it in a synchronous request.

Implement a degraded mode

When your backend application responds slowly, returns errors or is not available at all, your AEM application should be react accordingly. I had the case a number of times that any problem on the backend had an immediate effect on the AEM application, often resulting in downtimes because either the application was not able to handle the results of the HttpClient (so the response rendering failed with an exception), or because the Jetty threadpool was totally consumed by those requests.

Instead your AEM application should be able to fallback into a degraded mode, which allows you to display at least a message, that something is not working. In the best case the rest of the site continues to work as usual.

If you implement these 3 rules when you do your backend connections, and especially if you test the degraded mode, your AEM application will be much more resilient when it comes to network or backend hiccups, resulting in less service outages. And isn’t that something we all want?

by Jörg at October 16, 2023 01:43 PM

October 14, 2023

Things on a content management system - Jörg Hoh

Recap: AdaptTo 2023

It was adapTo() time again, the first time again in an in-person format since 2019. And it’s definitely much different from the virtual formats we experienced during the pandemic. More personal, and allowing me to get away from the daily work routine; I remember that in 2020 and 2021 I constantly had work related topics (mostly Slack) on the other screen, while I was attending the virtual conference. That’s definitely different when you are at the venue 🙂

And it was great to see all the people again. Many of the people which are part of the community for years, but also many new faces. Nice to see that the community can still attract new people, although I think that the golden time of the backend-heavy web-development is over. And that was reflected on stage as well, with Edge Delivery Services being quite a topic.

As in the past years, the conference itself isn’t that large (this year maybe 200 attendees) and it gives you plenty of chances to get in touch and chat about projects, new features, bugs and everything else you can imagine. The location is nice, and Berlin gives you plenty of opportunities to go out for dinner. So while 3 days of conference can definitely be exhausting, I would have liked to spend much more dinners with attendees.

I got the chance to come on stage again with one of my favorite topics: Performance improvement in AEM, a classic backend topic. According to the talk feedback, people liked it 🙂
Also, the folks of the adaptTo() recorded all the talks and you can find both the recording and the slide deck on the talk’s page.

The next call for papers is already announced to start in February ’24), and I will definitely submit a talk again. Maybe you as well?

by Jörg at October 14, 2023 04:07 PM

July 13, 2023

Things on a content management system - Jörg Hoh

AEM CS & dedicated egress IP

Many customers of AEM as a Cloud Service are used to perform a first level of access control by allowing just a certain set of IP addresses to access a system. For that reason they want that their AEM instances use a static IP address or network range to access their backend systems. AEM CS supports with this with the feature called “dedicated egress IP address“.

But when testing that feature there is often the feedback, that this is not working, and that the incoming requests on backend systems come from a different network range. This is expected, because this feature does not change the default routing for outgoing traffic for the AEM instances.

The documentation also says

Http or https traffic will go through a preconfigured proxy, provided they use standard Java system properties for proxy configurations.

The thing is that if traffic is supposed to use this dedicated egress IP, you have to explicitly make it use this proxy. This is important, because by default not all HTTP Clients do this.

For example, the in the Apache HTTP Client library 4.x, the HttpClients.createDefault() method does not read the system properties related proxying, but the HttpClients.createSystem() does. Same with the java.net.http.HttpClient, for which you need to configure the Builder to use a proxy. Also okhttp requires you to configure the proxy explicitly.

So if requests from your AEM instance is coming from the wrong IP address, check that your code is actually using the configured proxy.

by Jörg at July 13, 2023 07:40 AM

July 02, 2023

Things on a content management system - Jörg Hoh

Sling Model Exporter: What is exported into the JSON?

Last week we came across a strange phenomenon, when in the AEM release validation process the process broke in an unexpected situation. Which is indeed a good thing, because it covered an aspect I have never thought of.

The validation broke because during a request the serialization of a Sling Model failed with an exception. The short version: It tried to serialize a ResourceResolver(!) into JSON (more details in SLING-11924). Why would anyone serialize a ResourceResolver into a JSON to be consumed by an SPA? I clearly believe that this was not done intentionally, but happened by accident. But nevertheless, it broke the improvement we intended to make, so we had to rollback it and wait for SLING-11924 being implemented.

But it gives me the opportunity to explain, which fields of a Sling Model are exported by the SlingModelExporter. As it is backed by the Jackson data-bind framework, the same rules apply:

  • All public fields are serialized
  • all public available getter methods, which do not expect a parameter are serialized.

It is not too hard to check this, but there are a few subtle aspect to consider in the context of Sling Models.

  • Injections: make sure that you make only these injections as public, which you want to be dealt with by the SlingModelExporter. Make everything else private.
  • I see often Lombok used to create getters for SlingModels (because you need them for the use in HTL). This is especially problematic, when the annotation @Getter is done on a class-level, because now for every field (not matter the visibility) a getter is created, which is then picked up by the SlingModelExporter.

My call to action: Validate your SlingModels and check them that you don’t export a ResourceResolver by accident. (If you are a AEM as a Cloud Service customer and affected by this problem, you will probably get an email from us, telling you to do exactly that.)

by Jörg at July 02, 2023 06:38 PM

January 12, 2023

Things on a content management system - Jörg Hoh

Sling models performance (part 3)

In the first and second part of this series “Sling Models performance” I covered aspects which can degrade the performance of your Sling models, be it by not specifying the correct injector or by re-using complex models for very simple cases (by complex PostConstruct models).

And there is another aspect when it comes to performance degradation, and it starts with a very cool convenience function. Because Sling Models can create a whole tree of objects. Imagine this code as part of a Sling Model:

@ChildResource
AnotherModel child;

It will adapt the child-resource named “child” into the class “AnotherModel” and inject it. This nesting is a cool feature and can be a time-saver if you have a more complex resource structure to model your content.

But also it comes with a price, because it will create another Sling Model object; and even that Sling Model can trigger the creation of more Sling Models, and so on. And as I have outlined in my previous posts, the creation of these Sling Models does not come for free. So if your “main Sling Model” internally creates a whole tree of Sling Models, the required time will increase. Which can be justified, but not if you just need a fraction of the data of the Sling Models. So is it worth to spend 10 miliseconds to create a complex Sling Model just to call a simple getter of it, if you could retrieve this information alone in just 10 microseconds?

So this is a situation, where I need to repeat what I have written already in part 2:


When you build your Sling Models, try to resolve all data lazily, when it is requested the first time.

Sling Model Perforamance (part 2)

But unfortunately, injectors do not work lazily but eagerly; injections are executed as part of construction of the model. Having a lazy injection would be a cool feature …

So until this is available, you should use check the re-use of Sling Model quite carefully; always consider how much work is actually done in the background, and if the value of reusing that Sling Model is worth the time spent in rendering.

by Jörg at January 12, 2023 04:38 PM

January 02, 2023

Things on a content management system - Jörg Hoh

The most expensive HTTP request

TL;DR: When you do a performance test for your application, also test a situation where you just fire large number of invalid requests; because you need to know if your error-handling is good enough to withstand this often unplanned load.

In my opinion the most expensive HTTP requests are the ones which return with a 404. Because they don’t bring any value, are not as easily cacheable as others and are very easily to generate. If you are looking into AEM logs, you will often find requests from random parties which fire a lot of requests, obviously trying to find vulnerable software. But in AEM these always fail, because there are not resources with these names, returning a statuscode 404. But this turns a problem if these 404 pages are complex to render, taking 1 second or more. In that case requesting 1000 non-existing URLs can turn into a denial of service.

This can even get more complex, if you work with suffixes, and the end user can just request the suffix, because you prepend that actual resource by mod_rewrite on the dispatcher. In such situations the requested resource is present (the page you configured), but the suffix can be invalid (for example point to a non-existing resource). Depending on the implementation you can find out very late about this situation; and then you have already rendered a major part of the page just to find out that the suffix is invalid. This can also lead to a denial of service, but is much harder to mitigate than the plain 404 case.

So what’s the best way to handle such situations? You should test for such a situation explicitly. Build a simple performance test which just fires a few hundreds requests triggering a 404, and observe the response time of the regular requests. It should not drop! If you need to simplify your 404 pages, then do that! Many popular websites have very stripped down 404 pages for just that reason.

And when you design your URLs you should always have in mind these robots, which just show up with (more or less) random strings.

by Jörg at January 02, 2023 01:56 PM

December 21, 2022

Things on a content management system - Jörg Hoh

AEM article review December 2022

I am doing this blog now for quite some time (the first article in this blog dates back to December 2008! That was the time of CQ 5.0! OMG), and of course I am not the only one writing on AEM. Actually the number of articles which are produced every months is quite large, but I am often a bit disappointed because many just reproduce some very basic aspects of AEM, which can be found at many places. But the amount of new content which describe aspects which have barely been covered by other blog posts or the official product documentation is small.

For myself I try to focus on such topics, offer unique views on the product and provide recommendations how things can be done (better), all based on my personal experiences. I think that this type of content is appreciated by the community, and I get good feedback on it. To encourage the broader community to come up with more content covering new aspects I will do a little experiment and promote a few selected articles of others. I think that these article show new aspects or offer a unique view on certain on AEM.

Depending on the feedback I will decide i will continue with this experiment. If you think that your content also offers new views, uncovers hidden features or suggests best practices, please let me know (see the my contact data here). I will judge these proposals on the above mentioned criteria. But of course it will be still my personal decision.

Let’s start with Theo Pendle, who has written an article on how to write your own custom injector for Sling Models. The example he uses is a real good one, and he walks you through all the steps and explains very well, why that is all necessary. I like the general approach of Theos writing and consider the case of safely injecting cookie values as a valid for such a injector. But in general I think that there are not many other cases out there, where it makes sense to write custom injectors.

Also on a technical level John Mitchell has his article “Using Sling Feature Flags to Manage Continous Releases“, published on the Adobe Tech Blog. He introduces Sling Features and how you can use them to implement Feature Flags. And that’s something I have not seen used yet in the wild, and also the documentation is quite sparse on it. But he gives a good starting point, although a more practical example would be great 🙂

The third article I like the most. Kevin Nenning writes on “CRXDE Lite, the plague of AEM“. He outlines why CRXDE Lite has gained such a bad reputation within Adobe, that disabling CRXDE Lite is part of the golive checklist for quite some time. But on the other hand he loves the tool because it’s a great way for quick hacks on your local development instance and for a general read-only tool. This is an article every AEM developer should read.
And in case you haven’t seen it yet: AEM as a Cloud Service offers the repository browser in the developer console for a read-only view on your repo!

And finally there is Yuri Simione (an Adobe AEM champion), who published 2 articles discussing the question “Is AEM a valid Content Services Plattform?” (article 1, article 2). He discusses an implementation which is based on Jackrabbit/Oak and Sling (but not AEM) to replace an aging Documentum system. And finally he offers an interesting perspective on the future of Jackrabbit. Definitely a read if you are interested in a more broader use of AEM and its foundational pieces.

That’s it for December. I hope you enjoy these articles as much as I did, and that you can learn from them and get some new inspiration and insights.

by Jörg at December 21, 2022 05:29 PM

December 12, 2022

Things on a content management system - Jörg Hoh

Sling Models performance, part 2

In the last blog post I demonstrated the impact of the correct type of annotations on performance of Sling Models. But there is another aspect of Sling Models, which should not be underestimated. And that’s the impact of the method which is annotated with @PostConstruct.

If you are not interested in the details, just skip to the conclusion at the bottom of this article.

To illustrate this aspect, let me give you an example. Assume that you have a navigation (or list component) in which you want to display only pages of the type “product pages” which are specifically marked to be displayed. Because you are developer which is favoring clean code, you already have a “ProductPageModel” Sling Model which also offers a “showInNav()” method. So your code will look like this:

List<Page> pagesToDisplay = new ArrayList<>();
for (Page child : page.listChildren()) {
  ProductPageModel ppm = child.adaptTo(ProductPageModel.class);
  if (ppm != null && ppm.showInNav()) {
    pagesToDisplay.add(child);
  }
}

This works perfectly fine; but I have seen this approach to be the root cause for severe performance problems. Mostly because the ProductPageModel is designed the one and only Sling Model backing a Product Page; the @PostConstruct method of the ProductPageModel contains all the logic to calculate all retrieve and calculate all required information, for example Product Information, datalayer information, etc.

But in this case only a simple property is required, all other properties are not used at all. That means that the majority of the operations in the @PostConstruct method are pure overhead in this situation and consuming time. It would not be necessary to execute them at all in this case.

Many Sling Models are designed for a single purpose, for example rendering a page, where such a sling model is used extensively by an HTL scriptlet. But there are cases where the very same SlingModel class is used for different purposes, when only a subset of this information is required. But also in this case the whole set of properties is resolved, as it you would need for the rendering of the complete page.

I prepared a small test-case on my github account to illustrate the performance impact of such code on the performance of the adaption:

  • ModelWithPostConstruct contains a method annotated with @PostConstruct, which resolves a another property via an InheritanceValueMap.
  • ModelWithoutPostConstruct provides the same semantic, but executes the calculations lazy, only when the information is required.

The benchmark is implement in a simple servlet (SlingModelPostConstructServlet), which you can invoke on the path “/bin/slingmodelpostconstruct”

$ curl -u admin:admin http://localhost:4502/bin/slingmodelpostconstruct
test data created below /content/cqdump/performance
de.joerghoh.cqdump.performance.core.models.ModelWithPostconstruct: single adaption took 50 microseconds
de.joerghoh.cqdump.performance.core.models.ModelWithoutPostconstruct: single adaption took 11 microseconds

The overhead is quite obvious, almost 40 microseconds per adaption; of course it’s dependent on the amount of logic within this @PostConstruct method. And this postconstruct method is quite small, compared to other SlingModels I have seen. And in the cases where only a minimal subset of the information is required, this is pure overhead. Of course the overhead is often minimal if you just consider a single adaption, but given the large number of Sling Models in typical AEM projects, the chance is quite high that this turns into a problem sooner or later.

So you should pay attention on the different situations when you use your Sling Models. Especially if you have such vastly different cases (rendering the full page vs just getting one property) you should invest a bit of time and optimize them for these usecases. Which leads me to the following:

Conclusion

When you build your Sling Models, try to resolve all data lazily, when it is requested the first time. Keep the @PostConstruct method as small as possible.

by Jörg at December 12, 2022 08:17 AM

November 28, 2022

Things on a content management system - Jörg Hoh

Sling Model Performance

In my daily job as an SRE for AEM as a Cloud Service I often have to deal with performance questions, especially in the context of migrations of customer applications. Applications sometimes perform differently on AEM CS than they did on AEM 6.x, and a part of my job is to look into these cases.

This often leads to interesting deep dives and learnings; you might have seen this reflected in the postings of this blog 🙂 The problem this time was a tight loop like this:

for (Resource child: resource.getChildren()) {
SlingModel model = child.adaptTo(SlingModel.class);
if (model != null && model.hasSomeCondition()) {
// some very lightweight work
}
}

This code performed well with 1000 child resources in a AEM 6.x authoring instance, but quite poorly on an AEM CS authoring instance with the same number of child nodes. And the problem is not the large number of childnodes …

After wading knee-deep through TRACE logs I found the problem at an unexpected location. But before I present you the solution and some recommendations, let me you explain some background. But of course you can skip the next section and jump directly to the TL;DR at the bottom of this article.

SlingModels and parameter injection

One of the beauties of Sling Models is that these are simple PoJos, and properties are injected by the Sling Models framework. You just have to add matching annotations to mark them accordingly. See the full story in the official documentation.

The simple example in the documentation looks like this:

@Inject
String title;

which (typically) injects the property named “title” from the resource this model was adapted from. The same way you can inject services, child-nodes any many other useful things.

To make this work, the framework uses an ordered list of Injectors, which are able to retrieve values to be injected (see the list of available injectors). The first injector which returns a non-null value is taken and its result is injected. In this example the ValueMapInjector is supposed to return a property called “title” from the valueMap of the resource, which is quite early in the list of injectors.

Ok, now let’s understand what the system does here:

@Inject
@Optional
String doesNotExist;

Here a optional field is declared, and if there is no property called “doesNotExist” in the valueMap of the resource, other injectors are queried if they can handle that injection. Assuming that no injector can do that, the value of the field “doesNotExist” remains null. No problem at first sight.

But indeed there is a problem, and it’s perfomance. Even the lookup of a non-existing property (or node) in the JCR takes time, and doing this a few hundred or even thousand times in a loop can slow down your code. And a slower repository (like the clustered MongoDB persistence in the AEM as a Cloud Service authoring instances) even more.

To demonstrate it, I wrote a small benchmark (source code on my github account), which does a lot of adaptions to Sling Models. When deployed to AEM 6.5.5 or later (or a recent version of the AEM CS SDK) you can run it via curl -u admin:admin http://localhost:4502/bin/slingmodelcompare

This is its output:

de.joerghoh.cqdump.performance.core.models.ModelWith3Injects: single adaption took 18 microseconds
de.joerghoh.cqdump.performance.core.models.ModelWith3ValueMaps: single adaption took 16 microseconds
de.joerghoh.cqdump.performance.core.models.ModelWithOptionalValueMap: single adaption took 18 microseconds
de.joerghoh.cqdump.performance.core.models.ModelWith2OptionalValueMaps: single adaption took 20 microseconds
de.joerghoh.cqdump.performance.core.models.ModelWithOptionalInject: single adaption took 83 microseconds
de.joerghoh.cqdump.performance.core.models.ModelWith2OptionalInjects: single adaption took 137 microsecond
s

It’s a benchmark which on a very simple list of resources tries adaptions to a number of Model classes, which are different in their type of annotations. So adapting to a model which injects 3 properties takes approximately 20 microseconds, but as soon as a model has a failing injection (which is declared with “@Optional” to avoid failing the adaption), the duration increases massively to 83 microseconds, and even 137 microseconds when 2 these failed injections are there.

Ok, so having a few of such failed injections do not make a problem per se (you could do 2’000 within 100 milliseconds), but this test setup is a bit artificial, which makes these 2’000 a really optimistic number:

  • It is running on a system with a fast repository (SDK on my M1 Macbook); so for example the ChildResourceInjector does not has almost no overhead to test for the presence of a childResource called “doesNotExist”. This can be different, for example on AEM CS Author the Mongo storage has a higher latency than the segmentStore on the SDK or a publish. If that (non-existing) child-resource is not in the cache, there is an additional latency in the range of 1ms to load that information. What for? Well, basically for nothing.
  • The OsgiInjector is queried as well, which tries to access the OSGI ServiceRegistry; this registry is a central piece of OSGI, and it’s consistency is heavily guarded by locks. I have seen this injector being blocked by these locks, which also adds latency.

That means that these 50-60 microseconds could easily multiply, and then the performance is getting a problem. And this is the problem which initially sparked this investigation.

So what can we do to avoid this situation? That is quite easy: Do not use @Inject, but use the specialized injectors directly (see them in the documentation). While the benefit is probably quite small when it comes to properties which are present (ModelWith3Injects tool 18 microseconds vs 16 microseconds of ModelWith3ValueMaps), the different gets dramatic as soon as we consider failed injections:

Even in my local benchmark the improvement can be seen quite easily, there is almost no overhead of such a failed injection, if I explicitly mark them as Injection via the ValueMapInjector. And as mentioned, this overhead can be even larger in reality.

Still, this is a micro-optimization in the majority of all cases; but as mentioned already, many of these optimizations implemented definitely can make a difference.

TL;DR Use injector-specific annotations

Instead of @Inject use directly the correct injector. You normally know exactly where you want that injected value to come from.
And by the way: did you know that the use of @Inject is discouraged in favor of these injector-specific annotations?

Update: The Sling Models documentation has been updated and explicitly discourages the use of @Inject now.

by Jörg at November 28, 2022 08:25 AM

October 31, 2022

Things on a content management system - Jörg Hoh

Limits of dispatcher caching with AEM as a Cloud Service

In the last blog post I proposed 5 rules for Caching with AEM, how you should design your caching strategy. Today I want to show another aspect of rule 1: Prefer caching at the CDN over caching at the dispatcher.

I already explained that the CDN is always located closer to the consumer, so the latency is lower and the experience will be better. But when we limit the scope to AEM as a Cloud Service, the situation gets a bit complicated, because the dispatcher is not able to cache files for more than 24 hours.

This is caused by a few architectural decisions done for AEM as a Cloud Service:

These 2 decisions lead to the fact, that no dispatcher cache can hold files fore more than 24 hours because the instance is terminated after that time. And there are other situations where the publishs are to be re-created, for example during deployments and up/down-scaling situations, and then the cache does not contain files for 24 hours, but maybe just 3 hours.

This naturally can limit the cache-hit ratio in cases where you have content which is requested frequently but is not changed in days/weeks or even months. In an AEM as a Cloud Service setup these files are then rendered once per day (or more often, see above) per publish/dispatcher, while in other setups (for example AMS on on-prem setups where long-living dispatcher caches are pretty much default) it can delivered from the dispatcher cache without the need to re-render it every day.

The CDN does not have this limitation. It can hold for days and weeks and deliver them, if the TTL settings allow this. But as you can control the CDN only via TTL, you have to make a tradeoff between cache-hit ratio on the CDN and the accuracy of the delivered content regarding a potential change.

That means:

  • If you have files which do not change you just set a large TTL to them and then let the CDN handle them. A good example are clientlibs (JS and CSS files), because they have a unique name (an additional selector which is created as a hash over the content of the file.
  • If there’s a chance that you make changes to such content (mostly pages), you should set a reasonable TTL (and of course “stale-while-revalidate”) and accept that your publishs need to re-render these pages when the time has passed.

That’s a bit a drawback of the AEM as a Cloud Service setup, but on the hand side your dispatcher caches are regularly cleared.

by Jörg at October 31, 2022 02:02 PM

October 17, 2022

Things on a content management system - Jörg Hoh

Dispatcher, CDN and Caching

In today’s web performance discussions, there is a lot of focus on the browser as the most important. Google defines Web Core Vitals, and there are many other aspects which are important to have a fast site. Plus then SEO …

While many developers focus on these, I see that many sites often neglect the importance of proper caching. While many of these sites already use a CDN (in AEM CS a CDN is part of any offering), they often do not use the CDN in an optimal way; this can result in slow pages (because of the network latency) and also unnecessary load on the backend systems.

In this blog post I want to outline some ways how you can optimize your site for caching, with a focus on AEM in combination with a CDN. It does not really matter if it is AEM as a Cloud Service or AEM on AMS or on-premises, these recommendations can be applied to all of them.

Rule 1: Prefer caching at the CDN over caching at the dispatcher

The dispatcher is located close the AEM instance and typically co-located to your AEM instances. There is a high latency from the dispatcher to the end-user, especially if your end-users are spread across the globe. For example the average latency between Frankfurt/Germany and Sydney/Australia is approximately 250ms, and that makes browsing a website not really fast. Using a decent CDN can cut reduce these numbers dramatically.

Also a CDN is better suited to handle millions of requests per minute than a bunch of VMs running dispatcher instances, both from a cost perspective and from a perspective of knowhow required to operate at that scale.

That means that your caching strategy should aim for an optimal caching at the CDN level. The dispatcher is fine as a secondary cache to handle cache-misses or expired cache items. But no direct enduser request should it ever make through to the dispatcher.

Rule 2: Use TTL-based invalidation

The big advantage of the dispatcher is the direct control of the caching. You deliver your content from the cache, until you change that content. And immediately after the change the cache is actively invalidated, and your changed content is delivered. But you cannot use the same approach for CDNs, and while the CDNs made reasonable improvements to reduce the time to actively invalidate content from the CDNs, it still takes minutes.

A better approach is to use a TTL-based (time-to-live) invalidation (or rather: expiration), where every CDN node can decide on its own if a file in the cache is still valid or not. And if the content is too old, it’s getting refetched from the origin (your dispatchers).

Although this approach introduces some latency from the time of content activation to the time all users world-wide are to see it, such a latency is acceptable in general.

Rule 3: Staleness is not (necessarily) a problem

When you optimize your site, you need not only optimize that every request is requested from the CDN (instead from your dispatchers); but you also should think about what happens if a requested file is expired on the CDN. Ideally it should not matter much.

Imagine that you have a file which is configured with a TTL of 300 seconds. What should happen if this file is requested 301 seconds after it has been stored in the CDN cache. Should the CDN still deliver it (and accept that the user receives a file which can be a bit older than specified) or do you want to the user to wait until the CDN has obtained a fresh copy of that file?
Typically you accept that staleness for a moment and deliver the old copy for a while, until the CDN has obtained a fresh copy in the background. Use the “stale-while revalidate” caching headers to configure this behavior.

Rule 4: Pay attention to the 404s

A HTTP status 404 (“File not found”) is tricky to handle, because by default a 404 is not cached at the CDN. That means that all those requests will hit your dispatcher and eventually even your AEM instances, which are the authoritative source to answer if such a file exists. But the number of requests a AEM instance can handle is much smaller than the number the dispatchers or even the CDN can handle. And you should reserve these precious resources on doing something more useful than responding with “sorry, the resource you requested is not here”.

For that reason check the 404s and handle them appropriately; you have a number of options for that:

  • Fix incorrect links which are under your control.
  • Create dispatcher rules or CDN settings which handle request patterns which you don’t control, and return a 404 from there.
  • You also have the option to allow the CDN to cache a 404 response.

In any way, you should manage the 404s, because they are most expensive type of requests: You spend resources to deliver “nothing”.

Rule 5: Know your query strings

Query strings for requests were used a lot of provide parameters to the server-side rendering process, and you might use that approach as well in your AEM application. But query strings are also used a lot to tag campaign traffic for correct attribution; you might have seen such requests already, they often contain parameters like “utm_source”, “fbclid” etc. But these parameters do not have impact on the server-side rendering!
Because these requests cannot be cached by default, CDN and dispatcher will forward all requests containing any query string to AEM. And that’s again the most scarce resource, and having it rendered there will again impose the latency hit on your site visitors.

The dispatcher has the ability of remove named query strings from the request, which enables it to serve such requests from the dispatcher cache; that’s not as good as serving these requests from the CDN but much much better than handling them on AEM. You should use that as much as possible.

If you follow these rules, you have the chance not to only improve the user experience for your visitors, but at the same time you make your site much more scalable and resilient against attacks and outages.

by Jörg at October 17, 2022 06:58 AM

July 05, 2022

Things on a content management system - Jörg Hoh

What’s the maximum size of a node in JCR/AEM?

An interesting question which comes up every now and then is: “Is there a limit how large a JCR node can get?”.  And as always in IT, the answer is not that simple.

In this post I will answer that question and also outline why this limit is hardly a constraint in AEM development. Also I will show ways how you can design your application so that this limit is not a problem at all.

(Allow me a personal note here: For me the most interesting part of that question is the motivation behind it. When this question I asked I typically have the impression that the folks know that they are a bit off-limit here, because this is a topic which is discussed very rarely (if at all). That means they know that they (plan to) do something which violates some good practices. And for that reason they request re-assurance. For me this always leaves the question: Why do they do it then?? Because when you follow the recommended ways and content architecture patterns, you will never hit such a limit.)

We first have to distinguish between binaries and non-binaries. For binaries there is no real limit as they are stored in the blobstore. You can put files with 50GB in size there, not a problem. Such binaries are represented either using the nodetype “nt:file” (used most often) or using binary properties (rarely used).

And then there is the non-binary data. This data comprises of all other node- and property-types, where the information is stored within the nodestore (also often as multi-value properties). Here are limits.

In AEM CS MongoDB is used as data storage, and the maximum size of a MongoDB document is 16 Megabyte. As an approximation (it’s not always the case), you can assume that a single JCR node with all its properties is stored in a single MongoDB document, which directly results in a maximum size per node: 16 Megabytes.

In reality a node cannot get that large; other data is also stored inside that document. I recommend to store never more than 1 Megabyte of non-binary properties inside a single node. Technically you don’t have that limit in a TarMK/SegmentTar-only setup, but I would not exceed it either. You will have all kind of interesting problems and you barely have experience with such large nodes in the AEM world.

If you actually violate this limit in the size of a document, you get this very nasty exception and your content will not be stored:

javax.jcr.RepositoryException: OakOak0001: Command failed with error 10334 (BSONObjectTooLarge): ‘BSONObj size: 17907734 (0x1114016) is invalid. Size must be between 0 and 16793600(16MB) First element: _id: “7:/var/workflow/instances/server840/2022-06-01/xxx_reviewassetsworkflow1_114/history”‘ on server cmgbr9sharedcluster51rt-shard-00-01.xxxxx:27017. The full response is {“operationTime”: {“$timestamp”: {“t”: 1656435709, “i”: 87}}, “ok”: 0.0, “errmsg”: “BSONObj size: 17907734 (0x1114016) is invalid. Size must be between 0 and 16793600(16MB) First element: _id: \”7:/var/workflow/instances/server840/2022-06-01/xxx_reviewassetsworkflow1_114/history\””, “code”: 10334, “codeName”: “BSONObjectTooLarge”, “$clusterTime”: {“clusterTime”: {“$timestamp”: {“t”: 1656435709, “i”: 87}}, “signature”: {“hash”: {“$binary”: “MXahc2R2arLq+rc41fRzIFKzRAw=”, “$type”: “00”}, “keyId”: {“$numberLong”: “7059363699751911425”}}}} [7:/var/workflow/instances/server840/2022-06-01/xxx_reviewassetsworkflow1_114/history]
at org.apache.jackrabbit.oak.api.CommitFailedException.asRepositoryException(CommitFailedException.java:250) [org.apache.jackrabbit.oak-api:1.42.0.T20220608154910-4c59b36]

But is this really a limit which is hurting AEM developers and customers? Actually I don’t think so. And there are at least 2 good reasons why I believe this:

  • Pages barely do have that much content stored in a single component (be in the jcr:content node or any component beneath it), the same with assets. The few instances I have seen this exception just happened because a lot of “data” was stored inside properties (e.g. complete files), which would have better been stored in “nt:file” nodes as binaries.
  • Since version 1.0 Oak contains a warning if it needs to index properties larger than 100 Kilobytes, and I have rarely seen this warning in the wild. There are prominent examples in AEM itself when this warning is written for nodes in /libs.

So the best way to find out if you are close to run into this problem with the total size of the documents is to check the logs for this warning:

05.07.2022 09:31:57.326 WARN [async-index-update-fulltext-async] org.apache.jackrabbit.oak.plugins.index.lucene.LuceneDocumentMaker String length: 116946 for property: imageData at Node: /libs/wcm/core/content/editors/template/tour/content/items/third is greater than configured value 102400

Having these warnings in the logs means that you should pay attention to them; here it’s not a problem because this property is unlikely to get any larger over time. But you should pay attention to those properties which can grow over time.
(Although there is no warning if you have many smaller properties, which in sum hit the limits of the MongoDB document.)

How to mitigate?

As mentioned above it’s hard to come up with cases where this is actually a problem, especially if you are developing in line with the AEM guidelines. The only situation where I can imagine this limit to be a problem is when a lot of data is stored within a node, which is to be consumed by custom logic. But in this case you own the data and the logic. Therefor you have the chance to change the implementation in a way that this situation is not occurring anymore.

When you design your content and data structure, you should be aware of this limit of not storing more than 1 Megabyte within a single node. Because there is no workaround you have when you get that exception. The only way to make it work again is to fix the data structure and the code for it. There are 2 approaches:

  • Split the data across more nodes, ideally in a tree-ish way where you can also use application knowledge to store it in an intuitive (and often faster) way.
  • If you just have a single property which is that large, you could also try to convert it into a binary property. This is much simpler as in the majority of cases you just need to change the type of the property from String to Binary. The type conversions are done implicitly but if you store actual string data, you should worry about encoding then…

Now you know the answer to the question “What’s the maximum size of a node in JCR/AEM” and why it should never be a problem for you. Also I also outlined ways how you can avoid hitting this problem at all by choosing an appropriate node structure or storing large data in binaries instead of properties.

Happy developing and I hope you never encounter this situation!

by Jörg at July 05, 2022 12:03 PM

June 20, 2022

Things on a content management system - Jörg Hoh

Sling Scheduled Jobs vs Sling Scheduler

Apache Sling and AEM provide 2 different approaches to start processes at a given time or in a given interval. It is not always trivial to make the right decision between these two, and I have seen a few cases of misuse already. Let’s dive into this topic and I will outline in what situation to use the Scheduler and when to use Scheduled Jobs.

Let me outline the differences between these using using a simple table:

 Sling Scheduled JobSling Scheduler
Timing is persisted across restartsYesNo
Start a job via OSGI annotationsNoyes
Start a job via APIYesYes
Trigger on every cluster nodeYes (job execution then follows regular Sling Jobs rules)yes
Trigger just once per clusterYes (job execution then follows regular Sling Jobs rules)Yes (only on cluster leader possible)  
Comparison between Scheduled Jobs and Scheduler

To get a better understanding on these 2 distinct features, I cover a 2 use cases and list which feature is a the better match for it.

Execute code exactly once at a given time

That’s a case for the scheduled job. Even if the job is failing because the executing SLING instance goes down, it will be re-scheduled and tried again.

Here the exactly once semantics means that this is a single job with a global scope. Missing it is not an option. It might be delayed if the triggering date is missed or the execution is aborted, but it will be executed as soon as possible after the scheduled time has passed.

Periodic jobs which effect just a single AEM instance

Use the scheduler whenever you execute periodic/incremental jobs like cleanups, data imports etc. It’s not a problem if you miss their execution on time, but you just make sure that you execute at the next time (or trigger it during startup if necessary).

Another note to choose the right approach: You should not need to create a scheduled job during startup (that means on the startup of services), for these cases it’s normally better to use the Scheduler. There might be rare cases where this it is the right solution, but in the majority of cases you should just use the scheduler in that case.

A word of caution when you use Scheduled Jobs with a periodic pattern: As they are persisted, you need to un-register them when you don’t need them anymore.

by Jörg at June 20, 2022 05:39 PM

March 17, 2022

Things on a content management system - Jörg Hoh

How to analyze “Authentication support missing”

Errors and problems in running software manifest often in very interesting and non-obvious cases. A problem in location A manifests itself only with an unrelated error message in a different location B.

We also have one example of such a situation in AEM, and that’s the famous “Authentication support missing” error message.  I see often the question “I got this error message; what should I do now?”, and so I decided: It’s time to write a blog post about it. Here you are.

“Authentication support missing” is actually not even correct: There is no authentication module available, so you cannot authenticate. But in 99,99% of the cases this is just a symptom. Because the default AEM authentication depends on a running SlingRepository service. And a running Sling repository has a number of dependencies itself.

I want to highlight 2 of these dependencies, because they tend to cause problems most often: The Oak repository and the RepositoryInitializer service. Both must be up and be started/run succesfully until the SlingRepository service is being registered succesfully. Let’s look into each of these dependencies.

The Oak repository

The Oak repository is a quite complex system in itself, and there are many reasons why it did not start. To name a few:

  • Consistency problems with the repository files on disk (for whatever reasons), permission problems on the filesystem, full disks, …
  • Connectivity issues towards the storage (especially if you use a database or mongodb as storage)
  • Messed up configuration

If you have an “authentication support missing” message, you first check should be on the Oak repository, typically reachable in the AEM error.log. If you have an ERROR messages logged by any “org.apache.jackrabbit.oak” class during the startup, this is most likely the culprit. Investigate from there.

Sling Repository Initializer (a.k.a. “repoinit”)

Repoinit is designed to ensure that a certain structure in the repository is provided, even before any consumer is accessing it. All of the available scripts must be executed, and any failure will immediate terminate the startup of the SlingRepositoryService. Check also my latest blog post on Sling Repository Initializer for details how to prevent such problems.

Repoinit failures are typically quite prominent in the AEM error.log, just search for an ERROR message starting with this:

*ERROR* [Apache SlingRepositoryStartup Thread #1] com.adobe.granite.repository.impl.SlingRepositoryManager Exception in a SlingRepositoryInitializer, SlingRepositoryservice registration aborted …

These are 2 biggest contributors to this “Authentication support missing” error messages. Of course there are more reasons why it could appear. But to be honest, I only have seen these 2 cases in the last years.

I hope that this article helps you to investigate such situations more swiftly.

by Jörg at March 17, 2022 04:41 PM

March 11, 2022

Things on a content management system - Jörg Hoh

How to deal with RepoInit failures in Cloud Service

Some years, even before AEM as a Cloud Services, the RepoInit language has been implemented as part of Sling (and AEM) to create repository structures directly on the startup of the JCR Repository. With it your application can rely that some well-defined structures are always available.

In this blog post I want to walk you through a way how you can test repoinit statements locally and avoid pipeline failures because of it.

Repoinit statements are deployed as part of OSGI configurations; and that means that during the development phase you can work in an almost interactive way with it. Also exceptions are not a problem; you can fix the statement and retry.

The situation is much different when you already have repoinit statements deployed and you startup your AEM (to be exact: the Sling Repository service) again. Because in this case all repoinit statements are executed as part of the startup of the repository. And any exception in the execution of repoinits will stop the startup of the repository service and render your AEM unusable. In the case of CloudManager and AEM as a Cloud Service this will break your deployment.

Let me walk you through 2 examples of such an exception and how you can deal with it.

*ERROR* [Apache SlingRepositoryStartup Thread #1] com.adobe.granite.repository.impl.SlingRepositoryManager Exception in a SlingRepositoryInitializer, SlingRepositoryservice registration aborted java.lang.RuntimeException: Session.save failed: javax.jcr.nodetype.ConstraintViolationException: OakConstraint0025: /conf/site/configuration/favicon.ico[[nt:file]]: Mandatory child node jcr:content not found in a new node 
at org.apache.sling.jcr.repoinit.impl.AclVisitor.visitCreatePath(AclVisitor.java:167) [org.apache.sling.jcr.repoinit:1.1.36] 
at org.apache.sling.repoinit.parser.operations.CreatePath.accept(CreatePath.java:71)

In this case the exception is quite detailed what actually went wrong. It failed when saving, and it says that /conf/site/configuration/favicon (of type nt:file) was affected. The problem is that a mandatory child node “jcr:content” is missing.

Why is it a problem? Because every node of nodetype “nt:file” requires a “jcr:content” child node which actually holds the binary.

This is a case which you can detect very easily also on a local environment.

Which leads to the first recommendation:

When you develop in your local environment, you should apply all repoinit statements to a fresh environment, in which there are no manual changes. Because otherwise your repoinit statements rely on the presence of some things which are not provided by the repoinit scripts.

Having a mix of manual changes and repoinit on a local development environment and then moving it untested over is often leads to failures in the CloudManager pipelines.

The second example is a very prominent one, and I see it very often:

[Apache SlingRepositoryStartup Thread #1] com.adobe.granite.repository.impl.SlingRepositoryManager Exception in a SlingRepositoryInitializer, SlingRepositoryservice registration aborted java.lang.RuntimeException: Failed to set ACL (java.lang.UnsupportedOperationException: This builder is read-only.) AclLine DENY {paths=[/libs/cq/core/content/tools], privileges=[jcr:read]} 
at org.apache.sling.jcr.repoinit.impl.AclVisitor.setAcl(AclVisitor.java:85)

It’s the well-known “This builder is read-only” version. To understand the problem and its resolution, I need to explain a bit the way the build process assembles AEM images in the CloudManager pipeline.

In AEM as a cloud service you have an immutable part of the repository, which consists out of the trees “/libs” and “/apps”. They are immutable, because they cannot be modified on runtime, not even with admin permissions.

During build time this immutable part of the image is built. This process merges both product side parts (/libs) and custom application parts (/apps) together. After that also all repoinit scripts run, both the ones provided by the product as well as any custom one. And of course during that part of the build these parts are writable, thus writing into /apps using repoinit is not a problem.

So why do you actually get this exception, when /libs and /apps are writeable? This is because repoinit is executed a second time. During the “final” startup, when /apps and /libs are immutable.

Repoinit is designed around that idea, that all activities are idempotent. This means that if you want to create an ACL on /apps/myapp/foo/bar the repoinit statement is a no-op if that specific ACL already exists. A second run of repoinit will do nothing, but find everything still in place.

But if in the second run the system executes this action again, it’s not an no-op anymore. This means that this ACL is not there as expected. Or whatever the goal of that repoinit statement was.

And there is only one reason why this happen. There was some other action between these 2 executions of repoinit which changed the repository. The only thing which also modifies the repository are installations of content packages.

Let’s illustrate this problem with an example. Imagine you have this repoinit script:

create path /apps/myapp/foo/bar
set acl on /apps/myapp/foo/bar
  allow jcr:read for mygroup
end

And you have a content package which comes with content for /apps/myapp and the filter is set to “overwrite”, but not containing this ACL.

In this case the operations leading to this error are these:

  • Repoinit sets the ACL on /apps/myapp/foo/bar
  • the deployment overwrites /apps/myapp with the content package, so the ACL is wiped
  • AEM starts up
  • Repoinit wants to set the ACL on /apps/myapp/foo/bar, which is now immutable. It fails and breaks your deployment.

The solution to this problem is simple: You need to adjust the repoinit statements and the package definitions (especially the filter definitions) in a way, that the package installation does not wipe and/or overwrite any structure created by repoinit. And with “structure” I do not refer only to nodes, but also nodetypes, properties etc. All must be identical, and in the best case they don’t interfere.

It is hard to validate this locally, as you don’t have an immutable /apps and /libs, but there is a test approach which comes very close to it:

  • Run all your repoinit statements in your local test environment
  • Install all your content packages
  • Enable write tracing (see my blog post)
  • Re-run all your repo-init statements.
  • Disable write tracing again

During the second run of the repoinit statements you should not see any write in the trace log. If you have any write operation, it’s a sign that your packages overwrite structures created by repoinit. You should fix these asap, because they will later break your CloudManager pipeline.

With this information at hand you should be able to troubleshoot any repoinit problems already on your local test environment, avoiding pipeline failures because of it.

by Jörg at March 11, 2022 11:28 AM

February 03, 2022

Things on a content management system - Jörg Hoh

The deprecation of Sling Resource Events

Sling events are used for many aspects of the system, and initially JCR changes were sent with it. But the OSGI eventing (which the Sling events are built on-top) are not designed for a huge volumes of events (thousands per second); and that is a situation which can happen with AEM; and one of the most compelling reasons to get away from this approach is that all these event handlers (both resource change event and all others) share a single thread-pool.

For that reason the ResourceChangeListeners have been introduced. Here each listener provides detailed information which change it is interested in (restrict by path and type of the change) therefor Sling is able to optimise the listeners on the JCR level; it does not listen for changes when no-one is interested in. This can reduce the load on the system and improve the performance.
For this reason the usage of OSGI Resource Event listeners are deprecated (although they are still working as expected).

How can I find all the ResourceChangeEventListeners in my codebase?

That’s easy, because on startup for each of these ResourceChangeEventListeners you will find a WARN message in the logs like this:

Found OSGi Event Handler for deprecated resource bridge: com.acme.myservice

This will help you to identify all these listeners.

How do I rewrite them to ResourceChangeListeners?

In the majority of cases this should be straight-forward. Make your service implement the ResourceChangeListeners interface and provide these additional OSGI properties:

@Component(
service = ResourceChangeListener.class,
configurationPolicy = ConfigurationPolicy.IGNORE,
property = {
ResourceChangeListener.PATHS + "=/content/dam/asset-imports",
ResourceChangeListener.CHANGES + "=ADDED",
ResourceChangeListener.CHANGES + "=CHANGED",
ResourceChangeListener.CHANGES + "=REMOVED"
})

With this switch you allow resource events to be processed separately in an optimised way; they do not block anymore other OSGI events.

by Jörg at February 03, 2022 06:58 PM

January 21, 2022

Things on a content management system - Jörg Hoh

How to handle errors in Sling Servlet requests

Error handling is a topic which developers rarely pay too much attention to. It is done when the API forces them to handle an exception. And the most common pattern I see is the “log and throw” pattern, which means that the exception is logged and then re-thrown.

When you develop in the context of HTTP requests, error handling can get tricky. Because you need to signal the consumer of the response, that an error happened and the request was not successful. Frameworks are designed in a way that they handle any exception internally and set the correct error code if necessary. And Sling is not different from that, if your code throws an exception (for example the postConstruct of a Sling Model), the Sling framework catches it and sets the correct status code 500 (Internal Server Error).

I’ve seen code, which catches exception itself and sets the status code for the response itself. But this is not the right approach, because every exception handled this way the developers implicitly states: “These are my exceptions and I know best how to handle them”; almost as if the developer takes ownership of these exceptions and their root causes, and that there’s nothing which can handle this situation better.

This approach to handle exceptions on its own is not best practice, and I see 2 problems with it:

  • Setting the status code alone is not enough, but the remaining parts of the request processing need to stopped as well. Otherwise the processing continues as nothing happened, which is normally not useful or even allowed. It’s hard to ensure this when the exception is caught.
  • Owning the exception handling removes the responsibility from others. In AEM as a Cloud Service Adobe monitors response codes and the exceptions causing it. And if there’s only a status code 500 but no exception reaching the SlingMainServlet, then it’s likely that this is ignored, because the developer claimed ownership of the exception (handling).

If you write a Sling Servlet or code operating in the context of a request it is best practice not to catch exceptions, but to let them bubble up to the Sling Main Servlet, which is able to handle it appropriately. handle exceptions by yourself, only if you have a better way to deal with them as to log them.

by Jörg at January 21, 2022 07:27 PM

January 05, 2022

Things on a content management system - Jörg Hoh

How to deal with the “TooManyCallsException”

I randomly see the question “We get the TooManyCallsException while rendering pages, and we need to increase the threshold for the number of inclusions to 5000. Is this a problem? What can we do so we don’t run into this issue at all?”

Before I answer this question, I want to explain the background of this setting, why it was introduced and when such a “Call” is made.

Sling rendering is based on Servlets; and while a single servlet can handle the rendering of the complete response body, that is not that common in AEM. AEM pages normally consistent of a variety of different components, which internally can consist of distinct subcomponents as well. This depends on the design approach the development has choosen.
(It should be mentioned that all JSPs and all HTL scripts are compiled into regular Java servlets.)

That means that the rendering process can be considered as tree of servlets, and servlets calling other servlets (with the DefaultGetServlet being the root of such a tree when rendering pages). This tree is structured along the resource tree of the page, but it can include servlets which are rendering content from different areas of the repository; for example when dealing with content fragments or including images, which require their metadata to be respected.

It is possible to turn this tree into a cyclic graph; and that means that the process of traversing this tree of servlets will turn into a recursion. In that case request processing will never terminate, the Jetty thread pool will quickly fill up to its limit, and the system will get unavailable. To avoid this situation only a limited number of servlet-calls per request is allowed. And that’s this magic number of 1000 allowed calls (which is configured in the Sling Main Servlet).

Knowing this let me try to answer the question “Is it safe to increase this value of 1000 to 5000?“. Yes, it is safe. In case your page rendering process goes recursive it terminates later, which will increase a bit the risk of your AEM instance getting unavailable.


Are there any drawbacks? Why is the default 1000 and not 5000 (or 10000 or any higher value)?” From experience 1000 is sufficient for the majority of applications. It might be too low for applications where the components are designed very granular which in turn require a lot of servlet calls to properly render a page.
And every servlet call comes with a small overhead (mostly for running the component-level filters); and even if this overhead is just 100 microseconds, 1000 invocations are 100 ms just for the invocation overhead. That means you should find a good balance between a clean application modularization and the runtime performance overhead of it.

Which leads to the next question: “What are the problematic calls we should think of?“. Good one.
From a high-level view of AEM page renderings, you cannot avoid the servlet-calls which render the components. That means that you as an AEM application developer cannot influence the overall page rendering process, but you can only try to optimise the rendering of individual (custom) components.
To optimise these, you should be aware, that the following things trigger the invocation of a servlet during page rendering:

  • the <cq:include>, <sling:include> and <sling:forward> JSP tags
  • the data-sly-include statement of HTL
  • and every method which invokes directly or indirectly the service() method of a servlet.

A good way to check this for some pages is the “Recent requests” functionality of the OSGI Webconsole.

by Jörg at January 05, 2022 04:28 PM

December 01, 2021

Things on a content management system - Jörg Hoh

The web, an eventually consistent system

For many large websites, CDNs are the foundation for delivering content quickly to their customers around the world. The ability of CDNs to cache responses close to consumers also allows these sites to operate on a small hardware footprint. However, compared to what they would have to invest if they operated without a CDN and delivered all content through their own systems, this comes at a cost: your CDN may now deliver content that is out of sync with your origin because you changed the content on your own system. This change is not done in an atomic fashion. This is the same “atomic” as in the ACID principle of database implementations.
This is a conscious decision, and it is caused primarily by the CAP theorem. It states that in a distributed data storage system, you can only achieve 2 of these 3 guarantees:

  • Consistency
  • Availability
  • Partition tolerance

And in the case of a CDN (which is a highly distributed data storage system), its developers usually opt for availability and partition tolerance over consistency. That is, they accept delivering content that is out of date because the originating system has already updated it.

To mitigate this situation the HTTP protocol has features built-in which help to mitigate the problem at least partially. Check out the latest RFC draft on it, it is a really good read. The main feature is called “TTL” (time-to-live) and means that the CDN delivers a version of the content only for a configured time. Afterwards the CDN fetches a new version will from the origin system. The technical term for this is “eventual consistent” because at that point the state of the system with respect to that content is consistent again.

This is the approach all CDNs support, and it works very reliable. But only if you accept that you change content on the origin system and that it will reach your consumers with this delay. The delay is usually set to a period of time that is empirically determined by the website operators, trying to balance the need to deliver fresh content (which requires a very low or no TTL) with the number of requests that the CDN can answer instead of the origin system (in this case, the TTL should be as high as possible). Usually it is in the range of a few minutes.

(Even if you don’t use a CDN for your origin systems, you need these caching instructions, otherwise browsers will make assumptions and cache the requested files on their own. Browsing the web without caching is slow, even on very fast connections. Not to mention what happens when using a mobile device over a slow 3G line … Eventual consistency is an issue you can’t avoid when working on the web.)

Caching is an issue you will always have to deal with when creating web presences. Try to cache as much as possible without neglecting the need to refresh or update content at a random time.

You need to constantly address eventual consistency. Atomic changes (that means changes are immediately available to all consumers) are possible, but they come at a price. You can’t use CDNs for this content; you must deliver it all directly from your origin system. In this case, you need to design your origin system so that it can function without eventual consistency at all (and that’s built in into many systems). Not to mention the additional load it will have to handle.

And for this reason I would always recommend not relying on atomic updates or consistency across your web presence. Always factor in eventual consistency in the delivery of your content. And in most cases even business requirements where “immediate updates” are required can be solved with a TTL of 1 minute. Still not “immediate”, but good enough in 99% of all cases. For the remaining 1% where consistency is mandatory (e.g. real-time stock trading) you need to find a different solution. And I am not sure if the web is always the right technology then.

And as an afterthought regarding TTL: Of course many CDNs offer you the chance to actively invalidate the content, but it often comes with a price. In many cases you can invalidate only single files. Often it is not an immediate action, but takes seconds up to many minutes. And the price is always that you have to have the capacity to handle the load when the CDN needs to refetch a larger chunk of content from your origin system.

by Jörg at December 01, 2021 01:16 PM

November 01, 2021

Things on a content management system - Jörg Hoh

Understanding AEM request processing using the OSGI “Recent Request” console

During some recent work on performance improvements in request processing I used a tool, which is part of AEM for a very long time now; I cannot recall a time when it was NOT there. It’s very simple, but nevertheless powerful and it can help you to understand the processing of requests in AEM much better.

I am talking about the “Recent Requests Console” in the OSGI webconsole, which is a gem in the “AEM performance tuning” toolbox.

In this blog post I use this tool to explain the details of the request rendering process of AEM. You can find the detailed description of this process in the pages linked from this page (Sling documentation).

Screenshot “Recent requests”

With this Recent Requests screen (goto /system/console/requests) you can drill down into the rendering process of the last 20 requests handled by this AEM instance; these are listed at the top of the screen. Be aware that if you have a lot of concurrent requests you might often miss the request you are looking for, so if you really rely on it, you should increase the number of requests which are retained. This can be done via the OSGI configuration of the Sling Main Servlet.

When you have opened a request, you will see a huge number of single log entries. Each log entry contains as first element a timestamp (in microseconds, 1000 microseconds = 1 millisecond) relative to the start of the request. With this information you can easily calculate how much time passed between 2 entries.

And each request has a typical structure, so let’s go through it using the AEM Start page (/aem/start.html). So just use a different browser window and request that page. Then check back on the “Recent requests console” and select the “start.html”.
In the following I will go through the lines, starting from the top.

      0 TIMER_START{Request Processing}
      1 COMMENT timer_end format is {<elapsed microseconds>,<timer name>} <optional message>
     13 LOG Method=GET, PathInfo=null
     17 TIMER_START{handleSecurity}
   2599 TIMER_END{2577,handleSecurity} authenticator org.apache.sling.auth.core.impl.SlingAuthenticator@5838b613 returns true

This is a standard header for each request. We can see here that the authentication took 2599 microseconds.

   2981 TIMER_START{ResourceResolution}
   4915 TIMER_END{1932,ResourceResolution} URI=/aem/start.html resolves to Resource=JcrNodeResource, type=granite/ui/components/shell/page, superType=null, path=/libs/granite/ui/content/shell/start
   4922 LOG Resource Path Info: SlingRequestPathInfo: path='granite/ui/components/shell/page', selectorString='null', extension='html', suffix='null'

Here we see the 2 log lines for a the resolving process of a resourcetype. It took 1932 microseconds to map the request “/aem/start.html” to the resourcetype “granite/core/components/login” with the path being /libs/granite/ui/content/shell/start. Additionally we see information about the selector, extension and suffix elements.

   4923 TIMER_START{ServletResolution}
   4925 TIMER_START{resolveServlet(/libs/granite/ui/content/shell/start)}
   4941 TIMER_END{14,resolveServlet(/libs/granite/ui/content/shell/start)} Using servlet BundledScriptServlet (/libs/granite/ui/components/shell/page/page.jsp)
   4945 TIMER_END{21,ServletResolution} URI=/aem/start.html handled by Servlet=BundledScriptServlet (/libs/granite/ui/components/shell/page/page.jsp)

That’s a nested servlet resolution, which takes 14 respective 21 microseconds.  Till now that’s mostly standard and hard to influence performance-wise. But it already gives you a lot information, especially regarding the resourcetype which is managing the complete response processing.

   4948 LOG Applying Requestfilters
   4952 LOG Calling filter: com.adobe.granite.resourceresolverhelper.impl.ResourceResolverHelperImpl
   4958 LOG Calling filter: org.apache.sling.security.impl.ContentDispositionFilter
   4961 LOG Calling filter: com.adobe.granite.csrf.impl.CSRFFilter
   4966 LOG Calling filter: org.apache.sling.i18n.impl.I18NFilter
   4970 LOG Calling filter: com.adobe.granite.httpcache.impl.InnerCacheFilter
   4979 LOG Calling filter: org.apache.sling.rewriter.impl.RewriterFilter
   4982 LOG Calling filter: com.adobe.cq.history.impl.HistoryRequestFilter
   7870 LOG Calling filter: com.day.cq.wcm.core.impl.WCMRequestFilter
   7908 LOG Calling filter: com.adobe.cq.wcm.core.components.internal.servlets.CoreFormHandlingServlet
   7912 LOG Calling filter: com.adobe.granite.optout.impl.OptOutFilter
   7921 LOG Calling filter: com.day.cq.wcm.foundation.forms.impl.FormsHandlingServlet
   7932 LOG Calling filter: com.day.cq.dam.core.impl.servlet.DisableLegacyServletFilter
   7935 LOG Calling filter: org.apache.sling.engine.impl.debug.RequestProgressTrackerLogFilter
   7938 LOG Calling filter: com.day.cq.wcm.mobile.core.impl.redirect.RedirectFilter
   7940 LOG Calling filter: com.day.cq.wcm.core.impl.AuthoringUIModeServiceImpl
   8185 LOG Calling filter: com.adobe.granite.rest.assets.impl.AssetContentDispositionFilter
   8201 LOG Calling filter: com.adobe.granite.requests.logging.impl.RequestLoggerImpl
   8212 LOG Calling filter: com.adobe.granite.rest.impl.servlet.ApiResourceFilter
   8302 LOG Calling filter: com.day.cq.dam.core.impl.servlet.ActivityRecordHandler
   8321 LOG Calling filter: com.day.cq.wcm.core.impl.warp.TimeWarpFilter
   8328 LOG Calling filter: com.day.cq.dam.core.impl.assetlinkshare.AdhocAssetShareAuthHandler

These are all request-level filters, which are executed just once per request.

And now the interesting part starts: the rendering of the page itself. The building blocks are called “components” (that term is probably familiar to you) and it always follows the same pattern:

  • Calling Component Filters
  • Executing the Component
  • Return from the Component Filters (in reverse order of the calling)

This pattern can be clearly seen in the output, but most often it is more complicated because many components include other components, and so you end up in a tree of components being rendered.

As an example for the straight forward case we can take the “head” component of the page:

  25849 LOG Including resource MergedResource [path=/mnt/overlay/granite/ui/content/globalhead/experiencelog, resources=[/libs/granite/ui/content/globalhead/experiencelog]] (SlingRequestPathInfo: path='/mnt/overlay/granite/ui/content/globalhead/experiencelog', selectorString='null', extension='html', suffix='null')
  25892 TIMER_START{resolveServlet(/mnt/overlay/granite/ui/content/globalhead/experiencelog)}
  25934 TIMER_END{40,resolveServlet(/mnt/overlay/granite/ui/content/globalhead/experiencelog)} Using servlet BundledScriptServlet (/libs/cq/experiencelog/components/head/head.jsp)
  25939 LOG Applying Includefilters
  25943 LOG Calling filter: com.adobe.granite.csrf.impl.CSRFFilter
  25951 LOG Calling filter: com.day.cq.personalization.impl.TargetComponentFilter
  25955 LOG Calling filter: com.day.cq.wcm.core.impl.page.PageLockFilter
  25959 LOG Calling filter: com.day.cq.wcm.core.impl.WCMComponentFilter
  26885 LOG Calling filter: com.day.cq.wcm.core.impl.monitoring.PageComponentRequestFilter
  26893 LOG Calling filter: com.adobe.granite.metrics.knownerrors.impl.ErrorLoggingComponentFilter
  26896 LOG Calling filter: com.day.cq.wcm.core.impl.WCMDebugFilter
  26899 LOG Calling filter: com.day.cq.wcm.core.impl.WCMDeveloperModeFilter
  28125 TIMER_START{BundledScriptServlet (/libs/cq/experiencelog/components/head/head.jsp)#1}
  46702 TIMER_END{18576,BundledScriptServlet (/libs/cq/experiencelog/components/head/head.jsp)#1}
  46734 LOG Filter timing: filter=com.day.cq.wcm.core.impl.WCMDeveloperModeFilter, inner=18624, total=19806, outer=1182
  46742 LOG Filter timing: filter=com.day.cq.wcm.core.impl.WCMDebugFilter, inner=19806, total=19810, outer=4
  46749 LOG Filter timing: filter=com.adobe.granite.metrics.knownerrors.impl.ErrorLoggingComponentFilter, inner=19810, total=19816, outer=6
  46756 LOG Filter timing: filter=com.day.cq.wcm.core.impl.monitoring.PageComponentRequestFilter, inner=19816, total=19830, outer=14
  46761 LOG Filter timing: filter=com.day.cq.wcm.core.impl.WCMComponentFilter, inner=19830, total=20750, outer=920
  46767 LOG Filter timing: filter=com.day.cq.wcm.core.impl.page.PageLockFilter, inner=20750, total=20754, outer=4
  46772 LOG Filter timing: filter=com.day.cq.personalization.impl.TargetComponentFilter, inner=20754, total=20758, outer=4

At the top you see the LOG statement “Including resource …” which provides you with the information what resource is rendered, including additional information like selector, extension and suffix.

As next statement we have the resolution of the renderscript which is used to render this resource, plus the time it took (40 microseconds).

Then we have the invocation of all component filters, the execution of the render script itself, which is using a TIMER to record start time, end time and duration (18576 microseconds), and the unwinding of the component filters.

If you use a recent version of the SDK for AEM as a Cloud Service, all timestamps are in microseconds, but in AEM 6.5 and older the duration measured for the Filters (inner=…, outer=…) were printed in miliseconds (which is an inconsistency I just fixed recently).

If a component includes another component, it looks like this:

8350 LOG Applying Componentfilters
   8358 LOG Calling filter: com.day.cq.personalization.impl.TargetComponentFilter
   8361 LOG Calling filter: com.day.cq.wcm.core.impl.page.PageLockFilter
   8365 LOG Calling filter: com.day.cq.wcm.core.impl.WCMComponentFilter
   8697 LOG Calling filter: com.day.cq.wcm.core.impl.monitoring.PageComponentRequestFilter
   8703 LOG Calling filter: com.adobe.granite.metrics.knownerrors.impl.ErrorLoggingComponentFilter
   8733 LOG Calling filter: com.day.cq.wcm.core.impl.WCMDebugFilter
   8750 TIMER_START{BundledScriptServlet (/libs/granite/ui/components/shell/page/page.jsp)#0}
  25849 LOG Including resource MergedResource [path=/mnt/overlay/granite/ui/content/globalhead/experiencelog, resources=[/libs/granite/ui/content/globalhead/experiencelog]] (SlingRequestPathInfo: path='/mnt/overlay/granite/ui/content/globalhead/experiencelog', selectorString='null', extension='html', suffix='null')
  25892 TIMER_START{resolveServlet(/mnt/overlay/granite/ui/content/globalhead/experiencelog)}
  25934 TIMER_END{40,resolveServlet(/mnt/overlay/granite/ui/content/globalhead/experiencelog)} Using servlet BundledScriptServlet (/libs/cq/experiencelog/components/head/head.jsp)
  25939 LOG Applying Includefilters
[...]
148489 LOG Filter timing: filter=com.day.cq.wcm.core.impl.WCMDeveloperModeFilter, inner=1698, total=1712, outer=14
 148500 LOG Filter timing: filter=com.day.cq.wcm.core.impl.WCMDebugFilter, inner=1712, total=1717, outer=5
 148509 LOG Filter timing: filter=com.adobe.granite.metrics.knownerrors.impl.ErrorLoggingComponentFilter, inner=1717, total=1722, outer=5
 148519 LOG Filter timing: filter=com.day.cq.wcm.core.impl.monitoring.PageComponentRequestFilter, inner=1722, total=1735, outer=13
 148527 LOG Filter timing: filter=com.day.cq.wcm.core.impl.WCMComponentFilter, inner=1735, total=2144, outer=409
 148534 LOG Filter timing: filter=com.day.cq.wcm.core.impl.page.PageLockFilter, inner=2144, total=2150, outer=6
 148543 LOG Filter timing: filter=com.day.cq.personalization.impl.TargetComponentFilter, inner=2150, total=2154, outer=4
 148832 TIMER_END{140080,BundledScriptServlet (/libs/granite/ui/components/shell/page/page.jsp)#0}

You see the component filters, but then after the TIMER_START for the page.jsp (check the trailing timer number: #0, every timer has a unique ID!) line you see the inclusion of a new resource. For this again the render script is resolved and instead of the ComponentFilters the IncludeFilters are called, but in the majority of cases the list of filters are identical. And depending on the resource structure and the script, the rendering tree can get really deep. But eventually you can see that the the rendering of the page.jsp is completed; you can easily find it by looking for the respective timer ID.

Equipped with this knowledge you can now easily dig into the page rendering process and see which resources and resource types are part of the rendering process of a page. And if you are interested in the bottlenecks of the page rendering process you can check the TIMER_END lines which both include the rendering script plus the time in microseconds it took to render it (be aware, that this time also includes it too to render all scripts invoked from this render script).

But the really cool part is that this is extensible. Via the RequestProgressTracker you can easily write your own LOG statements, start timers etc. So if you want to debug requests to better understand the timing, you can easily use something like this:

slingRequest.getRequestProgressTracker().log("Checkpoint A");

And then you can find this log message in this screen when this component is rendered. You can use it to output useful (debugging) information or just have use its timestamp to identify performance problems. This can be superior to normal logging (to a logfile), becaus you can leave these statements in production code, and they won’t pollute the log files. You just need to have access to the OSGI webconsole, search for the request you are interested and check the rendering process.

And if you are interested, you can can also get all entries in this screen and do whatever you like. For example you can write a (request-level) filter, which calls first the next filter, and afterwards logs all entries of the RequestProgressTracker to the logfile, if the request processing took more than 1 second.

The Request Progress Tracker plus the “Recent Requests” Screen of the OSGI webconsole are a really cool combination to both help you to understand the inner working of the Sling Request Processing, and it’s also a huge help to analyze and understand the performance of request processing.

I hope that this technical deep dive into the sling page rendering process was helpful for you, and you are able to spot many interesting aspects of an AEM system just be using this tool. If you have questions, please leave me a comment below.

by Jörg at November 01, 2021 03:22 PM

October 25, 2021

CQ5 Blog - Inside Solutions

Cloud Manager: Deploy and Operate AEM Cloud Service

Cloud Manager: Deploy and Operate AEM Cloud Service

Cloud Manager is an integral part of Adobe’s AEM as a Cloud Service (AEMaaCS) offering. 

Cloud Manager provides a fully-featured Continuous Integration / Continuous Development (CI/CD) pipeline enabling organisations to build, test, and deploy their AEM applications to the Adobe Cloud automatically. 

Hosting, operation, and scaling of Adobe Experience Manager is all managed by Adobe in the background including a SLA. Maintenance of Cloud Manager and upgrading of AEM is taken care of by Adobe as well.

Cloud Manager benefits smaller projects with the extensive out of the box build pipeline and stable deployment that promises zero downtime. Larger projects can free up resources in their devops and operations team which no longer have to focus on the intricacy of deploying and hosting AEM. 

Lastly, overall system performance, stability and availability are improved since no one will know how to build and host Adobe Experience Manager better than Adobe.

Overall, Cloud Manager is a great cost and time saver due to a lot of functionality which is provided and maintained by Adobe. 

We will explore and highlight the main functionalities so that you understand the tool and reasoning why we, at One Inside, think it’s so great.

What is Adobe Cloud Manager?

Adobe Cloud Manager allows self-managed deployments and operation of AEM Cloud Service.

It consists of a CI/CD pipeline, various environments, code repositories and further information about the system like logs or SLA reports.

Log in to Adobe Cloud Manager

To log in to Cloud Manager, go to experience.adobe.com (Experience Manager / Launch Cloud Manager).

If you do not have access, either your company does not yet have the AEM Cloud Service licenses or your account is lacking the required permissions.

You can find the most important information about the environments and pipelines on the Startpage and have access to more detailed information. 

Cloud Manager Core Features

Cloud Manager has the following main features:

  • Self-Service web interface for the deployments and AEM operation
  • Cloud Manager functionality can also be accessed programmatically via API
  • Fully automated and configurable CI/CD Pipeline
  • Provisioning and configuration of productive and test environments
  • Adobe hosted git repositories
  • Automated quality assurance of the application (code quality, security and performance)
  • Autoscaling of both AEM author as well as publish Instances
  • Multitier Caching Architecture including global Akamai Cache

Benefits and Disadvantages of Adobe Cloud Manager

These outlined core features result in a great set of benefits when using Cloud Manager:

  • Performance – Great performance can be expected, the global CDN and the possibility to run the AEM servers in one of the globally distributed Azure datacenters (support for AWS is on the roadmap). AEM hosting by Adobe helps guarantee optimal AEM performance and continuous improvements.
  • Autoscaling – When subjected to unusually high load, Cloud Manager detects the need for additional resources and automatically brings additional instances online via autoscaling. This works for both authoring and publishing instances.
  • Confidence in deployments – since the same pipeline is executed by all AEMaaCS customers, Adobe can optimise the reliability of the pipeline and deployments. After ten successful deployments, the customer can usually independently carry out deployments without involving Adobe Customer Service at all.
  • Extensibility of the pipeline – Cloud Manager is integrated into the Experience Cloud APIs and is therefore easy to connect or integrate with other or custom services.
  • Backup – Cloud Manager will automatically back up before every release. If any issue is noticed after the deployment, the release can be set back with the press of a button. The production instances are backed up as well (24h point in time recovery, up to 7 days with Adobe-defined timestamps).
  • “Zero” Downtime – Adobe has a lot of experience in hosting AEM for its large customer base. This allows Adobe to achieve great availability and you can expect basically zero downtime. Need proof? Adobe’s SLA of 99.9%.
  • Very low initial setup time – Basically a “1 click setup” for environments and the default pipeline. Certificates and domains are also set up quickly via the UI.
  • Very low maintenance and operation costs – Adobe takes care of maintaining the pipeline, upgrading AEM, providing security fixes for the OS, and operating all systems (Cloud Manager, Apache / Dispatcher, AEM instances, Akamai CDN etc).
  • Always up to date AEM – Adobe releases new versions almost weekly or even more frequently if there is a very urgent security fix. The moment the new features or fixes are available you will see them on your AEM instances! Your security team will be very happy to hear that.
    For on-premises versions, new features will only be available approximately 6 months after their release (except security fixes which usually come with service packs).

As always there are some drawbacks, but the benefits far outweigh them and there are ways to work around them:

  • Less flexibility – The pipeline and architecture is, to a certain degree, predefined. For example, it’s no longer possible to install additional OS level applications or use a different caching solution like Varnish. The Adobe I/O runtime or an external environment has to be used to provide additional services instead.
  • Limited customisability of AEM – It’s no longer possible to extend AEM freely. Some customisation is still possible, especially if the developers get creative, but not everything. Since this will be a win for maintainability, this could almost be regarded as an advantage.
  • Less control – Since Adobe takes responsibility for running the services and provides an SLA there are certain limitations.
    For example, it’s not possible to log in on the publisher’s website, no admin password is available, and Felix Console access is blocked. Especially the last one is a concern for any AEM developers and will hinder the possibilities to debug issues on productive systems.
    These issues are somewhat alleviated since Cloud Manager allows to extract certain information like log files. On the authoring instance, some tools are still accessible (e.g. /crx/de). Cloud Manager also provides additional utilities, like viewing the bundle status (those will probably be expanded on in the future).

Release a new version of AEM in the cloud (CI/CD pipeline)

The Cloud Manager CI/CD pipeline brings the code in the repository to a build application on your productive Adobe Experience Manager environment.

There are two types of environments, each environment consists of the full AEM stack (author, publish, dispatcher). A single pair called “Production” consists of “Stage” and “Prod” environments. 

Every “Production” deployment first goes to “Stage” where it is analysed and can be further inspected manually before being approved and deployed to “Production”.

All other environments are referred to as “Non-Production” and are used as test environments. New test environments can be provisioned as needed.

The pipeline runs are shown in the UI and additional logs of each execution step can be downloaded to debug any issues. There are several steps in the pipeline explained as follows:

1 – Code in Adobe Git

Cloud Manager allows git repositories hosted by Adobe. The pipeline can only fetch code from those repositories. The pipeline can be triggered on commit on a certain branch or triggered manually.

To use an external non-Adobe repository, the changes have to be synchronised with the Adobe repository (this can easily be automated with various CI/CD tools like Github Actions, Bitbucket Pipelines or Jenkins).

2 – Build Code and Unit Tests

The project is built by executing the Maven build, including executing the unit tests. The result is the “Release” build.

3 – Code Scanning

This inspects the whole code base and applies static code analysis.

There are several rule sets for different topics like test coverage, potential security issues or maintainability in the context of AEM. Each topic is rated and a recommendation is given by Adobe. 

These recommendations are set up quite reasonably at 50% coverage. The goal of each project should be to reach those numbers.

4 – Deploy to Stage

The code is now deployed to the stage environment. Internally, Cloud Manager creates a copy of the whole stage environment before deploying it. If there is any issue or failure with the deployment, the Stage can be reverted to the previous state.

5 – Stage testing Tests: Security Tests, Performance & Load Tests, UI Tests

Various tests are executed by default by Adobe to test if AEM itself is still working as expected and some tools to measure the performance of the website. Additionally, custom tests can be added to further test the integration of the application into AEM or the website itself.

Deploy to Production

If not disabled, the pipeline halts at this point before deploying to Production. This allows us to inspect the performance tests and code audit test results.

Any further manual testing can now be done on Stage. If everything looks good, the build can be approved and will be deployed to Product. Otherwise, the build is cancelled and reverted.

Testing with Cloud Manager

Four different categories of tests are executed in the pipeline.

Unit Tests

These are executed before the deployment step. They test the application on a code level and in isolation.

Since unit tests are pretty much industry standard, there is mainly one interesting question: how high should the coverage be?

There are different viewpoints on this topic. 

Adobe is quite defensive or realistic with the expectation of 50%. From our perspective, for web based projects and especially content focused logic, the integration tests discussed afterwards provide a lot of value. 

For each project it has to be decided how much time is spent on each type of test.

Code Scanning

This is executed before the deployment step. It inspects the code itself by doing static analysis. It gives various metrics indicating the quality of the code base by a set of code quality rules defined by Adobe.

Internally, SonarQube is used to analyse the code. Additionally, OakPAL scans the built content package to catch various potential issues which might not work with the deployment.

There are three categories of criticality:

  • Critical: Pipeline stops immediately
  • Important: Pipeline pauses, can be manually continued if the issue is not urgent for the current release and fixed later
  • Info: Purely informational

There are the following types of ratings, each with different failure thresholds (check code quality rules documentation for details).

Over 100 SonarQube rules are applied. If a specific issue is a false positive and should be ignored, an Excel from the link above can be downloaded to look up the rule key. 

The key can then be used in the Java code to make sure SonarCube will skip the warning. An example for “Credentials should not be hard-coded” is ​​”squid:S2068″. 

In the Java code add the annotation: @SuppressWarnings(“squid:S2068”)

Experience Audit (Performance Testing)

This executes the well known Google Lighthouse Tool, the same that is available in Chrome Dev Tools. 

It indicates changes compared to the last release. It’s also possible to inspect and download the full Lighthouse report. 

This is a great feature to have, especially because the tests are executed on every run and in an isolated, repeatable environment. 

What is missing is a view for the audit over time, that would be really helpful to track performance.

Product Functional Testing, Custom Functional Testing, Custom UI Testing

There are several integration tests.

Adobe provides a set of tests to verify the basic functionality of AEM, for example, if content can still be replicated from Author to Publish. Adobe might add additional tests in the future as well – who doesn’t like free integration tests!

In addition, custom tests can be written to further verify the functionality of AEM. This is especially useful if there are custom AEM modifications.

UI Tests are intended to test the website itself on the publisher instance via dispatcher. The idea is to provide test content with the code which will be deployed to the stage and then execute the integration tests in various browsers to verify functionality. 

There is a default setup in the Maven Archetype for Integration tests based on webdriver.io and Selenium. 

Docker is used to build and execute the integration tests in the cloud. It’s possible to modify those to adjust the test setup. Important to note is that UI tests are disable dy default.

Follow the documentation for “Customer Opt-Int” to understand how to enable it.

Team and Roles

There is a set of predefined roles by Adobe, which also have their according permission profile and access restrictions of who is allowed to run or modify the pipeline and other features. 

Most of those roles probably match the existing roles of a project.

The most important ones are Business Owner, Deployment Manager, Program Manager, Developer. Content Authors do not have to interact with Cloud Manager. Permissions for Authors are set up in AEM itself.

We believe it’s not necessary to use this many roles for most projects and if you can trust your developers, it’s probably enough if the lead is “Business Owner” and developers are “Deployment Manager“. 

Have a look at the user permission table to decide what makes sense for your project.

A notable role that is missing is “DevOps” or “Operation“. Deployment Manager is what comes closest to a DevOps person, since it’s allowed to edit the pipeline, however, any experienced developer should be able to configure Cloud Manager.

If there are integrations of Cloud Manager planned, a person with DevOps experience might become helpful.

Integrate Cloud Manager programmatically in your current Solution (Advanced topic)

Adobe is aware that every customer has its unique application landscape. There are various ways to integrate AEM Cloud Service and Cloud Manager.

Adobe Cloud Manager API

All capabilities available in the UI can also be programmatically accessed with the Adobe Cloud Manager API

This allows to integrate the AEM Cloud Service Pipeline into a custom existing CI/CD infrastructure and also enhances the pipeline with additional custom features.

Webhooks are also supported by the API which is a great way to integrate with other services.

Some example use cases are triggering of the pipeline from an external action, monitoring and notification (e.g. Slack channel) of the pipeline runs, externally executing additional tests, or adding actions after the deployment like clearing of the cache.

Identity Management System (IMS) integration

Provisioning and access to control for the Cloud Manager and AEM can be handled manually but also via integration of external IMS.

Synchronising user accounts with group permissions is supported to automate provisioning. SAML with an external IDP is supported to enable Single Sign On.

Firewall

By default, the AEM Cloud Service instances do not have access to external systems due to security reasons. Simple IP whitelisting can be configured directly in Cloud Manager. 

For anything more complex, a solution can be worked out with the Adobe Cloud Manager engineer.

Forwarding Splunk logs

AEM Cloud Service internally uses Splunk to aggregate the logs. 

Via support it can be requested to configure forwarding of the Splunk logs to a custom Splunk instance. This is a great way to extract as much information from the system as possible.

Conclusion

As we can see, Adobe Cloud Manager provides out of the box enterprise-grade CI/CD and hosting of AEM applications in the cloud. 

There is a big cost-saving potential for both the initial setup as well as maintenance, thanks to the simple configuration in the web UI and all the operation efforts being taken care of by Adobe.

Combined with the powerful capability of AEM itself and the Adobe Experience Cloud as a whole, AEM Cloud Service is the best cloud-native CMS offering on the market.

This article is part of a series of content about AEM Cloud Service, where we explain how to move to AEM Cloud.

Learn how to design an AEM website with Core Components. Finally, once your website is live, start optimising it and improving the customer experience.

Basil Kohler

Basil Kohler

AEM Architect

Would you like to receive the next article?

Subscribe to our newsletter and we will send you the next article about AEM Cloud Service.

The post Cloud Manager: Deploy and Operate AEM Cloud Service appeared first on One Inside.

by Samuel Schmitt at October 25, 2021 08:31 AM

October 17, 2021

Things on a content management system - Jörg Hoh

AEM micro-optimization (part 4) – define allowed templates

This time I want to discuss a different type of micro-optimization. It’s not something you as a developer can implement in your code, but it’s rather a question of the application design, which  has some surprising impact. I came across it when I recently investigated poor performance in the Siteadmin navigation. And although I did this investigation in AEM as a Cloud Service, the logic on AEM 6.5 behaves the same way.

When you click in the siteadmin navigation through your pages, AEM collects a lot of information about pages and folders to display them in the proper context. For example, when you click on page with child pages, it collects information what actions should be displayed if a specific child node is going to be selected (copy, paste, publish, …)

An important information is if the “Create page” action should be made available. And that’s the thing I want to outline in this article.

Screenshot: “Create” dialog

Assuming that you have the required write permissions on that folder, the most important is if templates are allowed to be created as children of the current page. The logic is described in the documentation and is quite complex.

In short:

  • On the content the template must be allowed (using the cq:allowedTemplates property (if present) AND
  • The template must be allowed to be used as a child page of the current page

Both conditions are must be met for a template to make it eligible to be used as a source for a new page. To display the entry “Page” it’s sufficient if at least 1 template is allowed.

Now let’s think about the runtime performance of this check, and that’s mostly determined by the total number of templates in the system. AEM determines all templates by this JCR query:

//jcr:content/element(*,cq:Template)

And that query returns 92 results on my local SDK instance with WKND installed. If we look a bit more closely to the results, we can determine 3 different types of templates:

  • Static templates
  • Editable templates
  • Content Fragment models

So depending on your use-case it’s easy to end up with hundreds of templates, and not all of them are applicable at the location you are currently in. In fact, typically just very few templates can be used to create a page here. That means that the check most likely needs to iterate a lot to eventually encounter a template which is a match.

Let’s come back to the evaluation if that entry should be displayed. If you have defined the cq:allowedTemplates property  on the page or it’s ancestors it’s sufficient to check the templates listed there. Typically it’s just a handful of templates, and it’s very likely that you find a “hit” early on, which immediately terminates this check with a positive result. I want to explicitly mention that not every template listed can be created here, because there also other constraints (e.g. the parent template must be of a certain type etc) which must match.

 If template A is allowed to be used below /content/wknd/en, then we just need to check the single Template A to get that hit. We don’t care, where in the list of templates it is (which are returned by the above query), because we know exactly which one(s) to look at.

If that property is not present, AEM needs to go through all templates and check the conditions for each and every one, until it finds that positive result.  And the list of templates is identical to the order in which the templates are returned from the JCR query, that means the order is not deterministic. Also it is not possible to order the result in a helpful way, because the semantic of our check (which include regular expressions) cannot be expressed as part of the JCR query.

So you are very lucky if the JCR query returns a matching template already at position 1 of the list, but that’s very unlikely. Typically you need to iterate tens of templates to get a hit.

So, what’s the impact on the performance of this iteration and the checks? In an synthetic check with 200 templates, when I did not have any match, it took around 3-5ms to iterate and check all of the results.

You might ask, “I really don’t feel a 3-5ms delay”, but when the list view in siteadmin performs this check for up to 40 pages in a single request, it’s rather a 120-200 millisecond difference. And that is a significant delay for requests where bad performance is visible immediately. Especially if there’s a simple way to mitigate this.

And for that reason I recommend you to provide “cq:allowedTemplates” properties on your content structure. In many cases it’s possible and it will speed up the siteadmin navigation performance.

And for those, who cannot change that: I currently working on changing the logic to speedup the processing for the cases where no cq:allowedTemplates property is applicable. And if you are on AEM as a Cloud Service, you’ll get this improvement automatically.

by Jörg at October 17, 2021 02:13 PM

September 21, 2021

CQ5 Blog - Inside Solutions

5 tips to help you maintain and improve your chatbot

5 tips to help you maintain and improve your chatbot

Chatbot solutions are a flexible approach to connecting with customers in many situations. 

They simplify access to your brand for your customers. Plus, they help you gain deep insights into your customers’ needs.  

However, a chatbot is only successful when it manages to answer customer requests.   

To reach this goal, optimizations after go-live are key. 

In this post, we will explain the chatbot lifecycle and the five things you can do to improve your solution over time.  

Finally, we will explain the effort your marketing team needs to invest for your solution to be successful. 

What is the lifecycle of a chatbot? 

We divide the chatbot lifecycle into the following five phases: 

  1. From ideas to roadmap: First, you have to understand what chatbot solutions have to offer and define your vision.  
  2. Turning a roadmap into a plan:  Once your vision and strategy are defined, you will make a concrete plan to build your chatbot. 
  3. Build chatbot and conversations: Implement, create content, and design conversations and begin training your system 
  4. From training to go-live: You need to test your system, prepare the go-live, and bring your chatbot to life. 
  5. Scale and optimize the chatbot experience: Go-live is just the start for your chatbot. Now it’s time to grow and optimize! Customers are impatient – don’t plan to optimize your chatbot in the future, do it continuously, from day one.  

We will focus on the last step of the lifecycle, which is often misunderstood or put aside. 

Do I really need to optimize an AI chatbot? 

Chatbots use AI to perform.  

Why is continuous optimization required? Current systems offer a flexibility in understanding the intention of the user.  

However, if the language derivatives too much from the expected (trained) phrases and/or the intention is not known, the system reaches its limits.  

Chatbot systems are not fully self-learning. You must assist them for them to improve.

The chatbot’s answers are not yet driven by AI at all. It’s typically driven by hand-crafted content which is optimized for the expected/wanted user journey instead. 

Following challenges will occur for each chatbot over time – independent of how well you planned it: 

  • The user’s language is unexpected and therefore a wrong intention is assumed. 
  • Customers have questions concerning your business which you have not foreseen. 
  • You offer new services which the chatbot wasn’t trained for. 
  • Your chatbot system should get smarter with the integration of further systems, e.g., from a CRM or PIM 
  • The world we live in changes over time. This can force you to adapt to keep offering a pleasing and up to date experience for your customers – even though your services hasn’t changed. 

So what are the concrete steps to keep your chatbot up to date? 

How do you optimize a chatbot? 

Optimization of a AI chatbot is crucial, and we have listed some activities that your marketing team will have to take care of, with some support from your IT team.

1 – Understanding more and more questions 

Likely you started with a limited basis for natural language understanding. To understand any needs from your customers – independent of if you answer them or not – you will have to add more and more intents over time. 

In short: You have to teach your system new phrases. 

2 – Keeping your content up to date

As you optimize the content on your website over time, the same is required for your chatbot. Change wordings, influence the user journey, or improve linked assets. 

In short: You have to optimize chatbot answers continuously. 

3 – Learn from your customers 

Your customers offer deep insights into their needs. Benefit from it.  

To do so, monitor how they interact with your system and which questions they ask – especially for topics that you do not answer with your chatbot or even your website yet. 

In short: You must monitor chatbot interactions to gain insights. 

4 – Validate the success of your chatbot 

Chatbots are awesome. They support your customers and can have a great return on investment.  

However, don’t just believe that your chatbot is successful. Track if the intended goals are reached and try to collect additional data to identify further goals.  

Set your success factors in relation to other systems and identify cross-system benefits. 

In short: You have to monitor the success of your chatbot.

5  – Move your chatbot to the next level 

Likely you restricted the list of features for your initial chatbot.

After go-live and having first experience with conversational marketing, it’s absolutely the right time to introduce additional features to your chatbot.  

For example you could integrate further business information systems, try out new UX/UI concepts, or close feature gaps.  

You should also make sure you stay up to date with features offered by your competitors’ chatbots. 

In short: You have to enrich your chatbot with features to continue to benefit and stay attractive for customers. 

Who optimizes the chatbot and how often? 

For successful chatbot optimization, a proper process is required.  

Identify the regular tasks and assign responsibilities. Plan optimization reviews to ensure success in the long run. 

We described the main tasks earlier. They include: 

  • Increasing training data for existing intents 
  • Add new intents for unexpected topics 
  • Improve the flow and content of answers 

All these tasks are mainly content-driven. Therefore, the maintenance must be driven by the business stakeholder and its team.  

Team members must be trained to write content for chatbots and must understand the basic principles of NLU (Natural Language Understanding) to improve training data.  

As these tasks impact customers directly it’s important to do them regularly. Improve and add intents e.g., once a week and update content monthly.  

Beside chatbot specific maintenance tasks, other optimizations are needed as well : bug fixes and new features. 

These tasks are typically solved in collaboration with the development team. Business stakeholders know what they want, developers know how to implement. Don’t just wait until you need features but plan regular improvements.  

Lastly, ensure your system keeps its technology up to date. APIs may change and provide new features. 

In summary, optimization must be planned, responsibilities assigned, and tasks distributed to team members. 

Get the most out of your chatbot 

A chatbot is a great tool to open another channel for your customers. It’s not a replacement but an additional opportunity to offer the best service.  

It offers you deep insights into customers’ needs.  

On top of that, it forces you to rethink your processes and goals for service and marketing tasks. Take the opportunity and improve your services with a conversational channel. 

We summarized all aspects of the journey to a successful chatbot here (Whitepaper).  

From the first vision, over planning, to go live. Find out more about chatbots and how they benefit your customers and your organization.

Clemens Blumer

Clemens Blumer

Senior Software Architect

Would you like to receive the next article?

Subscribe to our newsletter and we will send you the next article about Conversational Marketing and Chatbot.

The post 5 tips to help you maintain and improve your chatbot appeared first on One Inside.

by Samuel Schmitt at September 21, 2021 02:06 PM

September 14, 2021

CQ5 Blog - Inside Solutions

Designing your AEM Cloud Service Website with Core Components

Designing your AEM Cloud Service Website with Core Components

Adobe brings a set of reusable and production-ready components for its content management system, AEM. 

Their name: The Adobe Core Components.

Their purpose: Speeding up development time.

But how do you take advantage of these Core Components to deliver a website fast? 

What is the best design approach, and how should the team members work together?

If your company is using AEM or AEM Cloud Service and you are looking to fast-track the design process while following the best practices of AEM’s styling system, you came to the right place.

Today, we are going to explain how to avoid a long design phase, and too much back and forth.

We will explain the best workflow to design  user experiences and build websites at scale:

  • You’ll learn what Adobe Core Components are…
  • and how your design team and development team should collaborate!

AEM Styling process in a nutshell

Let’s go through the styling process with Adobe Experience Manager. 

The process suits a new project with AEM Cloud Service

However, you could follow a similar design approach while working with an on-premise version of AEM.

Indeed, the basis of the styling process will remain the same for any version of AEM and leverage Adobe Core Components, a library of best practice components.

The main idea of the styling process can be summarised in three concepts:

  • Using standardised components
  • Low code, a software development approach that requires little to no coding
  • Reusability 

The standards are represented by the Core Components. More than just a library of web elements, they will drive the design of the user experience from the very beginning and help the designers, the website owner, and developers to work together on a common frame.

Low code is a key aspect of the styling process. Why always reinvent the wheel?

The Core Components offer a foundation. By leveraging them, the development effort is drastically reduced and the main task is all about styling the components. In other words, it’s focused on adapting CSS and JS. HTML won’t be adapted.

Finally, the design elements must be reusable for other websites, landing pages, or intranets.

When companies decide to invest in AEM and build a consistent user experience across many websites and channels, they must consider having proper information architecture.

A well-designed component library helps them implement websites at scale.

We will now go into detail of these concepts, explain what core components are, and what it means to build the user experience with core components in mind to get your project design up to speed.

What are the Core Components in AEM?

The Core Components are a set of standardised Web Content Management (WCM) components for Adobe Experience Manager (AEM).

They were introduced with AEM 6.3, and their aim is to speed up development time while reducing maintenance costs, and ensuring better upgradability. 

The use of Core Components is the best and recommended approach to start a new project with AEM as a Cloud Service. Core Components are cloud-ready. Using them will help you deliver your new website faster.

If you have expertise with building websites, you might have noticed that some elements or UI patterns are quite common. For example, we often build text, image elements, or teasers highlighting content of related pages. 

This is what the Core Components library has to offer. A list of thirty versatile components including: 

  • Title
  • Text
  • Image
  • List
  • Teaser
  • Download
  • Button
  • And more components…
Adobe Core Components

These components are built to be flexible and can be assembled to produce nearly any kind of layout.

From our expertise, 80% to 90% of usual components on a website can be implemented with a core component and a bit of styling. 

As you can see below, the teaser components come with four variations that cover most use cases. 

If something is missing, you can still customise a core component, extend a core component with additional functionality, or for complex scenarios, create a component from scratch (the old school way).

Adobe Core Components are open source, and you can find them on Github.

To summarise, the Core Components offer a standard approach and have many advantages:

  • Design-agnostic: data, logic and design are completely separated 
  • Stylisation: they can be styled in different ways
  • Flexibility: they offer a wide range of functionality
  • Future-proof: they guarantee compatibility with future versions of AEM

How to use the AEM Core Components

To understand the design process and the people, profiles, and roles involved in this process, it’s important to get a better understanding on how to use, extend, and customise Core Components.

We won’t go too much into the technical details, but it’s important that the anatomy of an AEM Core Component is understood.

Understanding the architecture of a Core Component

To make it simple, a core component can be split into two distinct parts: the backend part and the frontend part.

The backend contains:

  • The content model. It defines the structure of the content that can be stored in a component: for example, a teaser might consist of a title, an image, a short description text and a link to the target. 
  • The configuration of the components and the edit dialog. These elements let you define what to display, and what an editor can edit, and the options he can use.
  • The logic behind the preparation of the content for frontend (also called view).

The frontend part will be in charge of generating the output in HTML:

  • A markup language (HTL) is used to bring together content from the backend and HTML elements.
  • CSS and JS are used for the styling and effects applied on the elements.

(We use on purpose layman terms, if you want more specific information, jump to the official and technical documentation of Adobe.)

Customising a Core Component

Yes, a core component can be customised. You can extend them to match your requirements and avoid starting custom development from scratch.

However, a word of advice. To keep all the benefits and guarantee upgrade compatibility, some best practices and customisation patterns must be followed: 

1 ) Never modify the code directly. Instead, extend the existing logic:

The architecture of a Core Component allows you to extend the content model, dialog, and logic of a component and allow an editor to use additional content. 

For example, you might want to add a “category” field to a certain teaser. All you have to do is extend the teaser with a text element “category“, and define how the editor shall use it in the dialogue, and how it shall be represented in the HTML output.

2) Style the components by applying your own CSS styles: 

Core Components follow a standard naming convention inspired by Bootstrap to make it easy for an experienced frontend developer to apply the website’s branding.

You can read more about customization patterns here.

Roles in your team

The customisation patterns tell us what kind of profiles or roles you need in your website project team to guarantee smooth operation.

Basically, you will need two types of roles:

  • An AEM backend developer in charge of the configuration and extension of the backend logic of the core components.
  • A frontend developer mastering CSS and JS who could apply any look and feel to the HTML structure offered by the Core Components.

We do believe that customisation can be reduced to its minimum and even be avoided if you start designing your website with Core Components in mind. More about this later.

Managing design at scale with a flexible system​

The frontend developers will play a key role in the implementation process. Once they master the Core Components and style system, the sky will be the limit.

Just by adapting the style of the component, multiple themes could be created for various websites, microsites, landing pages, and more. 

And the beauty is: the Core Components stay the same.

You simply adapt their style and assemble them in a new manner for different websites. By setting up a versatile set of components for your digital presence together with various themes, you will be able to manage design and website at scale.

But how can this be achieved? This is what we tackle now, and outline how to design with the Core Components in mind.

AEM design workflow with Core Components

Now that the concept of Core Components is clearer, we can detail the design workflow. 

We will answer the following questions:

  • How to design a website with AEM using Core Components
  • How to design a website with AEM without compromise
  • How to design a website with AEM and go live fast 

We will cover two scenarios. One when the UI/UX of the website is not done yet, and the second when you already have the design of the website ready.

You may be doing a migration of an existing website to AEM, and therefore want to migrate your existing website first and apply changes later.

Step 1 – Map the mockup to Core Components

In our first scenario, the design of the website is not defined yet. We start with a blank page.

The two main recommendations are:

  • Plan the design based on the Core Components
  • Set up a team made of designers and AEM consultants

It’s crucial to take the Core Components into account from the beginning. We recommend you involve AEM experts from day one, as they will guide you through this process.

In other words, don’t leave the designer alone in a room and once the design is ready, hand it over to the AEM experts and developers for implementation. 

This often leads to unconceivable user experiences that don’t leverage the solution, coming with additional cost, escalations and frustrations.

A common misunderstanding is that a framework restricts the design process. But in fact, it’s the opposite. Talking to an AEM expert will open new perspectives and will unleash the full potential of AEM.

Together the designer and the AEM Expert will define a mockup including the main page templates and components to use. This guarantees that you get the best AEM has to offer.

In the scenario where the design of your website is ready, for instance if you are migrating to AEM as a Cloud Service, you should start with component mapping.

An AEM expert will analyse the building blocks of your current website and map it to Core Components.

With this scenario, there might be some trade-offs:

  • Changing the current blocks on your website to map with the Core Components layout and feature. Let’s imagine that you have a teaser with 4 CTA while the teaser Core Component offers only 2 CTA. Here you could decide to adapt your requirements to the Core Components.
  • Or, if your requirements are not adaptable, the solution would be to create a custom component, extending the Core Component that fits your current UX and UI.

Anyway, for both scenarios, the goal is to have a mockup of the website where all elements are represented by Core Components.

Step 2 –  Design in Adobe XD

Adobe XD is a design tool. 

With Adobe XD, designers can now design based on the out-of-the-box AEM Core Components, and consider how different styles can be implemented via AEM’s style system.​ 

Adobe created an UI Kit for AEM Components.

By using the premade UI Kit based on AEM Core Components, unnecessary design deviations are avoided, namely the kind of deviation that requires more development effort and involves extra cost. 

The steps for the designer are the following:

  1. They will assemble the Core Components based on the mockup which will create the layout for the different pages.
  2. They will then start styling each component based on the visual identity of the website and branding guidelines.

By following this approach, the backend developer will have an easier time configuring everything in AEM. Layout structure will represent a page template in the CMS.

It’s crucial that the website design stays in sync with the Core Components.

A similar process can be done in the case of scenario 2.

Step 3 –  Configuration and style in AEM matching the mockup

While the designer will adapt the look and feel of the Core Components in Adobe XD, a AEM backend developer can start the configuration of the page templates and components in AEM. 

This can be done in parallel as both work on the same basis – the defined mockup.

As soon as the design is ready and validated, a frontend developer can start working and applying the right style, CSS and JS to the core components.

Everything is bundled into AEM and ready to be deployed.

Overview of the AEM styling workflow 

To recap, here are the main steps of the design workflow with AEM and Core Components:

1 – Define a mockup based on the Core Components

2 – Create the UI and theme in Adobe XD

3 – Configure the page templates and components, then style the Core Components in AEM 

And do not forget, a critical aspect is to have a mixed team made of designers and AEM experts from the very beginning.

Core Components: AEM Cloud Service’s best companions

Even though the Adobe Core Components were created with AEM 6.3 before the release of AEM Cloud Service, they perform best when used together with the cloud version of Adobe CMS.

One of the main purposes of AEM Cloud Service is to enable fast innovation and help you focus on what matters most: building outstanding customer experiences

With AEM Cloud Service you don’t need to care about servers, IT operations, network, security etc. anymore. 

Leveraging Core Components gives you additional benefits by speeding up the design and development phase.

Quickly assemble the building blocks to realise a mockup, and then style them with limited backend development. 

This is the best way to tackle any AEM as a Cloud Service project, and will enable you to go live fast, while guaranteeing upgradability.

Finally, build enterprise websites faster with AEM Cloud Service and Core Components

Designing for AEM with Core Components is close to what you are already doing for other projects. You leverage a framework that enables you to build something faster.

The key element is to design with the Core Components in mind and to involve an AEM expert at the very first stage of the design process. 

The expert will guide you through the process and indicate the best way to use the components to avoid potential limitations. 

This article is part of a series of content about AEM Cloud Service, where we explain how to move to AEM Cloud Service.

Samuel Schmitt

Samuel Schmitt

Digital Solution Expert

Would you like to receive the next article?

Subscribe to our newsletter and we will send you the next article about AEM Cloud Service.

The post Designing your AEM Cloud Service Website with Core Components appeared first on One Inside.

by Samuel Schmitt at September 14, 2021 09:33 AM

August 31, 2021

CQ5 Blog - Inside Solutions

The Marketing Technology Landscape of Swiss Insurers

The Marketing Technology Landscape of Swiss Insurers

In this study, we analysed the top 25 Swiss insurance companies.

We focused on their marketing technology stacks. In essence, what technologies are these companies using for their website, mobile application, analytics, marketing automation and more?

How do they leverage these technologies to build an appealing customer experience and engage with their clients?

The top Swiss insurers are Zurich Financial Services, Swiss Re, Swiss Life, Helvetia Insurance, Axa Group, Bâloise, Helsana, CSS, Groupe Mutuel, Suva and 15 others. 

What is a Marketing Technology Stack? 

To simplify things, we didn’t analyse every single marketing tool used by Swiss insurance companies. 

We focused on the main elements of their marketing stack, which include:

  • The website and content management system as the baseline for customer experience and content
  • Analytics to collect data about website visitor’s behaviours, as well as personalisation engines to personalise the user experience
  • The marketing automation solution chosen to connect and engage with customers via email or other channels
  • The mobile app offering services directly via smartphone
  • The chatbot to engage in a conversational way

Disclaimer

The collected data is publicly available. We used tools such as builtwith.com or Wappalyzer to gather data about the technology used. Some information was found within public case studies published by other Swiss agencies and technology vendors. 

We cannot commit to the exactitude of all the data as we rely on third parties for it. The trends represent our own view of the market.

Now, let’s dive into the details of the marketing technology landscape of Swiss insurers.

Adobe Experience Manager is the leading CMS for Swiss Insurers

Adobe Experience Manager (AEM) is the most used Content Management System by Swiss insurers. Adobe’s CMS is the leading solution for content management systems and customer experience. 

Ranking first for several years of the Gartner Magic Quadrant and The Forrester Wave, it makes sense to see Adobe’s CMS flagship widely used by large enterprises. 

Of course, we, as a Swiss Adobe and AEM partner, are a bit biased at One Inside, but the market proves that it’s one of the best digital experience platforms.

AEM is used by 50% of the Top 10 Swiss Insurers and overall 36% in the top 25. Sitecore is also present. The .NET CMS is one of the top three solutions of the CMS segment and is often a choice for large enterprises. 

Magnolia CMS is the little one of the trio with good market share. Magnolia is a Swiss-made CMS and has an important footprint in its home country. AEM actually also has Swiss roots (if you remember the days of CQ).

In a February 2021 EY survey, that interviewed the IT leadership from more than 70 European Insurers, the result was that insurers have a bold ambition to move to cloud solutions.

The survey says: “most insurers aim to move at least 80% of their business to the cloud in the coming years to meet the primary objectives of increased agility and digital transformation.

We could argue that this trend is valid for Swiss insurers, and that the challenges they face are similar to other European companies: agility and time to market.

What does this mean for their choice of content management system?

At the core of the customer experience, lies the CMS. New cloud CMS solutions offer all the benefits insurance companies are looking for: cost optimisation and more agility with a speed of execution.

In the upcoming years, insurance companies might adopt such a solution and give up their on-premise solutions. They will get the benefits of delivering new customer experiences and releasing new offerings faster.

The current CMS landscape will evolve in the next 3 to 5 years, and insurance companies might then use native cloud CMS more commonly. 

Adobe and its AEM Cloud Service already has an excellent chance to gain market share. Indeed, its CMS is, for the moment, one of the only cloud-native Enterprise-CMS on the market.

Sitecore and Magnolia are technically speaking behind and offer a “cloud” CMS closer to managed cloud hosting or PaaS than an actual SaaS model, such as Adobe does. New players might enter the market, such as Headless CMS vendors.

Analytics and Personalisation: Google vs Adobe

Google Analytics is the leading analytics solution and is used by two-thirds of the 25 top Swiss insurers. The only contender is Adobe Analytics. 

Adobe Analytics is used by half of the top ten insurers. It is often used together with AEM and other solutions of the Adobe Experience Cloud, such as Adobe Target, for personalisation purposes.

Both analytics solutions answer the needs of enterprises. While Google Analytics might be more straightforward to set up, Adobe Analytics offers more web analytics features.

Analytics and personalisation often come in pairs but don’t always come together. While 100% of the companies use analytics solutions, not all of them use them together with a personalisation solution.

A personalisation solution can adapt the user experience based on behavioural information and helps the marketing team run A/B Tests.

Only ten of 25 Swiss insurers are using a personalisation solution. The leading vendors are Google, with Google Optimize 360, and Adobe, with Adobe Target. 

Both products offer a set of similar features with A/B testing and personalisation. 

Optimize 360 is natively integrated with Analytics 360, so you can use Analytics 360 reporting to understand where to improve your site quickly. 

Adobe Target offers a native connection with Adobe Analytics, Adobe Experience Manager and other Adobe Experience Cloud solutions. The native integration with the CMS offers great opportunities, such as adapting the display of web components based on behavioural rules or even based on complex evaluation by AI.

For instance, the Swiss insurance company CSS offers a personalised experience by leveraging Adobe Experience Manager with Target. 

While browsing the website and visiting different offerings and insurance products, the visitor will see teasers and other web elements aligned with his centre of interest. Find more information about the CSS project here.

A fragmented Marketing Automation landscape

For the marketing automation part, we focused mainly on lead generation and nurturing aspects via email.

Identifying the marketing automation solutions and email providers used by the 25 top swiss insurers was not the most straightforward task. Indeed, this information is not always publicly available. 

From the information we gathered, we noticed that the technology landscape of marketing automation is fragmented.

No vendor is leading in this segment, and various software vendors are used, such as Salesforces, Adobe Campaign, Sendinblue, Emarsys, Campaign Monitor, XMPie, Mailjet and even Mailchimp.

It’s pretty surprising to find Mailchimp on the list 5 times total.

Why surprising?

Mailchimp is a marketing automation software targeting SMB and e-commerce websites, and the marketing automation features are less appropriate for large enterprises. But it seems that Mailchimp fulfils the needs of a few insurance companies anyway.

Use cases in regards to requirements are limited. The main lead capture tactics are executed through premium calculators or newsletter subscriptions. We believe the first one drives the most leads. Also we noticed that the nurturing campaign is not so advanced, at least for the ones we tested.

Few marketing automation success stories are publicly available.

Suva, the Swiss National Accident Insurance Fund, uses XMPie for its email campaigns and achieving a 127% increase in signatories to the Suva Safety Charter, from 1,500 to 3,414 companies thanks to an omnichannel onboarding mixing email and print document (Source).

Chatbot, an under-exploited channel

As we already noticed with the use of Marketing Automation software, chatbot solutions are also not very present on the public websites of Swiss insurance companies.

Chatbots offer similar benefits as Marketing Automation software. They are an excellent tool to capture lead information and engage with the client in the channel of their choice, if done right.

On top of this, chatbots get the advantage of offering immediate answers to clients while collecting great insights to the company about the questions and concerns of their audience. 

Of the 25 companies from the study, 15 don’t offer any chatbot on their public website.

For the ten companies with a chatbot, again, the solutions are disparate. Some offer Whatsapp or Facebook bots; others offer simple rule-based chatbots. Very few companies fully embrace the conversation channel and offer advanced AI chatbots.

At One Inside, we had the chance to collaborate with CSS and build a fully integrated AI chatbot.

Firstly, the chatbot is seamlessly integrated into the layout of the website and offers an outstanding user interface. Secondly, the chatbot is completely integrated into the Adobe Experience Manager CMS. 

Integration in AEM allows marketing teams to manage chatbot questions and answers directly from the same back-end they are used to administering their web content: It facilitates the reuse of web content. Plus, the chatbot is intelligent and learns over time.

The below video explains the AEM Chatbot Module.

It’s hard to explain why Swiss insurers are reluctant to deploy AI Chatbots.

To compare with the US market, ​​according to a 2019 LexisNexis survey, more than 80% of large U.S. insurers have fully deployed AI solutions in place, including the research and development of chatbots – and this was two years ago.

It could be that the priority has shifted to other areas of the marketing stack, or that chatbot projects for enterprises are still hard to execute.

Mobile App, the customer portal in your pocket

84% of Swiss insurers offer a mobile app for iOS and Android smartphones. The few companies that didn’t invest in a mobile application are the ones that don’t have a customer portal. 

Indeed, the primary mobile application use case for insurers is to offer access to all customer information directly from a smartphone. The mobile app is an extension of the customer portal. 

Some insurers, such as Visana, found innovative use cases with their mobile application to increase customer engagement.

The solution built by Visana is called myPoints, and it’s a bonus programme. By doing more physical activity every day, you get rewarded with up to CHF 120 per year. An excellent way to stay healthy.

A step toward digital maturity

From the single prism of marketing technology, we were already able to judge the digital maturity of Swiss insurance companies. 

The foundation of the customer experience is already in place, at least for the top Swiss insurers, and they make great use of their content management system to support their content strategy.

More improvement could be gained on automation, conversational channels, and artificial intelligence.

Especially in marketing automation, the maturity could be higher, for example, by introducing a modern solution, such as Adobe Journey Optimizer, the new cloud-based marketing automation solution from Adobe that is seamlessly integrated in Adobe’s real-time CDP. 

We assume that certain insurers are already working on such solutions, as they are aware that they always need to modernise their marketing stack.

To conclude, the leading insurer, Zurich, sets ambitious targets to meet customers’ needs. As mentioned in this media release, Zurich will continue “to further transform insurance, using technology to meet changing needs and create rewarding experiences“. 

We’re very excited about the results, and are curious to see if the other market contenders will follow suit.

Samuel Schmitt

Samuel Schmitt

Digital Solution Expert

Would you like to receive the next article?

Subscribe to our newsletter and we will send you the next article about marketing technolgies.

The post The Marketing Technology Landscape of Swiss Insurers appeared first on One Inside.

by Samuel Schmitt at August 31, 2021 04:30 PM

July 19, 2021

CQ5 Blog - Inside Solutions

Headless CMS with AEM: A Complete Guide

Headless CMS with AEM: A Complete Guide

You might have already heard about Headless CMS and you may be wondering if you should go “all-in” with this new model.

In our complete guide, we are going to answer the most common questions, such as

  • What is the difference between Headless and traditional CMS?
  • Is headless the best choice for your next website implementation?
  • How to use a headless CMS for your next project

It’s best to understand what Headless CMS means before making any decision to start developing your next web project on a content delivery model that won’t fit.

At One Inside, our expertise relies on the implementation of the Adobe CMS, Adobe Experience Manager (AEM). We can show you what AEM can do in regards to content delivery — and in which case headless is recommended.

What is a traditional CMS?

This is likely the one you are familiar with. Traditional CMS uses a “server-side” approach to deliver content to the web.

The main characteristics of a traditional CMS are:

  • Authors generate content with WYSIWYG editors and use predefined templates.
  • HTML is rendered on the server
  • Static HTML is then cached and delivered
  • The management of the content and the publication and rendering of it are tightly coupled
What is a traditional CMS

Let’s define what a headless CMS is now.

What is a headless CMS?

A headless CMS decouples the management of the content from its presentation completely. Headless CMS can also be called an API-first content platform.

The authors create content in the backend, often without a WYSIWYG editor. The content created is not linked to a predefined template, meaning the author cannot preview the content.

The content is then distributed via an API. 

The presentation of the content on the website, mobile app, or any other channels, is done independently. Each channel fetches the content and defines the presentation logic.

A headless CMS is mainly made of 

  • A backend to create structured forms of content
  • An API to distribute content
What is a headless CMS

Let’s speak about the last category of CMS supporting both traditional and headless. 

What is a hybrid CMS?

A Hybrid CMS is a CMS supporting both content delivery models: the Headless and the traditional.

You can create content in a classic way, by creating a page in the backend, previewing the page, and publishing it. 

On the other hand, the content created can be distributed via an API as well.

This hybrid approach is offered by the traditional CMS. It’s straightforward to deliver the content from their repository via API. 

With a hybrid approach you get the best of both worlds:

  • Single source for content and assets
  • Multichannel delivery
  • Authors only need to learn one tool for all content authoring
  • The administration is simplified (one login, one server, one technology for content etc)

Is AEM a Headless CMS?

Yes, with Adobe Experience Manager you can create content in a headless fashion.

The content can be fully decoupled from the presentation layer and served via an API to any channels.

You might know that AEM offers a great interface for authors enabling them to create content by using predefined templates and web components.

The content can be organised hierarchically and published immediately to websites or any other channels.

As AEM offers the (very) best of both worlds, it supports the traditional approach and the headless way. AEM is considered a Hybrid CMS.

The Headless features of AEM go far beyond what “traditional” Headless CMS can offer.

How does AEM work in headless mode for SPAs?

Since version 6.4, AEM supports the Single Page Application (SPA) paradigm with the SPA Editor.

This enables content authors to build dynamic as well as content-focused applications as they are used to when working with creating pages.

SPAs are currently mostly used for static applications. 

Enabling dynamic page creation, layouts and components in a SPA with a visual content editor shows how valuable AEM’s Hybrid CMS approach is.

With the SPA Editor 2.0 it’s possible to only deliver content to specific areas or snippets in the app.

We are going to look into several aspects of how AEM implements the headless CMS approach:

  • What is the difference between rendering HTML in the backend vs SPA?
  • How does the SPA WYSIWYG content editor work?
  • What are Content APIs?
  • How to develop SPAs with AEM
  • What is Server Side Rendering (SSR)?
  • How to use Content Fragments and the GraphQL API?

We will use the technical insight gained in this section to conclude what the pros and cons of SPAs with AEM are and in which cases this approach is best.

Rendering HTML in the backend vs Single Page Application

Traditionally, the HTML of a web page is rendered by a backend server.

The browser loads the HTML and linked resources. Javascript is then used to enhance the user experience with dynamic functionality. 

When a user navigates to another page, this one loads again and the process is repeated. 

This approach works well for simpler and static pages. However, websites have become more and more complex and feature rich.

Websites now often behave like full-fledged applications such as a social media platform or a banking portal. 

The complex interactions, state management and consumptions of many APIs makes it difficult to develop and maintain frontend code.

That is why the Single Page Application method of developing dynamic web pages gained a lot of traction in the last decade. The responsibility of the view layer is shifted to the location where it is displayed: the browser.

With the SPA approach, the first page loads an empty HTML and Javascript / CSS. Javascript then dynamically assembles the webpage. 

This is a simplified example of a HTML delivered by a SPA:

      <head>
	<source src="/spa/main.js">
<link rel="stylesheet" href="/spa/style.css">
</head>
<body>
	<div id="app"></div>
</body>
    

Any additional data or content like text, products, user account info, images etc. are requested from backend APIs.

If the user navigates to a different link, only the content area is replaced and re-rendered, the rest of the UI does not change. This makes the website feel like an application and not necessarily like a website. 

The SPA WYSIWYG content editor

The AEM “what you see is what you get” editor was extended to support SPAs.

Seeing the content created directly in the app is a blessing for anyone who has worked with a form-based editor (of a traditional Headless CMS). 

Even better, an author that is familiar with AEM will immediately feel at home and be able to create content without learning a new tool – while also reusing any other AEM content or assets from the traditional web page.

How did Adobe implement this? Here’s an overview of the main elements:

If this all looks very familiar to you – it is! Except for step 3, 7 and 8, it’s all the same as with a backend rendered page. You can find a more detailed view of this here.

Content APIs

We referred to Content APIs, also called AEM Content Services, a couple of times already. 

What are they and how do they work?

AEM is built on the RESTful Sling framework. Architecturally, the visualisation layer is already completely decoupled from the data through the Java Content Repository. So in this regard, AEM already was a Headless CMS.

This shows that on any AEM page you can change the extension from .html with .json (or .infinity.json to be more correct) and AEM will return all the content for the request page. If you currently use AEM, check the sidenote below.

For this request AEM will return the raw data stored in the repository for the requested path. Even though this could be considered “Content as an API”, there are several issues with this approach:

  • The format of the data is unstructured
  • It’s difficult to work with for clients
  • Contains unneeded or unwanted data like the username of the author
  • Not stable in regards to changes
  • Paths will not be externalised
  • It is not possible to inject further logic, like resolving additional information for an image path

Therefore an additional layer was introduced called Sling Exporter Framework.

It allows us to easily define how existing Sling Models should be transformed and serialised to certain data formats like JSON or XML. It basically is the mapping for your data to the exposed data in the API.

Since Sling Models are already the basis of any modern AEM project this makes it straightforward to provide a transformation of the web content to JSON.

Adjusting existing Sling Models to support the Sling Exporter Framework usually just requires a single line:

      @Model(adaptables = SlingHttpServletRequest.class)
@Exporter(name = "jackson", extensions = "json")
public class Text {
	@ValueMapValue
	@Getter
	private String text;

	@ValueMapValue
	@Getter
	private boolean isRichText;
}
    

If this component is rendered with the selector .model., the following json will be generated by the exporter framework:

      {
	"id": "text-2d9d50c5a7",
	"text": "<p>Lorem ipsum dolor sit amet.</p>",
	"richText": true,
	":type":  ".../components/content/text"
}
    

Actual examples of how the data would be transformed can be found on the core components dev page

In the example of the image component you can note the following:

  • User data is not exposed in the JSON (red)
  • Path is rewritten (yellow)
  • Models can be automatically enhanced with auxiliary data useful for clients (green). 

In this case, AEM Core Components inject the required fields for Adobe Analytics with the Standardized Datalayer for modern Event-driver tracking.

With the corresponding Adobe Launch Extension this enables zero configuration Adobe Analytics Integration for Core Components.

Repository Data:

      jcr:primaryType: nt:unstructured
jcr:createdBy: admin
fileReference: /content/dam/core-components-examples/library/sample-assets/lava-into-ocean.jpg
jcr:lastModifiedBy: admin
jcr:created:
displayPopupTitle: true
jcr:lastModified:
titleValueFromDAM: true
sling:resourceType: core-components-examples/components/image
isDecorative: false
altValueFromDAM: true
    

JSON Exported Sling Model:

      {
  "id": "image-f4b958f398",
  "alt": "Lava flowing into the ocean",
  "title": "Lava flowing into the ocean",
  "src": "/content/core-components-examples/library/page-authoring/image/_jcr_content/root/responsivegrid/demo_554582955/component/image.coreimg.jpeg/1550672497829/lava-into-ocean.jpeg",
  "srcUriTemplate": "/content/core-components-examples/library/page-authoring/image/_jcr_content/root/responsivegrid/demo_554582955/component/image.coreimg{.width}.jpeg/1550672497829/lava-into-ocean.jpeg",
  "areas": [],
  "lazyThreshold": 0,
  "dmImage": false,
  "uuid": "0f54e1b5-535b-45f7-a46b-35abb19dd6bc",
  "widths": [],
  "lazyEnabled": false,
  ":type": "core-components-examples/components/image",
  "dataLayer": {
    "image-f4b958f398": {
      "@type": "core-components-examples/components/image",
      "repo:modifyDate": "2019-01-22T17:31:15Z",
      "dc:title": "Lava flowing into the ocean",
      "image": {
        "repo:id": "0f54e1b5-535b-45f7-a46b-35abb19dd6bc",
        "repo:modifyDate": "2019-02-20T14:21:37Z",
        "@type": "image/jpeg",
        "repo:path": "/content/dam/core-components-examples/library/sample-assets/lava-into-ocean.jpg",
        "xdm:tags": [],
        "xdm:smartTags": {}
      }
    }
  }
}
    

A complete example of a content structure which supports the Sling Exporter Framework might look like this:

Pages with their properties, editable template structure for static components, responsive grid (“parsys”) components and the content of the components – basically all information which describes the content of the page – are exported in a well-defined consistent API intended for clients to be consumed and rendered.

Sidenote for AEM users

Are you using AEM for your website? 

If so, try  replacing the .html with .json. Do you get any JSON back? If yes, you can try to add another childpath named “jcr:content” before .json: /de/home.html -> /de/home/jcr:content.json. 

If any json is returned you will probably also see some technical metadata and a username. Depending how users are created, this may be a cryptic number, but it could also be a readable name or an email address.

In any case, you might want to discuss this with your team. Adobe recommends disabling the default Sling GET servlet on the productive publish instances.

Benefits of developing Single Page Application with AEM

Adobe put a lot of effort into making it as simple as possible to get up and running and develop SPAs with AEM. 

All the mechanisms are also tightly integrated into existing AEM technology, making SPAs a first-class citizen in AEM.

In the following, we will give a small overview of how the setup looks to give some further insights on what impact this has on developer teams working with AEM.

Setup & Onboarding

Adobe provides a reference implementation called Core Components with a large set of components containing all the current best practices in regards to AEM development (SPA or non-SPA). 

The projects can be locally set up with the project templating tool AEM Maven Archetype for on premise or AEM as a Cloud Service installation. It supports creating a React or Angular SPA project template with the following:

  • AEM base setup
  • Core Components
  • Setup for Sling Exporter Framework
  • A frontend build chain that builds and deploys all assets directly into AEM
  • Angular / React libraries for the AEM integration
  • A static preview server for local, AEM-independent frontend development

Further, there is a very good starting tutorial for React and Angular that will get developers up and running quickly. 

Of course, this doesn’t mean that a developer is ready to deliver production-ready AEM SPAs solutions the next day, but it’s good to know that Adobe is committed to simplifying onboarding of developers.

AEM SPA Backend Development

AEM Backend Developers will have less work to do, because they no longer have to integrate the frontend and take the HTML and migrate it to HTML. This usually is a big pain point of any larger AEM project introducing bugs and costing time. 

With the SPA approach, the interface between backend and frontend is no longer the HTML markup but instead the Content API, which can be predefined, specified and more easily adjusted.

Further, the responsibility of rendering the UI in the browser goes to the frontend team where usually the expertise in these areas lie anyway.

AEM developers can focus on what they know best: building the solid backbone of the application and ensuring content authors have the required tools and enjoyable interfaces to create content.

AEM SPA Frontend Setup

In AEM projects, frontend developers usually build a static prototype with a set of static components which are handed to the backend.

This is, as mentioned, usually a very inefficient process. It is hard to tackle this problem without requiring frontend developers to install AEM, which comes with its own set of problems.

Therefore teams just have to accept this aspect of AEM development. This is restrictive for frontend developers to build great user experiences.

With the SPA approach, frontend developers get the full power and responsibility of the frontend – without having to know a lot about or install AEM.

This is all due to the fact that the only communication with the backend is via the Content API and the clear separation of providing data and the presentation layer.

We know how Content APIs deliver the content. But how are they consumed from the frontend? There are three setups, each valuable depending on the context.

JSON Mock

A JSON mock file is basically a copy of an example output directly from AEM Content API.

The frontend developer can adjust this file, switch content, add components and more. They can simulate authoring in AEM by editing a file, as long as they move within the specification of the Content API.

Point to remote AEM

The frontend setup can be configured to point to a certain remote AEM Instance like QA, pre-production or even production! Gone are the days where frontend had to struggle to reproduce an issue in their local environment.

This enables many more possibilities, like running a whole integration test suite in the frontend on productive content without having to move large amounts of data to a different stage.

Running as part of the AEM installation

As might be expected, the frontend can be deployed to an AEM instance.

It is important to locally test the integration with the authoring environment and the SPA editor during development.

This part could also be taken care of by the backend developers to avoid forcing the frontend developers to install AEM locally.


Of course, this will also be the setup used on stages or on production.

AEM Frontend development

After understanding how the content comes to the SPA – how does the frontend code know how to render the content?

Similarly, how does the AEM editor know how to communicate with the SPA when the author changes content?

This is handled by the frontend libraries provided by Adobe for React and Angular.

Other frameworks are in consideration, but nothing was announced in regards to for example Vue or Svelte support. The only option in these cases would be to build a similar library.

These libraries contain functionality to parse the Content API output, instantiate the required components, fill the content properties and dynamically put them in the required order into the application context so that they are rendered on the page.

The same goes for other aspects like switching a page, routing and so on. Most of this works out of the box.

The main task the frontend developer has  is to map the frontend components to the resource types of the backend Sling Models.

This will also enable the AEM editor to inform the frontend about which component needs to refetch its content when editing.

To demonstrate this in practice, consider the following standard angular component:

The component code has to be extended with the following:

This “maps” the frontend code to an AEM resource type which corresponds to a Sling Model which maps to the content in the repository.

Additionally, an “Edit Config” is provided to give hints to the AEM editor, for example when a component is considered “empty”, so that a placeholder can be displayed.

Server Side Rendering (SSR)

There are two major concerns when using the SPA approach: initial page loading speed and SEO. 

Both can be solved by implementing SSR into the architecture. We also will shortly discuss how to actually implement this architecture (source: Adobe).

Server Side Rendering (SSR)

Initial page load issue with SPA

As discussed, with SPAs only an HTML document with an empty body is sent to the browser.

Therefore, the browser can initially not display any content and the user has to wait until the Javascript – which for SPAs is usually quite a lot compared to more traditional sites – is done loading, parsing and executing.

In the meanwhile, only a blank screen or a loading spinner is visible.

For a complex website like a banking application this is less of an issue since the users probably accept some loading time.

For a content-focused page where the user expects it to load within a couple hundred of milliseconds, this is usually not acceptable.

Server Side Rendering solves this by instead executing the Javascript in the backend in a headless browser or a NodeJS server and returning the populated initial HTML to the client.

The browser can directly start rendering the HTML and the users get immediate feedback . The browser will still fetch the Javascript and the SPA will inject itself into the rendered HTML (this is usually referred to as “rehydration”).

From there on, the execution flow continues as if the page was initially rendered in the browser. For more details and considerations we recommend this article by Google.

SEO and SPA

When a search engine crawls a SPA it will only see a blank HTML without content to parse. This can cause issues with SEO and ranking.

Some crawlers, like Google, support executing Javascript. Similar to SSR they will execute the Javascript to then crawl and index the content.

However, Google does not treat HTML and Javascript rendered pages equally. There are always two passes: first raw HTML and then processed Javascript.

These passes are not treated equally, the first pass which only reads and processes the HTML, has priority. The consequences of this can be a (much) longer timeframe until a page is indexed and ranks on Google.

SEO and SPA

Even though in the disadvantages regarding Google seem to be decreasing, not all search engines support executing Javascript.

There are also other contexts where non-SSR SPAs become problematic, for example generating a preview to share a page on social media.

Server Side Rendering in AEM

The groundwork and architectural pattern for SSR with AEM is proposed by Adobe and the recommendation is to use Adobe I/O to build the infrastructure.

We also heard that Adobe is working on an extended tutorial or possibly even reference implementation for SSR with Adobe I/O.

But for now this would have to be developed with a custom implementation on the basis of this sample code base.

How does Headless AEM work for clients that are not web-based?

So far this article focused on content-focused web pages or mobile hybrid SPAs.

The headless capabilities of AEM and decoupling content from rendering HTML enables many more use cases and applications where content needs to be displayed from native Android or iOS Apps, Social Media Snippets, digital signage systems to small IOT devices.

To accommodate such a vast ecosystem, loosely structured web content is problematic.

However, the rich feature set of AEM also allows to create structured content according to a predefined model by using the “AEM Content Fragments” feature.

Content fragments are predefined form-based or simple rich text  pieces of content which can be linked and structured. Other content from AEM like text, assets and tags can of course be reused in content fragments as well.

Fetching structured data with GraphQL

Recently AEM was extended to allow consuming content fragments with GraphQL (besides the already existing simpler JSON APIs). 

In concept, GraphQL can be compared to a SQL database query, the difference being that the query is not used for a database but instead an API. 

This allows different clients to query the API according to their own needs instead of the API having to provide different endpoints returning different amounts or sets of data for different clients. 

For example, a smartwatch might want to display less content than the corresponding app and would query only what is needed without the backend having to support this use case.

Since GraphQL requires a predefined data structure it would not work that well with web content and content fragments were the obvious choice. 

Adobe is working on GraphQL support and additional features like subscription, mutation and pagination are on its way. Due to the flexibility of the query possibilities, performance is a key topic for any GraphQL API. 

Adobe plans to tackle this by using “persisted queries”.

A client will first “register” a query. AEM will give a handle for the query. This query handle can then be invoked with a simple GET call which can be cached, making any following query fast and scalable.

When to implement Adobe Experience Manager in a headless way

As discussed in a previous article about Headless CMS vs Hybrid CMS, you cannot go Headless at any price. There are some limitations and it might not be the best for every use case.

Let’s go through the pros and the cons of the headless approach.

The pros of the headless approach:

  • Enables implementation of content-focused SPAs with dynamic pages, layouts and components (for web or hybrid mobile apps)
  • Delivers a unique user experience for content pages achievable with SPAs (e.g. no page reload)
  • Well known WYSIWYG authoring environment for SPAs
  • “Full” integration for content focused SPAs, (upcoming) “light” integration for existing projects or limited content areas
  • Content & Asset reuse – all content from one CMS
  • Omnichannel delivery of content to any type of client from smart watches to digital signage to IOT
  • Supported by Adobe (Editoring, Backed technology, frontend libraries, project setup)
  • Separation of concern: AEM developers build the backend, frontend developers the frontend – no “integrating of markup” anymore
  • Probably fewer AEM developers and more React / Angular frontend developers required, simplifies hiring

The cons of the headless approach:

  • SSR setup required, especially for content focused web pages (initial page load & SEO)
  • No clear guidance on how to implement SSR from Adobe (yet)
  • Traditional approach well established, architectures with many successful projects scaled to large user bases 
  • Many unknowns: learning curve for developers, some bumps in the road are to be expected with any new technology
  • Frontend developers still have to consider the AEM editor with some limitations e.g. no use of the CSS viewport height (vh) units
  • Not all AEM features are supported (yet)
  • Only Adobe support for Angular and React, no Vue or Svelte

From this we conclude the following three key points:

  1. SPAs for content focused web pages is now a valid approach that makes it possible to deliver new UX without any sacrifices in content creation with (almost) the full capabilities of AEM supported. This requires the implementation of an architecture with SSR which has to be custom built until there is direct guidance from Adobe.
  2. AEM is a fully capable headless CMS that can deliver content to any device or screen with modern technologies and standards (JSON API, GraphQL etc) which should be able to scale to large user bases due to performance optimisations by Adobe.
  3. Separation of concerns in regards to providing data and presenting data on the technical level means great improvements for the developer teams. No integration of markup. Frontend developers get the freedom they need and will be happier coders. Backend developers can focus on what they know best. A side benefit of this is that probably fewer AEM developers will be needed which are much more difficult to hire than React or Angular developers.

From these takeaways we can recommend AEM headless or hybrid to be considered when the following points are met:

  • You aim to deliver the same experience and code base for a content-focused page on the web and a hybrid mobile app.
  • You struggle to find enough AEM developers for web-based projects but have a strong team of frontend developers.
  • You have an existing SPA and want to display content in limited areas of the up (wait for upcoming SPA editor 2.0).
  • You want to deliver content from AEM to platforms that are not web technology based (headless).

Finally, is Hybrid CMS the best solution?

Yes, absolutely!

Hybrid CMS is the future, because it makes possible to keep the established traditional approaches while being able to deliver content to any other device or platform, all from a single, consistent user interface using modern technologies like SPAs and GraphQL.

A single platform for all your content also means reusing content across all platforms!

Text, assets, tags, Content Fragments, Experience Fragments – all can be reused on your traditional site, SPA, native iOS or Android App, digital signage (AEM Screens) or on a toaster with a display.

Even better, through the tight integration of AEM into the Adobe Experience Cloud, content can be further reused in emails (Adobe Campaign or Marketo), personalisation (Adobe Target) and many more tools and technologies. 

For web technology based projects, it also allows you to split the teams according to the separation of concern.

This means that developers can focus on what they know and do best. Plus, hiring might be simpler because potentially less AEM developers will be needed.

The drawback? All this technology is quite new and there are no established best practices yet, so there naturally are some unknowns and risks for new projects.

Basil Kohler

Basil Kohler

AEM Architect

Would you like to receive the next article?

Subscribe to our newsletter and we will send you the next article about Adobe Experience Manager.

The post Headless CMS with AEM: A Complete Guide appeared first on One Inside.

by Samuel Schmitt at July 19, 2021 11:30 AM

June 07, 2021

CQ5 Blog - Inside Solutions

AEM Screens: Questions, Answers and Lessons Learned

AEM Screens: Questions, Answers and Lessons Learned

At One Inside we have the chance to collaborate with our customer SBB, the Swiss Federal Railway, on innovative projects. 

These projects blend many channels and sources of information to provide travellers with up-to-the-minute information throughout their journeys.

These are next-generation omnichannel experiences. 

There is one particular project that we want to highlight today, the “CMS customer information”. 

The aim of this project is to deliver valuable customer information to different screens at train stations, among them Smart Information Displays – touch-screen based kiosks that will be available on several hundred train stations all over Switzerland.

On the solution side, Adobe Experience Manager is used as the base content management system, and AEM Screens, Adobe’s Digital Signage solution, is used to deliver content to the various screens.

You may be wondering:

  • How to leverage a digital signage solution? 
  • What is AEM Screens? 
  • What lessons did we learn from the project? 

We will answer all those questions and more below 

What is AEM Screens?

Adobe Experience Manager (AEM) Screens is a digital signage solution allowing you to publish dynamic content and digital experience to a vast variety of screens and displays at your store, premises or at train stations. 

The beauty of AEM Screens is that it is part of the Adobe Experience Manager solution, Adobe’s powerful CMS. 

It enables marketers and content editors to work from a single place, create content for the website, mobile channels and as well push assets to displays.

The aim is to manage all assets for any channel from one simple and intuitive interface.

The video below is a replay of a webinar held in January 2021, where we detail our project with the Swiss Railways.

Later in this article, we share the audience questions, together with more insights about AEM Screens and its integration in a large enterprise ecosystem.

What kind of displays and screens can you manage with AEM Screens?

Any kind of screens, menu board, touchscreens, widescreens and more can be used with AEM Screens. You can then deliver unified and useful experiences into physical spaces.

In the project done for SBB, the Swiss Railway, we introduced 3 kinds of screens:

  • Smart Information Displays (SID): Their main purpose is to replace the paper timetables and network plans and offer information in an interactive way at the train station.
  • E-Panels (ad screens): During large disruptions, ad screens can be used to provide customer information. 
  • Inspiration Desks: Located in Traveler Information Centers, inspiration desks inspire customers to buy leisure trips.
AEM Screens

What are the key features of AEM Screens that you used in this project?

There are several features that we used and would not want to miss: 

  • Configuration: It’s very easy to connect a new screen to the system and provide it with a new configuration. For example, this is how we configure the Smart Information Displays (touch-screen based kiosks). 
  • Content: AEM Screens is directly integrated into AEM Sites and AEM Assets and can use the same content that is used on the website or in other channels. It’s easy to navigate to a specific screen (just choose the train station and then the corresponding screen(s)) and add content to it. 
  • AEM Screens Player: AEM Screens has its own player that runs on the media player hardware that delivers the signal to the screen and acts as a client for the AEM Screens CMS. This setup allows us to provide any kind of information to a screen – be it a loop of nice images and videos, a single-page-application that interacts with the user through a touch display, or service disruption information pushed to the screen within seconds. 
  • Flexibility: AEM is basically a hybrid CMS (a mixture between a traditional and a headless CMS with a digital signage package attached to it. Because it’s heavily based on open interfaces and standards, we were able to find an interface to connect to every backend system and every screen we wanted to. This was crucial more than once during the project. 

What is the impact on content creation, or how easy is it to use existing assets with AEM Screens?

AEM Screens is part of Adobe Experience Manager, the famous content management system for enterprise. 

AEM Screens uses the same user interface and functionality for content management as AEM Sites – the functionality that allows creating websites – so that authors feel at home immediately. 

Existing assets can be used either by dragging and dropping them manually into channels where they get immediately published to the screens.

Or existing web content can be reused by re-rendering it to a screen-optimized design (using specific stylesheets) – for example, you don’t want links to external sites that a user at a touch-based display can tap on. 

For the project with SBB, most of the content we push to screens is automatically consumed from backend systems, enriched with additional assets, and rendered for the specific screen, all automatically.

Is it possible to build interactive and personalized experiences with AEM Screens?

Yes and Yes. Actually, we already built an interactive experience with the Inspiration Desks to be rolled out in all-new travel information centers at large train stations in Switzerland.

It’s based on a single page application (SPA) that uses Angular and loads data (from AEM Sites) asynchronously.

The SPA runs in an AEM Screen channel and can be replaced by an idle channel that shows images and videos when nobody is at the desk. 

Interaction with the SPA happens through a touch-based screen limiting some functionalities (e. g. no keyboard entries), but this is not a huge problem. QR codes are used to get content to the user’s mobile phone.

If we can identify a user standing at a screen, for example through NFC tags or another method, all personalization functions that are available within AEM can be used, including personalization and optimization provided by Adobe Target. 

Basically, anything that can be done on a website can be done with AEM Screens as well.

AEM Screens touchscreen device

Is it possible to integrate analytics with AEM Screens?

Yes. Actually, that is exactly what we will do, at least for the interactive screens, such as the Inspiration Desks (touch-screen enabled desks with travel tips and useful information about Switzerland and the SBB). 

For the Inspiration Desks, we want to learn how users use them. Which content is interesting to them, and which isn’t.

In addition, we want to display QR codes that bring travel tips to mobile phones. If someone then buys a ticket after reading a travel tip, we will see his first interaction on the Inspiration Desk.

Did you use AEM Content Fragment and/or Experience Fragments with AEM Screens?

Yes we did. We used AEM Content Fragments to configure centrally managed information blocks displayed on various screens.

It’s a great out-of-the-box mechanism offered by AEM Sites. Our team configured this mechanism for the purpose of AEM Screens easily. 

As you can set up specific permissions on the Content Fragment feature, we allowed a few content editors to create fragments and edit content. 

The screens manager can then pick the content (fragment) they want to display on their screens.

Is it possible to use an API Gateway with AEM Screens?

We don’t use an API gateway, we use the APIs that Adobe Experience Manager provides.

AEM provides a lot of interfaces, and mostly we use REST APIs exchanging JSON objects, as all other systems are able to understand them. It’s easy to implement such an interface.

How to manage multi-language displays with AEM Screens?

As you are probably well aware, Switzerland has 4 official languages, 3 of them are used by SBB, together with English at airport train stations. 

Basically, each station has its “home” language. All communication on said station shall be in its home language, and maybe a second language.

However, it’s not that easy to find out in which languages to communicate on each station.

Our solution is able to support all languages (de, en, it, fr). How many languages a message is submitted in doesn’t matter – we will rebroadcast the message in all languages available. 

The SBB employee in charge of service disruption information at the Traffic Control Center decides which languages to use. 

Have you used a CDN?

Yes, we have, but only because Adobe Experience Manager is already used for the website, and therefore has a CDN (Content Delivery Network) already. We can use already existing setups and workflows. 

If we had had to build the system from scratch, we probably wouldn’t have used a CDN, as all the consuming systems are basically located within the SBB network, and only a small amount of images are used. 

In fact, the Smart Information Displays use AEM Screens’ functionality to cache images and other content locally, so that they can provide content even if the connection to the network is lost (offline functionality). 

What’s the acceptable latency between someone making a change in the CMS and the corresponding displays being updated? 

Speed is essential, especially during a service disruption. Not only because everybody is waiting for information, but also as information can and will be updated quite often, for example when busses are available or other connections are re-routed. 

Once the SBB employee at the Traffic Control Center sends out a message, it’s processed in the CMS in less than a second.

The E-Panels, for example, poll the CMS every 20 seconds for a new message and need about 5 to 10 seconds to switch from advertisements to a disruption message.

All together it takes about 30 seconds from the publication of the message to its presentation on the screens. 

This is much faster than the current system, and in the near future, we might bring it down to a few seconds for other screens.

Who is the main target audience for the digital signage solution considering that most Swiss people have smartphones with the SBB mobile app installed?

Most of the travellers have the SBB mobile app or at least a smartphone where they can get information from the website.

But imagine a larger train station full of people during a large service disruption. There needs to be information, and the information needs to be accurate, fast, and consistent. 

That’s what we’re working on. You’ll always be faster looking at a monitor than searching for your next train on your mobile phone. 

Furthermore, we plan to integrate the website and the mobile app as additional channels into the solution to ensure that the same information is served to all users.

For what kind of other use cases could AEM Screens be used?

There are many use cases imaginable – basically all use cases that involve a display, be it a touch-based interactive screen or a simple, passive display. 

Any information can be put onto a screen, be it advertisements, information about the operative situation of a railway network, or  timetable. 

It gets interesting when one starts to combine sources: A (big) screen is still quite an investment, therefore if it can be used for multiple purposes – e. g. ads during idle times, a timetable during rush hours, and a service disruption message during service disruption – basically multiplies the value of a screen. 

AEM Screens lets you configure different channels with specific content and allows  you to manage these channels based on schedules or other rules, even manually, so that you are always in control of what is displayed on your screens. 

If you want to know more, we’ll gladly tell you everything about AEM Screens and discuss your scenario. Don’t hesitate to contact us.

Michael Grob

Michael Grob

Senior Consultant Digital Marketing

Would you like to receive the next article?

Subscribe to our newsletter and we will send you the next article about Adobe Experience Manager.

The post AEM Screens: Questions, Answers and Lessons Learned appeared first on One Inside.

by Samuel Schmitt at June 07, 2021 08:44 AM

June 03, 2021

Things on a content management system - Jörg Hoh

AEM micro-optimizations (part 3)

Welcome to my third post on AEM micro-optimizations. Again with some interesting ways how you can improve your AEM application performance, somethings with little improvements, but sometimes with significant ones.

During some recent performance optimization I came across code, which felt a bit odd. Technically it was quite easy:

for (Item item : manyItems) {
  proprocessSingleItem (resolver, item);
}
void processSingleItem (ResourceResolver resolver, Item i} {
// do something with the resourceResolver
resolver.commit();
}

That is indeed a very common pattern, especially in software, which evolved over time: You have code, which deals with a single item. And later, if you need to do it for multiple items, you execute this code in a loop. Works perfectly, and the pattern is widely used.

And it can be problematic.

If you have an operation in that performSingleItem() method, which comes with a method creating some overhead . Maybe you are not aware of that overhead, so it goes unnoticed. Maybe you expect, that if a that performSingleItem() method takes 5 ms for an item, requiring 50 ms for 10 items is ok. Well, an O(n) algorithm isn’t too bad, is it?

But what if I tell you, that the static overhead of that method is that so large, that providing 10 items as parameters  instead of just one will increase the runtime of it not by a factor of 10, but only by a factor of 1.1?

Imagine you need to go grocery shopping for your Sunday dinner. You get yourself ready, take the bike to the grocery store, get the potatoes you need. Pay, and get back home. Drop the potatoes there. Then again, taking the bike to the grocery store, getting the some meat. Back home. Again to the grocery store, this time for paprika (grilled paprika are delicious …). And so on and so on, until you have everything you need for your barbecue on Sunday. You spent now 6 hours mostly on the bike and waiting at the counter.

Are you doing that? No, of course not. You drive once to the grocery store, get all the things and pack them onto your bike, and get home. Takes maybe 90 minutes. Have the static overhead (cycling, waiting at the counter) just once saves a lot of it.

It’s the same in coding. You have static overhead (acquiring locks, getting database connections, network latency, calling through thick framework layers will just copying references to the data), which is not determined by the amount of data you process. But unlike in the example of grocery shopping it’s not directly visible at which times there is such a static overhead, and unfortunately documentation rarely point that out.

Writing to the repository comes with such a static overhead; and it can be like a 20 minutes ride to the grocery store. Saving 10 times smaller batches definitely takes more time than saving once with a batch of 10-times the size.  At least if you keep the size of the changeset limited, for details here check this earlier posting of mine.

Check this great presentation of Georg Henzler at adaptTo() 2019 (starting at 17:00min ) (slides) for some benchmark data, how the size of the changeset influences the time to save (spoiler: for realistic sizes it does not really increase).

So I changed the above code to something like this:

for (Item item : manyItems) {   
  proprocessSingleItem (resolver, item);
} 
resolver.commit();

void processSingleItem (ResourceResolver resolver, Item i} { 
  // do something with the resourceResolver but no commit
}

Switching to this approach improved the performance for ~ 100 items by a factor of more than 10! And that’s an impressive number for such a minimal change.

So check your code for this specific coding pattern, find out if the parameters are good (that means small changes) and add some performance logging. And then convert to this batching mode and see what your numbers are doing.

Of course, very often this saving is operating in the context of a much larger operation, and a 10 times improvement in this area will only speed up the larger operation of 12 seconds to 11 seconds. But hey, when you get this 1 second for almost free, just do it (and we are still talking about micro-optimizations). But nothing prevents you from taking a deeper look into what the system is doing in the remaining 11 seconds.

Leave me a comment if you have some interesting story to share, where such small changes resulted in big improvements.

by Jörg at June 03, 2021 02:41 PM

May 24, 2021

Things on a content management system - Jörg Hoh

AEM micro-optimization (part 2)

Micro optimizations are important, and their importance is described by a LWN posting about the linux kernel:

Most users are unlikely to notice any amazing speed improvements resulting from these changes. But they are an important part of the ongoing effort to optimize the kernel’s behavior wherever possible; a long list of changes like this is the reason why Linux performs as well as it does.

And is not specific for the Linux kernel, but you can apply the same strategy to every piece of software. AEM as a complex (and admittedly, it can sometimes be really slow) beast applies the very same.

There are a number of cases in AEM, where do you operate not only single objcets (pages, assets, resources, nodes), but apply the same operation on multiple of these objects.

The naive approach of just iterating the list and execute the operation on a single element of that list can be quite ineffective, especially if this operation comes with a static overhead.

Some examples:

  • For replication there are some pre-checks, then the creation of the package, the creation of the sling jobs (or sending the package to the pipeline when running on AEM as a Cloud Service), the update of the replication status, writing the audit log entries.
  • When determining the replication status of a page, the replication queues need to checked if this page is still subject to a pending replication, which can get slow when the queues are full.
  • Committing changes to the JCR repository; there is a certain overhead in it (validating all changes, comitting them to permanent storage, invoking the synchronous listeners, locking etc).

And in many cases these bottlenecks are known for a while, and there is API which allows to perform this action in a batch mode for a multitude of elements:

(The ReplicationStatusProvider has been introduced some years back when we had to deal with large workflow packages being replicated, which resulted in a lot of traversales of the replication queue entries. Adding this optimized version improved the performance by at least a factor of 10; so even in less intense operations I expect an improvement.

So if you have a hand-crafted loop to execute a certain activity on many elements, check if a more efficient batch API is available. There’s a good chance that it is already there.

If you have more cases where batch mode should be available, you it isn’t, leave a comment here. I am happy to support to either find the right API or potentially kickstart a product improvement.

by Jörg at May 24, 2021 02:28 PM

May 17, 2021

CQ5 Blog - Inside Solutions

The Headless CMS Adventure: More than a trend?

The CMS world goes through a revolution during which it reinvents itself every few years. 

Currently, the latest trend is the so-called ‘headless’ CMS, which completely separates the management of content from its presentation. 

As a result, a question arises from website owners and web developers: do we need a headless CMS?

Headless CMS offer many advantages compared to traditional CMS, but the added value is also dependent on the context, the number of channels, and what you are trying to achieve with your content.

In one of our projects, we followed a headless approach and learned a lot from it. 

Today, we would like to share our learnings with you and hopefully, this will help you decide which direction is best for your future web project.

What are the advantages of a headless CMS?

A headless CMS differs from a traditional one, often described as ‘monolithic’ CMS. 

In a headless fashion, the content and presentation layer of the website are completely separated.

Moreover, the presentation is outside of the responsibility of the CMS and must be implemented elsewhere.

the advantages of a headless CMS

The headless strategy does have advantages: the CMS can only take care of the content, and its functionalities can be completely optimised for just that. 

As the content does not contain any information about how it is to be presented, the same content can be used for several output channels – such as the website, a mobile app, social media and even print media at best. 

In contrast, a monolithic CMS takes care of both the creation and the presentation of the content in ‘pages’. The page structure ultimately corresponds to the navigation of the website. 

Depending on the architecture of the CMS, content may be placed directly on pages, which makes content reuse difficult on other channels. In other architectures, there may exists a certain logical decoupling between content and design. 

monolithic vs Headless CMS

How does Adobe Experience Manager manage content?

In Adobe Experience Manager (AEM), the CMS within the Adobe Experience Cloud, content is usually assembled directly by the authors into pages that then correspond (more or less) to the web pages 1:1, even if, technically speaking, the content management and the presentation of the pages are separated. 

Why am I telling you about AEM? Pretty simple.

Firstly, because One Inside is a certified Adobe Partner and our teams are experts in the implementation of AEM projects.

Secondly, because it was the CMS used for the project (as in almost all of our projects).

Now let’s talk about this project.

The project: going headless at any price

The goal of the project was to display content (already available on the website) in another channel: a digital kiosk.

A digital kiosk is a ‘vending machine’ consisting of a computer and a touch screen. The users of the digital kiosk can get informed about the products offered in a self-service way.

Our team joined the project relatively late, after the technical solution had already been designed.

The kiosk leveraged Angular to display information to its users.

The (front-end) application of the kiosk was already halfway implemented. The architectural decision had been taken as well: the kiosk should store all content locally, on the device. 

As the solution was designed, it was not possible to benefit directly from the main website created in AEM. Otherwise, it would have been possible to create a version of the website adapted to the kiosk with relatively little effort.

Instead, REST-APIs had to be used to provide content to the kiosk. A library of JSON objects had to be defined for the content and the configuration of the kiosk. Thanks to the openness of AEM and direct support of REST and JSON, this could be implemented with reasonable effort. 

Is headless always best?

During the project, another advantage of separating content and presentation was highlighted: the frontend team and the backend team could work independently from each other.

The backend team was in charge of managing the content in AEM while the frontend team took care of an Angular app for the content presentation.

Each team can 

  • concentrate fully on their area of expertise, 
  • work with their usual technologies and frameworks, 
  • and hardly has to consider the other party.

A weekly coordination meeting plus an interface definition (properly documented) is sufficient for the team’s collaboration.

headless cms architecture

The architecture diagram above shows that the headless approach is justified for this use case: in the course of the project, other screens and devices will be connected, each device via its own interface. 

These can be implemented in a short time with little consultation between the teams – and new channels or devices can be added at any time. 

Welcome to the wonderful world of headless CMS!

Now all existing monolithic CMS can be replaced!

Or maybe not. 

With the first mentioned device, the kiosk, we could have saved a lot of time in our project if we had simply implemented a slimmed-down website directly in AEM. 

Headless was an overhead and introduced unnecessary complexity.

A lot of functionality had to be created using different technologies that would have been available already.

AEM’s CMS is powerful and offers the possibilities to create a light version of a website, for example based on Experience Fragments.

Too much effort for simple websites

Especially for websites or web-related channels, the headless approach might not be a better solution than a traditional approach. 

While backend for content repository is built (the CMS), the frontend is created completely independently (with Angular, React or a similar framework).

On the front-end side, everything has to be developed from scratch. Every component, every page template, including the entire navigation. 

This is an additional effort that should not be underestimated compared to design adaptations of the existing website with AEM and Core Components.

Think content first

What is overlooked is what I believe to be the biggest problem with the headless approach: how does content fit into the website, or rather how is a page structure defined?

On a highly structured site such as a news portal, where articles are assigned to categories (e.g. politics, sports) and displayed by topics, this can be done automatically, and similar solutions can be found for e-commerce.

In such cases, when the navigation can be automated, a headless approach might be appropriate. 

For an average corporate website where content is displayed in a semi-structured hierarchy, the frontend has to request the right content.

This can go as far as that the whole navigation has to be created in the frontend making it extremely difficult to support it via the backend. 

All functions for structuring content, which are taken for granted in a conventional content management system, must be created in the frontend with great effort, if at all possible. 

In our example with the kiosk, this problem had to be solved in such a way that the entire content structure per device is maintained in the CMS and then made available to the respective device as a JSON object.

The kiosk then procures the content itself and ensures the appropriate presentation. 

In the long run, the coordination between frontend and backend becomes problematic: even the smallest errors in the JSON configuration file lead to errors in the frontend. 

The advantages of a traditional CMS outweigh the headless approach in this case.

Hybrid CMS: the best of both worlds

Fortunately, Adobe has not stopped at the monolithic approach but has shown that Adobe Experience Manager is considered one of the most innovative products on the market for a good reason.

Content Services and related technologies such as Content Fragments and Experience Fragments allow content management and presentation to be separated. Plus, all content can also be retrieved as JSON objects via REST APIs. 

This approach, known as ‘hybrid CMS’, offers a headless CMS in addition to the traditional, page-based CMS optimised for websites. 

Content prepared for the website can be made available to other systems such as displays, kiosks, apps or the Internet of Things (IoT) via the corresponding APIs.

The question is no longer whether headless or not, but only: when do I go for headless and when for the traditional approach? 

Making the right choice can decide the success of a project. 

We are happy to help you choose the optimal content management strategy for your project and show you an efficient, cost-effective way to succeed. 

Talk to us.

Michael Grob

Michael Grob

Senior Consultant Digital Marketing

Would you like to receive the next article?

Subscribe to our newsletter and we will send you the next article about Adobe Experience Manager.

The post The Headless CMS Adventure: More than a trend? appeared first on One Inside.

by Samuel Schmitt at May 17, 2021 09:51 AM

May 11, 2021

CQ5 Blog - Inside Solutions

Introducing a step-by-step approach to build chatbots for enterprise

Introducing a step-by-step approach to build chatbots for enterprise

Chatbots and other conversational marketing solutions have been around for a while now. Most of us are used to seeing little chat boxes at the bottom of a website.

But how do they work? 

How can enterprises tackle these projects and offer a conversational channel to their customers?

How do you leverage the power of AI and natural language processing (NLP) to build proper dialog and conversations within a chatbot solution? 

After discussing and interviewing customers and prospects, we came to the conclusion that the project’s steps are not always obvious. Chatbot projects are a rather new challenge compared to website projects.  

Today, we want to share our knowledge in a complete guide on building a chatbot for enterprise. The whitepaper details our methodology in five steps. It explains the challenges, risks, and shares best practices.

What you will find in the Chatbot Journey whitepaper

The whitepaper has been created by our chatbot experts and consultants. They explain the steps to guide you towards your first conversational experience: from ideation to implementation.

We have highlighted the tasks to be done as well as the effort and team members required.

To make things easier, we created a five-step plan to successfully run chatbot projects and to share insights about tasks, required skills and challenges:

  • Step 1 – From idea to roadmap: make a list of ideas and define business cases
  • Step 2 – Turning a roadmap into a plan: bring all the experts together and define the solution end to end
  • Step 3 – Designing the chatbot and conversations: build your ideas into the chatbot
  • Step 4 – From training to go-live
  • Step 5 – Scale and optimize the chatbot experience: enhance the conversation
chatbot project plan

Our approach was designed with the requirements and challenges of enterprises in mind. 

You should read this whitepaper if you plan to create a chatbot for your enterprise and will have a key role in your organization, such as:

  • Marketing Manager: you are in charge of digital marketing activities and customer experience. Your goal is to build a new channel to engage with your customers.
  • Digital Project Manager: you are in charge of your organization’s digital projects and a chatbot is a brand new channel that you have to tackle soon.
  • Executive: you might be heading marketing or the IT department and oversee the digital transformation at your company. It’s important for you to understand the challenges and risks that come with such a project.

Do you want to read the whitepaper?

You can check  the brand new Chatbot Journey Whitepaper out right here:

Get started with our chatbot for entreprise guide

Read our complete Guide to Enterprise Chatbots

If you have additional questions about our methodology, our experts are here to answer all your questions.

Clemens Blumer

Clemens Blumer

Senior Software Architect

Would you like to receive the next article?

Subscribe to our newsletter and we will send you the next article about Conversational Marketing and Chatbot.

The post Introducing a step-by-step approach to build chatbots for enterprise appeared first on One Inside.

by Samuel Schmitt at May 11, 2021 08:02 AM

May 04, 2021

CQ5 Blog - Inside Solutions

Front-end performance: How we optimized our website

Front-end performance: How we optimized our website (from the living style guide)

The front-end is the first place where you meet your users. But if your website takes too long to load, they won’t wait around.

No one is going to visit and appreciate a slow website. Not your customers, and especially not Google.  

The page experience and overall performance of your website, such as page speed, are important ranking factors for search engines. Better performance means more chances to be found by your audience.

Optimizing your website for speed is very important to improve customer experience. 

We’ve gathered our expertise on front-end and website optimization and are now sharing how we actually improved our own website performance from the living style guide.

We hope it will give you hints on how to tackle your own website performance review and update.

What is front-end optimization?

Front-end optimization is the process of fine-tuning the HTML, CSS and Javascript of your website to make it faster to load by a web browser.

A key parameter to consider is the user experience, and how fast the main content of the page is rendered.

Nobody likes to see a blank screen for seconds. 

As a website owner, you have to understand that website loading time is a critical ranking factor for Google and is measured as one of the key metrics of the Core Web Vitals: Largest Contentful Paint (LCP).

At One Inside, we take care of our customers and want to offer them a great experience when they visit our website as well. 

Our approach to front-end development is to build a front-end living style guide. This is how we can fix front-end issues fast and in a streamlined way as well. 

Before telling you how we improved our website, we need to introduce the concept of a living  style guide – as it is where the improvement will take place.

What is a front-end living style guide?

The front-end living style guide is a website that defines the corporate design and user interface for your company websites or other web or mobile applications you may use. 

Each component required to build your website such as the navigation, teasers, carousel, and banner, are displayed individually. 

It helps the designer and developer understand how the website or mobile application will look, so they can focus on one single element at a time.

At One Inside, we built a living style guide for our corporate website as well as introduced this front-end best practice to customer projects

What is a front-end living style guide

Below we’ll show you how we improve our website by optimizing the living style guide.

Use the living style guide to improve page performance

As the name implies, the living style guide’s main purpose is to “guide” your design. It guarantees consistency among all the components of your website and for different devices.

The living style guide for our website was used as expected and drove the UI implementation. However it has one issue: a bad score with Google Lighthouse.

(Google Lighthouse is available via the Chrome developer console and generates reports about a web page speed and performance).

The living style guide was considered slow to load and slowed down interactions. The user had to wait to see the page and to be able to click on buttons and icons.

If the living style guide is slow, the resulting website will suffer from the same issue.

Analysis of the frontend code

The very first thing to do is to understand how frontend code is built.

The technical pattern of the living style guide was the following:

  • All frontend components were compiled as one application or (one large JS file)
  • All CSS were compiled as one large CSS file

This pattern used to be applied everywhere and is optimized for the HTTP1 protocol.

The HTTP1 protocol recommendations indicate to minimize the number of requests to the server and to load a few large files instead of multiple small files. 

Why? 

Because establishing a new request (connection to the server) can be time-consuming (independently from the size of the file) and many simultaneous requests are not allowed.

This is the reason so many web designs encapsulated their Javascript and CSS in one single file.

This approach has some advantages:

  • A small number of requests to the server
  • Servers get less load and run faster
  • Better compression ratio (the compression is better on one big file than multiple individual small files)

The disadvantages are:

  • Slow to download and decompress by the browser
  • Slow to transform into usable information (technically: parsing and compiling JavaScript).
  • The browser is blocked and waiting until the end of its data processing into usable information. Meanwhile, the main thread of the browser is blocked. (See resources: the cost of JavaScript). It may not be a big problem on fast desktop computers but can be on mobile devices.
  • This means the interaction on the page is blocked and waiting as the browser doesn’t know what can interact, and with what.
  • The display is blocked until the end of the loading of the stylesheet to render for each of the elements of the page.

The question we asked ourselves was: is there a way to get away from this current pattern?

Looking at the improvement brought by the new HTTP2 protocol gave us a solution.

HTTP2 to help the frontend developer

Today, modern servers provide the new version of the HTTP Protocol (HTTP2) and web browsers are using multithreaded and multicore capabilities of modern computers and mobile devices more and more. 

This means we can process even more data simultaneously.

The HTTP2 protocol comes with great improvements:

  • The protocol can accept multiple simultaneous requests by re-using the actual connection.
  • It doesn’t need to negotiate and re-create a connection for each request. The lost-time disappears.
  • It can send many files (from many requests) into one answer (multiplexing).
  • And it offers better compression (30% better).

Thanks to HTTP2, frontend development recommendations got an update and today’s best practice is to split every file into small pieces.

Here we load the needed components of a page simultaneously.

It comes with many advantages for the web browser:

  • Javascript components are loaded only when needed by the client.
  • They are rendered as soon they are available, without having to wait for the other files.
  • Each UI component is interactive and has its definitive design and size.

On top of this, HTTP2 also supports a “push” function.

The server can send a file that is not yet requested but is known to be necessary. This is done by using “preload” in HTML.

How we boosted website performance

To improve the website we applied several changes to the living style guide such as:

  • Splitting each Javascript and CSS into smaller files
  • Loading critical files first
  • Changing the code pattern

Let’s have a detailed look.

Splitting into small components   

Our improved frontend is now divided into many components. 

These components are a small part of the page, which has a specific functionality: display the image-gallery, display the main-menu, display the main-content, and so on.

Each component can be simple (just a static template) or more complicated,with interactions and javascript, reacting on click, managing state of an element (displayed or hidden), loading and preparing a new element (a new slide or image in a gallery).

Critical mandatory files first

On every website, there are a few Javascript and CSS files which are absolutely mandatory to allow the browser to start displaying something. 

These are usually corresponding to the first parts that we see. The header, the global styling, the main menu, and approximately the first 1/3rd of the screen.

In order to display these parts almost instantaneously, for an astonishing “perceived speed” and best user-experience, we need to provide them as fast as we can.

To do so, we need to extract this code into small files which will load faster.

As soon as the browser receives them, it can start to render the page without waiting for other parts of the page to load. 

For the user, the page displays and the loading seems finished.

From one code to multiple codes

I highly encourage you to read the excellent article the cost of Javascript. These are the main takeaways:

When the browser receives a javascript file, 

  • it needs to extract (from compression).
  • parse the file as code, check if it’s conform (no syntax errors, etc.) 
  • compile it for the machine (transform into something that the computer can execute)
  • check to which element of the webpage it must be applied (when the user makes a click on a button or an icon).

If the entire webpage is managed by one Javascript-file (like one big global component), the browser will take lot’s of time for each step of the process. 

During this time, the main thread of the browser will be busy and can’t do anything else. The browser is blocked and nothing happens. 

The display is also suspended and clicking on items does nothing. The webpage is not yet interactive. The user can’t interact with it.

By going away from this pattern (and paradigm), we can take advantage of the modern architecture of browsers and of our modern computers and mobiles: real efficient multitasking.

The solution is to split everything into smaller pieces of code or components. 

At the end, the result is the same: the browser has processed the same amount of code, the same amount of bytes is downloaded, but the browser did it by processing one small piece of code at time. 

As soon as one piece is processed, this part of the webpage can go live, be displayed, rendered and made interactive, while other parts are still loading, or parsing, or compiling.

Asynchronous loading

Async or defer? This is the question.

The existing solution to load a script and to not block the main thread is to specify it as asynchronous. 

When we include it in the HTML page, we add the async attribute. 

When the scripts are set as async, it simply tells the browser that the scripts are not blocking, hence, the browser doesn’t need to wait for them and that it can go on. 

The scripts will be started as soon as they are received (without any notion of order).

An older defer attribute does exist too: it keeps the order of the script, even if the files are received differently. 

For example: 

  • A first script is big, a second script is small. Both are set with defer. 
  • The browser will first receive the second file, as it is smaller.
  • But it will wait for the first script, and start the first before starting the second.

Javascript loaded on demand

Async & Defer are already a good solution, but in our case, it was a bit more complicated.

The components (and so, the composition of the page) can be freely chosen by the editor of the webpage in the CMS (content management system). 

That means that we, as a frontend developer, don’t know which components (and corresponding Javascript-file) should be included and loaded in the page.

Of course, we don’t want to load everything, just in case it may be chosen to be used on that page. Otherwise, we are back to the monolithic, one global component pattern.

This is exactly what we want to avoid.

Our solution is to add a specific attribute in the HTML of each component, telling the page that this part is not a simple HTML, but a specific component.

As we have seen, we compiled each component (with Webpack) as an independent JavaScript file.

A small mandatory JavaScript is loaded first: it checks all the HTML elements to detect if a component is present on the page, and then launches a request to load its corresponding JavaScript file. Once done, it initializes the component to make it interactive.

About the styles

We separate the global mandatory stylesheets that are meant to be used on all pages: header, teaser, navigation-menus and more, and compile them into a separate CSS-file. 

This is quick to load and the browser can start to render the DOM (styles and sizes) without waiting for the rest of the page.

Subsequently, the page can be rendered and viewed, even if other chunks are still loading at this moment.

By just changing this pattern, the Google performance score increased to over 90/100!

Further improvements

If we wanted to achieve a better score with our website, here’s what we could do additionally:

  • Mobile-styles only when the browser is on mobile-devices, hence this means these styles must be delivered globally for the website and not within the payload of each component. This is difficult to manage in a project.
  • Desktop styles only on desktop-device.
  • Global styles can be  a global small file (not duplicated in critical-CSS AND in component).

What is this all about? To summarize front-end performance optimization in one sentence: load only what is needed and when it is needed. 

You will see drastic changes on your website performance and a direct positive impact on the customer experience.

And in the end, this is what we all want: happy customers.

Sébastien Closs

Sébastien Closs

Senior Software Engineer Frontend

Would you like to receive the next article?

Subscribe to our newsletter and we will send you the next article about website optimization.

The post Front-end performance: How we optimized our website appeared first on One Inside.

by Samuel Schmitt at May 04, 2021 08:27 AM

May 03, 2021

Things on a content management system - Jörg Hoh

AEM micro-optimization (part 1)

As a followup on the previous article I want to show you, how a micro-optimization can look like.  My colleague Miroslav Smiljanic found that there is a significant difference in the time it takes to compute these statements (1) and (2).

Node node = …
Session session = node.getSession();
String parentPath = node.getParent().getPath();

Node p1 = node.getParent(); // (1)
Node p2 = session.getNode(parentPath); // (2)

assertEquals(p1,p2);

He did the whole writeup in the context of a suggested improvement in Sling, and proved it with impressive numbers.

Is this change important? Just by itself it is not, because going the resource/node tree upwards is not that common compared to going downwards the tree. So replacing a single call might yield only in an improvement of a fraction of a milisecond, even if the case (2) is up to 200 times faster than (1)!

But if we can replace the code in all cases where the getParent() can be used with the performant getParent() call, especially in the lowlevel areas of AEM and Sling, all areas might benefit from it. And then we don’t execute it only once per page rendering, but maybe a hundred times. And then we might end up with tens of miliseconds of improvement already, for any request!

And in special usecases the effect might be even higher (for example if your code is constantly traversing the tree upwards).

Another example of such an micro-optimization, which is normally quite insignificant but can yield huge benefits in special cases can be found in SLING-10269, where I found that a built-in caching of the isResourceType() results reduces the rendering times of some special requests by 50%, because it is done thousands of times.

Typically micro-optimizations have these properties:

  • In the general case the improvement is barely visible (< 1% improvement of performance)
  • In edge cases they can be a life saver, because they reduce execution time by a much larger percentage.

The interesting part is, that these improvements accumulate over time, and that’s where it is getting interesting. When you have implemented 10 of these in low-level routines the chances are high that your usecase benefits from it as well. Maybe by 10 times 0.5% performance improvement, but maybe also a 20% improvement, because you hit the sweet spot of one of these.

So it is definitely worth to pay attention to these improvements.

My recommendation for you: Read the entry in the Oak “Do’s and Don’ts” page and try to implement this learning in your codebase. And if you find more of such cases in the Sling codebase the community appreciates a ticket.

(Photo by KAL VISUALS on Unsplash)

by Jörg at May 03, 2021 08:01 AM

April 28, 2021

CQ5 Blog - Inside Solutions

Adobe Summit 2021: Personalization drives Business Growth

Personalization drives Business Growth

Adobe Summit 2021

Again, it is this time of the year: It is time for Adobe Summit. 

Unfortunately, the Summit is online. Fortunately, it is free for everybody to watch the 458 sessions. 

Nevertheless, Summit started with a keynote hosted by Adobe President and CEO Shantanu Narayen, which held quite a few surprises.

Instead of his living room like last year, Mr Narayen welcomed the audience from the lobby of an Adobe corporate building. 

The style reminded us of Apple keynotes, but Narayen did not take us to a walk at Apple Park but came directly to the topic. 

Of course, the pandemic was responsible for one of the biggest disruption corporates faced: the fast move to digital. 

Digital experiences shape how we live, learn and play, Shantanu said and added that there is no going back: Every business has to be a digital business. 

E-Commerce made a huge step forward. In 2020, 844 Billion Dollars were spent through e-commerce, almost twice as much as in the year before.

Therefore, it is simple to understand that the new winners in this digital economy are companies that can drive business growth with personalization because the digital economy runs on customer connections. 

Exciting innovations at the Adobe Summit 2021

Anil Chakravarthy took over to present exciting technology innovations in the Experience Cloud after reiterating that every business had to move to digital, fast. And Adobe can help. 

Not only has Adobe recently acquired Workfront, a marketing planning and work management (resulting in a new product called Adobe Workfront). 

Another significant change is the declining relevance of third party cookies: First-party data has never been more relevant. 

This fact is the motivation behind many smaller and bigger innovations in all solutions of the Adobe Experience Cloud: 

Adobe Experience Platform

There are several improvements to the Adobe Experience Platform, including 

  • AEP Collection Enterprise, a new client-side tagging mechanism (and the new name/replacement of AEP Launch); 
  • AEP Segment Match that allows multiple companies to share their data while staying compliant with the data protection laws; 
  • and the announcement of the beta release of the real-time CDP – for B2B.

Adobe Journey Optimizer

The biggest announcement is a new Adobe Journey Optimizer solution that looks like a child of Adobe Campaign with Adobe Analytics

Adobe Journey Optimizer is a new implementation based on the Adobe Experience Platform that allows personalized experiences across the entire customer journey (including multiple channels) based on real-time CDP data – all in a single application. 

We will give more details about Adobe Journey Optimizer as soon as they become available. 

AEM as a Cloud Service

Some of the newly announced features of Adobe Experience Manager (AEM) are already available in AEM as a Cloud Service

  • The possibility to use AEM as a headless CMS based on the API-first Content Services, support for GraphQL and content optimization functionality based on Adobe Sensei, Adobe’s AI. 
  • AEM Forms is now available as a Cloud Service, with AEM Screens coming to the cloud in the very foreseeable future. 
  • Furthermore, AEM Assets Essential is a new service available in all of Adobe’s solution, based on AEM Assets, of course. 

The new features will be available in the on-premise variant of the CMS later on. So if you cannot wait, it’s time to move your AEM to the Cloud.

What are the main benefits of Adobe Experience Manager as a Cloud Service?

Adobe Commerce

Last but not least, significant advances have been made in Adobe Commerce (a.k.a. Magento): 

  • Intelligent visual recommendations recommend products based on images; 
  • Sensei-based new intelligent Live Search
  • And the integration with Adobe Sign that lets shoppers digitally sign their orders. 

Similar to AEM, Adobe Commerce supports now headless E-Commerce based on an API-first approach using GraphQL.

These new functions reflect the fact that digital is “the new normal” – our lives have changed dramatically, and we as marketers have to take advantage of it if we want or not. Now, every day is Black Friday. 

Inspiration from Leaders

Adobe would not be Adobe if they would not give a lot of room for business experts and thought leaders to provide insights and inspiration that are not primarily product related.

Adobe’s CEO interviewed one of the most important CEO’s at the moment, Pfizer’s Albert Bourla.  

Bourla said that Pfizer would not have created the COVID vaccine if they were not prepared: They started in May 2019 – half a year before the pandemic outbreak – with a radical digital transformation: One can only transform a company if one transforms the culture. 

By using digital analytics and big data, a study that would typically take 18 months could be finished within 18 hours – this speed was the key to invent the vaccine: “It is the culture that invented the vaccine.”

Adobe summit keynote speaker

Deborah Wahl, CEO of General Motors, laid out a strategy to phase out gas and Diesel engines until 2035 and be carbon neutral in 2040. 

This shall be done by a combination of Altrium, the new battery-chassis concept that shall become the foundation of 30 or more different car types of all the company’s brands; and VIP, their concept for personalized objects that shall bring a completely new experience to car ownership. 

GM relies heavily on the Adobe Experience Cloud to provide an omnichannel customer experience. It includes AI, Machine Learning and GM’s dealers – independent franchisees – in the marketing concept, as these dealers know their community way better than GM does. 

FedEx has digital in its DNA, being the inventor of tracking and tracing packages in 1978, and today scanning each of the more than 20 million packages shipped each day more than 20 to 30 times. 

They had to become a digital company more than a logistic company and had to take advantage of what they call “logistic intelligence”. 

Not only had FedEx scaled their business from a 5-day network to a 7-day network, but the e-commerce boom of last year put a heavy load on the more than 600’000 employees. 

In addition, FedEx is responsible for transporting COVID vaccines – an IoT device is packaged into each vaccine package, which allows for a 99.9% accuracy in worldwide delivery. 

Adobe and FedEx announced a partnership that shall result in “logistic intelligence” functionality for small and medium enterprises that use Adobe Commerce. 

And now we’re back to digital marketing, which – according to Adobe – consists equally of science and art. 

And if you want to know more about Adobe’s current and new offerings, let us know. 

Michael Grob

Michael Grob

Senior Consultant Digital Marketing

Would you like to receive the next article?

Subscribe to our newsletter and we will send you the next article about Adobe Experience Manager.

The post Adobe Summit 2021: Personalization drives Business Growth appeared first on One Inside.

by Samuel Schmitt at April 28, 2021 08:41 AM

April 12, 2021

Things on a content management system - Jörg Hoh

The effect of micro-optimizations

Optimizing software for speed is a delicate topic. Often you hear the saying “Make it work, make it right, make it fast”, implying performance optimization should be the last step you should do when you code. Which is true to a very large extent.

But in many cases, you are happy if your budget allows to you to get to the “get it right” phase, and you rarely get the chance to kick off a decent performance optimization phase. That’s a situation which is true in many areas of the software industry, and performance optimization is often only done when absolutely necessary. Which is unfortunate, because it leaves us with a lot of software, which has performance problems. And in many cases a large part of the problem could be avoided if only a few optimizations were done (at the right spot, of course).

But all this statement of “performance improvement phase” assumes, that it requires huge efforts to make software more performant. Which in general is true, but there are typically a number of actions, which can be implemented quite easily and which can be beneficial. Of course these rarely boost your overall application performance by 50%, but most often it just speeds up certain operations. But depending on the frequency these operations are called it can sum into a substantial improvement.

I did once a performance tuning session on an AEM publish instance to improve the raw page rendering performance of an application. The goal was to squeeze more page responses out of the given hardware. Using a performance test and a profiler I was able to find the creation of JCR sessions and Sling ResourceResolvers to take 1-2 milliseconds, which was worth to investigate. Armed with this knowledge I combed through the codebase, reviewed all cases where a new Session is being created and removed all cases where it was not necessary. This was really a micro-optimization, because I focussed on tiny pieces of the code (not even the areas which are called many times) , and the regular page rendering (on a developer machine) was not improving at all. But in production this optimization turned out to help a lot, because it allowed us to deliver 20% more pages per second out of the publish at peak.

In this case I spend quite some amount of time to come to the conclusion, that opening sessions can be expensive under load. But now I know that and spread that knowledge via code reviews and blog posts.

Most often you don’t see the negative effect of these anti-patterns (unless you overdo it and every Sling Models opens a new ResourceResolver), and therefor the positive effects of applying these micro-optimizations are not immediately visible. And in the end, applying 10 micro-optimizations with a ~1% speedup each sum up to a pretty nice number.

And of course: If you can apply such a micro-optimization in a codepath which is heavily used, the effects can be even larger!

So my recommendation to you: If you come across such a piece of code, optimize it. Even if you cannot quantify and measure the immediate performance benefit of it, do it.

Same as:

(for int=0;i<= 100;i++) {
  othernumber += i;
}

I cannot quantify the improvement, but I know, that

othernumber += 5050;

is faster than the loop, no questions asked. (Although that’s a bad example, because hopefully the compiler would do it for me.)

In the upcoming blog posts I want to show you a few cases of such micro-optimizations in AEM, which I personally used with good success. Stay tuned.

(Photo by Michael Longmire on Unsplash)

by Jörg at April 12, 2021 08:24 AM

March 27, 2021

Things on a content management system - Jörg Hoh

Writing integration tests for AEM, part 5

This a part of my ongoing series about writing integration tests with AEM.

Integration tests help you to keep control
Photo by Chris Leipelt on Unsplash

Writing tests seems to be a recurring topic 🙂 This week I wrote some integration tests which included one of the most important workflows in AEM: Activation of pages. Right now haven’t blogged about the handling of both author and publish in an integration test. I will show you how to do it.

So let’s assume that you want to do some product testing and validate that replication is working and also writes correct audit log entries. This should be covered with an integration test. You can find the complete sourcecode in the ActivatePageIT at the integrationtests github project.

Before we dig into the code itself, a small hint for the development phase of tests. If you can want to execute only a single integration tests, you can instruct maven to do this with the parameter “-Dit.test=<Name of the testclass>”. So in our case the complete maven command line looks like this:

mvn clean install -Peaas-local -Dit.test=ActivatePageIT -Dit.author.url=http://localhost:4502

(assuming that you don’t run your AEM author on same port as I do … if you want to change that, modify the parameters in the pom.xml).

On the coding side, the approach follows of every integration test: we need to get the correct clients first:

As we want to use replication, we use a ReplicationClient, which is provided by the testing client library.

Next we define use a custom Page class, which allows us to define the parentPath:

Then the actual test case is straight forward.

I used some more features of the testing clients to just test the existence or absence of the page, plus the doGetJson() method to get the JSON representation of the pages (in the getAuditEntries() method).

So, writing integration tests with this tooling at hand is easy and actual fun. Especially if the test code is straight forward to implement like here.

by Jörg at March 27, 2021 04:12 PM

March 08, 2021

Things on a content management system - Jörg Hoh

AEM as a Cloud Service and the handling of binaries

When you are long-time user of AEM 6.x (and even CQ5), you are probably familiar with the Asset Update workflow. The primary task of it is the extraction of metadata from the binary asset and the creation of (smaller) renditions for it. This workflow is normally executed on the AEM authoring instance.

“Never underestimate the bandwidth …!” (symbolic photo)
Photo by Massimo Botturi on Unsplash

But since the begin of this approach it is plagued with problems:

  • The question of supported filetypes. Given the almost unlimited amount of file formats and their often proprietary implementation, it’s not always possible to perform these operations. In many cases, the support of these file types within Java is poor.
  • Additionally, depending on the size and the type of the asset and the quality of the library which provides support for this filetype, the processing can be very time consuming and also consume a lot of heap. Imagine that you can want to create renditions of a TIFF file which has dimensions of 10k * 10k pixels (assuming that you have a 24bit resolution) this requires 300 megabyte of contininous heap to store an uncompressed version of it. You have to size the heap size accordingly, otherwise you will run out of memory (OOM).
  • To avoid these issues, for many filetypes external tools like imagemagick were used, which both come with support of various image types (in many cases much better than the Java Image library), plus the ability not to blow the AEM process when the process fails (because imagemagick runs in a dedicated process). But also the capabilities of imagemagick are limited, and the support for more exotic (non-image) file types could be better.
  • In all cases you need to size your hardware for a worst case scenario. For example you need to provision a lot of heap, if your authors might start to ingest large images. And you need to provision enough CPU to mitigate negative impacts on all other operations.
  • Another big problem is the latency. Assuming that your asset is very large (it’s not uncommon to have assets larger than 1 Gigabyte), it takes time to copy the binary from the (remote) datastore to a location where the processing takes place. Even if you can transfer 100 MiB per second, it needs 10 seconds to have the file transferred to the local disk; normally this process runs through the AEM JVM, which is problematic in terms of heap usage, and also can cause performance problems. Not to mention code, which is not aware of the possible sizes and tries to load the complete stream into memory.

In AEM as a Cloud Service this is offloaded, and that’s what AssetCompute is for. It performs all these steps on its own; also not using imagemagick for image handling, but high quality and optimized routines which also power other Adobe products.

But what does that mean for you as developer for AEM as a Cloud Service? In the first place, it does not have any impact. But you should learn a few things from it:

  • Do not create any renditions on your own, use assetCompute instead. This service is extensible (checkout Project Firefly), so you can do all kind of asset operations there. There is no need anymore to use the java image library code.
  • Avoid streaming binary data through AEM. AEM as a Cloud Service itself (the JVM) should not be bothered with streaming binary data into and out of the JVM. If you want to upload files into AEM, you should use the aem-upload library

In general, think twice before you open an InputStream in AEM (either via Rendition.getStream() or also via the JCR API). Normally you never know how much data is behind it, and for almost all transformation cases it makes sense to use AssetCompute to perform these.

by Jörg at March 08, 2021 11:16 AM

March 02, 2021

Things on a content management system - Jörg Hoh

META: domain switch

After some 12 years I finally switched over the domain name of this blog to something which is more closely attached to me. Don’t be surprised if you end up on “cqdump.joerghoh.de”. But of course the old domain name will continue to work, and I don’t plan to remove it.

by Jörg at March 02, 2021 07:43 AM

February 08, 2021

Things on a content management system - Jörg Hoh

CRX DE driven development

A recurring problem I see in AEM project implementations is the problem of missing abstraction. A lot of code deals passes around resources, ValueMaps and even Strings (paths). And because we are supposed to build software the proper way, the called method checks (or more often: not checks) that the provided resource parameter is not null, and that the resource is of the correct type.

But instead of dealing with resources, the class names and comments suggest that the code actually dealing with products. Or website structures. Or assets. But instead of using a “product” classes (or website class, or the provided asset class) still resources are used. The abstraction is missing!

For me the root cause of this problem is the CRXDE Lite. Exactly that thing which you can open on your local AEM instance at /crx/de/. Because it shows you a very nice hierarchical view to the repository, it shows you paths, and properties. And if a developer starts to build a mental model of something, this tool comes in quite handy. Because you can reach everything via path, which is a String! So instead of expressing relations between concepts I see often this:

String path = …
String pathResource = resourceResolver.getResource(path);

And because we know it’s an existing, and we want to determine the parent resource, I see

String path = …
int lastSlash = path.lastIndexOf("/");
String parentPath = path.substring(0,lastSlash);
Resource parentResource = resourceResolver.getResource(parentPath);

Which is hilarious, because

pathResource.getParent();

is much easier to use (and did you spot the off-by-one bug in the String operation example? And what does happen if the path ends already with a slash?). But that still leaves the question, why you need to get the parent resource. Maybe a

ProductCategory category = myProduct.getCategory();

is a more expressive way to describe the same. I would definitely prefer it.

So CRXDE is your biggest enemy when designing your application. If you are a seasoned AEM developer, my recommendation to you: Don’t explain your application with CRXDE. Rather use proper abstractions. Don’t do CRXDE driven development!

If that topic sounds familiar to you: I did a talk on the AdaptTo() conference 2020 regarding this topic, you can find the recording here. There I explain the problem in more detail, also including some better examples 🙂

by Jörg at February 08, 2021 01:39 PM

January 18, 2021

Things on a content management system - Jörg Hoh

Writing integration tests for AEM (part 4)

This a part of my ongoing series about writing integration tests with AEM.

In the last post I mentioned that the URL provided to our integration tests allows us to test our dispatcher rules as well, a kind of “unit testing” the dispatcher setup. That’s what we do now.

This is the German way of saying “Stop here if you don’t have the right user-agent^Wvehicle”
Photo by Julian Hochgesang on Unsplash

As a first step we need to create a new RequestValidationClient, because we need to customize the underlying HTTP client, so it does not automatically follow HTTP redirects; otherwise it would be impossible for us to test redirects. And while we are on it, we want to customize the user-agent header as well, so it’s easier to spot the requests we do during the ingration tests. The way to customize the underlying HTTP client is documented, but a bit clumsy. But besides that this RequestValidationClient is not different from the SlingClient it’s derived from. Maybe we change that later.

The actual integration tests are in PublishRedirectsIT. Here I use this RequestValidationClient to perform unauthenticated requests (as end-users typically do) against the publish instance. To illustrate the testing of the client, there are 3 tests:

  • In the testInitialRedirectAndHomepage method it is validated, that a request to “/” will result in a permanent redirect to /en/us.html. Additionally it is made sure that /us/en.html is actually present and returns a 200.
  • A second test is hitting /system/console, which must never be exposed to the internet.
  • A third test ensures, that the default get servlet is properly secured, so that the infamous “infinity” selector for the JSON extension is returning a 404.

With this approach it is possible to validate that that complete security checklist of the dispatcher is actually implemented and that all “invalid” urls are properly blocked.

Some remarks to the PublishRedirectIT implementation itself:

  • Also here the tests are a bit clumsier than they could be. First, because the recommended ways to perform a HTTP request always have a “expectedReturnCode” parameter, which is unfortunate because we want to perform this test ourself. For that reason I build a small workaround to accept all status codes. The testing clients should offer that natively though.
  • And secondly, I encountered problems with the authentication on the publish. And that’s the reason why the creation of the anonymousPublish is how it is.

But anyway, that’s a neat approach to validate that your dispatcher setup is properly done. And of course you could also use the JsoupClient to test a page on publish as well.

Some remarks if you want to execute these tests in your system: I adjusted the configuration of the “dispatcher” module of the repository as well, so you can easily use it together with the dispatcher docker image (check out this fantastic documentation).

That’s it for today, happy testing!

by Jörg at January 18, 2021 03:03 PM

December 18, 2020

Things on a content management system - Jörg Hoh

Writing integration tests for AEM (part 3)

This a part of my ongoing series about writing integration tests with AEM.

In the last post on writing integration tests with AEM I quickly walked you through a simple test case for authoring instances, but I didn’t provide much context, what is going on exactly, and how it will be executed in Cloud Manager. That’s what I want to talk about today.

As we have seen, some relevant parameters for integration tests are provided are provided externally, most notable the URLs for the environment plus credentials.

In the pom.xml it looks like this:

Here you can see defaults, but you can simply override them by providing the exact values with the command line, as you already did in the previous post with overriding the URL of the authoring instance. The POM just introduces another indirection via properties which is technically not really necessary.

CloudManager works the same way: It invokes the maven-failsafe-plugin to execute the integration tests and provides overrides these default values with the correct data specific for that environment (including the admin credentials).

In detail, the urls are configured like this:

This means that your tests acess the loadbalanced author cluster and the loadbalanced publish farm (including dispatcher!).

This has 2 implications:

  • On your local installation you should have as well a dispatcher configured in front of the publish instance to have an identical setup
  • You can use integration tests also to validate your publish dispatcher rules!

And armed with this knowledge I will show you in the next post how you can validate with integration tests, that your domain setup is configured correctly.

by Jörg at December 18, 2020 06:27 PM

December 15, 2020

Things on a content management system - Jörg Hoh

Writing integration tests for AEM (part 2)

This a part of my ongoing series about writing integration tests with AEM.

In the last blog post I gave you a quick overview over the integration test framework you have at hand and what chances it gives you.

Now let’s get our hands dirty and create our first integration test. We will write a simple test which is connecting to the local author instance and tests that the wknd homepage is completely loaded and that all referenced files (images, javascript, css, …) are present.

This is where we start — just us and a lot of space to fill with good tests
Photo by Neven Krcmarek on Unsplash

Prerqusite is that you have the wknd-package fully installed (clone the wknd github repo, build it and install the package in the “all” module should do the trick). There is no specific requirement on AEM itself, so AEM 6.4 or newer should suffice.

Basic structure

When you have started with the maven archetype for AEM, you should have a it.tests maven module, which contains all integration tests. Although they are tests, they are stored in src/java. That means that the whole test suite is created as build-artifact, and thus can be easily executed also outside of the maven build process.

Another special thing to remember: All test class names must end with “IT” (like “IntegrationTest”), otherwise they are ignored.

A custom client

(I have all that code ready on github, so you can just clone it and start playing.)

As a first step we will create a custom test client, which is able to handle the parse a rendered page. As a basis I started with HtmlUnit, but that turned out to be a bit unflexible regarding multiple calls, so I switched over to jsoup for that.
That means our first piece of code is a JsoupClient. It extends the standard CQClient, and for that we are able to use the “doRequest()” method to fetch the page content.

That’s the basis, because from now on we just deal with jsoup specific structures (Document, Node). Then we add the actual test class (AuthorHomepageValidationIT), which has first some boilerplate code:

The basis for all is the CQAuthorClassRule, and based on that we create a jsoupClient object, which is itself using an “AdminClient” (that means using the admin user for the tests). And now we can easily start and create simple tests with this jsoupClient instance.

(Please check the files in the github repo to get the complete picture, I omitted here quite a bit for brevity.)

We are using the standard tooling for unit tests here to create an integration test, that means using the @Test annotation plus the usual set of asserts. But we are doing integration tests, that means that we are just validating the operation which is executed by AEM. If you are start to use a mocking framework here, you are wrong!

OK, how do I run this integration test?

Now, as we have written our integration test, we need to execute it. To do that, use your command line and execute this command in the it.tests module:

mvn clean install -Peaas-local -Dit.author.url=http://localhost:4502

(You need to specify the author url as parameter because my personal default of port 6602 for my local authoring instance might not work on your local instance. Check the pom.xml for all details, it is not that complicated.)

The output will look like this:

[INFO] --- maven-failsafe-plugin:2.21.0:integration-test (default-integration-test) @ de.joerghoh.aem.it.tests ---
[INFO]
[INFO] -------------------------------------------------------
[INFO] T E S T S
[INFO] -------------------------------------------------------
[INFO] Running integrationtests.it.tests.AuthorHomepageValidationIT
[main] INFO com.adobe.cq.testing.junit.rules.ConfigurableInstance - Using Basic Auth as default. Index lane detection: false
[main] INFO org.apache.sling.testing.junit.rules.instance.util.ConfigurationPool - Reading initial configurations from the system properties
[main] INFO org.apache.sling.testing.junit.rules.instance.util.ConfigurationPool - Found 1 instance configuration(s) from the system properties
[main] INFO org.apache.sling.testing.junit.rules.instance.ExistingInstanceStatement - InstanceConfiguration (URL: http://localhost:6602, runmode: author) found for test integrationtests.it.tests.AuthorHomepageValidationIT
[main] WARN com.adobe.cq.testing.client.CQClient - Cannot resolve path //fonts.googleapis.com/css?family=Source+Sans+Pro:400,600|Asar&display=swap: Illegal character in query at index 57: //fonts.googleapis.com/css?family=Source+Sans+Pro:400,600|Asar&display=swap
[main] INFO integrationtests.it.tests.AuthorHomepageValidationIT - skipping linked resource from another domain: https://wknd.site/content/wknd/language-masters/en.html
[main] INFO integrationtests.it.tests.AuthorHomepageValidationIT - validated 148 linked resources
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 6.787 s - in integrationtests.it.tests.AuthorHomepageValidationIT
[INFO] Running integrationtests.it.tests.GetPageIT
[main] INFO com.adobe.cq.testing.junit.rules.ConfigurableInstance - Using LoginToken Auth as default. Index lane detection: false
[main] INFO com.adobe.cq.testing.junit.rules.ConfigurableInstance - Using LoginToken Auth as default. Index lane detection: false
[main] INFO org.apache.sling.testing.junit.rules.instance.ExistingInstanceStatement - InstanceConfiguration (URL: http://localhost:6602, runmode: author) found for test integrationtests.it.tests.GetPageIT
[WARNING] Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 0.002 s - in integrationtests.it.tests.GetPageIT
[INFO] Running integrationtests.it.tests.CreatePageIT
[main] INFO com.adobe.cq.testing.junit.rules.ConfigurableInstance - Using LoginToken Auth as default. Index lane detection: false
[main] INFO org.apache.sling.testing.junit.rules.instance.ExistingInstanceStatement - InstanceConfiguration (URL: http://localhost:6602, runmode: author) found for test integrationtests.it.tests.CreatePageIT
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.3 s - in integrationtests.it.tests.CreatePageIT
[INFO]
[INFO] Results:
[INFO]
[WARNING] Tests run: 3, Failures: 0, Errors: 0, Skipped: 1

I marked the relevant output with blue. It shows that my test was reaching out to my local AEM instance at port 6602 and validated 148 resources in total. If you want to get more details what exactly was validated, add an info log message here.

Congratulations, you have just run your first integration test!

I leave it to you to provoke a failure of that integration tests; all you have to do is to have a image or a clientlib referenced on the wknd homepage (specified here) which does not return a HTTP status code 200. And of course this test is quite generic, as it does not mandate that a specific clientlibrary is there or that even the page footer is working. But as you have the power of JSOUP at hand, it should not be too hard to write even more assertions to check these additional requirements.

In the next blog post I will elaborate a bit more around running integration tests and configuring them properly, before we start to explore the possibilities offered to us by the AEM testing clients.

(Update 2020-12-18: Changed the profile name to match CloudManager behavior)

by Jörg at December 15, 2020 08:30 PM

December 14, 2020

Things on a content management system - Jörg Hoh

Writing integration tests for AEM (part 1)

This a part of my ongoing series about writing integration tests with AEM.

Building tests is an integral part of software development, and does not only include unit test but also integration and frontend tests. With AEM as a Cloud Service integration tests are getting more and more important, as it allows you to run automated tests on “real” cloud service instances as part of the Cloudmanager Pipeline. See the documentation of CloudManager.

If you check the details, you will find that the overall structure for integration tests are part of all projects which are created based on the AEM Project Archetype since at least version 11 (April 2017). So technically everyone is able to implement integration tests based on that structure yet, but I haven’t seen them to have received proper attention. I ignored these integration tests also for most of the time…

A vintage implementation of a HTTP Client with 3 threads (symbol photo)
Photo by Pavan Trikutam on Unsplash

Recently I worked with my colleague Valentin Olteanu on creating a small integration test suite, and I was honestly surprised how easy it can be. And because integration tests are now an official part of the Cloud Manager pipeline and the first place where your code can be tested on an real CM instance.

So I want to give you a short overview of the capabilities of the Integration-Test framework for AEM. In the next blog post I will show a real-life usecase where such Integration tests can really help.

Ok, what are these integration tests and what can we do with these tests?

Integration tests are running outside of AEM, as part of the deployment/test pipeline. They test the interaction of your custom application (which you have validated with your unittests) with everything else, most prominently AEM itself. You can test the complete page rendering, you can test custom integrations, background processes and everything where you need the full AEM stack, and where mocks are not sufficient.

The test framework itself provides you proper abstraction to perform a lot of operations in very convenient way.  For example

  • There is an AssetClient which allows you to upload assets into AEM
  • Functionality to create/delete/modify pages (as part of the CQClient)
  • Functionaity to replicate content
  • and much more (see the whole list of clients)

And everything wrapped in java, so you don’t have to deal with underlying HTTP requests. So this is an effective way to remote  control AEM from within java code. But of course there’s also a raw preconfigured HTTP Client (with hostnames, authentication etc alreay set) which you can use to perform custom actions. And the testing framework around is still the junit framework we are all used to.

But be aware: This integration test suite cannot directly access the JCR and Sling API, because it is running externally. If you want to create nodes or read their status, you have to rely on other means.

It is also no Selenium Test! If you want to do proper UI testing, please check the documentation on UI testing (still in beta, expect the general availability soon). I plan to create a blog post about it.

A very simple integration (basically just a validation of a page which has been created with a Page rule) can look like this (the full code)

    @Test
    public void testCreatePageAsAuthor() throws InterruptedException {
        // This shows that it exists for the author user
        userRule.getClient().pageExistsWithRetry(pageRule.getPath(), TIMEOUT);
    }

This integration test class itself comes with a bit of boilerplate code, mostly Junit rules to setup the connection and prepare the environment (for example to create the page for which we test the existence).

And the best thing: You don’t need to take care of URLs and authentication, because these parameters are specified outside of your code and are normally provided via Maven properties. This keeps the code very portable, and gives you the chance to execute it both locally and as part of the Cloudmanager pipeline.

Iin the next blog post I want to demonstrate how easy it can be to validate that a page in the AEM author renders correctly.

by Jörg at December 14, 2020 08:45 PM

September 02, 2020

Things on a content management system - Jörg Hoh

Long running sessions and SegmentNotFoundExceptions

If you search this blog, you find one recurring theme over the years: The lifecycle of JCR sessions and Sling ResourceResolvers. That you should not keep them open for a long time. And that you definitely have to close them. But I never gave you an example what can happen if you don’t follow this recommendation. Until now.

These days I learned that was is actual problem which can arise because of it. And the problem is called “SegmentNotFoundException”.

In the past a SegmentNotFoundException was a clear indication of a corrupt JCR repository. The recommendation was always either to fix it or to restore from backup. Both operations are tedious, require downtimes and possibly also mean a loss of data. That’s probably also the reason why this specific problem is often taken for the sign of such a repository exception. So let’s systematically look at it.

The root cause

With AEM 6.4 the feature of “tail-compaction” was introduced, which is a version of the online compaction feature. It is less efficient but takes less time than the full compaction. By default in AEM the tail compaction runs daily and the online compaction once a week.

But from what I understood, this tail compaction has a problem with long running sessions, and it can happen, that tar files are compacted and removed, which are still referenced. That means, that it’s not really a on-disk corruption which needs to be fixed, but rather that some “old sessions” (read about MVCC in the previous post) are referencing data which is not there (anymore).

An unclosed session – a symbol photo (by engin akyurt on Unsplash)

Validate the symptoms

This problem I describe in this post happens under some special circumstances, which you should check first before you start the hunt for long-running sessions:

  • You get SegmentNotFoundExceptions (always with the same segment ID).
  • A repository check doesn’t find any inconsistency.
  • If you restart the instance, the error is gone, but appears again after some time (mostly at least a day).
  • You are running AEM 6.4 or AEM 6.5 (SP doesn’t seem to matter).

In the case I observed, only a single workflow step was affected, but not all the time and only after some time, which made me believe that it was related to the compaction. But it was very hard to track down the error, because the workflow step itself was complex, but safe.

The solution

Fix any long-running session in your application (unless you are registering an ObservationListener in there, which takes care of the refreshs by design). Really all. Use the JMX webconsole plugin and check the list of registered session mbeans every day on a production instance. Count them. Look at the timestamps when the session was opened.

 In the case I observed, the long running session was in a different area of the application, but was working on the same data (user profiles) as the failing workflow step. But the 2 areas in the code were totally unrelated to each other, so that was the only way to track it down.

Final words

Some other notes, which I consider as important in this context:

  • When you encounter a SegmentNotFoundException, please always open a support ticket, just in case. If it’s a different issue than described here, it’s better if you have that ticket open already.
  • If you see exactly this issue, and changing your application code makes this problem go away, please also raise a support ticket. That bug should get fixed (even if long-running sessions are not recommended since years).
  • As mentioned, when you encounter this issue, it’s not a persisted corruption. Restarting will cause the issue not to appear for some time, but that should only buy you time to identify and fix the long running sessions.
  • And AEM as a Cloud Service is not affected by this problem, because neither Online Compaction nor Tail Compaction are used. Instead the Golden Master is offline compacted before cloning.

by Jörg at September 02, 2020 07:13 PM

August 24, 2020

Things on a content management system - Jörg Hoh

Long running sessions and clustering

In the last blog post I briefly talked about the basics what to consider when you are writing cluster-aware code. The essence is to be aware of your write activities, and make sure that the scheduled activities are running only on a single cluster node and not on many or all of them.

Today’s focus is on the behavior of JCR sessions with respect to clustering. From a conceptual point of view there is hardly a difference to a single-node cluster (or standalone instance), but the presence of more cluster nodes add a new angle of potential problems to it.

When I talk about JCR, I am thinking of the Apache Oak implementation, which is implemented on top of the MVCC pattern. (The previous Jackrabbit implementation is using a different approach, so this whole blog post does not apply there.) The basic principle of MVCC is that each session is clearly separated from any other session which is open in parallel. Also any changes performed on a session is not visible to other sessions unless

  • the other session is invoking session.refresh() or
  • the other session is opened after the mentioned session is closed.

This behavior applies to all sessions of a JCR repository, no matter if the are opened on the same cluster node or not. The following diagram visualizes this

Diagram showing how 2 sessions are performing changes to the repository whithout seeing the changes of the other as long as they don’t use session.refresh()

We have 2 sessions A1 and B1 which are initiated at the same time t0, and which perform changes independently of each other on the repository, so session B1 cannot see the changes performed with A1_1 (and vice versa). At time t1 session A1 is refreshed, and now it can see the changes B1_1 and B1_2. And afterwards B1 is refreshed as well, and can now see the changes A1_1 and A1_2 as well.

But if a session is not refreshed (or closed and a new session is used), it will never see the changes which happened on the repository after the session has been opened.

As said before, these sessions do not need to run on 2 separate cluster nodes, you get the same behavior on a single cluster node as well. But I mentioned, that multiple cluster nodes are a special problem here. Why is that case?

That problem are OSGI services in the background, which perform a certain job and write data to the JCR repository. In a single-node cluster this not a problem, because all of these activities go through that single service; and if that service uses a long-running JCR session for it, that will never be a problem. Because this service is responsible for all changes, and the service can read and write all the relevant data. In a cluster with more than 2 nodes, each cluster node might have that service running, and the invocations of the services might be random. And as in the diagram above, on cluster node A the data A1_1 is written. And on cluster node 2 the data point B1_1 is written. But they don’t see each other’s changes if they don’t refresh the session! And in most applications, which are written for single-node AEM instances, session.refresh() is barely used, because in such situations there’s simply no need for it, as this problem never occurred.

So when you are migrating your application to AEM as a Cloud Service, review your applications and make sure that you find all long-running ResourceResolvers and JCR sessions. The best option is then to remove these long-running sessions and replace them with short-living ones, which are closed if the job is done. The second-best option is to introduce a session.refresh(), so the session sees any updates which happend to the repository in the meanwhile. (And btw: if you registering an ObservationListener in that session, you don’t need a manual refresh, as this refresh is done by the ObservationListener method anyway; what would it be for if not for reporting changes to the repository, which happen after opening the session?)

That’s all right now regarding cluster-aware coding. But I am sure that there is more to come 🙂

by Jörg at August 24, 2020 04:24 PM

August 14, 2020

Things on a content management system - Jörg Hoh

Cluster aware coding in AEM

With AEM as a Cloud Service quite a number of small things have changed; and next to others you also get real clustering support in the authoring environment. Which is nice, because it gives you downtime-less authoring during deployments.

But this cluster also comes with a few gotchas, and one of them is that your application code needs to be cluster-aware. But what does that mean? What consequences does it have and what code do you have to change if you have never paid attention to this aspect?

The most important aspect is to do “every change only once“. It doesn’t make sense that 2 cluster nodes are importing the same set of data. A special version of this aspect is “avoid concurrent writes to the same node“, which can happen when a scheduled job is kicked off at the same time on all nodes, and this job is trying to change something in the repository. In this case you don’t only have overhead, but very likely a lot of exceptions.

And there is a similar aspect, which you should pay attention to: connections to external systems. If you have a cluster, running the same code and configs, it’s not always wanted that each cluster node reaches out to that external system. Maybe you need to the update it with the latest content only once, because it triggers some expensive processing on their side, and you don’t want to have that triggered two or three times, probably pretty much at the same time.

I have mentioned you 2 cases where a clustered application can be behave differently than a single-node environment, now let me show you how you can make your application cluster-aware.

Scheduled jobs

Scheduled jobs are a classic tool to execute certain jobs at a certain time. Of course we could use the Sling Scheduler directly, but to make the execution more robust, you should wrap it into a Scheduled Sling Job.

See the Sling Jobs website for the documentation and some example (although the Javadocs are missing the ScheduleBuilder class, but here’s the code). And of course you should check out Kaushal Mall’s post with even more examples.

Jobs give you the guarantee, that this job is going to be executed only at least once.

Use the Sling Scheduler only for very frequent jobs (e.g. once every 5 minutes), where it doesn’t matter if one execution is skipped, e.g. because the instance was just restarting. To limit the execution of such a job to a single node, you can annotate the job runner with this annotation:

@Property (name="scheduler.runOn", value="SINGLE")

(see the docs)

What about caches?

In-memory caches are often used to speed up operations. Most often they contain the results of previous operations which are then reused; cache elements are either actively purged or expire using a time-to-live.

Normally such caches are not affected by clustering. They might contain different items with potentially different values in the cluster nodes, but that must never be a problem. If that is a problem, you have to look for a different approach, e.g. persisting the data to the repository (if they are not already coming from there) or externalizing the cache (e.g to a redis or memcached instance).

Also, having a simpler application instead of the highest-cache-hit ration possible is often a good trade-off.

Ok, these were the topics I wanted to discuss here. But expect a blog post about one of my favorite topics: “Long running sessions and clustering”.

by Jörg at August 14, 2020 02:53 PM